Feature/fix issue 36 #39

migbro · 2024-02-02T20:24:45Z

My colleague @dmiller15 had a much simpler suggestion to simply skip * entries as a start in the annotator tool. It'd likely tidy up our current concerns...expanding to skip all non ACGT alleles (like N or ACN) could be tackled, but not sure it needs to immediately.

into feature/fix-issue-36

src/commands/annotate_cmd.rs

migbro · 2024-02-05T14:42:16Z

Hmmm, your suspicions were correct. Performance took an absolute beating. Before:

[warning] found non ACGT ALT character '*', encoding as 'T' for (1-based) position: 54724
[warning] found non ACGT ALT character '*', encoding as 'T' for (1-based) position: 109576
[warning] found non ACGT ALT character '*', encoding as 'T' for (1-based) position: 136645
[warning] found non ACGT ALT character '*', encoding as 'T' for (1-based) position: 820929
[warning] found non ACGT ALT character '*', encoding as 'T' for (1-based) position: 820932
[warning] found non ACGT ALT character '*', encoding as 'T' for (1-based) position: 820933
[warning] found non ACGT ALT character '*', encoding as 'T' for (1-based) position: 820937
[warning] found non ACGT ALT character '*', encoding as 'T' for (1-based) position: 820940
[warning] found non ACGT ALT character '*', encoding as 'T' for (1-based) position: 820946
[warning] found non ACGT ALT character '*', encoding as 'T' for (1-based) position: 820948
[echtvar] chr1:3856686 evaluated 10000 variants (20618 / second). wrote 10000 variants.
[echtvar] chr1:7278646 evaluated 20000 variants (25348 / second). wrote 20000 variants.
[echtvar] chr1:11711930 evaluated 30000 variants (24135 / second). wrote 30000 variants.
[echtvar] chr1:41528737 evaluated 100000 variants (26082 / second). wrote 100000 variants.
[echtvar] chr1:88677065 evaluated 200000 variants (26648 / second). wrote 200000 variants.
[echtvar] chr1:123927955 evaluated 300000 variants (28639 / second). wrote 300000 variants.
[echtvar] chr2:191385523 evaluated 1000000 variants (28066 / second). wrote 1000000 variants.
[echtvar] chr4:166499610 evaluated 2000000 variants (28035 / second). wrote 2000000 variants.
[echtvar] chr7:26761630 evaluated 3000000 variants (28011 / second). wrote 3000000 variants.
[echtvar] chr10:5013757 evaluated 4000000 variants (27790 / second). wrote 4000000 variants.
[echtvar] chr12:131568036 evaluated 5000000 variants (27859 / second). wrote 5000000 variants.
[echtvar] chr17:26955403 evaluated 6000000 variants (27892 / second). wrote 6000000 variants.
[echtvar] chrX:44338293 evaluated 7000000 variants (27958 / second). wrote 7000000 variants.
[echtvar] evaluated 7168554 variants (27870 / second). wrote 7168554 variants.

real	4m18.420s
user	5m44.228s
sys	0m11.149s

After:

contig chr1 pos 54724 alt has * value, skipping annotation, outputting entry as-is
contig chr1 pos 109576 alt has * value, skipping annotation, outputting entry as-is
contig chr1 pos 136645 alt has * value, skipping annotation, outputting entry as-is
contig chr1 pos 820929 alt has * value, skipping annotation, outputting entry as-is
contig chr1 pos 820932 alt has * value, skipping annotation, outputting entry as-is
contig chr1 pos 820933 alt has * value, skipping annotation, outputting entry as-is
contig chr1 pos 820937 alt has * value, skipping annotation, outputting entry as-is
contig chr1 pos 820940 alt has * value, skipping annotation, outputting entry as-is
contig chr1 pos 820946 alt has * value, skipping annotation, outputting entry as-is
contig chr1 pos 820948 alt has * value, skipping annotation, outputting entry as-is
[echtvar] chr1:3856686 evaluated 10000 variants (3594 / second). wrote 9932 variants.
[echtvar] chr1:7278646 evaluated 20000 variants (4405 / second). wrote 19871 variants.
[echtvar] chr1:11711930 evaluated 30000 variants (4131 / second). wrote 29808 variants.
[echtvar] chr1:41528737 evaluated 100000 variants (4478 / second). wrote 99431 variants.
[echtvar] chr1:88677065 evaluated 200000 variants (4530 / second). wrote 198983 variants.
[echtvar] chr1:123927955 evaluated 300000 variants (4948 / second). wrote 298617 variants.
[echtvar] chr2:191385523 evaluated 1000000 variants (4800 / second). wrote 995154 variants.
[echtvar] chr4:166499610 evaluated 2000000 variants (4830 / second). wrote 1990363 variants.
[echtvar] chr7:26761630 evaluated 3000000 variants (4885 / second). wrote 2985468 variants.

I didn't bother letting it finish...I'll see if we can think of something else. At that speed, it's no better than bcftools I don't think.

migbro · 2024-02-05T14:42:56Z

This was me testing on a trio germline dataset

brentp · 2024-02-05T14:45:09Z

hmm. it shouldn't have that much difference. Are you sure it wasn't somehow cached on the disk? In first run? And did you compile in release mode both times?

migbro · 2024-02-05T14:49:06Z

Hmmm, the first time was from a docker image that I had built (which is release mode), the second time built like it is in the big.sh script with very few flags. I can see if that makes a difference, it'd be nice if that were the case.

brentp · 2024-02-05T14:56:23Z

yeah, default is debug mode, so if you add --release, it should help a lot.

dmiller15 · 2024-02-05T15:01:39Z

I guess if we're hard set on avoiding additional vectors we could potentially move this into the update_expr_values function. We're already create the alleles vector there.

This eidx if/else https://github.com/brentp/echtvar/blob/main/src/lib/echtvar.rs#L360-L374 could get an additional check for alleles[1][0] being star and set Err(0).

Changes the behavior in that we're going to get the user defined missing values for these star records.

migbro · 2024-02-05T15:18:34Z

Ha, you weren't kidding. In release mode:

contig chr1 pos 54724 alt has * value, skipping annotation, outputting entry as-is
contig chr1 pos 109576 alt has * value, skipping annotation, outputting entry as-is
contig chr1 pos 136645 alt has * value, skipping annotation, outputting entry as-is
contig chr1 pos 820929 alt has * value, skipping annotation, outputting entry as-is
contig chr1 pos 820932 alt has * value, skipping annotation, outputting entry as-is
contig chr1 pos 820933 alt has * value, skipping annotation, outputting entry as-is
contig chr1 pos 820937 alt has * value, skipping annotation, outputting entry as-is
contig chr1 pos 820940 alt has * value, skipping annotation, outputting entry as-is
contig chr1 pos 820946 alt has * value, skipping annotation, outputting entry as-is
contig chr1 pos 820948 alt has * value, skipping annotation, outputting entry as-is
not reporting further warnings
[echtvar] chr1:3856686 evaluated 10000 variants (23809 / second). wrote 9932 variants.
[echtvar] chr1:7278646 evaluated 20000 variants (28735 / second). wrote 19871 variants.
[echtvar] chr1:11711930 evaluated 30000 variants (27700 / second). wrote 29808 variants.
[echtvar] chr1:41528737 evaluated 100000 variants (29895 / second). wrote 99431 variants.
[echtvar] chr1:88677065 evaluated 200000 variants (30280 / second). wrote 198983 variants.
[echtvar] chr1:123927955 evaluated 300000 variants (32761 / second). wrote 298617 variants.
[echtvar] chr2:191385523 evaluated 1000000 variants (32364 / second). wrote 995154 variants.
[echtvar] chr4:166499610 evaluated 2000000 variants (32402 / second). wrote 1990363 variants.
[echtvar] chr7:26761630 evaluated 3000000 variants (32299 / second). wrote 2985468 variants.
[echtvar] chr10:5013757 evaluated 4000000 variants (31676 / second). wrote 3980662 variants.
[echtvar] chr12:131568036 evaluated 5000000 variants (31154 / second). wrote 4975474 variants.
[echtvar] chr17:26955403 evaluated 6000000 variants (30839 / second). wrote 5970308 variants.
[echtvar] chrX:44338293 evaluated 7000000 variants (30891 / second). wrote 6964726 variants.
[echtvar] evaluated 7168554 variants (30847 / second). wrote 7132548 variants.

real	3m53.756s
user	5m33.645s
sys	0m9.607s

And to rule out disk caching, I grabbed another file:

...
[echtvar] chr1:3621740 evaluated 10000 variants (23866 / second). wrote 9923 variants.
[echtvar] chr1:6984629 evaluated 20000 variants (28694 / second). wrote 19863 variants.
[echtvar] chr1:11068271 evaluated 30000 variants (29013 / second). wrote 29818 variants.
[echtvar] chr1:36348307 evaluated 100000 variants (29682 / second). wrote 99474 variants.
[echtvar] chr1:80929174 evaluated 200000 variants (28731 / second). wrote 199011 variants.
[echtvar] chr1:122617572 evaluated 300000 variants (30117 / second). wrote 298533 variants.
[echtvar] chr2:161533605 evaluated 1000000 variants (30614 / second). wrote 995313 variants.
[echtvar] chr4:119815132 evaluated 2000000 variants (31118 / second). wrote 1990591 variants.
[echtvar] chr6:133804491 evaluated 3000000 variants (32149 / second). wrote 2985724 variants.
[echtvar] chr9:40358564 evaluated 4000000 variants (31931 / second). wrote 3980823 variants.
[echtvar] chr12:20835434 evaluated 5000000 variants (31591 / second). wrote 4975837 variants.
[echtvar] chr15:97834681 evaluated 6000000 variants (31408 / second). wrote 5970966 variants.
[echtvar] chr20:47014627 evaluated 7000000 variants (31129 / second). wrote 6965828 variants.
[echtvar] evaluated 7576734 variants (31222 / second). wrote 7539482 variants.
Skipped 37252 variants with * alt.

Lastly, cleared the disk cache with echo 3 | sudo tee /proc/sys/vm/drop_caches and reverted to pre-* skip:

[warning] found non ACGT ALT character '*', encoding as 'T' for (1-based) position: 180951
[warning] found non ACGT ALT character '*', encoding as 'T' for (1-based) position: 631860
[warning] found non ACGT ALT character '*', encoding as 'T' for (1-based) position: 789516
[warning] found non ACGT ALT character '*', encoding as 'T' for (1-based) position: 789517
[warning] found non ACGT ALT character '*', encoding as 'T' for (1-based) position: 791549
[warning] found non ACGT ALT character '*', encoding as 'T' for (1-based) position: 820931
[warning] found non ACGT ALT character '*', encoding as 'T' for (1-based) position: 820940
[warning] found non ACGT ALT character '*', encoding as 'T' for (1-based) position: 820946
[warning] found non ACGT ALT character '*', encoding as 'T' for (1-based) position: 904491
[warning] found non ACGT ALT character '*', encoding as 'T' for (1-based) position: 904493
[echtvar] chr1:3621740 evaluated 10000 variants (18761 / second). wrote 10000 variants.
[echtvar] chr1:6984629 evaluated 20000 variants (21482 / second). wrote 20000 variants.
[echtvar] chr1:11068271 evaluated 30000 variants (20242 / second). wrote 30000 variants.
[echtvar] chr1:36348307 evaluated 100000 variants (24260 / second). wrote 100000 variants.
[echtvar] chr1:80929174 evaluated 200000 variants (25169 / second). wrote 200000 variants.
[echtvar] chr1:122617572 evaluated 300000 variants (26221 / second). wrote 300000 variants.
[echtvar] chr2:161533605 evaluated 1000000 variants (27826 / second). wrote 1000000 variants.
[echtvar] chr4:119815132 evaluated 2000000 variants (27360 / second). wrote 2000000 variants.
[echtvar] chr6:133804491 evaluated 3000000 variants (28070 / second). wrote 3000000 variants.
[echtvar] chr9:40358564 evaluated 4000000 variants (28185 / second). wrote 4000000 variants.
[echtvar] chr12:20835434 evaluated 5000000 variants (28432 / second). wrote 5000000 variants.
[echtvar] chr15:97834681 evaluated 6000000 variants (28554 / second). wrote 6000000 variants.
[echtvar] chr20:47014627 evaluated 7000000 variants (28659 / second). wrote 7000000 variants.
[echtvar] evaluated 7576734 variants (28659 / second). wrote 7576734 variants.

real	4m25.610s
user	6m0.630s
sys	0m11.247s

Perhaps surprisingly or not, it's actually faster. My guess is cutting out needing to encode the * values, and as @dmiller15 pointed out to me, also the annotation of the records all together.

migbro added 4 commits February 2, 2024 19:45

🔨

650687b

🧹 clean up output message

08ceffc

🔨 add skip * alleles

bade14c

Merge branch 'feature/fix-issue-36' of https://github.com/migbro/echtvar

8a96973

into feature/fix-issue-36

brentp reviewed Feb 3, 2024

View reviewed changes

src/commands/annotate_cmd.rs Outdated Show resolved Hide resolved

🔨 change vec eval, limit print warn

c5ec38b

migbro added 2 commits February 5, 2024 17:26

🔨 added skip final warn

2e5b08d

🐎 change scope of warn message

e02d006

brentp merged commit 45b468e into brentp:main Feb 6, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/fix issue 36 #39

Feature/fix issue 36 #39

migbro commented Feb 2, 2024

migbro commented Feb 5, 2024

migbro commented Feb 5, 2024

brentp commented Feb 5, 2024

migbro commented Feb 5, 2024

brentp commented Feb 5, 2024

dmiller15 commented Feb 5, 2024

migbro commented Feb 5, 2024 •

edited

Loading

Feature/fix issue 36 #39

Feature/fix issue 36 #39

Conversation

migbro commented Feb 2, 2024

migbro commented Feb 5, 2024

migbro commented Feb 5, 2024

brentp commented Feb 5, 2024

migbro commented Feb 5, 2024

brentp commented Feb 5, 2024

dmiller15 commented Feb 5, 2024

migbro commented Feb 5, 2024 • edited Loading

migbro commented Feb 5, 2024 •

edited

Loading