You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
From Artem by email: "[T]he difference in number of reads mapping between CoV+ control libraries and mammal transcriptomes is very large."
I suspect this may be misleading. If I understand correctly, Cov+ datasets have known infections by known coronaviruses, mostly (all?) Cov-19, but we are looking for incidental infections by novel coronaviruses which by definition are not in the pan-genome. Possibly, the number of viral transcripts will tend to be much lower with an incidental infection. Certainly, a novel virus will have lower identity and uneven coverage to the pan-genome compared to a positive control with a Cov-19 infection.
To model what we might see in production, Cov+ datasets should be mapped to a pan-genome with the genome of the known infection (Cov-19) excluded and of its genes plus close relatives. This is hold-out validation, also called cross-validation.
A hold-out pan-genome can be constructed using usearch as follows:
Here, 0.95 is the identity threshold; here all sequences having >=95% with cov19_genome.fa are removed from the reference; hits.uc is a tsv file with one record for each pan-genome sequence indicating whether it matched or not.
The text was updated successfully, but these errors were encountered:
From Artem by email: "[T]he difference in number of reads mapping between CoV+ control libraries and mammal transcriptomes is very large."
I suspect this may be misleading. If I understand correctly, Cov+ datasets have known infections by known coronaviruses, mostly (all?) Cov-19, but we are looking for incidental infections by novel coronaviruses which by definition are not in the pan-genome. Possibly, the number of viral transcripts will tend to be much lower with an incidental infection. Certainly, a novel virus will have lower identity and uneven coverage to the pan-genome compared to a positive control with a Cov-19 infection.
To model what we might see in production, Cov+ datasets should be mapped to a pan-genome with the genome of the known infection (Cov-19) excluded and of its genes plus close relatives. This is hold-out validation, also called cross-validation.
A hold-out pan-genome can be constructed using usearch as follows:
usearch -usearch_global pan_genome.fa \
-db cov19_genome.fa \
-strand both \
-id 0.95 \
-uc hits.uc \
-notmached holdout_pan_genome.fa
Here, 0.95 is the identity threshold; here all sequences having >=95% with cov19_genome.fa are removed from the reference; hits.uc is a tsv file with one record for each pan-genome sequence indicating whether it matched or not.
The text was updated successfully, but these errors were encountered: