Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perform CoV Cross-validation (hold-out) #48

Closed
rcedgar opened this issue Apr 24, 2020 · 1 comment
Closed

Perform CoV Cross-validation (hold-out) #48

rcedgar opened this issue Apr 24, 2020 · 1 comment
Assignees
Labels
Bioinformatics Bioinformatics task

Comments

@rcedgar
Copy link
Collaborator

rcedgar commented Apr 24, 2020

From Artem by email: "[T]he difference in number of reads mapping between CoV+ control libraries and mammal transcriptomes is very large."

I suspect this may be misleading. If I understand correctly, Cov+ datasets have known infections by known coronaviruses, mostly (all?) Cov-19, but we are looking for incidental infections by novel coronaviruses which by definition are not in the pan-genome. Possibly, the number of viral transcripts will tend to be much lower with an incidental infection. Certainly, a novel virus will have lower identity and uneven coverage to the pan-genome compared to a positive control with a Cov-19 infection.

To model what we might see in production, Cov+ datasets should be mapped to a pan-genome with the genome of the known infection (Cov-19) excluded and of its genes plus close relatives. This is hold-out validation, also called cross-validation.

A hold-out pan-genome can be constructed using usearch as follows:

usearch -usearch_global pan_genome.fa \
-db cov19_genome.fa \
-strand both \
-id 0.95 \
-uc hits.uc \
-notmached holdout_pan_genome.fa

Here, 0.95 is the identity threshold; here all sequences having >=95% with cov19_genome.fa are removed from the reference; hits.uc is a tsv file with one record for each pan-genome sequence indicating whether it matched or not.

@rcedgar rcedgar added the Bioinformatics Bioinformatics task label Apr 24, 2020
@ababaian ababaian changed the title Testing on Cov+ datasets Perform CoV Cross-validation (hold-out) Apr 24, 2020
@ababaian
Copy link
Owner

This is very well outlined. @victorlin #27 this is a perfect example of the application of the script you're working on.

@rcedgar rcedgar self-assigned this Apr 27, 2020
@rcedgar rcedgar closed this as completed May 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bioinformatics Bioinformatics task
Projects
None yet
Development

No branches or pull requests

2 participants