Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple k-mer sizes confirmation and testing #20

Open
2 of 6 tasks
dkoslicki opened this issue Feb 6, 2020 · 2 comments
Open
2 of 6 tasks

Multiple k-mer sizes confirmation and testing #20

dkoslicki opened this issue Feb 6, 2020 · 2 comments
Assignees
Labels

Comments

@dkoslicki
Copy link
Owner

dkoslicki commented Feb 6, 2020

Definitions:
"new method" = use a very large k-mer size, put in ternary search trie, use prefix matches to infer smaller k-mer size containment values
"old method" = train and re-run CMash on each individual k-mer size.

Tasks:

  • Address Multiple k-mer sizes bug #19 so we no nothing funky is happening with the current implementation.
  • Create testing environment to prepare for comparing old method to new method
  • Compare old to new method in estimating containment indexes between individual genomes (i.e. using this kind of command). Repeat over many different genomes to get an idea of the difference between old and new method
  • Compare old to new method in estimating presence/absence of training database organisms in many simulated metagenomes (i.e. run StreamingQueryDNADatabase.py with multiple k-mer size training database and compare to running StreamingQueryDNADatabase.py many times with training databases trained with a specific k-mer size).

This would be sufficient for a conference paper. More details can follow depending on interest.

For a journal publication, would need to:

  • Understand the theory behind the bias in the k-mer prefix truncation (@dkoslicki has already written it up)
  • Test the magnitude of the bias factor over many genomes (relatively straightforward task).
dkoslicki added a commit that referenced this issue Mar 18, 2020
…. Initial commit to add files. Scratch work, mainly for my local efforts as directories are hard coded.
dkoslicki added a commit that referenced this issue Mar 18, 2020
dkoslicki added a commit that referenced this issue Mar 18, 2020
dkoslicki added a commit that referenced this issue Mar 18, 2020
… altogether, and don't break if you don't see a TST match for a shorter kmer size. #2 and #20. Calling it a night
dkoslicki added a commit that referenced this issue Mar 18, 2020
… original (revcomped), realized this is actually issue #19 #20 and #2. Including break for now in StreamingQuery... line 261
dkoslicki added a commit that referenced this issue Mar 18, 2020
…pears to resolve the issue, but really need to think better about not adding *all the things* to the TST and BF. #2 #20 #19
dkoslicki added a commit that referenced this issue Mar 18, 2020
@dkoslicki
Copy link
Owner Author

@ShaopengLiu1 #19 should be addressed now. I am not closing #19 or #2 until we have a better testing environment spun up, as all tests I have done are locally (very not optimal).

dkoslicki added a commit that referenced this issue Mar 20, 2020
dkoslicki added a commit that referenced this issue Mar 20, 2020
… to #20 and same change as in a4e1850 just on a different branch
@dkoslicki
Copy link
Owner Author

@ShaopengLiu1 just a note: I added a class that will now compute the absolute ground truth containment indicies. Recall that the last column of StreamingQueryDNADatabase.py is still an estimate of the containment index (just using un-truncated k-mers). The class to compute the ground truth is at /CMash/CMash/GroundTruth.py. You can see in this comment how the results by the tests/script_tests/./run_small_tests.sh correspond quite nicely with the ground truth values.

If you would like to utilize this ground truth class, I strongly suggest you use my personal server (ping me if you forgot the IP address and login info) as it takes quite a bit of time and memory to brute-force calculate all the k-mers and their reverse complements.

To interact with the class, you can do something like:

import CMash.GroundTruth as G
training_database_file = "<snip>/TrainingDatabase.h5"
query_file = "<snip>/taxid_1192839_4_genomic.fna.gz"
g = G.TrueContainment(training_database_file=training_database_file, k_sizes="4-6-1")  # this step will take a long time if the k_sizes are realistically large 
df = g.return_containment_data_frame(query_file=query_file1, location_of_thresh=-1, coverage_threshold=.1)

Note that the query_file need not be in the TrainingDatabase.h5 (as its k-mers will still be enumerated if it's not in the training database).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants