Multiple k-mer sizes confirmation and testing #20

dkoslicki · 2020-02-06T21:27:36Z

Definitions:
"new method" = use a very large k-mer size, put in ternary search trie, use prefix matches to infer smaller k-mer size containment values
"old method" = train and re-run CMash on each individual k-mer size.

Tasks:

Address Multiple k-mer sizes bug #19 so we no nothing funky is happening with the current implementation.
Create testing environment to prepare for comparing old method to new method
Compare old to new method in estimating containment indexes between individual genomes (i.e. using this kind of command). Repeat over many different genomes to get an idea of the difference between old and new method
Compare old to new method in estimating presence/absence of training database organisms in many simulated metagenomes (i.e. run StreamingQueryDNADatabase.py with multiple k-mer size training database and compare to running StreamingQueryDNADatabase.py many times with training databases trained with a specific k-mer size).

This would be sufficient for a conference paper. More details can follow depending on interest.

For a journal publication, would need to:

Understand the theory behind the bias in the k-mer prefix truncation (@dkoslicki has already written it up)
Test the magnitude of the bias factor over many genomes (relatively straightforward task).

The text was updated successfully, but these errors were encountered:

…. Initial commit to add files. Scratch work, mainly for my local efforts as directories are hard coded.

… changes. #2 and #20

…oblems in #2 and #20 go away

…e, the results are accurate. #2 and #20

… altogether, and don't break if you don't see a TST match for a shorter kmer size. #2 and #20. Calling it a night

… original (revcomped), realized this is actually issue #19 #20 and #2. Including break for now in StreamingQuery... line 261

…pears to resolve the issue, but really need to think better about not adding *all the things* to the TST and BF. #2 #20 #19

#20 #19

…e to master. #2 #19 #20

dkoslicki · 2020-03-19T02:25:36Z

@ShaopengLiu1 #19 should be addressed now. I am not closing #19 or #2 until we have a better testing environment spun up, as all tests I have done are locally (very not optimal).

…e.py as well #2 #20 #19

… to #20 and same change as in a4e1850 just on a different branch

dkoslicki · 2020-03-27T18:38:57Z

@ShaopengLiu1 just a note: I added a class that will now compute the absolute ground truth containment indicies. Recall that the last column of StreamingQueryDNADatabase.py is still an estimate of the containment index (just using un-truncated k-mers). The class to compute the ground truth is at /CMash/CMash/GroundTruth.py. You can see in this comment how the results by the tests/script_tests/./run_small_tests.sh correspond quite nicely with the ground truth values.

If you would like to utilize this ground truth class, I strongly suggest you use my personal server (ping me if you forgot the IP address and login info) as it takes quite a bit of time and memory to brute-force calculate all the k-mers and their reverse complements.

To interact with the class, you can do something like:

import CMash.GroundTruth as G
training_database_file = "<snip>/TrainingDatabase.h5"
query_file = "<snip>/taxid_1192839_4_genomic.fna.gz"
g = G.TrueContainment(training_database_file=training_database_file, k_sizes="4-6-1")  # this step will take a long time if the k_sizes are realistically large 
df = g.return_containment_data_frame(query_file=query_file1, location_of_thresh=-1, coverage_threshold=.1)

Note that the query_file need not be in the TrainingDatabase.h5 (as its k-mers will still be enumerated if it's not in the training database).

Add the ground_truth.py for #20

…server. Relavant to #20 testing

dkoslicki mentioned this issue Feb 7, 2020

Improved classification time with KMC #15

Open

dkoslicki added a commit that referenced this issue Mar 18, 2020

start working on diagnosing #20 which turns out to involve #2 as well…

26a9561

…. Initial commit to add files. Scratch work, mainly for my local efforts as directories are hard coded.

dkoslicki added a commit that referenced this issue Mar 18, 2020

add ground truth containment indicies for #2 and #20

3080025

dkoslicki added a commit that referenced this issue Mar 18, 2020

might have found the issue with the bloom filter. committing to track…

0ca50b7

… changes. #2 and #20

dkoslicki added a commit that referenced this issue Mar 18, 2020

almost, but for larger ranges, this is still an issue. #2 and #20

cccd5bb

dkoslicki added a commit that referenced this issue Mar 18, 2020

problem really is with the bloom filter. Remove it everywhere, and pr…

bf01bf3

…oblems in #2 and #20 go away

dkoslicki added a commit that referenced this issue Mar 18, 2020

yup yup, remove the BF, and while the performance is terrible wrt tim…

1ccf64f

…e, the results are accurate. #2 and #20

dkoslicki added a commit that referenced this issue Mar 18, 2020

ok, well, at least it works, but only if you rip out the bloom filter…

4f6378f

… altogether, and don't break if you don't see a TST match for a shorter kmer size. #2 and #20. Calling it a night

dkoslicki added a commit that referenced this issue Mar 18, 2020

ok, one last try: prefixes of reverse complements are suffixes of the…

e6a1a3e

… original (revcomped), realized this is actually issue #19 #20 and #2. Including break for now in StreamingQuery... line 261

dkoslicki added a commit that referenced this issue Mar 18, 2020

leaving it as pass for now in StreamingQuery... line 261-262. This ap…

0702d2a

…pears to resolve the issue, but really need to think better about not adding *all the things* to the TST and BF. #2 #20 #19

dkoslicki added a commit that referenced this issue Mar 18, 2020

forgot to increase BF size. #2 #20 #19

a49980f

dkoslicki added a commit that referenced this issue Mar 18, 2020

added a last couple of notes for my future reference #2 #20 #19

f7bee92

dkoslicki added a commit that referenced this issue Mar 18, 2020

commit output with no BF for checking. #2 #20 #19

0161b06

dkoslicki added a commit that referenced this issue Mar 18, 2020

ok, bug sufficiently squashed for now, just code cleanup remains. See #2

b7d4cb7

#20 #19

dkoslicki added a commit that referenced this issue Mar 18, 2020

code cleanup, prepping for incorporation into master #2 #20 #19

cf64b7a

dkoslicki added a commit that referenced this issue Mar 18, 2020

create new branch that will nuke all the non-essential things to merg…

4c53757

…e to master. #2 #19 #20

dkoslicki added a commit that referenced this issue Mar 18, 2020

remove all unecessary files before merge to master #2 #19 #20

13f3fd6

dkoslicki added a commit that referenced this issue Mar 19, 2020

Merge branch 'kmer_range_issue_to_merge' addresses #2 #20 and #19

c8c91f4

dkoslicki added a commit that referenced this issue Mar 19, 2020

make sure to add modifications to TST to the StreamingQueryDNADatabas…

115dec3

…e.py as well #2 #20 #19

dkoslicki added a commit that referenced this issue Mar 20, 2020

try greater than or equal to for #20

a4e1850

dkoslicki added a commit that referenced this issue Mar 20, 2020

change filter to >= instead of strict > for the -c option. Relevant…

319961b

… to #20 and same change as in a4e1850 just on a different branch

dkoslicki assigned ShaopengLiu1 Mar 23, 2020

dkoslicki added the priority label Mar 23, 2020

ShaopengLiu1 added a commit that referenced this issue May 2, 2020

Merge remote-tracking branch 'origin/master' into shaopeng

13ba97a

Add the ground_truth.py for #20

ShaopengLiu1 added a commit that referenced this issue May 2, 2020

Modify threads usage in Groundtruth.py to limit the MEM usage in Lab …

9c1e9a6

…server. Relavant to #20 testing

ShaopengLiu1 added a commit that referenced this issue May 10, 2020

temp code for rep1: compare 3 CI #20

702dc6f

ShaopengLiu1 added a commit that referenced this issue May 30, 2020

#20, pick random microbiome data for low/mid/high CI range

07082a3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiple k-mer sizes confirmation and testing #20

Multiple k-mer sizes confirmation and testing #20

dkoslicki commented Feb 6, 2020 •

edited by ShaopengLiu1

Loading

dkoslicki commented Mar 19, 2020

dkoslicki commented Mar 27, 2020

Multiple k-mer sizes confirmation and testing #20

Multiple k-mer sizes confirmation and testing #20

Comments

dkoslicki commented Feb 6, 2020 • edited by ShaopengLiu1 Loading

dkoslicki commented Mar 19, 2020

dkoslicki commented Mar 27, 2020

dkoslicki commented Feb 6, 2020 •

edited by ShaopengLiu1

Loading