Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ground truth computation is too slow for realistic data set sizes #30

Closed
3 tasks done
dkoslicki opened this issue Apr 6, 2020 · 1 comment
Closed
3 tasks done
Assignees
Labels

Comments

@dkoslicki
Copy link
Owner

dkoslicki commented Apr 6, 2020

In the current CMash/GroundTruth.py, since it's using a naive python set calculation, this is much too slow to get results in a reasonable amount of time on larger data sets. Will need to switch to using KMC many times over to calculate the actual ground truth.

Work being done on groundtruth branch.

  • Use KMC to count k-mers of all training database genomes for all k-mer sizes (and return total number of distinct k-mers)
  • Use KMC to count k-mer of the query genome for all k-mer sizes
  • Calculate containment index via kmc_tools intersect and divide by total number of distinct training database k-mers
@dkoslicki dkoslicki self-assigned this Apr 6, 2020
dkoslicki added a commit that referenced this issue Apr 6, 2020
dkoslicki added a commit that referenced this issue Apr 6, 2020
dkoslicki added a commit that referenced this issue Apr 6, 2020
…allelization since it doesn't give its temp files unique names. Got around this by using RAM only mode. Added ability to compute all training kmers via KMC. #30
dkoslicki added a commit that referenced this issue Apr 7, 2020
…ces observed between it and the pure python version. Investigating now #30
dkoslicki added a commit that referenced this issue Apr 7, 2020
@dkoslicki
Copy link
Owner Author

merged into master and completed. Closing. @ShaopengLiu1 should be ready for you to use

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant