Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Other uses of k-mer sketches created by skani? #23

Closed
jolespin opened this issue Dec 14, 2023 · 5 comments
Closed

[Question] Other uses of k-mer sketches created by skani? #23

jolespin opened this issue Dec 14, 2023 · 5 comments

Comments

@jolespin
Copy link

I'm wondering if there are other uses for the k-mer sketches for each genome. For example, let's say you had 1000 genomes and built an index/sketch for each one. Do you know of any methods to determine the relative abundance of a sample by aligning to these sketches instead of the full genomes? I may be thinking of the problem incorrectly but that's why I thought I would reach out to ask if this was a possibility using skani sketches or if there are any tools you know of that can do this with custom databases?

@bluenote-1577
Copy link
Owner

Hi @jolespin,

Great question: what you're describing is almost exactly what kraken does.

Kraken sketches each genome into k-mers (they don't call it sketching, but it's the same idea) and checks each read from a sample against their database of k-mers. They combine all of their "sketches" into one large database, though.

The key difference between "aligning" samples (e.g. using kraken) versus genomes (e.g. using skani) is that reads are much shorter than genomes. Thus, different parameters/methods are necessary. Of course, that's why the outputs differ between the algorithms as well.

To answer your question: skani can not be used for abundance calculations for short reads, unfortunately, due to how it's designed. It may be able to be hacked to do abundances for long-reads... but that's not something I've explored yet.

I recently designed a method called sylph for profiling metagenomic samples and giving abundances. It also uses the "sketching" idea for fast abundance profiling, for arbitrary, custom databases of fasta files, which seems important to you. See the associated manuscript in the README for info.

Thanks,

Jim

@jolespin
Copy link
Author

jolespin commented Dec 15, 2023

Kraken sketches each genome into k-mers (they don't call it sketching, but it's the same idea) and checks each read from a sample against their database of k-mers.

Thank you. I forgot about Kraken ever since I started testing out Metaphlan4. I forget that these tools have different ways in which they profile.

To answer your question: skani can not be used for abundance calculations for short reads, unfortunately, due to how it's designed. It may be able to be hacked to do abundances for long-reads... but that's not something I've explored yet.

Good to know when I get more long read datasets in the future.

I recently designed a method called sylph for profiling metagenomic samples and giving abundances. It also uses the "sketching" idea for fast abundance profiling, for arbitrary, custom databases of fasta files, which seems important to you. See the associated manuscript in the README for info.

This is great and definitely along the lines of what I'm looking for. And yes, definitely interested in building custom databases. It would be nice to have a gut, oral, soill, or marine-specific database built from MAGs.

A few follow up questions:

  1. Are these sketches the same sketch format that is used in skani?

  2. So in this usage: sylph sketch -1 *_1.fastq.gz -2 *_2.fastq.gz -d my_sketches -t 50 the output is a directory of sketches for each sample (i.e., fastq pair).

In this usage, sylph sketch -g genomes/*.fa.gz -o database -t 50 the database is a single file that contains all of the sketches for the genomes.

2a. Can you create a database from existing genome sketches? Or should you always create a sketch database from genomes? The reason why I'm asking is I'm wondering if it will be useful to create all the sketches you may need then mix and match when creating curated databases.

2b. Just so I'm understanding correctly, this is the recommended usage if you're operating on a per sample basis:
2b.1) Create a sketch database for each sample as a .sylsp using both forward and reverse read pairs
2b.2) Create a sketch database that includes all of the genomes into a .syldb file
2b.3) Use sylph profile for the .sylsp against the .syldb

  1. I noticed there is a Contig_name field. Is the operation only on the first contig or the entire fasta/assembly (assuming the former just want to be sure)?

  2. Let's say you have both prokaryotes and viruses in a dataset. Is it advised to build a separate .syldb database for prokaryotes and viruses or merged into one? If the former, if I adjust the sub-sampling rate will that give me a bias for either prokaryotes or viruses if I merge abundance tables?

I found the answer to (4) here: https://github.com/bluenote-1577/sylph/wiki/sylph-cookbook#profiling-small-genomes-such-as-viruses

  1. Just tried a run. In the current implementation possible to include the sample name if only a single sample is included or a manifest (i.e., [id_sample][r1][r2]) if multiple samples are included?
(VEBA-profile_env) [jespinoz@exp-15-35 TestVEBA]$ sylph sketch -1 veba_output/preprocess/S1/output/cleaned_1.fastq.gz -2 veba_output/preprocess/S1/output/cleaned_2.fastq.gz -d tmp/
2023-12-15T04:26:16.567Z INFO  [sylph::sketch] Sketching paired sequences...
2023-12-15T04:26:19.224Z INFO  [sylph::sketch] Sketching tmp/cleaned_1.fastq.gz.paired.sylsp complete.
2023-12-15T04:26:19.234Z INFO  [sylph::sketch] Finished.

Since the output .sylsp uses the basename, I'm unable to run multiple samples at the same time since all the files have the same basename and the sample name is encoded in the directory (e.g., S1) in this case. Not a huge deal because I can run them individually for each sample (or symlink if I need to run in batch) but was wondering if this is currently possible or will be included in future versions?

Apologies for all the questions but this and skani came at a very convenient time in my research.

Thanks again for your help and creating these incredible tools.

@bluenote-1577
Copy link
Owner

This is great and definitely along the lines of what I'm looking for. And yes, definitely interested in building custom databases. It would be nice to have a gut, oral, soill, or marine-specific database built from MAGs.

Great to know it's of use. Check https://github.com/bluenote-1577/sylph/wiki/Pre%E2%80%90built-databases for some pre-built databases.

Answering questions:

1: No, the sketch format is different and not interchangeable.

2: Yes, when sketching reads, the output is a .sylsp file for each (pair) of reads. Whereas for genomes, all genomes get combined into one database sketch.

2a: You can't combine sketches to create a database, but you can use multiple databases
together for profiling. For example, if you do sylph profile db1.syldb db2.syldb sample.sylsp then you're profiling against db1.syldb + db2.syldb combined, not all-to-all

2b: Sounds correct. Make sure you use -1 and -2 options for sketching paired-end reads.

3: The operation is on the whole genome unless the -i option is used during sketching. The -i options treats each contig in a file as a genome.

6: You're correct in that this is a problem. This is a very reasonable feature request and a problem that I actually ran into as well. I'll definitely incorporate an option for specifying sample names in the next version.

Thanks for the questions. Let me know if you found any documentation insufficient or confusing. Or, if you find any command line usage confusing. I'm very interested in user feedback.

I may repost this issue to the sylph repo or add a FAQ sometime in the future with your questions if you don't mind!

@jolespin
Copy link
Author

I've been able to build a pretty good wrapper around this and will include in the version 2 manuscript I'm writing right now as the profile-taxonomy module.

The documentation and your cookbooks are really helpful, I feel like you covered pretty much all of my questions.

A couple of features that aren't essential but I feel could make this software more scalable is the following:

  1. A manifest file for the reads as you mentioned earlier
  2. An option for providing databases as a list of filepaths similar to the genomes.
  3. The ability to create individual sketches and merge them
  4. The ability to update a database given new genomes
  5. The ability to provide some summary of what's in a database similar to diamon'd dbinfo. For example, let's say there was a database someone sent me and I wanted to see what was inside, I could run something like sylph inspect genomes_database.syldb.
  6. It would also be useful if you could also provide a 2 column table for the genomes where the first column was the unique genome identifier and the second column was the path to assembly fasta. That way when the output files are being generated, it would be wrt to the cleaned up genome name and will be more streamline to use. That said, this is super easy to do post hoc so not much of an inconvenience.

The reason for 3,4,5 is the scenario where sylph is used to query hundreds of thousands of genomes rapidly and these databases would potentially be growing as time goes on with large scale project like TARA or HMP. Instead of recompiling the entire genome sketch database, it would be really useful to have these precomputed and then merge them to build a massive database. Either that or an option to add or remove an existing database. The ability to provide multiple databases is very useful but if one did this for individual genomes, there would be some system issues due to a really really long command line argument.

Anyways, feel free to take these with a grain of salt as the software is already great. I saw that it's up on biorxiv, good luck on your publication! I'll definitely being sharing this tool within my circle.

@bluenote-1577
Copy link
Owner

Thanks for the feedback! I'll look to incorporate some of these ideas in future releases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants