ENH: action to build custom Diamond database #103

Sann5 · 2023-11-27T17:30:51Z

Closes #102 (ENH: action to build custom Diamond database)

Action to create a DIAMOND formatted reference database from a FASTA input file.
Why is this useful? Until now the only functionality supported in Qiime is to download the complete Diamond DB. It might be useful to allow users to create their own reference databases for downstream use.

Run it locally

Assuming you have a working environment, run the following.

First, clone the repo and checkout the PR branch:

git clone git@github.com:bokulich-lab/q2-moshpit.git
gh pr checkout 103

Now let's get you some data (the one from the Diamond tutorial) so you can run the action.

cd wherever_you_want_to_download_the_data_to

wget https://scop.berkeley.edu/downloads/scopeseq-2.07/astral-scopedom-seqres-gd-sel-gs-bib-40-2.07.fa
mkdir sequences
awk 'BEGIN{FS=" "}{if(!/>/){print toupper($0)}else{print $0}}' astral-scopedom-seqres-gd-sel-gs-bib-40-2.07.fa > sequences/protein-sequences.fasta
qiime tools import --input-path sequences --output-path sequences.qza --type FeatureData\[ProteinSequence\]

Now we are ready to run the action locally.

qiime moshpit build-diamond-db --i-sequences sequences.qza --o-diamond-db custome_diamond.qza --verbose

Alternatively, you can run the tests for this action (nothing will be downloaded).

cd <you_local_path_to>/q2-moshpit/q2_moshpit/eggnog/tests
pytest -vv -W "ignore" -k build_diamond_db

q2_moshpit/plugin_setup.py

codecov · 2023-11-27T17:54:58Z

Codecov Report

Attention: 4 lines in your changes are missing coverage. Please review.

Comparison is base (0668991) 97.74% compared to head (b337326) 97.70%.
Report is 9 commits behind head on main.

❗ Current head b337326 differs from pull request most recent head ab14978. Consider uploading reports for the commit ab14978 to get more accurate results

Files	Patch %	Lines
q2_moshpit/eggnog/_utils.py	40.00%	3 Missing ⚠️
q2_moshpit/_utils.py	66.66%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #103      +/-   ##
==========================================
- Coverage   97.74%   97.70%   -0.04%     
==========================================
  Files          35       39       +4     
  Lines        2172     2267      +95     
==========================================
+ Hits         2123     2215      +92     
- Misses         49       52       +3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Sann5 · 2023-12-01T15:32:42Z

I added taxon<> files as optional inputs in the last commits.
Related PR and issues regarding these optional inputs:
- ENH: Semantic Type for NCBI protein-taxonomy data q2-types-genomics#66
- ENH: Semantic Type for NCBI protein-taxonomy data q2-types-genomics#67 (Blocked until merged)

q2_moshpit/_examples.py

misialq

Hey @Sann5, here's the first round of code comments. I have not tried running it out yet though - did you get a chance to see what's up with those taxonomy annotations? Do they really get added to the final output?

q2_moshpit/citations.bib

q2_moshpit/eggnog/_method.py

q2_moshpit/eggnog/tests/test_method.py

q2_moshpit/plugin_setup.py

q2_moshpit/eggnog/_method.py

Sann5 · 2023-12-20T17:12:46Z

❗️❗️❗️❗️❗️❗️

So I checked whether the output diamond database was different when building it with the taxonomy information.... It turns out it's not.
More precisely the sequence headers and the sequence themselves are the same when calling diamond getseq diamond_db_with_taxainfo.dmdn than when calling diamond getseq diamond_db.dmdn.
It could be that the database still contains some taxa information somewhere and that the diamond getseq method is not accessing/displaying it.
I checked the diamond methods (e.g. diamond getseq, diamond makedb) available and none seem to be able to provide this information (whether a diamond database contains taxonomy info).
It could be that the sequence header needs to contain an accession ID to be mapped to the taxonomy info. So this is what I will try next.

Sann5 · 2023-12-21T09:27:29Z

I was right. The sequence headers (or IDs in Diamond terminology) are parsed and if found then the taxonomy procedure is triggered. In other words, unless the headers have access numbers the taxonomy thing won't work.
Taxonomy info is still not visible when retrieving the sequences via diamond getseq, but the output from running the program suggests that the procedure is, in fact, taking place. This is the output for a set of sequences that contain accession numbers in the sequence headers.

qiime moshpit build-custom-diamond-db --i-seqs sequences.qza --i-taxonomy ncbi_tax_data.qza --o-diamond-db diamond_db_w_taxa.qza --verbose                                                                          
Running external command line application(s). This may print messages to stdout and/or stderr.
The command(s) being run are below. These commands cannot be manually re-run as they will depend on temporary files that no longer exist.

Command: diamond makedb --verbose --in /var/folders/6p/vfbtsnkn1x738nqkqzcprf5w0000gn/T/qiime2/santiago/data/90568ca3-e55d-481f-9826-97eb787be9b4/data/protein-sequences.fasta --db /var/folders/6p/vfbtsnkn1x738nqkqzcprf5w0000gn/T/q2-DiamondDatabaseDirFmt-jpf19f91/ref_db.dmnd --threads 1 --file-buffer-size 67108864 --taxonmap /var/folders/6p/vfbtsnkn1x738nqkqzcprf5w0000gn/T/qiime2/santiago/data/5cae2374-57c9-4095-bf34-90b3824f44d7/data/prot.accession2taxid.gz --taxonnodes /var/folders/6p/vfbtsnkn1x738nqkqzcprf5w0000gn/T/qiime2/santiago/data/5cae2374-57c9-4095-bf34-90b3824f44d7/data/nodes.dmp --taxonnames /var/folders/6p/vfbtsnkn1x738nqkqzcprf5w0000gn/T/qiime2/santiago/data/5cae2374-57c9-4095-bf34-90b3824f44d7/data/names.dmp

diamond v2.1.8.162 (C) Max Planck Society for the Advancement of Science, Benjamin Buchfink, University of Tuebingen
Documentation, support and updates available at http://www.diamondsearch.org
Please cite: http://dx.doi.org/10.1038/s41592-021-01101-x Nature Methods (2021)

#CPU threads: 1
Scoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1)
Database input file: /var/folders/6p/vfbtsnkn1x738nqkqzcprf5w0000gn/T/qiime2/santiago/data/90568ca3-e55d-481f-9826-97eb787be9b4/data/protein-sequences.fasta
Opening the database file...  [0s]
Loading sequences... Sequences = 4, letters = 843, average length = 210
 [0s]
Masking sequences...  [0s]
Writing sequences...  [0s]
Writing accessions...  [0s]
Hashing sequences...  [0s]
Loading sequences...  [0s]
Writing trailer...  [0s]

Accession parsing rules triggered for database seqids (use --no-parse-seqids to disable):
 UniRef prefix  0
gi|xxx| prefix  0
   xxx| prefix  4
   |xxx suffix  4
   .xxx suffix  0
  :PDB= suffix  0

Loading taxonomy names...  [0.959s]
Loaded taxonomy names for 2544505 taxon ids.
Loading taxonomy mapping file...  [273.777s]
Joining accession mapping...  [1.121s]
Writing taxon id list...  [0s]

Accession parsing rules triggered for mapping file seqids (use --no-parse-seqids to disable):
 UniRef prefix  0
gi|xxx| prefix  0
   xxx| prefix  0
   |xxx suffix  0
   .xxx suffix  1229668180
  :PDB= suffix  0

Building taxonomy nodes...  [0.019s]
3101783 taxonomy nodes processed.
Number of nodes assigned to rank:
no rank           796315
superkingdom      4
kingdom           13
subkingdom        1
superphylum       1
phylum            309
subphylum         31
superclass        6
class             520
subclass          169
infraclass        19
cohort            5
subcohort         3
superorder        57
order             1882
suborder          374
infraorder        135
parvorder         26
superfamily       896
family            10281
subfamily         3254
tribe             2357
subtribe          586
genus             108123
subgenus          1811
section           534
subsection        41
series            9
species group     360
species subgroup  134
species           2083492
subspecies        28923
varietas          9819
forma             682
strain            46154
biotype           17
clade             957
forma specialis   750
genotype          22
isolate           1305
morph             11
pathogroup        5
serogroup         153
serotype          1237
subvariety        0

Closing the input file...  [0s]
Closing the database file...  [0.002s]

Database sequences                   4
Database letters                     843
Accessions in database               4
Entries in accession to taxid file   1229668180
Database accessions mapped to taxid  0
Database sequences mapped to taxid   0
Database hash                        55bd90f89749443d4b0fb107c33a615f
Total time                           277s

And this is the output for a set of sequences that do not contain accession numbers in the sequence headers.

qiime moshpit build-custom-diamond-db --i-seqs simple_ids.qza --i-taxonomy ncbi_tax_data.qza --o-diamond-db diamond_db_w_taxa_simpleids.qza --verbose
Running external command line application(s). This may print messages to stdout and/or stderr.
The command(s) being run are below. These commands cannot be manually re-run as they will depend on temporary files that no longer exist.

Command: diamond makedb --verbose --in /var/folders/6p/vfbtsnkn1x738nqkqzcprf5w0000gn/T/qiime2/santiago/data/100f6a9f-c9a6-4dfb-8440-a9c310d1d4f2/data/protein-sequences.fasta --db /var/folders/6p/vfbtsnkn1x738nqkqzcprf5w0000gn/T/q2-DiamondDatabaseDirFmt-n6apnk1f/ref_db.dmnd --threads 1 --file-buffer-size 67108864 --taxonmap /var/folders/6p/vfbtsnkn1x738nqkqzcprf5w0000gn/T/qiime2/santiago/data/5cae2374-57c9-4095-bf34-90b3824f44d7/data/prot.accession2taxid.gz --taxonnodes /var/folders/6p/vfbtsnkn1x738nqkqzcprf5w0000gn/T/qiime2/santiago/data/5cae2374-57c9-4095-bf34-90b3824f44d7/data/nodes.dmp --taxonnames /var/folders/6p/vfbtsnkn1x738nqkqzcprf5w0000gn/T/qiime2/santiago/data/5cae2374-57c9-4095-bf34-90b3824f44d7/data/names.dmp

diamond v2.1.8.162 (C) Max Planck Society for the Advancement of Science, Benjamin Buchfink, University of Tuebingen
Documentation, support and updates available at http://www.diamondsearch.org
Please cite: http://dx.doi.org/10.1038/s41592-021-01101-x Nature Methods (2021)

#CPU threads: 1
Scoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1)
Database input file: /var/folders/6p/vfbtsnkn1x738nqkqzcprf5w0000gn/T/qiime2/santiago/data/100f6a9f-c9a6-4dfb-8440-a9c310d1d4f2/data/protein-sequences.fasta
Opening the database file...  [0s]
Loading sequences... Sequences = 3, letters = 303, average length = 101
 [0s]
Masking sequences...  [0s]
Writing sequences...  [0s]
Writing accessions...  [0s]
Hashing sequences...  [0s]
Loading sequences...  [0s]
Writing trailer...  [0s]

Accession parsing rules triggered for database seqids (use --no-parse-seqids to disable):
 UniRef prefix  0
gi|xxx| prefix  0
   xxx| prefix  0
   |xxx suffix  0
   .xxx suffix  0
  :PDB= suffix  0

Loading taxonomy names...  [0.978s]
Loaded taxonomy names for 2544505 taxon ids.
Loading taxonomy mapping file...  [267.32s]
Joining accession mapping...  [134.995s]
Writing taxon id list...  [0s]

Accession parsing rules triggered for mapping file seqids (use --no-parse-seqids to disable):
 UniRef prefix  0
gi|xxx| prefix  0
   xxx| prefix  0
   |xxx suffix  0
   .xxx suffix  1229668180
  :PDB= suffix  0

Building taxonomy nodes...  [0.01s]
3101783 taxonomy nodes processed.
Number of nodes assigned to rank:
no rank           796315
superkingdom      4
kingdom           13
subkingdom        1
superphylum       1
phylum            309
subphylum         31
superclass        6
class             520
subclass          169
infraclass        19
cohort            5
subcohort         3
superorder        57
order             1882
suborder          374
infraorder        135
parvorder         26
superfamily       896
family            10281
subfamily         3254
tribe             2357
subtribe          586
genus             108123
subgenus          1811
section           534
subsection        41
series            9
species group     360
species subgroup  134
species           2083492
subspecies        28923
varietas          9819
forma             682
strain            46154
biotype           17
clade             957
forma specialis   750
genotype          22
isolate           1305
morph             11
pathogroup        5
serogroup         153
serotype          1237
subvariety        0

Closing the input file...  [0s]
Closing the database file...  [0.001s]

Database sequences                   3
Database letters                     303
Accessions in database               3
Entries in accession to taxid file   1229668180
Database accessions mapped to taxid  0
Database sequences mapped to taxid   0
Database hash                        78c88ee380db113880e4d2ecac2eb3d6
Total time                           404s

Just for reference the output looks very different when building the DB without taxa info (and its much faster).


qiime moshpit build-custom-diamond-db --i-seqs sequences.qza --o-diamond-db diamond_db.qza --verbose
Running external command line application(s). This may print messages to stdout and/or stderr.
The command(s) being run are below. These commands cannot be manually re-run as they will depend on temporary files that no longer exist.

Command: diamond makedb --verbose --in /var/folders/6p/vfbtsnkn1x738nqkqzcprf5w0000gn/T/qiime2/santiago/data/90568ca3-e55d-481f-9826-97eb787be9b4/data/protein-sequences.fasta --db /var/folders/6p/vfbtsnkn1x738nqkqzcprf5w0000gn/T/q2-DiamondDatabaseDirFmt-81r4s79z/ref_db.dmnd --threads 1 --file-buffer-size 67108864

diamond v2.1.8.162 (C) Max Planck Society for the Advancement of Science, Benjamin Buchfink, University of Tuebingen
Documentation, support and updates available at http://www.diamondsearch.org
Please cite: http://dx.doi.org/10.1038/s41592-021-01101-x Nature Methods (2021)

#CPU threads: 1
Scoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1)
Database input file: /var/folders/6p/vfbtsnkn1x738nqkqzcprf5w0000gn/T/qiime2/santiago/data/90568ca3-e55d-481f-9826-97eb787be9b4/data/protein-sequences.fasta
Opening the database file...  [0s]
Loading sequences... Sequences = 4, letters = 843, average length = 210
 [0s]
Masking sequences...  [0s]
Writing sequences...  [0s]
Hashing sequences...  [0s]
Loading sequences...  [0s]
Writing trailer...  [0s]
Closing the input file...  [0s]
Closing the database file...  [0s]

Database sequences  4
  Database letters  843
     Database hash  55bd90f89749443d4b0fb107c33a615f
        Total time  0.001000s

New files for dbs New files for dbs in eggnog. Further adjustments.

q2_moshpit/plugin_setup.py

misialq · 2024-01-03T12:55:06Z

Hey @Sann5, thanks for testing out the different options for including the taxonomy annotations. I'm only wondering now, what that really brings if you cannot see any difference when running the getseq command... Maybe we should use both of those databases (with and without taxonomy annotations) to perform some small functional annotation task to see whether there's any difference there?

misialq · 2024-01-03T13:00:29Z

Also, when you look carefully at the output of both commands you will notice that in neither of those the tax IDs were mapped to your query sequences, which may explain why the getseq command returns the same for both:
case 1:

Database sequences                   4
Database letters                     843
Accessions in database               4
Entries in accession to taxid file   1229668180
Database accessions mapped to taxid  0
Database sequences mapped to taxid   0

case 2:

Database sequences                   3
Database letters                     303
Accessions in database               3
Entries in accession to taxid file   1229668180
Database accessions mapped to taxid  0
Database sequences mapped to taxid   0

What do your database headers look like in both scenarios?

Co-authored-by: Michal Ziemski <mziemski@ethz.ch>

Sann5 · 2024-01-11T15:52:36Z

Hey @misialq :). So... I tried with sequences whose accession IDs are in fact in the database as we discussed.

✅ The good news is that now the output shows that indeed the sequences were mapped to the NCBI database:

Database sequences                   19
Database letters                     9707
Accessions in database               19
Entries in accession to taxid file   1229668180
Database accessions mapped to taxid  19
Database sequences mapped to taxid   19
Database hash                        1914c45d75c4d85b4dbfab7672dff245
Total time                           285s

❌ The bad news is that the diamond getseq your_custome_diamond_DB.dmnd command still returns the sequences with their original headers (they are not modified by the fact of constructing the database with the taxonomy information). As I mentioned before I think that the taxonomy info might still be there even if the getseq command does not reflect it.
❓ What should I do? Perform some a functional annotation task to see whether there's any difference (as you suggested).

misialq · 2024-01-12T13:36:46Z

Hey @Sann5, thanks for investigating! Finally, that makes sense now 😅 What I would suggest is that we just carry on with this PR and in parallel maybe you could test the annotation (as you suggested) when you're working on the tutorials to see what the difference is (if any). How does that sound?

misialq

Good to go! 🚀

Sann5 · 2024-01-12T19:10:47Z

Hey @Sann5, thanks for investigating! Finally, that makes sense now 😅 What I would suggest is that we just carry on with this PR and in parallel maybe you could test the annotation (as you suggested) when you're working on the tutorials to see what the difference is (if any). How does that sound?

Sounds good @misialq, I will do so :) Again I cannot merge this one because the test doesn't pass 😅. Can you do it?

Sann5 and others added 4 commits November 27, 2023 17:00

ENH: Build Diamond DB from fasta action

36921f6

Added parameter compatibility

d1315ee

Merge branch 'main' into build_diamond_db

f056d08

Linting corrections

b337326

Sann5 requested a review from misialq November 27, 2023 17:40

Sann5 marked this pull request as draft November 27, 2023 17:40

Sann5 self-assigned this Nov 27, 2023

Sann5 commented Nov 27, 2023

View reviewed changes

q2_moshpit/plugin_setup.py Outdated Show resolved Hide resolved

Sann5 added 2 commits December 1, 2023 11:44

Added parameters and optional input to build_diamond_db

1cfa0be

Merge branch 'main' into build_diamond_db

82a3afc

Sann5 mentioned this pull request Dec 1, 2023

ENH: Semantic Type for NCBI protein-taxonomy data bokulich-lab/q2-types-genomics#66

Closed

Sann5 added 2 commits December 1, 2023 14:46

Added example and citations for build_diamond_db

16e2ac8

Added compatibility with optional input

1dafa50

Sann5 added stat:blocked enhancement New feature or request labels Dec 1, 2023

Sann5 commented Dec 1, 2023

View reviewed changes

q2_moshpit/_examples.py Outdated Show resolved Hide resolved

Sann5 added 3 commits December 5, 2023 11:37

Taxonomy typed run enabled

c8dc737

Add test for run with taxon information

45f04e1

Update name action name to build_custom_diamond_db

ba763f1

Sann5 marked this pull request as ready for review December 5, 2023 13:41

misialq requested changes Dec 6, 2023

View reviewed changes

Sann5 added 7 commits December 11, 2023 10:37

Refactor TaxonomyNCBI to NCBITaxonomy

8a7ca00

remove file entry from citations

118b92e

remove pdb from eggnog._methods

3e78688

eggnog._method.py: moving relative imports to end of section

59a066f

eggnog.tests.test_method.py: reorganize imports

718fc97

refactor sequences to seqs

4b02097

Refactoring seqs and method name. further updates

bb248a7

Sann5 and others added 7 commits December 15, 2023 10:11

plugin_setup.py: update to taxonomy input_description

4d09730

Another update to description on in the function registration

0d61121

Set default threads to 1

0d9c678

Another update to parameter description in the function registration

10455b3

extend command inside

77bde67

write in/out paths directly in comand

19a0527

Merge branch 'main' into build_diamond_db

134e49c

Sann5 requested a review from misialq December 15, 2023 09:48

Move the db related code to its own files.

6c7a2e2

New files for dbs New files for dbs in eggnog. Further adjustments.

Sann5 closed this Dec 22, 2023

Sann5 deleted the build_diamond_db branch December 22, 2023 09:45

Sann5 restored the build_diamond_db branch December 22, 2023 09:49

Sann5 reopened this Dec 22, 2023

Adjust paths from imports since change of db related code

3103355

misialq reviewed Jan 3, 2024

View reviewed changes

q2_moshpit/plugin_setup.py Outdated Show resolved Hide resolved

q2_moshpit/plugin_setup.py Outdated Show resolved Hide resolved

misialq mentioned this pull request Jan 11, 2024

ENH: action to build custom Diamond database #102

Closed

Sann5 and others added 3 commits January 11, 2024 15:22

Merge branch 'main' into build_diamond_db

69156c6

set --log always to true

cf83dba

Update q2_moshpit/plugin_setup.py

ab14978

Co-authored-by: Michal Ziemski <mziemski@ethz.ch>

Sann5 requested a review from misialq January 11, 2024 15:52

misialq approved these changes Jan 12, 2024

View reviewed changes

misialq merged commit 492dc37 into bokulich-lab:main Jan 15, 2024
3 of 5 checks passed

Sann5 deleted the build_diamond_db branch February 27, 2024 12:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: action to build custom Diamond database #103

ENH: action to build custom Diamond database #103

Sann5 commented Nov 27, 2023 •

edited

Loading

codecov bot commented Nov 27, 2023 •

edited

Loading

Sann5 commented Dec 1, 2023

misialq left a comment

Sann5 commented Dec 20, 2023 •

edited

Loading

Sann5 commented Dec 21, 2023 •

edited

Loading

misialq commented Jan 3, 2024

misialq commented Jan 3, 2024

Sann5 commented Jan 11, 2024

misialq commented Jan 12, 2024

misialq left a comment

Sann5 commented Jan 12, 2024 •

edited

Loading

ENH: action to build custom Diamond database #103

ENH: action to build custom Diamond database #103

Conversation

Sann5 commented Nov 27, 2023 • edited Loading

Closes #102 (ENH: action to build custom Diamond database)

Run it locally

codecov bot commented Nov 27, 2023 • edited Loading

Codecov Report

Sann5 commented Dec 1, 2023

misialq left a comment

Choose a reason for hiding this comment

Sann5 commented Dec 20, 2023 • edited Loading

Sann5 commented Dec 21, 2023 • edited Loading

misialq commented Jan 3, 2024

misialq commented Jan 3, 2024

Sann5 commented Jan 11, 2024

misialq commented Jan 12, 2024

misialq left a comment

Choose a reason for hiding this comment

Sann5 commented Jan 12, 2024 • edited Loading

Sann5 commented Nov 27, 2023 •

edited

Loading

codecov bot commented Nov 27, 2023 •

edited

Loading

Sann5 commented Dec 20, 2023 •

edited

Loading

Sann5 commented Dec 21, 2023 •

edited

Loading

Sann5 commented Jan 12, 2024 •

edited

Loading