mmseqs2 prefilter failed: No k-mer could be extracted for the database genomad_db/genomad_db #25

haleyhallowell · 2023-06-30T16:49:41Z

Hello! I am currently trying to utilize the genomad annotate module to annotate a .fna file of Megahit assembled contigs. After downloading the database and attaching a unique identifier to each .fna headline (because they all started with k127), i ran the following command:

genomad annotate final_vOTUs_numbered.fna ./genomad_output ./genomad_db

I get this error directly from genomad:

Traceback (most recent call last):
File "/home/hhallow1/.conda/envs/genomad/lib/python3.10/site-packages/genomad/mmseqs2.py", line 137, in run_mmseqs2
subprocess.run(command, stdout=fout, stderr=fout, check=True)
File "/home/hhallow1/.conda/envs/genomad/lib/python3.10/subprocess.py", line 526, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['mmseqs', 'search', PosixPath('genomad_output/final_vOTUs_numbered_annotate/final_vOTUs_numbered_mmseqs2/query_db/query_db'), PosixPath('genomad_db/genomad_db'), PosixPath('genomad_output/final_vOTUs_numbered_annotate/final_vOTUs_numbered_mmseqs2/search_db/search_db'), PosixPath('genomad_output/final_vOTUs_numbered_annotate/final_vOTUs_numbered_mmseqs2/tmp'), '--threads', '48', '-s', '4.2', '--cov-mode', '1', '-c', '0.2', '-e', '0.001', '--split', '0', '--split-mode', '0', '--max-seqs', '1000000', '--min-ungapped-score', '20', '--max-rejected', '225']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/hhallow1/.conda/envs/genomad/bin/genomad", line 8, in
sys.exit(cli())
File "/home/hhallow1/.conda/envs/genomad/lib/python3.10/site-packages/click/core.py", line 1130, in call
return self.main(*args, **kwargs)
File "/home/hhallow1/.conda/envs/genomad/lib/python3.10/site-packages/rich_click/rich_group.py", line 21, in main
rv = super().main(*args, standalone_mode=False, **kwargs)
File "/home/hhallow1/.conda/envs/genomad/lib/python3.10/site-packages/click/core.py", line 1055, in main
rv = self.invoke(ctx)
File "/home/hhallow1/.conda/envs/genomad/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/hhallow1/.conda/envs/genomad/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/hhallow1/.conda/envs/genomad/lib/python3.10/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "/home/hhallow1/.conda/envs/genomad/lib/python3.10/site-packages/genomad/cli.py", line 425, in annotate
genomad.annotate.main(
File "/home/hhallow1/.conda/envs/genomad/lib/python3.10/site-packages/genomad/modules/annotate.py", line 202, in main
mmseqs2_obj.run_mmseqs2(threads, sensitivity, evalue, splits)
File "/home/hhallow1/.conda/envs/genomad/lib/python3.10/site-packages/genomad/mmseqs2.py", line 140, in run_mmseqs2
raise Exception(f"'{command_str}' failed.") from e
Exception: 'mmseqs search genomad_output/final_vOTUs_numbered_annotate/final_vOTUs_numbered_mmseqs2/query_db/query_db genomad_db/genomad_db genomad_output/final_vOTUs_numbered_annotate/final_vOTUs_numbered_mmseqs2/search_db/search_db genomad_output/final_vOTUs_numbered_annotate/final_vOTUs_numbered_mmseqs2/tmp --threads 48 -s 4.2 --cov-mode 1 -c 0.2 -e 0.001 --split 0 --split-mode 0 --max-seqs 1000000 --min-ungapped-score 20 --max-rejected 225' failed.

Here is the output from mmseqs2:

Converting sequences
[=====
Time for merging to query_db_h: 0h 0m 0s 79ms
Time for merging to query_db: 0h 0m 0s 31ms
Database type: Aminoacid
Time for processing: 0h 0m 1s 380ms
search genomad_output/final_vOTUs_numbered_annotate/final_vOTUs_numbered_mmseqs2/query_db/query_db genomad_db/genomad_db genomad_output/final_vOTUs_numbered_annotate/final_vOTUs_numbered_m
mseqs2/search_db/search_db genomad_output/final_vOTUs_numbered_annotate/final_vOTUs_numbered_mmseqs2/tmp --threads 48 -s 4.2 --cov-mode 1 -c 0.2 -e 0.001 --split 0 --split-mode 0 --max-seq
s 1000000 --min-ungapped-score 20 --max-rejected 225

MMseqs Version: 13.45111
Substitution matrix nucl:nucleotide.out,aa:blosum62.out
Add backtrace false
Alignment mode 2
Alignment mode 0
Allow wrapped scoring false
E-value threshold 0.001
Seq. id. threshold 0
Min alignment length 0
Seq. id. mode 0
Alternative alignments 0
Coverage threshold 0.2
Coverage mode 1
Max sequence length 65535
Compositional bias 1
Max reject 225
Max accept 2147483647
Include identical seq. id. false
Preload mode 0
Pseudo count a 1
Pseudo count b 1.5
Score bias 0
Realign hits false
Realign score bias -0.2
Realign max seqs 2147483647
Gap open cost nucl:5,aa:11
Gap extension cost nucl:2,aa:1
Zdrop 40
Threads 48
Compressed 0
Verbosity 3
Seed substitution matrix nucl:nucleotide.out,aa:VTML80.out
Sensitivity 4.2
k-mer length 5
k-score 2147483647
Alphabet size nucl:5,aa:21
Max results per query 1000000
Split database 0
Split mode 0
Split memory limit 0
Diagonal scoring true
Exact k-mer matching 0
Mask residues 1
Mask lower case residues 0
Minimum diagonal score 20
Spaced k-mers 1
Spaced k-mer pattern
Local temporary path
Rescore mode 0
Remove hits by seq. id. and coverage false
Sort results 0
Mask profile 1
Profile E-value threshold 0.1
Global sequence weighting false
Allow deletions false
Filter MSA 1
Maximum seq. id. threshold 0.9
Minimum seq. id. 0
Minimum score per column -20
Minimum coverage 0
Select N most diverse seqs 1000
Min codons in orf 30
Max codons in length 32734
Max orf gaps 2147483647
Contig start mode 2
Contig end mode 2
Orf start mode 1
Forward frames 1,2,3
Reverse frames 1,2,3
Translation table 1
Translate orf 0
Use all table starts false
Offset of numeric ids 0
Create lookup 0
Add orf stop false
Overlap between sequences 0
Sequence split mode 1
Header split mode 0
Chain overlapping alignments 0
Merge query 1
Search type 0
Search iterations 1
Start sensitivity 4
Search steps 1
Exhaustive search mode false
Filter results during exhaustive search 0
Strand selection 1
LCA search mode false
Disk space limit 0
MPI runner
Force restart with latest tmp false
Remove temporary files false

prefilter genomad_output/final_vOTUs_numbered_annotate/final_vOTUs_numbered_mmseqs2/query_db/query_db genomad_db/genomad_db genomad_output/final_vOTUs_numbered_annotate/final_vOTUs_numbered_mmseqs2/tmp/4444936417411739143/pref --sub-mat nucl:nucleotide.out,aa:blosum62.out --seed-sub-mat nucl:nucleotide.out,aa:VTML80.out -s 4.2 -k 5 --k-score 2147483647 --alph-size nucl:5,aa:21 --max-seq-len 65535 --max-seqs 1000000 --split 0 --split-mode 0 --split-memory-limit 0 -c 0.2 --cov-mode 1 --comp-bias-corr 1 --diag-score 1 --exact-kmer-matching 0 --mask 1 --mask-lower-case 0 --min-ungapped-score 20 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 0 --pca 1 --pcb 1.5 --threads 48 --compressed 0 -v 3

Query database size: 56046 type: Aminoacid
Estimated memory consumption: 1G
Target database size: 227897 type: Profile
Process prefiltering step 1 of 1

Index table k-mer threshold: 104 at k-mer size 5
Index table: counting k-mers
[=================================================================] 227.90K 10s 479ms
Index table: Masked residues: 0
No k-mer could be extracted for the database genomad_db/genomad_db.
Maybe the sequences length is less than 14 residues.
Error: Prefilter died

the .fna file, the genomad_output directory and the genomad_db directory are all in the same directory, and i am running the command from that directory as well. Any ideas how to fix this? Thanks!!

The text was updated successfully, but these errors were encountered:

apcamargo · 2023-06-30T17:07:48Z

Hi @haleyhallowell. I don't think I've seen this error before. Do you still get it if you remove short sequences? Could you share the input with me?

haleyhallowell · 2023-06-30T17:53:43Z

Hey! Sure, I can share with you; it is attached! I'm currently waiting on another run to go through the HPC queue. I noticed that when i used pip install genomad that i had to manually install prodigal-gv and mmseqs2, so i made a new environment using a conda install. ill update you once that goes through!
final_vOTUs_numbered.txt.zip

Not sure on the short sequences, nothing should be that short as i filter out <2000bp

haleyhallowell · 2023-07-01T18:09:17Z

Hi! Just wanted to let you know i got it working -- seems like it was the weird install with pip i mentioned above that made it throw this error. Conda install the second time got it working. Thanks!

apcamargo · 2023-07-04T14:48:53Z

Thanks for letting me know! I'll close the issue.

Maybe it was a problem with the MMseqs2 version? The latest versions of geNomad are only compatible with version 14-7e284.

apcamargo closed this as completed Jul 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mmseqs2 prefilter failed: No k-mer could be extracted for the database genomad_db/genomad_db #25

mmseqs2 prefilter failed: No k-mer could be extracted for the database genomad_db/genomad_db #25

haleyhallowell commented Jun 30, 2023

apcamargo commented Jun 30, 2023

haleyhallowell commented Jun 30, 2023 •

edited

Loading

haleyhallowell commented Jul 1, 2023

apcamargo commented Jul 4, 2023

mmseqs2 prefilter failed: No k-mer could be extracted for the database genomad_db/genomad_db #25

mmseqs2 prefilter failed: No k-mer could be extracted for the database genomad_db/genomad_db #25

Comments

haleyhallowell commented Jun 30, 2023

apcamargo commented Jun 30, 2023

haleyhallowell commented Jun 30, 2023 • edited Loading

haleyhallowell commented Jul 1, 2023

apcamargo commented Jul 4, 2023

haleyhallowell commented Jun 30, 2023 •

edited

Loading