# ictv-mmseqs2-protein-database

This repository contains instructions to generate a MMSeqs2 protein database with ICTV taxonomy. This database was not benchmarked. For taxonomic assignment of viral genomes you can try [geNomad](https://github.com/apcamargo/genomad).

## Dependencies:

- [`aria2`](https://github.com/aria2/aria2)
- [`ripgrep`](https://github.com/BurntSushi/ripgrep)
- [`csvtk`](https://github.com/shenwei356/csvtk)
- [`seqkit`](https://github.com/shenwei356/seqkit)
- [`taxonkit` (version 0.11.1)](https://github.com/shenwei356/taxonkit/releases/tag/v0.11.1)
- [`taxopy`](https://github.com/apcamargo/taxopy)
- [`mmseqs2`](https://github.com/soedinglab/MMseqs2)



## Instructions
First we actually install the dependencies....


In [2]:
import os
os.makedirs(f"{os.environ['HOME']}/silly_buisness",exist_ok=True)
os.chdir(f"{os.environ['HOME']}/silly_buisness")

In [None]:
%%bash
# Create and activate a new conda environment
# conda create -n ictv_mmseqs2 python=3.9 -y
conda activate ictv_mmseqs2

# Install conda packages
mamba install -c bioconda -c conda-forge \
    aria2 \
    ripgrep \
    csvtk \
    seqkit \
    taxonkit=0.11.0 \
    mmseqs2 \
    -y
### taxonkit has to be 0.11.0 !!
# Install taxopy using pip
pip install taxopy ipython ipython-kernel
# NOte- now change your i/pythonkernel to the one you just created

In [31]:
%%bash
echo $PWD

/home/neri/silly_buisness


### Verify installations


In [2]:
%%bash
echo "Checking installed tools:"
aria2c --version | head -n 1
rg --version | head -n 1
csvtk version | head -n 1
seqkit version | head -n 1
taxonkit version | head -n 1
python -c "import taxopy; print(f'taxopy {taxopy.__version__}')"
mmseqs version | head -n 1

echo -e "\nSetup complete. You can now run the rest of the thing."

Checking installed tools:
aria2 version 1.37.0
ripgrep 14.1.0 (rev 9477456963)
csvtk v0.30.0
seqkit v2.8.2
taxonkit v0.10.0


taxopy 0.13.0
15.6f452

Setup complete. You can now run the rest of the thing.


Now, download the latest VMR release from ICTV and convert it to a tabular file:

In [14]:
%%bash
mkdir ~/silly_buisness
cd ~/silly_buisness
aria2c -x 4 -o ictv.xlsx "https://ictv.global/vmr/current?fid=15873" --check-certificate=false
# wget https://ictv.global/vmr/current -O ictv.xlsx --no-check-certificate -o /dev/null

# convert xlsx to tsv
csvtk xlsx2csv ictv.xlsx \
    | csvtk csv2tab \
    | sed 's/\xc2\xa0/ /g' \
    | csvtk replace -t -F -f "*" -p "^\s+|\s+$" \
    > ictv.tsv

# choose columns, and remove duplicates
csvtk cut -t -f "Realm,Subrealm,Kingdom,Subkingdom,Phylum,Subphylum,Class,Subclass,Order,Suborder,Family,Subfamily,Genus,Subgenus,Species" ictv.tsv \
    | csvtk uniq -t -f "Realm,Subrealm,Kingdom,Subkingdom,Phylum,Subphylum,Class,Subclass,Order,Suborder,Family,Subfamily,Genus,Subgenus,Species" \
    | csvtk del-header -t \
    > ictv.taxonomy.tsv

mkdir: cannot create directory ‘/home/neri/silly_buisness’: File exists


Create a file that will store all the ICTV taxa names:

In [17]:
%%bash

csvtk cut -t -H -f 1,3,5,7,9,11,13,15 ictv.taxonomy.tsv \
    | sed 's/\t/\n/g' \
    | awk '!/^[[:blank:]]*$/' \
    | sort -u \
    > ictv.names.txt

Now let's get NCBI taxdump though I don't really think we need all cause we can try to get refseq entries of only examplers (and even they error)

In [23]:
%%bash
cd ~/silly_buisness
aria2c  -x 4 -o taxdump.tar.gz  ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz
tar -xzf taxdump.tar.gz


Use `taxonkit create-taxdump` to create a custom taxdump for ICTV. Next, execute the `fix_taxdump.py` script, which will make the taxids sequential to make them compatible with MMSeqs2:

In [40]:
%%bash
# mkdir ~/.taxonkit/
# mv ./*dmp ~/.taxonkit/
# mv gc.prt ~/.taxonkit/
# mv readme* ~/.taxonkit/
taxonkit create-taxdump -K 1 -P 3 -C 5 -O 7 -F 9 -G 11 -S 13 -T 15 \
    --rank-names "realm","kingdom","phylum","class","order","family","genus","species" \
    ictv.taxonomy.tsv --out-dir ictv-taxdump --force


14:16:09.365 [INFO][0m 18665 records saved to ictv-taxdump/nodes.dmp
14:16:09.379 [INFO][0m 18665 records saved to ictv-taxdump/names.dmp
14:16:09.379 [INFO][0m 0 records saved to ictv-taxdump/merged.dmp
14:16:09.379 [INFO][0m 0 records saved to ictv-taxdump/delnodes.dmp


This next block used to be called "fix_taxdump.py"

In [41]:
from pathlib import Path

taxid_mapping_dict = {}
count = 1
with open(Path("ictv-taxdump").joinpath("names.dmp")) as fin:
    for line in fin:
        taxid = line.split("\t")[0]
        taxid_mapping_dict[taxid] = str(count)
        count += 1

with open(Path("ictv-taxdump").joinpath("names.dmp")) as fin, open(
    Path("ictv-taxdump").joinpath("names.dmp.new"), "w"
) as fout:
    for line in fin:
        line = line.strip().split("\t")
        line[0] = taxid_mapping_dict[line[0]]
        line = "\t".join(line)
        fout.write(f"{line}\n")

with open(Path("ictv-taxdump").joinpath("nodes.dmp")) as fin, open(
    Path("ictv-taxdump").joinpath("nodes.dmp.new"), "w"
) as fout:
    for line in fin:
        line = line.strip().split("\t")
        line[0] = taxid_mapping_dict[line[0]]
        line[2] = taxid_mapping_dict[line[2]]
        line = "\t".join(line)
        fout.write(f"{line}\n")

Path("ictv-taxdump").joinpath("names.dmp").unlink()
Path("ictv-taxdump").joinpath("nodes.dmp").unlink()
Path("ictv-taxdump").joinpath("nodes.dmp.new").rename("ictv-taxdump/nodes.dmp")
Path("ictv-taxdump").joinpath("names.dmp.new").rename("ictv-taxdump/names.dmp")

PosixPath('ictv-taxdump/names.dmp')

Download the NCBI taxdump and the `prot.accession2taxid` file. Then, filter `prot.accession2taxid` to keep only viral proteins:

In [45]:
%%bash
# Download the NCBI taxdump
# aria2c -x 4 "ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz" --check-certificate=false
mkdir ncbi-taxdump
tar zxfv taxdump.tar.gz -C ncbi-taxdump
rm taxdump.tar.gz

# Download the protein → taxid association and filter for viruses
# aria2c -x 4 "https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/prot.accession2taxid.FULL.gz" --check-certificate=false

gunzip prot.accession2taxid.FULL.gz

awk '{print $2}' prot.accession2taxid.FULL \
    | sort -u \
    | taxonkit --data-dir ncbi-taxdump lineage \
    | rg "\tViruses;" \
    | awk '{print $1}' \
    > virus_taxid.list

csvtk grep -t -f 2 -P virus_taxid.list prot.accession2taxid.FULL > virus.accession2taxid

# rm prot.accession2taxid.FULL # why antonio why I thought it crashed and then it deleted it now I need to redownload this wasteful legacy filE??

mkdir: cannot create directory ‘ncbi-taxdump’: File exists
tar (child): taxdump.tar.gz: Cannot open: No such file or directory
tar (child): Error is not recoverable: exiting now
tar: Child returned status 2
tar: Error is not recoverable: exiting now
rm: cannot remove 'taxdump.tar.gz': No such file or directory
gzip: prot.accession2taxid.FULL already exists;	not overwritten


#### Find the ICTV-compliant proteins and write a new table with the ICTV taxids
Execute the `get_ictv_taxids.py` (moved into code block) script to create a `accession2taxid` file with ICTV taxids.

In [3]:
import taxopy

# Read the official ICTV names
with open("ictv.names.txt") as fin:
    ictv_name_set = {i.strip() for i in fin.readlines()}

# Create the Taxopy NCBI and ICTV databases
ncbi_taxdb = taxopy.TaxDb(
    nodes_dmp="ncbi-taxdump/nodes.dmp",
    names_dmp="ncbi-taxdump/names.dmp",
    merged_dmp="ncbi-taxdump/merged.dmp",
    keep_files=True,
)
ictv_taxdb = taxopy.TaxDb(
    nodes_dmp="ictv-taxdump/nodes.dmp",
    names_dmp="ictv-taxdump/names.dmp",
    keep_files=True,
)

# Replace non-ICTV taxids
with open("virus.accession2taxid") as fin, open(
    "virus.accession2taxid.ictv", "w"
) as fout:
    next(fin)
    for line in fin:
        acc, taxid = line.split("\t")
        ncbi_taxon = taxopy.Taxon(int(taxid), ncbi_taxdb)
        for j in ncbi_taxon.name_lineage:
            if j in ictv_name_set:
                ictv_taxid = taxopy.taxid_from_name(j, ictv_taxdb)
                ictv_taxon = taxopy.Taxon(ictv_taxid[0], ictv_taxdb)
                fout.write(f"{acc}\t{ictv_taxon.taxid}\n")
                break



IndexError: list index out of range

In [25]:
j

'Preplasmiviricota'

Download the proteins from NCBI and filter the FASTA file to keep only the proteins associated with ICTV viruses:

In [None]:
%%bash
# Download and filter NR proteins
aria2c --check-certificate=false -x 4 "https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz"

# Create a list containing the accessions of the proteins of ICTV viruses
cut -f 1 virus.accession2taxid.ictv > virus.accession.txt

# Filter the NR proteins to keep the proteins encoded by ICTV viruses
seqkit grep -j 4 -f virus.accession.txt nr.gz | seqkit seq -i -w 0 -o nr.virus.faa.gz

rm nr.gz

There will be proteins in `virus.accession2taxid.ictv` that are not in NR. So we will keep only the proteins that are present in the filtered NR FASTA file:

In [None]:
%%bash
# Filter the NR virus taxid table
seqkit fx2tab -n -i nr.virus.faa.gz > nr.virus.list.txt
csvtk grep -t -H -f 1 -P nr.virus.list.txt virus.accession2taxid.ictv > nr.virus.accession2taxid.ictv

Using the filtered NR FASTA, the ICTV taxdump, and the `virus.accession2taxid.ictv` tabular file, we will create a MMSeqs2 protein database with taxonomy information:

In [None]:
%%bash
# Create the MMSeqs2 database
mkdir virus_tax_db
mmseqs createdb --dbtype 1 nr.virus.faa.gz virus_tax_db/virus_tax_db
mmseqs createtaxdb virus_tax_db/virus_tax_db tmp --ncbi-tax-dump ictv-taxdump --tax-mapping-file nr.virus.accession2taxid.ictv
rm -rf tmp

Finally, to assign taxonomy to viral sequences in an input file (`input.fna`):

In [None]:
%%bash
mmseqs easy-taxonomy input.fna virus_tax_db/virus_tax_db taxonomy_results tmp -e 1e-5 -s 6 --blacklist "" --tax-lineage 1