Skip to content

Searching proteins

Jamie Morton edited this page Apr 9, 2024 · 11 revisions

This search and alignment tutorial is building off of the previous tutorial on how to build search databases. Here, we will provide guidance on how to perform structured queries against protein sequence datasets. We will first show how to pull down prebuilt databases, and how to build your own queries.

To get started first download the TM-Vec CATH model and the prebuilt database

wget https://users.flatironinstitute.org/thamamsy/public_www/tm_vec_cath_model.ckpt
wget https://users.flatironinstitute.org/thamamsy/public_www/tm_vec_cath_model_params.json
wget https://users.flatironinstitute.org/jmorton/public_www/deepblast-public-data/checkpoints/deepblast-v3.ckpt

To search against the embeddings_cath_s100_final.npy database, you just need to provide the proteins you want to query in the form a of fasta formatted file. Say that you have a file called my_proteins.fasta that you want to query against the database. You can perform this query using the following command

tmvec-search \
    --query bagel.fa \
    --tm-vec-model tm_vec_cath_model.ckpt \
    --tm-vec-config tm_vec_cath_model_params.json \
    --database bagel_database/db.npy \
    --metadata bagel_database/meta.npy \
    --database-fasta bagel.fa \
    --device 'gpu' \
    --output-format tabular \
    --output tabular.txt \
    --output-embeddings test.npy

This will output a table of top K-nearest neighbors for each query protein, in addition to their TM-scores under tabular.txt. The first few lines of this file should look as follows

query_id	rank	database_id	tm-score
68-1147	0	68-1147	0.9999999
68-1147	1	Thiostrepton_B	0.98808634
68-1147	2	Siomycin_A_(Siomycin_B)_(Siomycin_C)_(Siomycin_D)	0.9833769
68-1147	3	Thiostrepton_A	0.96945214
68-1147	4	Nosiheptide	0.9661771
A10255_B_(A10255_E)_(A10255_G)	0	A10255_B_(A10255_E)_(A10255_G)	1.0000002
A10255_B_(A10255_E)_(A10255_G)	1	Thioxamycin	0.9883827
A10255_B_(A10255_E)_(A10255_G)	2	Promoinducin	0.9829995
A10255_B_(A10255_E)_(A10255_G)	3	A10255_J	0.97695875

The query id refers to the id of a protein in the query fasta file. The database id refers to the id of the protein in the database fasta file. The rank here specifies the ordering due to the TM-score. For instance 68-1147 vs 68-1147 have a TM-score close to 1 since they are identical (rank 0), and 68-1147 vs Thiostrepton_B have a TM-score of 0.98808634, which making Thiostrepton_B the second most structurally similar protein to the protein 68-1147 in the database. Because the query fasta file and the database fasta file are identical, we expect the top hits to correspond to identical proteins.

The --output-embeddings contains the embeddings of the query proteins. With this file, it is possible to compute TSNE-representations and directly visualize the similarity between proteins. Opening test.npy will require the numpy.load function.

The GPU flag is needed to specify that a GPU will be used for computation -- this will drastically reduce the runtime for encoding the proteins (which is the slowest operation).

Formatting options

There are three different formats that TM-vec can output, (1) a tabular format, (2) an embedding format and (3) an alignment format. Generating the tabular format was demonstrated above. The embedding format takes the input query sequences and converts them to protein vectors. The alignment format computes structural alignments amongst the top K nearest neighbors using DeepBLAST. If computing alignments, you will need to specify additional files in order to retrieve the original protein sequences (namely the fasta file and the fasta index file). Specifying alignments can be done as follows

tmvec-search \
    --query bagel.fa \
    --tm-vec-model tm_vec_cath_model.ckpt \
    --tm-vec-config tm_vec_cath_model_params.json \
    --database bagel_database/db.npy \
    --metadata bagel_database/meta.npy \
    --database-fasta bagel.fa \
    --database-faidx bagel.fai \
    --deepblast-model deepblast-pt-l8.ckpt \
    --device 'gpu' \
    --k-nearest-neighbors 1 \
    --output-format alignment \
    --output alignments.txt \
    --output-embeddings test.npy

The alignments file contains a list of gapped alignments (no statistics or TM-scores). In this tutorial the first 5 alignments are formatted below.

Tricyclic  CLGVGSCNNFAGCGYAIVCFW
Tricyclic  CLGVGSCNNFAGCGYAIVCFW

Capistruin  MVRLLAKLLRSTIHGSNGVSLDAVSSTHGTPGFQTPDARVISRFGFN
Capistruin  MVRLLAKLLRSTIHGSNGVSLDAVSSTHGTPGFQTPDARVISRFGFN

Chaxapeptin  MEPQMTELQPEAYEAPSLIEVGEFSEDTLGFGSKPLDSFGLNFF
Chaxapeptin  MEPQMTELQPEAYEAPSLIEVGEFSEDTLGFGSKPLDSFGLNFF

Enterocin  MGNSILNKMTVEEMEAVKGGNLVCPPMPDYIKRLSTGKGVSSVYMAWQIANCKSSGSCMKGQTNRTC
Enterocin  MGNSILNKMTVEEMEAVKGGNLVCPPMPDYIKRLSTGKGVSSVYMAWQIANCKSSGSCMKGQTNRTC

Caulonodin_V  MTQVSPSPLRLIRVGRALDLTRSIGDSGLRESMSSQTYWP
Caulonodin_V  MTQVSPSPLRLIRVGRALDLTRSIGDSGLRESMSSQTYWP

Gassericin  MKNFNTLSFETLANIVGGRNNLAANIGGVGGATVAGWALGNAVCGPACGFVGAHYVPIAWAGVTAATGGFGKIRK
Gassericin  MKNFNTLSFETLANIVGGRNNLAANIGGVGGATVAGWALGNAVCGPACGFVGAHYVPIAWAGVTAATGGFGKIRK

Because we are DeepBLASTing bagel.fa against itself, most of the proteins will align against themselves (see Other things to consider for more discussion).

One thing to note about these alignments is that if two proteins are too dissimilar, then DeepBLAST will not return an alignment. As a result, TM-vec and DeepBLAST will not always agree, it won't be uncommon to have hits from TM-vec that have no DeepBLAST alignments. This is especially the case when the TM-score is slow (around 0.5), DeepBLAST will likely not report any alignments.

Other things to consider

If you want to run these operations on a GPU, make sure to specify the --device flag (i.e. --device 'gpu'). Also keep in mind that performing alignments can be resource intensive (i.e. 4 alignments / second in this tutorial on a RTX 4090 GPU), so if you just want to identify remotely homologous proteins on large queries, you may want to avoid the alignment format.

Finally, we want to point out that DeepBLAST is not a perfect aligner -- it will not perform as well as traditional sequence aligners when the sequence identity is >90%. This is partially due to the training dataset, DeepBLAST was trained on proteins who have sequence identity <40% and TM-score > 0.6. As a result, there are edge cases where DeepBLAST will not be able to correctly rank proteins with high sequence identity (even if they are identical). This is apparent in this tutorial involving bacteriocins.