Skip to content

Latest commit

 

History

History
142 lines (96 loc) · 4.36 KB

README.md

File metadata and controls

142 lines (96 loc) · 4.36 KB

vcmsa

Python library to run vector clustering Multiple Sequence Alignment (vcMSA).

About

VcMSA uses protein language models to allow alignment of protein sets with conserved function/structure but poorly conserved sequence.

This is an alpha version of the algorithm code that functions as a proof-of-concept, expect later optimization for speed and efficiency

Project Github repo

Colab notebook

Authors

Claire D. McWhite

Isabel Armour-Garb

Mona Singh

Citation

CD McWhite, I Armour-Garb, M Singh, "Leveraging protein language models for accurate multiple sequence alignments", Genome Research, 2023

Inputs

Downloading a language model from Huggingface

  • Example for the prot_t5_xl_uniref50 model.
  • vcMSA was also tested with prot_bert_bfd model, however output alignments differ from the t5 model
from transformers import T5Tokenizer, T5Model
tokenizer = T5Tokenizer.from_pretrained("Rostlab/prot_t5_xl_uniref50")
tokenizer.save_pretrained('prot_t5_xl_uniref50')
model = T5Model.from_pretrained("Rostlab/prot_t5_xl_uniref50")
model.save_pretrained('prot_t5_xl_uniref50')

Installation

vcmsa can be directly installed with [sudo permission] from pypi

conda create --prefix vcmsa_env --file vcmsa/environment.txt
conda activate vcmsa_env
pip install vcmsa 

or

conda create --prefix vcmsa_env --file vcmsa/environment.txt 

conda activate vcmsa_env
easy_install install vcmsa

Alternatively, vcmsa can also be installed from source.

git clone --branch main --quiet https://github.com/clairemcwhite/vcmsa.git

conda create --prefix vcmsa_env --file vcmsa/environment.txt

conda activate vcmsa_env
cd vcmsa
python setup.py install
#or 
sudo python setup.py install
#or 
python setup.py install --user   

As this code is in development, please submit cases where vcmsa fails as a github Issue : https://github.com/clairemcwhite/vcmsa/issues

Package usage

### Command line usage

$ vcmsa  


usage: vcmsa [-h] -i FASTA_PATH [-e EMBEDDING_PATH] -o OUT_PATH [-nb] [-sl SEQLIMIT] [-ex] [-fx]
             [-l LAYERS [LAYERS ...]] [-hd HEADS] [-st SEQSIMTHRESH] -m MODEL_NAME [-pca] [-p PADDING] [-l2]

optional arguments:
  -h, --help            show this help message and exit
  -i FASTA_PATH, --in FASTA_PATH
                        Path to fasta
  -e EMBEDDING_PATH, --emb EMBEDDING_PATH
                        Path to embeddings
  -o OUT_PATH, --outfile OUT_PATH
                        Path to outfile
  -bc, --batch_correct
                        If added, do batch correction on sequences
  -sl SEQLIMIT, --seqlimit SEQLIMIT
                        Limit to n sequences. For testing
  -ex, --exclude        Exclude outlier sequences from initial alignment process
  -fx, --fully_exclude  Additionally exclude outlier sequences from final alignment
  -l LAYERS [LAYERS ...], --layers LAYERS [LAYERS ...]
                        Which layers (of 30 in protbert) to select, default = '-16 -15 -14 -13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1'
  -hd HEADS, --heads HEADS
                        File will one head identifier per line, format layer1_head3
  -st SEQSIMTHRESH, --seqsimthresh SEQSIMTHRESH
                        Similarity threshold for clustering sequences
  -m MODEL_NAME, --model MODEL_NAME
                        Model name or path to local model
  -pca, --pca_plot      If flagged, output 2D pca plot of amino acid clusters
  -p PADDING, --padding PADDING
                        Number of characters of X to add to start and end of sequence (can be important for
                        fragment sequences), default: 10
  -l2, --headnorm       Take L2 normalization of each head

ex.
```bash
fasta=LDLa.vie.20seqs.fasta
layers='-16 -15 -14 -13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1'
model=prot_t5_xl_uniref50
suffix=.16layer.t5.aln
vcmsa  -i $fasta -o $fasta.$suffix --layers  $layers  -m $model --exclude --pca_plot