Skip to content

PyTorch implementation of Parapred (Liberis et al., 2018) with Paratyping (Richardson et al., 2021)

License

Notifications You must be signed in to change notification settings

alchemab/parapred-pytorch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Parapred - PyTorch

MIT License Python 3.10 pytorch LinkedIn Alchemab

Banner

This repo contains a PyTorch implementation of the original Parapred code from Liberis et al., 2018. Alchemab are providing users a faster, more modular tool to predict paratope probabilities for CDR sequences. We also provide instructions on how to integrate parapred into existing Python code.

We would also like to point users to the original Github repo for Parapred, based on Keras.

Table of Contents

Setup

We provide three ways to set up Parapred.

Conda environment

$ conda env create -f environment.yml
$ make install

Pip/requirements.txt

$ pip install -r requirements.txt
$ make install

Dockerfile

Building the Docker image requires a bit of memory to download and install PyTorch in the Docker container. We advise that you allocate at least 4GB of RAM to the Docker daemon.

For Mac, this can be set by Docker > Preferences > Resources > Advanced.

$ docker build -t parapred_pytorch:latest .

Running instructions

Parapred takes as input any Chothia-defined CDR sequence + 2 residues on either side. We provide a table to help translate between IMGT numbers and Chothia numbers.

To run Parapred, this depends on whether you used Conda/Pip or Docker.

Conda/Pip

After install,

$ python cli.py predict <CDR_SEQUENCE> [OPTIONS]

e.g. to predict on the CDRH3 sequence (+2 residues on either side) of adalimumab (AKVSYLSTASSLDYWG) and save the results to output_adalimumab.json,

$ python cli.py predict AKVSYLSTASSLDYWG -v -o output_adalimumab.json

Output format

The output of Parapred is a JSON file:

{
  "AKVSYLSTASSLDYWG":[
    ["A",0.01034],["K",0.23327],["V",0.70882],["S",0.6835],
    ["Y",0.93412],["L",0.98127],["S",0.95116],["T",0.89977],
    ["A",0.77837],["S",0.83716],["S",0.29981],["L",0.03982],
    ["D",0.14933],["Y",0.10781],["W",0.00362],["G",0.00776]
  ]
}

From the command line, we advise using tools such as jq, and here's a link to a good blog post on how to use it. For example, to get the residues whose Parapred probabilities are over 0.6,

$ jq -c '.[] | .[] | select(.[1] >= 0.6)' output.json
# output using AKVSYLSTASSLDYWG
["V",0.70882]
["S",0.6835]
["Y",0.93412]
["L",0.98127]
["S",0.95116]
["T",0.89977]
["A",0.77837]
["S",0.83716]

Running via Docker

The Docker equivalent is

docker run -v /tmp:/data parapred_pytorch:latest predict AKVSYLSTASSLDYWG -v -o /data/output_adalimumab.json
  • The first -v flag before the Docker image tag (parapred_pytorch:latest) mounts your /tmp folder to the /data folder of the Docker container
  • The second -v flag acts as the verbose flag for the predict command of the CLI tool
  • The -o flag sets the location to save the output of the prediction JSON. This is in context of the Docker container. Thus, even if we write /data/ here in the Terminal, it's the /data/ folder of the Docker container, which is actually /tmp of your machine (if you've mounted as we suggest above)

Paratyping

We also include the capability for Paratyping, as described in Richardson et al. 2021. Briefly, sequence positions whose paratope probabilities are greater than 0.67 are aligned, and identity is calculated over those positions. Users can specify paratope probability thresholds using the -t option.

$ python cli.py paratype AKVSYLSTASSLDYWG AKVSYLSTWSSLDYWG

This will generate the following alignment:

AKVSYLSTASSLDYWG
  ::::::x:      
AKVSYLSTWSSLDYWG
Paratype Score=0.875

To interpret the alignment,

  • : corresponds to a match for two paratope positions
  • x corresponds to a mismatch for two paratope positions
  • ' corresponds to a position that is a paratope in the first sequence but not in the second
  • . corresponds to a position that is a paratope in the second sequence but not in the first
  • Gaps correspond to positions that were not predicted to be a paratope in either sequence

Integration into a larger Python codebase

import torch
from parapred.model import Parapred, clean_output
from parapred.cnn import generate_mask
from parapred.preprocessing import encode_batch

# prepare your input
sequences = (["AKVSYLSTASSLDYWG", "YCQRYNRAPYTFG"])

# PyTorch's pack padded sequence function requires length sorting from longest to shortest
sorted_cdr_strings = [cdr for cdr in sorted(sequences, key = lambda z: len(z), reverse=True)]
sorted_cdr_lookup = dict(enumerate(sorted_cdr_strings))
lookup_cdr = dict([(v,i) for i, v in enumerate(sorted_cdr_strings)])

# encoded is a tensor of (batch_size x features x max_length). so is mask.
encoded, lengths = encode_batch(sorted_cdr_strings, max_length = 40)
mask = generate_mask(encoded, lengths)

# initialise the model and load pretrained weights
model = Parapred()
model.load_state_dict(torch.load("parapred/weights/parapred_pytorch.h5"))

# trigger evaluation mode and don't allow gradients to move around
_ = model.eval()
with torch.no_grad():
    probs = model(encoded, mask, lengths)

# this cleans up probabilities that would have been predicted for the
# padded positions, which we should ignore.
probs = [clean_output(pr, lengths[i]).tolist() for i, pr in enumerate(probs)]
 
# map back to CDR sequence; remember that we submitted length-sorted strings
mapped = [list(zip(sorted_cdr_strings[i], pr)) for i, pr in enumerate(probs)]

# we need to re-order `mapped` back to the original (unsorted) ordering 
mapped = [mapped[lookup_cdr[s]] for s in sequences]

Additional notes

How Parapred works

We encourage you to read the paper by Liberis et al., 2018 as it contains details on the training and testing that led up to the neural network architecture and parameters. In short,

  • Parapred first featurises sequences in to a (B x F x L) tensor, where F = 28, L = 40
  • Parapred then applies a masked 1D convolution; the output is still (B x F x L), but a mask is applied to "zero out" positions that should really be pads.
  • The output tensor is then packed and processed by a bi-direcitonal LSTM, where the output is (B x * x 512), as the lengths of the CDRs are variable.
  • Finally it is sent to a fully-connected layer with (hard) sigmoid activation to calculate probabilities

workflow.png

Aims of this repo

Parapred-pytorch (this repo) is written to help researchers rapidly predict paratope probabilities of CDR sequences. It is a more minimalist implementation that focusses more on providing inference capabilities. We provide here the pre-trained weights from the original Parapred Github repo, which were translated to fit the PyTorch framework.

Numbering

The original Parapred method was based on the Chothia-defined CDRs based on the Chothia numbering. We provide a table mapping the CDR Chothia boundaries in the corresponding IMGT numbers. Note that these are not identical to the IMGT boundaries of the CDRs; e.g. CDRH3 according to the IMGT definition is IMGT H105-H117. To number sequences and convert between CDR definitions, we recommend using ANARCI.

CDR Chothia numbers IMGT numbers Parapred input (IMGT)
H1 H26-H34 H27-H37 H25-H39
H2 H52-H56 H57-H64 H55-H66
H3 H95-H102 H107-H117 H105-H119
L1 L24-L34 L24-40 L22-L42
L2 L50-L56 L56-L69 L54-L71
L3 L89-L97 L105-L117 L103-L119

LSTM activations

Our implementation provides users the ability to test Parapred using sigmoid activations for the LSTM layer:

$ python cli.py predict AKVSYLSTASSLDYWG -s -v

The original Parapred method was based on Tensorflow 1.2/Keras, which had used hard_sigmoid activations in the LSTM step. We retain this hard sigmoid as the default, though we would like to highlight that the hard sigmoid is an approximation of the regular sigmoid function (allows faster computation).