## mhcpredict library basics

mhcpredict provides a standardized programmatic interface for executing several MHC binding prediction methods. The results from each method can then be processed and visualized in a consistent manner. The Tepitope module implements the TEPITOPEPan method and requires no external program to run. netMHCIIpan must be downloaded separately from the website and installed on your system. 

### References:

D. Farrell and S. V Gordon, “Epitopemap: a web application for integrated whole proteome epitope prediction,” BMC Bioinformatics, vol. 16, no. 1, p. 221, 2015.

L. Zhang, Y. Chen, H.-S. Wong, S. Zhou, H. Mamitsuka, and S. Zhu, “TEPITOPEpan: extending TEPITOPE for peptide binding prediction covering over 700 HLA-DR molecules.,” PLoS One, vol. 7, no. 2, p. e30483, Jan. 2012.

A. S. De Groot and W. Martin, “Reducing risk, improving outcomes: bioengineering less immunogenic protein therapeutics.,” Clin. Immunol., vol. 131, no. 2, pp. 189–201, May 2009.

F. a Chaves, A. H. Lee, J. L. Nayak, K. a Richards, and A. J. Sant, “The utility and limitations of current Web-available algorithms to predict peptides recognized by CD4 T cells in response to pathogen infection.,” J. Immunol., vol. 188, no. 9, pp. 4235–48, May 2012.

In [45]:
import os
import pandas as pd
from mhcpredict import base, sequtils, analysis

In [10]:
genbankfile = 'testing/zaire-ebolavirus.gb'
fastafile = 'testing/zaire-ebolavirus.faa'
savepath = 'testing'

## Methodology

Predictors for each method inherit from the `Predictor` class and all implement a predict method for scoring a single sequence. This may wrap methods from other modules and/or call command line predictors. For example the `TepitopePredictor` uses the `mhcpredict.tepitope` module. This method should return a Pandas `DataFrame`. The `predictProteins` method is used for multiple proteins contained in a dataframe of sequences in a standard format. This is created from a genbank or fasta file (see examples below). For large numbers of sequences predictProteins should be called with save=True so that the results are saved as each protein is completed to avoid memory issues, since many alleles might be called for each protein. Results are saved with one file per protein in msgpack format.

In [30]:
#get list of predictors
print base.predictors

['tepitope', 'netmhciipan', 'iedbmhc1', 'iedbmhc2', 'bcell']


In [51]:
#get data in genbank format into a dataframe
df = sequtils.genbank2Dataframe(genbankfile, cds=True)
#get data in fasta format
df = sequtils.fasta2Dataframe(fastafile)

#create tepitope predictor
P = base.getPredictor('tepitope')

#run prediction for several alleles and save results to savepath
alleles = ["HLA-DRB1*0101", "HLA-DRB1*0305", "HLA-DRB1*0401"]
P.predictProteins(df,length=11,alleles=alleles,save=True,path=savepath)

ZEBOVgp1
ZEBOVgp2
ZEBOVgp3
ZEBOVgp4
ZEBOVgp4
ZEBOVgp4
ZEBOVgp5
ZEBOVgp6
ZEBOVgp7


In [54]:
#read previous results for a protein
filename = 'testing/ZEBOVgp1.mpk'
res = pd.read_msgpack(filename)
#set this data for the predictor
#assumes the data is for the right predictor, need to add checks...
P.data = res
print res[:10]
#get promiscuous binders
print P.getPromiscuousBinders(data=res,n=2)

         peptide       core  pos  score      name         allele  rank
198  VIFRLMRTNFL  FRLMRTNFL  198    3.4  ZEBOVgp1  HLA-DRB1*0101     1
199  IFRLMRTNFLI  FRLMRTNFL  199    3.4  ZEBOVgp1  HLA-DRB1*0101     1
200  FRLMRTNFLIK  FRLMRTNFL  200    3.4  ZEBOVgp1  HLA-DRB1*0101     1
709  NRFVTLDGQQF  FVTLDGQQF  709    2.5  ZEBOVgp1  HLA-DRB1*0101     4
710  RFVTLDGQQFY  FVTLDGQQF  710    2.5  ZEBOVgp1  HLA-DRB1*0101     4
711  FVTLDGQQFYW  FVTLDGQQF  711    2.5  ZEBOVgp1  HLA-DRB1*0101     4
70   DSFLLMLCLHH  FLLMLCLHH   70    2.0  ZEBOVgp1  HLA-DRB1*0101     7
71   SFLLMLCLHHA  FLLMLCLHH   71    2.0  ZEBOVgp1  HLA-DRB1*0101     7
72   FLLMLCLHHAY  FLLMLCLHH   72    2.0  ZEBOVgp1  HLA-DRB1*0101     7
32   QGIVRQRVIPV  IVRQRVIPV   32    1.7  ZEBOVgp1  HLA-DRB1*0101    10
        core  allele    score      peptide      name  pos
1  IVRQRVIPV       2  3.46104  QGIVRQRVIPV  ZEBOVgp1   32
0  FRLMRTNFL       3  5.60000  VIFRLMRTNFL  ZEBOVgp1  198
2  LLNLSGVNN       2  2.47414  RLLNLSGVNNL  Z

Here name is the protein identifier from the input file (a locus tag for example) and a score column which will differ between methods. MHC-II methods can be run for varying lengths, with the core usually being the highest scoring in that peptide/n-mer (but not always).

## Global predictions for a genome

One application of immunoinformatics is to screen out likely candidate antigens from the genome for further study. The approach used here is to perform predictions for all protein sequences and select out potential antigens based on promiscuity of predicted binders. That is, those found in multiple alleles. You can also choose a global percentage cut-off and minimum number of alleles.

In [56]:
path = 'testing'
method = 'tepitope'

#get binders from existing results (res is a DataFrame)
P.getBinders(data=res)
#get all promiscuous binders (>=n) for an entire set of related proteins, i.e. in a genome
b = analysis.getAllBinders(path, method='tepitope', n=2)
print b[:10]


getting binders..
score
testing/quantiles.csv
HLA-DRB1*0101    0.800
HLA-DRB1*0305    2.835
HLA-DRB1*0401    2.780
Name: 0.96, dtype: float64
        core  allele    score      peptide      name  pos
3  YHKILTAGL       2  2.50000  MDYHKILTAGL  ZEBOVgp1   18
2  IVRQRVIPV       3  3.46104  QGIVRQRVIPV  ZEBOVgp1   32
0  FLLMLCLHH       2  3.68000  DSFLLMLCLHH  ZEBOVgp1   70
1  FRLMRTNFL       3  5.60000  VIFRLMRTNFL  ZEBOVgp1  198
3  YRQSRSASQ       2  4.50000  GEYRQSRSASQ  ZEBOVgp5   36
1  LLLLIARKT       2  2.72170  LLLLIARKTCG  ZEBOVgp5   99
2  LRALLTLCA       2  4.60000  SKLRALLTLCA  ZEBOVgp5  164
0  CESSAVVVS       2  3.32760  LPCESSAVVVS  ZEBOVgp5  248
1  VVQQQTIAS       2  4.58000  ATVVQQQTIAS  ZEBOVgp2   93
0  FNNLNSTTS       2  3.70000  EAFNNLNSTTS  ZEBOVgp2  197


## Epitopes: binder clustering

Epitope clustering has been previously observed to be an indicator of T cell epitope regions. The `findClusters` method in the analysis module allows automatic cluster detection from a set of predicted binders from one or more proteins. It can be done for a whole genome.

The result is a table of sequence regions with the number of binders and density of epitope cluster.

In [60]:
#find clusters of binders in these results
cl = analysis.findClusters(b, method, dist=10, minsize=2)
print cl

7 proteins with binders
11 clusters found in 3 proteins

        name  start   end  binders  clustersize   density    method
4   ZEBOVgp7     50    63        3           13  0.230769  tepitope
8   ZEBOVgp7   1801  1823        3           22  0.136364  tepitope
6   ZEBOVgp7   1376  1388        2           12  0.166667  tepitope
0   ZEBOVgp3    103   116        2           13  0.153846  tepitope
1   ZEBOVgp3    169   182        2           13  0.153846  tepitope
2   ZEBOVgp3    247   260        2           13  0.153846  tepitope
5   ZEBOVgp7    784   797        2           13  0.153846  tepitope
7   ZEBOVgp7   1652  1665        2           13  0.153846  tepitope
3   ZEBOVgp6     24    38        2           14  0.142857  tepitope
10  ZEBOVgp7   1997  2011        2           14  0.142857  tepitope
9   ZEBOVgp7   1882  1901        2           19  0.105263  tepitope
