# MHC Epitope Prediction

Reference:

- [`epitopepredict`](https://epitopepredict.readthedocs.io/en/latest/description.html#prediction-algorithms)
  - [Python API](https://epitopepredict.readthedocs.io/en/latest/examples.html)

In [1]:
import os
import warnings

In [2]:
warnings.simplefilter('ignore', FutureWarning)

In [3]:
import epitopepredict as ep
from epitopepredict import base, sequtils, analysis, plotting

# Get list of predictors

| name	| description |
|:- | :- |
| basicmhc1	| built-in MHC-class I predictor |
| tepitope | implements the TEPITOPEPan method, built in (MHC-II)|
| netMHCpan | http://www.cbs.dtu.dk/services/NetMHCpan/ (MHC-I) |
| netMHCIIpan | http://www.cbs.dtu.dk/services/NetMHCIIpan/ (MHC-II) |
| mhcflurry | https://github.com/openvax/mhcflurry (MHC-I) |
| IEDB MHC-I tools | http://tools.immuneepitope.org/mhci/download/ |

Only `tepitope`, `netmhciipan`, `netmhcpan`, `mhcflurry` are installed locally.


In [4]:
print(base.predictors)

['basicmhc1', 'tepitope', 'netmhciipan', 'netmhcpan', 'mhcflurry', 'iedbmhc1', 'iedbmhc2']


## S. aureus analysis

Use hihgly expressedd proteins in `sa_highly_expressed_genes.fasta`

In [5]:
pid = 'sa_highly_expressed_genes'

In [6]:
df = sequtils.fasta_to_dataframe(f'{pid}.fasta')

run predictions for a protein sequence:

In [7]:
alleles = """HLA-DRB1*04:01
HLA-DRB1*04:02
HLA-DRB1*15:01
HLA-DRB1*12:01""".split()

In [8]:
%%time

np = 8
for predictor in ['tepitope', 'netmhciipan']:
    p = base.get_predictor()
    p.predict_proteins(df, 
                       length=11, 
                       alleles=alleles,
                       save=True, 
                       path=f'{pid}.{predictor}',
                       threads=np)

predictions done for 27 sequences in 4 alleles
results saved to /Users/ccc14/learning/learn-immune-ds/hla/sa_highly_expressed_genes.tepitope
predictions done for 27 sequences in 4 alleles
results saved to /Users/ccc14/learning/learn-immune-ds/hla/sa_highly_expressed_genes.netmhciipan
CPU times: user 78.5 ms, sys: 69.5 ms, total: 148 ms
Wall time: 2.52 s


### Load and analyze

In [9]:
predictor = 'netmhciipan'
path = f'{pid}.{predictor}'
p.load(path=path)

get all the binders using the current data loaded into the predictor

In [10]:
#default is to use percentile cutoff per allele, returns a dataframe
binders = p.get_binders(cutoff=.95)

In [11]:
binders.shape

(2054, 7)

In [12]:
binders.sort_values('score', ascending=False).head(25)

Unnamed: 0,allele,core,name,peptide,pos,rank,score
0,HLA-DRB1*1501,MRIIKYLTI,sp_Q2FHU8_ISDE_STAA3,MRIIKYLTILV,0,1.0,7.3
184,HLA-DRB1*1201,LVILLVITI,tr_A0A0H2XDN8_A0A0H2XDN8_STAA3,VLLVILLVITI,184,1.0,6.88208
186,HLA-DRB1*1201,LVILLVITI,tr_A0A0H2XDN8_A0A0H2XDN8_STAA3,LVILLVITILL,186,1.0,6.88208
185,HLA-DRB1*1201,LVILLVITI,tr_A0A0H2XDN8_A0A0H2XDN8_STAA3,LLVILLVITIL,185,1.0,6.88208
110,HLA-DRB1*1501,MVLFALILF,tr_A0A0H2XDN8_A0A0H2XDN8_STAA3,TMVLFALILFQ,110,1.0,6.8
111,HLA-DRB1*1501,MVLFALILF,tr_A0A0H2XDN8_A0A0H2XDN8_STAA3,MVLFALILFQG,111,1.0,6.8
109,HLA-DRB1*1501,MVLFALILF,tr_A0A0H2XDN8_A0A0H2XDN8_STAA3,ITMVLFALILF,109,1.0,6.8
39,HLA-DRB1*1501,LRKFILIIL,tr_A0A0H2XDN8_A0A0H2XDN8_STAA3,LRKFILIILVG,39,1.0,6.8
38,HLA-DRB1*1501,LRKFILIIL,tr_A0A0H2XDN8_A0A0H2XDN8_STAA3,RLRKFILIILV,38,1.0,6.8
37,HLA-DRB1*1501,LRKFILIIL,tr_A0A0H2XDN8_A0A0H2XDN8_STAA3,SRLRKFILIIL,37,1.0,6.8


get binders for only one protein by top median rank

In [13]:
name = df.iloc[0,0]
name

'sp_Q2FH96_GUAC_STAA3'

In [14]:
p.get_binders(name=name, cutoff=5, cutoff_method='rank').sort_values('rank')

Unnamed: 0,allele,core,name,peptide,pos,rank,score
222,HLA-DRB1*0401,MVMIGSLFA,sp_Q2FH96_GUAC_STAA3,ASMVMIGSLFA,222,1.0,5.0
224,HLA-DRB1*1201,MVMIGSLFA,sp_Q2FH96_GUAC_STAA3,MVMIGSLFAAH,224,1.0,4.79558
223,HLA-DRB1*1201,MVMIGSLFA,sp_Q2FH96_GUAC_STAA3,SMVMIGSLFAA,223,1.0,4.79558
222,HLA-DRB1*1201,MVMIGSLFA,sp_Q2FH96_GUAC_STAA3,ASMVMIGSLFA,222,1.0,4.79558
224,HLA-DRB1*1501,MVMIGSLFA,sp_Q2FH96_GUAC_STAA3,MVMIGSLFAAH,224,1.0,6.6
223,HLA-DRB1*1501,MVMIGSLFA,sp_Q2FH96_GUAC_STAA3,SMVMIGSLFAA,223,1.0,6.6
224,HLA-DRB1*0402,MVMIGSLFA,sp_Q2FH96_GUAC_STAA3,MVMIGSLFAAH,224,1.0,5.58
222,HLA-DRB1*1501,MVMIGSLFA,sp_Q2FH96_GUAC_STAA3,ASMVMIGSLFA,222,1.0,6.6
222,HLA-DRB1*0402,MVMIGSLFA,sp_Q2FH96_GUAC_STAA3,ASMVMIGSLFA,222,1.0,5.58
223,HLA-DRB1*0401,MVMIGSLFA,sp_Q2FH96_GUAC_STAA3,SMVMIGSLFAA,223,1.0,5.0


get all promiscuous binders

In [15]:
pb = p.promiscuous_binders(n=2, cutoff=.95)

In [16]:
pb.shape

(213, 8)

In [17]:
pb.head(3)

Unnamed: 0,peptide,pos,name,alleles,core,score,mean,median_rank
46,ASMVMIGSLFA,222,sp_Q2FH96_GUAC_STAA3,4,MVMIGSLFA,6.6,5.493895,1.0
58,AVQIMQTLKML,170,tr_A0A0H2XFV3_A0A0H2XFV3_STAA3,4,VQIMQTLKM,6.25442,5.623605,1.0
660,LLLLILLTIIS,10,tr_A0A0H2XIG1_A0A0H2XIG1_STAA3,4,LLILLTIIS,5.3,5.0366,1.0


find clusters of binders in these results

In [18]:
cl = analysis.find_clusters(pb, dist=9, min_size=3)

In [19]:
cl

Unnamed: 0,name,start,end,binders,length
4,sp_Q2FHU7_ISDF_STAA3,122,163,8,41
19,tr_A0A0H2XDN8_A0A0H2XDN8_STAA3,178,206,7,28
17,tr_A0A0H2XDN8_A0A0H2XDN8_STAA3,99,126,7,27
40,tr_A0A0H2XIG1_A0A0H2XIG1_STAA3,134,167,6,33
7,sp_Q2FHU8_ISDE_STAA3,0,30,6,30
36,tr_A0A0H2XI76_A0A0H2XI76_STAA3,0,27,6,27
31,tr_A0A0H2XEB0_A0A0H2XEB0_STAA3,0,22,6,22
38,tr_A0A0H2XIG1_A0A0H2XIG1_STAA3,4,24,5,20
39,tr_A0A0H2XIG1_A0A0H2XIG1_STAA3,89,119,4,30
23,tr_A0A0H2XE68_A0A0H2XE68_STAA3,8,34,4,26
