# MHC Epitope Prediction

Reference:

- [`epitopepredict`](https://epitopepredict.readthedocs.io/en/latest/description.html#prediction-algorithms)
  - [Python API](https://epitopepredict.readthedocs.io/en/latest/examples.html)

In [1]:
import os
import warnings

In [2]:
warnings.simplefilter('ignore', FutureWarning)

In [3]:
import epitopepredict as ep
from epitopepredict import base, sequtils, analysis, plotting

# Get list of predictors

| name	| description |
|:- | :- |
| basicmhc1	| built-in MHC-class I predictor |
| tepitope | implements the TEPITOPEPan method, built in (MHC-II)|
| netMHCpan | http://www.cbs.dtu.dk/services/NetMHCpan/ (MHC-I) |
| netMHCIIpan | http://www.cbs.dtu.dk/services/NetMHCIIpan/ (MHC-II) |
| mhcflurry | https://github.com/openvax/mhcflurry (MHC-I) |
| IEDB MHC-I tools | http://tools.immuneepitope.org/mhci/download/ |

Only `tepitope`, `netmhciipan`, `netmhcpan`, `mhcflurry` are installed locally.


In [4]:
print(base.predictors)

['basicmhc1', 'tepitope', 'netmhciipan', 'netmhcpan', 'mhcflurry', 'iedbmhc1', 'iedbmhc2']


## S. aureus analysis

Proteomes · Staphylococcus aureus (strain NCTC 8325 / PS 47)

https://www.uniprot.org/proteomes/UP000008816

In [5]:
pid = 'UP000008816_93061'

In [6]:
df = sequtils.fasta_to_dataframe(f'{pid}.fasta')

run predictions for a protein sequence:

In [7]:
alleles = ["HLA-DRB1*01:01", "HLA-DRB1*03:05"]

In [8]:
%%time

np = 8
for predictor in ['tepitope', 'netmhciipan']:
    p = base.get_predictor()
    p.predict_proteins(df, 
                       length=11, 
                       alleles=alleles,
                       save=True, 
                       path=f'{pid}.{predictor}',
                       threads=np)

took 10.1 seconds
predictions done for 2889 sequences in 2 alleles
results saved to /Users/ccc14/learning/learn-immune-ds/hla/UP000008816_93061.tepitope
predictions done for 2889 sequences in 2 alleles
results saved to /Users/ccc14/learning/learn-immune-ds/hla/UP000008816_93061.netmhciipan
CPU times: user 50.2 ms, sys: 56.2 ms, total: 106 ms
Wall time: 20.1 s


### Load and analyze

In [9]:
predictor = 'netmhciipan'
path = f'{pid}.{predictor}'
p.load(path=path)

get all the binders using the current data loaded into the predictor

In [10]:
#default is to use percentile cutoff per allele, returns a dataframe
binders = p.get_binders(cutoff=.95)

In [11]:
binders.shape

(106176, 7)

In [12]:
binders.head(3)

Unnamed: 0,allele,core,name,peptide,pos,rank,score
41,HLA-DRB1*0101,CRFSRPIPS,tr_Q2FX82_Q2FX82_STAA8,SPCRFSRPIPS,41,1.0,2.4
42,HLA-DRB1*0101,CRFSRPIPS,tr_Q2FX82_Q2FX82_STAA8,PCRFSRPIPSA,42,1.0,2.4
43,HLA-DRB1*0101,CRFSRPIPS,tr_Q2FX82_Q2FX82_STAA8,CRFSRPIPSAG,43,1.0,2.4


get binders for only one protein by top median rank

In [13]:
name = df.iloc[0,0]
name

'sp_O34090_HEM3_STAA8'

In [14]:
p.get_binders(name=name, cutoff=5, cutoff_method='rank').sort_values('rank')

Unnamed: 0,allele,core,name,peptide,pos,rank,score
223,HLA-DRB1*0101,CVTAERTFL,sp_O34090_HEM3_STAA8,AKCVTAERTFL,223,1.0,0.89
224,HLA-DRB1*0101,CVTAERTFL,sp_O34090_HEM3_STAA8,KCVTAERTFLA,224,1.0,0.89
225,HLA-DRB1*0101,CVTAERTFL,sp_O34090_HEM3_STAA8,CVTAERTFLAE,225,1.0,0.89
215,HLA-DRB1*0305,VHNDEVAKC,sp_O34090_HEM3_STAA8,SKVHNDEVAKC,215,1.0,3.766
216,HLA-DRB1*0305,VHNDEVAKC,sp_O34090_HEM3_STAA8,KVHNDEVAKCV,216,1.0,3.766
217,HLA-DRB1*0305,VHNDEVAKC,sp_O34090_HEM3_STAA8,VHNDEVAKCVT,217,1.0,3.766
57,HLA-DRB1*0101,FVKEIQHEL,sp_O34090_HEM3_STAA8,GLFVKEIQHEL,57,4.0,0.5
58,HLA-DRB1*0101,FVKEIQHEL,sp_O34090_HEM3_STAA8,LFVKEIQHELF,58,4.0,0.5
59,HLA-DRB1*0101,FVKEIQHEL,sp_O34090_HEM3_STAA8,FVKEIQHELFE,59,4.0,0.5
125,HLA-DRB1*0305,LRRGAQILS,sp_O34090_HEM3_STAA8,SSLRRGAQILS,125,4.0,3.43064


get all promiscuous binders

In [15]:
pb = p.promiscuous_binders(n=2, cutoff=.95)

In [16]:
pb.shape

(6943, 8)

In [17]:
pb.head(3)

Unnamed: 0,peptide,pos,name,alleles,core,score,mean,median_rank
25168,GWRIIDPIISI,191,tr_Q2G1I9_Q2G1I9_STAA8,2,WRIIDPIIS,6.04834,5.17417,1.0
2631,ALVVLDGVSLI,190,tr_Q2FX22_Q2FX22_STAA8,2,LVVLDGVSL,5.96548,4.68274,1.0
19908,FVILPVVMSIG,268,tr_Q2FW78_Q2FW78_STAA8,2,FVILPVVMS,5.9426,4.8713,1.0


find clusters of binders in these results

In [18]:
cl = analysis.find_clusters(pb, dist=9, min_size=3)

In [19]:
cl

Unnamed: 0,name,start,end,binders,length
992,tr_Q2G129_Q2G129_STAA8,457,506,9,49
544,tr_Q2FWI3_Q2FWI3_STAA8,247,278,7,31
568,tr_Q2FWV7_Q2FWV7_STAA8,1,32,7,31
552,tr_Q2FWL9_Q2FWL9_STAA8,349,379,7,30
443,tr_Q2FVT9_Q2FVT9_STAA8,158,179,7,21
...,...,...,...,...,...
1280,tr_Q2G2N5_Q2G2N5_STAA8,225,237,2,12
1293,tr_Q2G2T5_Q2G2T5_STAA8,71,83,2,12
1297,tr_Q2G2U7_Q2G2U7_STAA8,150,162,2,12
1300,tr_Q2G2V7_Q2G2V7_STAA8,110,122,2,12
