# MHC Epitope Prediction

Reference:

- [`epitopepredict`](https://epitopepredict.readthedocs.io/en/latest/description.html#prediction-algorithms)
  - [Python API](https://epitopepredict.readthedocs.io/en/latest/examples.html)

- USA300
- NCTC8325

In [1]:
import os
import warnings

In [2]:
warnings.simplefilter('ignore', FutureWarning)

In [3]:
import epitopepredict as ep
from epitopepredict import base, sequtils, analysis, plotting

# Get list of predictors

| name	| description |
|:- | :- |
| basicmhc1	| built-in MHC-class I predictor |
| tepitope | implements the TEPITOPEPan method, built in (MHC-II)|
| netMHCpan | http://www.cbs.dtu.dk/services/NetMHCpan/ (MHC-I) |
| netMHCIIpan | http://www.cbs.dtu.dk/services/NetMHCIIpan/ (MHC-II) |
| mhcflurry | https://github.com/openvax/mhcflurry (MHC-I) |
| IEDB MHC-I tools | http://tools.immuneepitope.org/mhci/download/ |

Only `tepitope`, `netmhciipan`, `netmhcpan`, `mhcflurry` are installed locally.


In [4]:
print(base.predictors)

['basicmhc1', 'tepitope', 'netmhciipan', 'netmhcpan', 'mhcflurry', 'iedbmhc1', 'iedbmhc2']


## S. aureus analysis

Use hihgly expressedd proteins in `sa_highly_expressed_genes.fasta`

In [5]:
ls seqs

NCTC8325_UP000008816.fasta  USA300_UP000001939.fasta


run predictions for a protein sequence:

In [6]:
pids = ['NCTC8325_UP000008816', 'USA300_UP000001939']
alleles = """HLA-DRB1*04:01
HLA-DRB1*04:02
HLA-DRB1*15:01
HLA-DRB1*01:01""".split()
predictors = ['netmhciipan']

In [7]:
%%time

np = 8
for pid in pids:
    df = sequtils.fasta_to_dataframe(f'seqs/{pid}.fasta')
    for predictor in predictors:
        p = base.get_predictor()
        p.predict_proteins(df, 
                           length=11, 
                           alleles=alleles,
                           save=True, 
                           path=f'results/{pid}.{predictor}',
                           threads=np)

took 41.21 seconds
predictions done for 20021 sequences in 4 alleles
results saved to /Users/ccc14/learning/learn-immune-ds/hla/results/NCTC8325_UP000008816.netmhciipan
predictions done for 2607 sequences in 4 alleles
results saved to /Users/ccc14/learning/learn-immune-ds/hla/results/USA300_UP000001939.netmhciipan
CPU times: user 148 ms, sys: 73 ms, total: 221 ms
Wall time: 49 s


### Load and analyze

In [8]:
pid = 'NCTC8325_UP000008816'
predictor = 'netmhciipan'
path = f'results/{pid}.{predictor}'
p.load(path=path)

get all the binders using the current data loaded into the predictor

In [9]:
#default is to use percentile cutoff per allele, returns a dataframe
binders = p.get_binders(cutoff=.95)

In [10]:
binders.shape

(904697, 7)

In [11]:
binders.sort_values('score', ascending=False).head(25)

Unnamed: 0,allele,core,name,peptide,pos,rank,score
4,HLA-DRB1*1501,VRIFQNLII,tr_A0A2T4Q2K9_A0A2T4Q2K9_STAWA,YSVRIFQNLII,4,1.0,9.4
5,HLA-DRB1*1501,VRIFQNLII,tr_A0A2T4Q2K9_A0A2T4Q2K9_STAWA,SVRIFQNLIIN,5,1.0,9.4
6,HLA-DRB1*1501,VRIFQNLII,tr_A0A2T4Q2K9_A0A2T4Q2K9_STAWA,VRIFQNLIINN,6,1.0,9.4
425,HLA-DRB1*1501,LRMYGNIDI,sp_Q5HMP3_AACA_STAEQ,LRMYGNIDIEK,425,1.0,9.0
423,HLA-DRB1*1501,LRMYGNIDI,sp_Q5HMP3_AACA_STAEQ,DILRMYGNIDI,423,1.0,9.0
424,HLA-DRB1*1501,LRMYGNIDI,sp_Q5HMP3_AACA_STAEQ,ILRMYGNIDIE,424,1.0,9.0
97,HLA-DRB1*1501,MRLFARLSL,tr_Q2G049_Q2G049_STAA8,SMRLFARLSLD,97,1.0,9.0
98,HLA-DRB1*1501,MRLFARLSL,tr_Q2G049_Q2G049_STAA8,MRLFARLSLDS,98,1.0,9.0
96,HLA-DRB1*1501,MRLFARLSL,tr_Q2G049_Q2G049_STAA8,ASMRLFARLSL,96,1.0,9.0
269,HLA-DRB1*1501,VRLYVSLDI,tr_A0A4U9T8R6_A0A4U9T8R6_STACP,IVRLYVSLDID,269,1.0,8.9


get binders for only one protein by top median rank

In [12]:
name = df.iloc[0,0]
name

'sp_Q2FHP2_LSPA_STAA3'

In [13]:
p.get_binders(name=name, cutoff=5, cutoff_method='rank').sort_values('rank')

Unnamed: 0,allele,core,name,peptide,pos,rank,score
142,HLA-DRB1*0401,YVIQEFNKA,tr_A0A1F1BWF7_A0A1F1BWF7_9STAP,LFYVIQEFNKA,142,1.0,3.90
43,HLA-DRB1*1501,VLVYLLIQS,tr_A0A8I1BDL0_A0A8I1BDL0_STAEP,FVLVYLLIQSI,43,1.0,5.40
42,HLA-DRB1*1501,VLVYLLIQS,tr_A0A8I1BDL0_A0A8I1BDL0_STAEP,QFVLVYLLIQS,42,1.0,5.40
8,HLA-DRB1*0402,IVLLNSLSK,tr_A0A8I1BDL0_A0A8I1BDL0_STAEP,IVLLNSLSKYI,8,1.0,4.20
7,HLA-DRB1*0402,IVLLNSLSK,tr_A0A8I1BDL0_A0A8I1BDL0_STAEP,LIVLLNSLSKY,7,1.0,4.20
...,...,...,...,...,...,...,...
4,HLA-DRB1*0101,MKRERMLTI,tr_A0A0E1VN37_A0A0E1VN37_STAA3,IPMKRERMLTI,4,5.0,-0.40
5,HLA-DRB1*0101,MKRERMLTI,tr_A0A0E1VN37_A0A0E1VN37_STAA3,PMKRERMLTIR,5,5.0,-0.40
6,HLA-DRB1*0101,MKRERMLTI,tr_A0A0E1VN37_A0A0E1VN37_STAA3,MKRERMLTIRV,6,5.0,-0.40
0,HLA-DRB1*0401,VRKANYTLH,tr_A0A654CAD3_A0A654CAD3_9STAP,MEVRKANYTLH,0,5.0,-0.32


get all promiscuous binders

In [14]:
pb = p.promiscuous_binders(n=2, cutoff=.95)

In [15]:
pb.shape

(67136, 8)

In [16]:
pb.head(3)

Unnamed: 0,peptide,pos,name,alleles,core,score,mean,median_rank
215168,IRIYNTMCIEK,37,tr_A0A0H2WZD3_A0A0H2WZD3_STAAC,4,IRIYNTMCI,8.56,4.7375,1.0
33043,CVFRIYTNLSL,3,tr_A0A0H2WW19_A0A0H2WW19_STAAC,4,FRIYTNLSL,8.3,4.97,1.0
326205,LRLFMLLTLIS,39,sp_Q2FZT3_Y907_STAA8,4,LRLFMLLTL,8.2,4.945,1.0


find clusters of binders in these results

In [17]:
cl = analysis.find_clusters(pb, dist=9, min_size=3)

In [18]:
cl

Unnamed: 0,name,start,end,binders,length
5465,tr_A0A380DTS1_A0A380DTS1_STAAU,1,45,13,44
9186,tr_A0A7Z8E263_A0A7Z8E263_STACP,138,180,13,42
5506,tr_A0A380DUA4_A0A380DUA4_STAAU,0,49,12,49
2818,tr_A0A1F1C1P9_A0A1F1C1P9_9STAP,2,50,12,48
6113,tr_A0A380E9B4_A0A380E9B4_STAAU,13,61,12,48
...,...,...,...,...,...
13088,tr_Q93I83_Q93I83_STAAU,440,452,2,12
13089,tr_Q93IA9_Q93IA9_STAAU,8,20,2,12
13093,tr_Q9FDP4_Q9FDP4_STAAU,5,17,2,12
13097,tr_Q9FDP6_Q9FDP6_STAAU,315,327,2,12
