# MHCGlobe use cases
-----------

### 1. Predict peptide-MHC binding affinity

MHCGlobe predicts a binding score, and both the binding score and binding affinity (nM), `mhcglobe_score` and `mhcglobe_affinity`, respectively.

`mhcglobe.ensemble(train_type='full')` returns the MHCGlobe model trained on the full database of MHC-peptide binding instances in the present study, including both human and non-human pMHC data. 

In [1]:
import os
import sys
import pandas as pd

sys.path.append('./src/')
import mhcglobe, mhcperf

In [2]:
# Load MHCGlobe class object containing the fully trained model.
mhcglobe_model = mhcglobe.ensemble(train_type='full')

# Load binding data as CSV. Required columns are `allele` and `peptide`.
example_binding_data = './example/example_binding_data.csv'
prediction_cols = ['allele', 'peptide']
pMHC_data = pd.read_csv(example_binding_data, usecols=prediction_cols)

# Predict peptide-MHC binding with MHCGlobe
mhcglobe_model.predict_on_dataframe(pMHC_data)

Unnamed: 0,allele,peptide,mhcglobe_affinity,mhcglobe_score
0,HLA-B*15:01,AQMWSLMYF,64.439289,0.614990
1,HLA-A*33:01,DIDILQTNSR,396.716782,0.447011
2,HLA-B*38:01,QPKKAAAAL,20009.974714,0.084641
3,HLA-A*31:01,FLALGFFLR,117.473925,0.559490
4,HLA-B*07:02,RPQKRPSCI,104.042288,0.570712
...,...,...,...,...
95,HLA-A*30:01,KAFNHASVK,118.069387,0.559023
96,HLA-A*02:01,WMMAMKYPI,61.744092,0.618939
97,HLA-A*02:03,FLMGFNRDV,30.050617,0.685494
98,HLA-A*02:01,LLSTTEWQV,91.183992,0.582905


### 2. User Re-train MHCGlobe

MHCGlobe contains an ensemble of deep neural network models, which can easily trained on user defined peptide-MHC binding data. We recomend training MHCGlobe using the initialized weights and biases that were used in the paper for comparable results. The recomended initalized MHCGlobe model can be accessed using `MHCGlobe(train_type='init')`. Note, the example below trains the MHCGlobe architecture on a small subset of the available pMHC binding data, and demonstrates how users can re-train MHCGlobe on new user-currated datasets. 


In [3]:
# Load example training data.
training_cols = ['allele', 'peptide', 'measurement_value', 'measurement_inequality']
binding_data = pd.read_csv(example_binding_data, usecols=training_cols)

# Example paths to save user re-trained MHCGlobe.
new_model_id   = 'mhcglobe_example'                                              # user defined model identifier.
new_model_path = f'./outputs/example_mhcglobe/{new_model_id}'  # path to save re-trained model.

# Re-train MHCGlobe
user_mhcglobe = mhcglobe.ensemble(train_type='init').train_ensemble(
        df_train          = binding_data,
        new_mhcglobe_path = new_model_path,
        verbose           = 0)

# Load re-trained MHCGlobe
user_mhcglobe = mhcglobe.ensemble(new_mhcglobe_path=new_model_path)

# Predict peptide-MHC binding with re-trained MHCGlobe.
user_mhcglobe.predict_on_dataframe(binding_data.loc[:, prediction_cols])

Training...
Training complete.


Unnamed: 0,allele,peptide,mhcglobe_affinity,mhcglobe_score
0,HLA-B*15:01,AQMWSLMYF,11.090725,0.777619
1,HLA-A*33:01,DIDILQTNSR,90.539889,0.583560
2,HLA-B*38:01,QPKKAAAAL,707.040463,0.393602
3,HLA-A*31:01,FLALGFFLR,27.750513,0.692854
4,HLA-B*07:02,RPQKRPSCI,61.596063,0.619161
...,...,...,...,...
95,HLA-A*30:01,KAFNHASVK,26.100112,0.698521
96,HLA-A*02:01,WMMAMKYPI,21.628262,0.715891
97,HLA-A*02:03,FLMGFNRDV,8.286035,0.804564
98,HLA-A*02:01,LLSTTEWQV,28.258902,0.691176


# MHCPerf use cases
__________

The fully trained MHCPerf will be used to predict allele-level performance of a given MHCGlobe model based on its training data featurized.

### 1. Estimate allele-level PPV for query MHC alleles given a pMHC binding dataset.

MHCPerf features will be created for all query alleles based on this data count dictionary, relating the binding dataset to each of the query alleles.

In [4]:
# Load MHCPerf model
mhcperf_model = mhcperf.model()

In [5]:
from paths import DataPaths

# Compute MHCPerf features for query alleles.
query_alleles = ['HLA-A*11:01', 'HLA-B*08:01','HLA-A*30:03','HLA-C*14:02']
mhcglobe_training_data = DataPaths().mhcglobe_full_training_data

perf_features = mhcperf.featurize_from_binding(mhcglobe_training_data, query_alleles)

# Predict PPV for each MHC allele in `query_alleles`
query_allele_ppv_est = mhcperf_model.predict_ppv(perf_features)

# Show MHCPerf allele-level PPV estimates.
pd.DataFrame(zip(query_alleles, query_allele_ppv_est), columns=['allele', 'PPV_est'])

Unnamed: 0,allele,PPV_est
0,HLA-A*11:01,0.710229
1,HLA-B*08:01,0.650147
2,HLA-A*30:03,0.563628
3,HLA-C*14:02,0.629275


### 2. MHCPerf Hyperparameter Tuning and Training

MHCPerf is a single neural network which can be retained on new observations of MHCGlobe PPV performance in response to changes to the MHCGlobe training set. MHCPerf re-training repeats hyperparameter optimization using the grid search algorithm. This procedure can be run in the notebook (as shown below) or run in parallel in the backgroun using the following script call. 

    $ python3 /tf/mhcglobe/mhcperf/mhcperf.py {df_train_path} {new_mhcperf_path}

In [6]:
# MHCPerf training data
# Feature column meansings are shown in Supplementary Figure 1.
mhcperf_train_path  = mhcperf_model.train_df_path
mhcperf_train_df = pd.read_csv(mhcperf_train_path)

# New MHCPerf model save path
new_model_id = 'mhcperf_example'
new_mhcperf_path = f'./outputs/example_mhcperf/{new_model_id}' 

if not os.path.exists(new_mhcperf_path):
    # Train new MHCPerf
    new_mhcperf  = mhcperf.train(
        df_train_path  = mhcperf_train_path,
        model_savepath = new_mhcperf_path,
        verbose        = 0)
else:
    # Load New MHCPerf
    new_mhcperf = mhcperf.model(new_mhcperf_path, mhcperf_train_path)

Running grid search for hyperparameter selection using 3-fold CV.


100%|██████████| 48/48 [18:07<00:00, 22.66s/it]


### 3. Estimate new allele-level PPV values if more training data is collected for MHCGlobe.

Updates an existing mhcperf featurized df to estimate how PPV estimates by MHCPerf
would change if additional data is collected for a given allele of interest. This functionality avoids recomputing the entire feature df for each query allele, and instead updates the existing feature set describing the full training dataset for MHCGlobe.

In [7]:
from featurize_mhcperf import update_mhcperf_features

allele_gets_data = 'HLA-B*15:13'
add_n_data = 2000

# Compute updated feature set
df_all_update = update_mhcperf_features(allele_gets_data, add_n_data)

# Load the fullly trained MHCPerf model
mhcperf_model = mhcperf.model()

# MHCPerf PPV estimates for all MHC alleles with unique pseudosequence
ppv_estimates = mhcperf_model.predict_ppv(df_all_update)

# Show updated PPV estimates given new data
df_all_update.insert(1, 'PPV_est', ppv_estimates)
df_all_update[['allele', 'PPV_est']].head(10)

Unnamed: 0,allele,PPV_est
0,HLA-A*01:01,0.740534
1,HLA-A*01:02,0.712476
2,HLA-A*01:03,0.705833
3,HLA-A*01:06,0.649255
4,HLA-A*01:07,0.714751
5,HLA-A*01:08,0.686687
6,HLA-A*01:10,0.74151
7,HLA-A*01:104,0.633132
8,HLA-A*01:105,0.741617
9,HLA-A*01:106,0.639161
