# MHCGlobe use cases
-----------

### 1. Predict peptide-MHC binding affinity

MHCGlobe predicts a binding score, and both the binding score and binding affinity (nM), `mhcglobe_score` and `mhcglobe_affinity`, respectively.

`MHCGlobe(train_type='full')` returns the MHCGlobe model trained on the full database of MHC-peptide binding instances in the present study, including both human and non-human pMHC data. 

In [1]:
import os
import pandas as pd
import sys
sys.path.append('./src/')

from mhcglobe import MHCGlobe

# Load MHCGlobe class object containing the fully trained model.
mhcglobe = MHCGlobe(train_type='full')

# Load binding data as CSV. Required columns are `allele` and `peptide`.
example_binding_data = './example/example_binding_data.csv'
prediction_cols = ['allele', 'peptide']
pMHC_data = pd.read_csv(example_binding_data, usecols=prediction_cols)

# Predict peptide-MHC binding with MHCGlobe
mhcglobe.predict_on_dataframe(pMHC_data)

Unnamed: 0,allele,peptide,mhcglobe_affinity,mhcglobe_score
0,HLA-B*15:01,AQMWSLMYF,64.439289,0.614990
1,HLA-A*33:01,DIDILQTNSR,396.716782,0.447011
2,HLA-B*38:01,QPKKAAAAL,20009.974714,0.084641
3,HLA-A*31:01,FLALGFFLR,117.473925,0.559490
4,HLA-B*07:02,RPQKRPSCI,104.042288,0.570712
...,...,...,...,...
95,HLA-A*30:01,KAFNHASVK,118.069387,0.559023
96,HLA-A*02:01,WMMAMKYPI,61.744092,0.618939
97,HLA-A*02:03,FLMGFNRDV,30.050617,0.685494
98,HLA-A*02:01,LLSTTEWQV,91.183992,0.582905


### 2. Re-train MHCGlobe

MHCGlobe contains an ensemble of deep neural network models, which can easily trained on user defined peptide-MHC binding data. We recomend training MHCGlobe using the initialized weights and biases that were used in the paper for comparable results. The recomended initalized MHCGlobe model can be accessed using `MHCGlobe(train_type='init')`.


In [20]:
cols = ['allele', 'model_id', 'PPV', 'data_aa_pos_1', 'data_aa_pos_2',
       'data_aa_pos_3', 'data_aa_pos_4', 'data_aa_pos_5', 'data_aa_pos_6',
       'data_aa_pos_7', 'data_aa_pos_8', 'data_aa_pos_9', 'data_aa_pos_10',
       'data_aa_pos_11', 'data_aa_pos_12', 'data_aa_pos_13', 'data_aa_pos_14',
       'data_aa_pos_15', 'data_aa_pos_16', 'data_aa_pos_17', 'data_aa_pos_18',
       'data_aa_pos_19', 'data_aa_pos_20', 'data_aa_pos_21', 'data_aa_pos_22',
       'data_aa_pos_23', 'data_aa_pos_24', 'data_aa_pos_25', 'data_aa_pos_26',
       'data_aa_pos_27', 'data_aa_pos_28', 'data_aa_pos_29', 'data_aa_pos_30',
       'data_aa_pos_31', 'data_aa_pos_32', 'data_aa_pos_33', 'data_aa_pos_34',
       'N1_dist', 'N2_dist', 'N3_dist', 'N4_dist', 'N5_dist', 'N6_dist',
       'N7_dist', 'N8_dist', 'N9_dist', 'N10_dist', 'N1_data', 'N2_data',
       'N3_data', 'N4_data', 'N5_data', 'N6_data', 'N7_data', 'N8_data',
       'N9_data', 'N10_data', 'dist_bin_0.0', 'dist_bin_0.1', 'dist_bin_0.2',
       'dist_bin_0.3', 'dist_bin_0.4', 'dist_bin_0.5', 'dist_bin_0.6',
       'dist_bin_0.7', 'data_size']

cols

['allele',
 'model_id',
 'PPV',
 'data_aa_pos_1',
 'data_aa_pos_2',
 'data_aa_pos_3',
 'data_aa_pos_4',
 'data_aa_pos_5',
 'data_aa_pos_6',
 'data_aa_pos_7',
 'data_aa_pos_8',
 'data_aa_pos_9',
 'data_aa_pos_10',
 'data_aa_pos_11',
 'data_aa_pos_12',
 'data_aa_pos_13',
 'data_aa_pos_14',
 'data_aa_pos_15',
 'data_aa_pos_16',
 'data_aa_pos_17',
 'data_aa_pos_18',
 'data_aa_pos_19',
 'data_aa_pos_20',
 'data_aa_pos_21',
 'data_aa_pos_22',
 'data_aa_pos_23',
 'data_aa_pos_24',
 'data_aa_pos_25',
 'data_aa_pos_26',
 'data_aa_pos_27',
 'data_aa_pos_28',
 'data_aa_pos_29',
 'data_aa_pos_30',
 'data_aa_pos_31',
 'data_aa_pos_32',
 'data_aa_pos_33',
 'data_aa_pos_34',
 'N1_dist',
 'N2_dist',
 'N3_dist',
 'N4_dist',
 'N5_dist',
 'N6_dist',
 'N7_dist',
 'N8_dist',
 'N9_dist',
 'N10_dist',
 'N1_data',
 'N2_data',
 'N3_data',
 'N4_data',
 'N5_data',
 'N6_data',
 'N7_data',
 'N8_data',
 'N9_data',
 'N10_data',
 'dist_bin_0.0',
 'dist_bin_0.1',
 'dist_bin_0.2',
 'dist_bin_0.3',
 'dist_bin_0.4',
 'di

In [19]:
df1.columns

Index(['allele', 'fold', 'trial', 'left_out_alleles', 'model_id', 'PPV',
       'data_aa_pos_1', 'data_aa_pos_2', 'data_aa_pos_3', 'data_aa_pos_4',
       'data_aa_pos_5', 'data_aa_pos_6', 'data_aa_pos_7', 'data_aa_pos_8',
       'data_aa_pos_9', 'data_aa_pos_10', 'data_aa_pos_11', 'data_aa_pos_12',
       'data_aa_pos_13', 'data_aa_pos_14', 'data_aa_pos_15', 'data_aa_pos_16',
       'data_aa_pos_17', 'data_aa_pos_18', 'data_aa_pos_19', 'data_aa_pos_20',
       'data_aa_pos_21', 'data_aa_pos_22', 'data_aa_pos_23', 'data_aa_pos_24',
       'data_aa_pos_25', 'data_aa_pos_26', 'data_aa_pos_27', 'data_aa_pos_28',
       'data_aa_pos_29', 'data_aa_pos_30', 'data_aa_pos_31', 'data_aa_pos_32',
       'data_aa_pos_33', 'data_aa_pos_34', 'N1_dist', 'N2_dist', 'N3_dist',
       'N4_dist', 'N5_dist', 'N6_dist', 'N7_dist', 'N8_dist', 'N9_dist',
       'N10_dist', 'N1_data', 'N2_data', 'N3_data', 'N4_data', 'N5_data',
       'N6_data', 'N7_data', 'N8_data', 'N9_data', 'N10_data', 'dist_bin_0.0',
  

In [4]:
# Load example training data.
training_cols = ['allele', 'peptide', 'measurement_value', 'measurement_inequality']
pMHC_data = pd.read_csv(example_binding_data, usecols=training_cols)

# Re-train
user_model_id = 'mhcglobe_example1'

new_mhcglobe_path = '/tf/mhcglobe/outputs/mhcglobe_models/'
new_mhcglobe = MHCGlobe('init').train_ensemble(
        df_train          = pMHC_data,
        new_mhcglobe_path = new_mhcglobe_path,
        verbose           = 0)

Training...
Training complete.


In [5]:
# Load user trained MHCGlobe
new_mhcglobe = MHCGlobe(new_mhcglobe_path=new_mhcglobe_path)

# Predict peptide-MHC binding with MHCGlobe re-trained by user.
new_mhcglobe.predict_on_dataframe(pMHC_data.loc[:, prediction_cols])

Unnamed: 0,allele,peptide,mhcglobe_affinity,mhcglobe_score
0,HLA-B*15:01,AQMWSLMYF,6.163512,0.831915
1,HLA-A*33:01,DIDILQTNSR,71.936123,0.604818
2,HLA-B*38:01,QPKKAAAAL,1106.494729,0.352209
3,HLA-A*31:01,FLALGFFLR,18.622461,0.729720
4,HLA-B*07:02,RPQKRPSCI,48.980856,0.640341
...,...,...,...,...
95,HLA-A*30:01,KAFNHASVK,16.242568,0.742357
96,HLA-A*02:01,WMMAMKYPI,13.742250,0.757807
97,HLA-A*02:03,FLMGFNRDV,3.591422,0.881832
98,HLA-A*02:01,LLSTTEWQV,17.712534,0.734350


# MHCPerf use cases
__________

The fully trained MHCPerf will be used to predict allele-level performance of a given MHCGlobe model based on its training data featurized.

### 1. Estimate allele-level PPV for query MHC alleles given a pMHC binding dataset.

MHCPerf features will be created for all query alleles based on this data count dictionary, relating the binding dataset to each of the query alleles.

In [None]:
#sys.path.append('./mhcperf_BEST/') # make sure mhcglobe doesn't get trained with mhcperf scripts.

import mhcperf

# Load MHCPerf model
mhcperf_model = mhcperf.model()

# Compute MHCPerf features for query alleles.
query_alleles = ['HLA-A*11:01', 'HLA-B*08:01','HLA-A*30:03','HLA-C*14:02']
mhcglobe_training_data = '/tf/natmtd/data/mhcglobe_data/mhcglobe_full_train_data.csv'
perf_features = mhcperf.featurize_from_binding(mhcglobe_training_data, query_alleles)

# Predict PPV for each MHC allele in `query_alleles`
query_allele_ppv_est = mhcperf_model.predict_ppv(perf_features)

# Show MHCPerf allele-level PPV estimates.
pd.DataFrame(zip(query_alleles, query_allele_ppv_est), columns=['allele', 'PPV_est'])

### 2. MHCPerf Hyperparameter Tuning and Training

MHCPerf is a single neural network which can be retained on new observations of MHCGlobe PPV performance in response to changes to the MHCGlobe training set. MHCPerf re-training repeats hyperparameter optimization using the grid search algorithm. This procedure can be run in the notebook (as shown below) or run in parallel in the backgroun using the following script call. 

    $ python3 /tf/mhcglobe/mhcperf/mhcperf.py {df_train_path} {new_mhcperf_path}

In [None]:
# Training Data for MHCPerf
# Feature column meansings are shown in Supplementary Figure 1.

# MHCPerf training data
mhcperf_train_path  = '/tf/fairmhc/mhcglobe/data/mhcperf_data/mhcperf_data.csv'
mhcperf_train_df = pd.read_csv(mhcperf_train_path)

# New MHCPerf model save path
new_mhcperf_path = '/tf/local/mhcperf_models/mhcperf_example' 

if not os.path.exists(new_mhcperf_path):
    # Train new MHCPerf
    new_mhcperf  = mhcperf.train(
        df_train_path  = mhcperf_train_path,
        model_savepath = new_mhcperf_path,
        verbose        = 0)
else:
    # Load New MHCPerf
    new_mhcperf = mhcperf.model(new_mhcperf_path, mhcperf_train_path)

### 3. Estimate allele-level PPV with new binding data.

Updates an existing mhcperf featurized df to estimate how PPV estimates by MHCPerf
would change if additional data is collected for a given allele of interest. This functionality avoids recomputing the entire feature df for each query allele, and instead updates the existing feature set describing the full training dataset for MHCGlobe.

In [None]:
import featurize_mhcperf

allele_gets_data = 'HLA-B*15:13'
add_n_data = 2000

# Compute updated feature set
df_all_update = featurize_mhcperf.update_mhcperf_features(allele_gets_data, add_n_data)

# Load the fullly trained MHCPerf model.
mhcperf_model = mhcperf.model()

# MHCPerf PPV estimates for all MHC alleles with unique pseudosequence.
ppv_estimates = mhcperf_model.predict_ppv(df_all_update)
df_all_update.insert(1, 'PPV_est', ppv_estimates)
df_all_update[['allele', 'PPV_est']].head()