<a target="_blank" href="https://colab.research.google.com/github/giordamaug/HELP/blob/main/HELPpy/notebooks/prediction.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>
<a target="_blank" href="https://www.kaggle.com/notebooks/welcome?src=https://github.com/giordamaug/HELP/blob/main/HELPpy/notebooks/prediction.ipynb">
  <img src="https://kaggle.com/static/images/open-in-kaggle.svg" alt="Open In Colab"/>
</a>

### 1. Install HELP from GitHub
Skip this cell if you already have installed HELP.

In [None]:
!pip install git+https://github.com/giordamaug/HELP.git

### 2. Download the input files
For a chosen tissue (here `Kidney`), download from GitHub the label file (here `Kidney_HELP.csv`, computed as in Example 1) and the attribute files (here BIO `Kidney_BIO.csv`, CCcfs `Kidney_CCcfs_1.csv`, ..., `Kidney_CCcfs_5.csv`, and N2V `Kidney_EmbN2V_128.csv`).  

Skip this step if you already have these input files locally.

In [None]:
tissue='Kidney'
!wget https://raw.githubusercontent.com/giordamaug/HELP/main/data/{tissue}_HELP.csv
!wget https://raw.githubusercontent.com/giordamaug/HELP/main/data/{tissue}_BIO.csv
for i in range(5):
  !wget https://raw.githubusercontent.com/giordamaug/HELP/main/data/{tissue}_CCcfs_{i}.csv
!wget https://raw.githubusercontent.com/giordamaug/HELP/main/data/{tissue}_EmbN2V_128.csv

Observe that the CCcfs file has been subdivided into 5 separate files for storage limitations on GitHub. 

### 3. Load the input files and process the tissue attributes

+ The label file (`Kidney_HELP.csv`) can be loaded via `read_csv`; its three-class labels (`E`, `aE`, `sNE`) are converted to two-class labels (`E`, `NE`); 

+ The tissue gene attributes are loaded and assembled via `feature_assemble_df` using the downloaded datafiles BIO, CCcfs subdivided into 5 subfiles (`'nchunks': 5`) and embedding. We do not apply missing values fixing (`'fixna': False`), while we do apply data scaling (`'normalize': 'std'`) to the BIO and CCcfs attributes.  

In [25]:
path = "../data" 
tissue='Brain'
import pandas as pd
import os
from HELPpy.preprocess.loaders import feature_assemble_df
df_y = pd.read_csv(os.path.join(path, f"{tissue}_HELP.csv"), index_col=0)
df_y = df_y.replace({'aE': 0, 'sNE': 0, 'E': 1})
print(df_y.value_counts(normalize=False))
features = [{'fname': os.path.join(path, f'{tissue}_BIO.csv'), 'fixna' : False, 'normalize': 'std'},
            {'fname': os.path.join(path, f'{tissue}_CCcfs.csv'), 'fixna' : False, 'normalize': 'std'},
            {'fname': os.path.join(path, f'{tissue}_EmbN2V_128.csv'), 'fixna' : False, 'normalize': None}
            ]
df_X, df_y = feature_assemble_df(df_y, features=features, verbose=True)

  df_y = df_y.replace({'aE': 0, 'sNE': 0, 'E': 1})


label
0        16685
1         1246
Name: count, dtype: int64
Majority 0 16685 minority 1 1246
[Brain_BIO.csv] found 58547 Nan...
[Brain_BIO.csv] Normalization with std ...
[Brain_CCcfs.csv] found 6735590 Nan...
[Brain_CCcfs.csv] Normalization with std ...
[Brain_EmbN2V_128.csv] found 0 Nan...
[Brain_EmbN2V_128.csv] No normalization...
17244 labeled genes over a total of 17931
(17244, 3458) data input


### 4. Estimate the performance of EGs prediction 

Instantiate the prediction model described in the HELP paper (soft-voting ensemble `VotingSplitClassifier` of `n_voters=10` classifiers) and estimate its performance via 5-fold cross-validation (`k_fold_cv` with `n_splits=5`). Then, print the obtained average performances (`df_scores`)... 

In [26]:
from HELPpy.models.prediction import VotingEnsembleLGBM, k_fold_cv
clf = VotingEnsembleLGBM(n_voters=13, learning_rate=0.1, n_estimators=200, boosting_type='gbdt', n_jobs=-1, random_state=42)
df_scores, scores, predictions = k_fold_cv(df_X, df_y, clf, n_splits=5, seed=0, show_progress=True, verbose=True)
df_scores

{0: 0, 1: 1}
label
0        16010
1         1234
Name: count, dtype: int64



5-fold:   0%|          | 0/5 [00:00<?, ?it/s]

Unnamed: 0,measure
ROC-AUC,0.9592±0.0051
Accuracy,0.8826±0.0067
BA,0.8927±0.0087
Sensitivity,0.8809±0.0070
Specificity,0.9044±0.0156
MCC,0.5327±0.0157
CM,"[[14104, 1906], [118, 1116]]"


... and those in each fold (`scores`)

In [28]:
import numpy as np
y_pred = predictions['prediction'].values.ravel()
y_prob = predictions['probabilities'].values.ravel()
y_true = predictions['label'].values.ravel()
from sklearn.metrics import *
from imblearn.metrics import specificity_score

print(pd.DataFrame({'ROC-AUC' : [roc_auc_score(y_true, 1-y_prob, average='weighted')],
              'Accuracy' : [accuracy_score(y_true, y_pred)],
              'Sensitivity' : [specificity_score(y_true, y_pred)],
              'Specificity' : [recall_score(y_true, y_pred)],
              'BA' : [balanced_accuracy_score(y_true, y_pred)],
              'MCC' : [matthews_corrcoef(y_true, y_pred)],
              'CM': [confusion_matrix(y_true, y_pred)]
              }).T.to_latex())
print(confusion_matrix(y_true, y_pred))

\begin{tabular}{ll}
\toprule
 & 0 \\
\midrule
ROC-AUC & 0.959160 \\
Accuracy & 0.882626 \\
Sensitivity & 0.880949 \\
Specificity & 0.904376 \\
BA & 0.892663 \\
MCC & 0.532446 \\
CM & [[14104  1906]
 [  118  1116]] \\
\bottomrule
\end{tabular}

[[14104  1906]
 [  118  1116]]


In [22]:
scores

Unnamed: 0_level_0,ROC-AUC,Accuracy,BA,Sensitivity,Specificity,MCC,CM
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.960777,0.881636,0.896712,0.879138,0.914286,0.533857,"[[2815, 387], [21, 224]]"
1,0.962056,0.885117,0.894816,0.88351,0.906122,0.53689,"[[2829, 373], [23, 222]]"
2,0.957221,0.876994,0.894213,0.874141,0.914286,0.525166,"[[2799, 403], [21, 224]]"
3,0.96532,0.880766,0.909436,0.876015,0.942857,0.545108,"[[2805, 397], [14, 231]]"
4,0.953931,0.887406,0.886408,0.88757,0.885246,0.531287,"[[2842, 360], [28, 216]]"


Show labels, predictions and their probabilities (`predictions`) and save them in a csv file

In [27]:
predictions

Unnamed: 0_level_0,label,prediction,probabilities
gene,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
A4GNT,0,0,0.999996
AAAS,0,1,0.000417
AASDH,0,0,0.984238
ABCA2,0,0,0.928934
ABCA3,0,0,0.999901
...,...,...,...
ZSWIM7,0,0,0.910995
ZSWIM8,0,0,0.803753
ZXDA,0,0,0.999902
ZXDB,0,0,0.999990


In [29]:
import os
savepath = "../data"
csEG = predictions[(predictions['label']==1) & (predictions['prediction']==1)].index.values
predictions.to_csv(os.path.join(savepath, f"pred_{tissue}_EvsNE.csv"), index=True)
with open(os.path.join(savepath, f"pred_csEG_{tissue}.txt"), 'w', encoding='utf-8') as f:
    f.write('\n'.join(list(csEG)))

### 5. Compute TPR for ucsEGs and csEGs

Read the result files for ucsEGs (`ucsEG_Kidney.txt`) and csEGs (`csEGs_Kidney_EvsNE.csv`) already computed for the tissue, compute the TPRs (tpr) and show their bar plot. 

In [31]:
import pandas as pd
import os
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
tissues = ['Kidney', 'Lung', 'Brain']
path = '../data'
labels = []
data = []
tpr = []
genes = {}
for tissue in tissues:
    #!wget https://raw.githubusercontent.com/giordamaug/HELP/main/data/ucsEG_{tissue}.txt
    ucsEGs = pd.read_csv(os.path.join(path,f"ucsEG_{tissue}.txt"), index_col=0, header=None).index.values
    #!wget https://raw.githubusercontent.com/giordamaug/HELP/main/data/csEGs_{tissue}_EvsNE.csv
    predictions = pd.read_csv(os.path.join(path,f"csEGs_{tissue}.txt"), index_col=0)
    indices = np.intersect1d(ucsEGs, predictions.index.values)
    preds = predictions.loc[indices]
    num1 = len(preds[preds['label'] == preds['prediction']])
    den1 = len(preds[preds['label'] == 0])
    den2 = len(predictions[predictions['label'] == 0])
    num2 = len(predictions[(predictions['label'] == 0) & (predictions['label'] == predictions['prediction'])])
    labels += [f"ucsEGs\n{tissue}", f"csEGs\n{tissue}"]
    data += [float(f"{num1 /den1:.3f}"), float(f"{num2 /den2:.3f}")]
    tpr += [f"{num1}/{den1}", f"{num2}/{den2}"]
    genes[f'ucsEGs_{tissue}_y'] = preds[preds['label'] == preds['prediction']].index.values
    genes[f'ucsEGs_{tissue}_n'] = preds[preds['label'] != preds['prediction']].index.values
    genes[f'csEGs_{tissue}_y'] = predictions[(predictions['label'] == 0) & (predictions['label'] == predictions['prediction'])].index.values
    genes[f'csEGs_{tissue}_n'] = predictions[(predictions['label'] == 0) & (predictions['label'] != predictions['prediction'])].index.values
    print(f"ucsEG {tissue} TPR = {num1 /den1:.3f} ({num1}/{den1}) ucsEG {tissue} TPR =  {num2/den2:.3f} ({num2}/{den2})")

f, ax = plt.subplots(figsize=(4, 4))
colors = ["#FF0B04", "#FF0B04", "#4374B3", "#4374B3","#ffffff", "#ffffff"]
# Set your custom color palette
customPalette = sns.color_palette("pastel", n_colors=6)
g = sns.barplot(y = data, x = labels, ax=ax, hue= data, palette = customPalette, orient='v', legend=False)
ax.set_ylabel('TPR')
ax.set(yticklabels=[])
for i,l,t in zip(range(len(tissues)*2),labels,tpr):
    ax.text(-0.1 + (i * 1.0), 0.1, f"({t})", rotation='vertical')
for i in ax.containers:
    ax.bar_label(i,)

FileNotFoundError: [Errno 2] No such file or directory: '../data/csEGs_Kidney.csv'

In [36]:
s = []
for tissue in tissues:
    #!wget https://raw.githubusercontent.com/giordamaug/HELP/main/data/ucsEG_{tissue}.txt
    s += [set(pd.read_csv(os.path.join(path,f"ucsEG_{tissue}.txt"), index_col=0, header=None).index.values)]
s[1] & s[2], s[2] - (s[0] | s[1])

({'CDK2', 'CKS1B', 'DDX11', 'NCAPH2'},
 {'ACTB',
  'FAM50A',
  'FDXR',
  'FXN',
  'GABPB1',
  'GFER',
  'GNB1L',
  'HSCB',
  'KTI12',
  'NOPCHAP1',
  'NUP54',
  'PGS1',
  'RBM48',
  'RPL39',
  'RPP25L',
  'SERBP1',
  'SSB',
  'TAMM41',
  'TIMM9',
  'TOMM20',
  'URM1',
  'VHL',
  'VRK1'})

This code can be used to produce Fig 5(B) of the HELP paper by executing an iteration cycle for both `kidney` and `lung` tissues.

At the end, we print the list of ucs_EGs for the tissue.

In [39]:
tissue = 'Brain'
genes[f'ucsEGs_{tissue}_y'], genes[f'ucsEGs_{tissue}_n']

(array(['ACTB', 'CDK2', 'CHMP7', 'CKS1B', 'DDX11', 'EMC3', 'EXOSC1',
        'FAM50A', 'FDXR', 'GFER', 'NCAPH2', 'NUP54', 'RBM48', 'RPL39',
        'SERBP1', 'SNRPB2', 'SRSF10', 'SSB', 'TAF1D', 'TIMM9', 'TOMM20',
        'URM1', 'USP10', 'VRK1'], dtype=object),
 array(['ARF4', 'ARFRP1', 'CDK6', 'FERMT2', 'FXN', 'GABPB1', 'GNB1L',
        'HSCB', 'ITGAV', 'KTI12', 'NHLRC2', 'PGS1', 'PTK2', 'RPP25L',
        'TAMM41', 'VHL', 'WDR25'], dtype=object))