<a target="_blank" href="https://colab.research.google.com/github/giordamaug/HELP/blob/main/help/notebooks/prediction.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>
<a target="_blank" href="https://www.kaggle.com/notebooks/welcome?src=https://github.com/giordamaug/HELP/blob/main/help/notebooks/prediction.ipynb">
  <img src="https://kaggle.com/static/images/open-in-kaggle.svg" alt="Open In Colab"/>
</a>

# Install HELP from GitHub
Skip this cell if you already have installed HELP.

In [None]:
!pip install git+https://github.com/giordamaug/HELP.git

# Download the input files
In this cell we download from GitHub repository the label file and the attribute files. Skip this step if you already have these input files locally.

In [None]:
tissue='Kidney'
!wget https://raw.githubusercontent.com/giordamaug/HELP/main/help/datafinal/{tissue}_HELP.csv
!wget https://raw.githubusercontent.com/giordamaug/HELP/main/help/datafinal/{tissue}_BIO.csv
for i in range(5):
  !wget https://raw.githubusercontent.com/giordamaug/HELP/main/help/datafinal/{tissue}_CCcfs_{i}.csv
!wget https://raw.githubusercontent.com/giordamaug/HELP/main/help/datafinal/{tissue}_EmbN2V_128.csv

In [6]:
%cd ../../data

/Users/maurizio/HELP/data


# Process the tissue attributes
In this code we load tissue gene attributes by several datafiles. We apply missing values fixing and data scaling with `sklearn.preprocessing.StandardScaler` on the `BIO` and `CCcfs` attributes, while no normalization and fixing on embedding attributes (`EmbN2V_128`). The attributes are all merged in one matrix by the `feature_assemble` function as input for the prediction model building.

In [10]:
tissue='Kidney'
import pandas as pd
from HELPpy.preprocess.loaders import feature_assemble_df
import os
df_y = pd.read_csv(f"{tissue}_HELP.csv", index_col=0)
df_y = df_y.replace({'aE': 'NE', 'sNE': 'NE'})
print(df_y.value_counts(normalize=False))
features = [{'fname': f'{tissue}_BIO.csv', 'fixna' : False, 'normalize': 'std'},
            {'fname': f'{tissue}_CCcfs.csv', 'fixna' : False, 'normalize': 'std', 'nchunks' : 5},
            {'fname': f'{tissue}_EmbN2V_128.csv', 'fixna' : False, 'normalize': None}]
df_X, df_y = feature_assemble_df(df_y, features=features, saveflag=False, verbose=True)

label
NE       16678
E         1253
Name: count, dtype: int64
Majority NE 16678 minority E 1253
[Kidney_BIO.csv] found 52532 Nan...
[Kidney_BIO.csv] Normalization with std ...


Loading file in chunks: 100%|██████████| 5/5 [00:02<00:00,  1.99it/s]


[Kidney_CCcfs.csv] found 6676644 Nan...
[Kidney_CCcfs.csv] Normalization with std ...
[Kidney_EmbN2V_128.csv] found 0 Nan...
[Kidney_EmbN2V_128.csv] No normalization...
17236 labeled genes over a total of 17931
(17236, 3456) data input


In [16]:
import os
import pandas as pd
import numpy as np
import random
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import *
from sklearn.model_selection import StratifiedKFold
from collections import Counter
from lightgbm import LGBMClassifier
from sklearn.ensemble import RandomForestClassifier
from tqdm import tqdm
from tabulate import tabulate
from typing import List,Dict,Union,Tuple
def set_seed(seed=1):
    """
    Set random and numpy random seed for reproducibility

    :param int seed: inistalization seed

    :returns None.
    """
    random.seed(seed)
    np.random.seed(seed)
def predict_cv_(X, Y, n_splits=10, method='LGBM', method_args = {}, balanced=False, saveflag: bool = False, outfile: str = 'predictions.csv', verbose: bool = False, display: bool = False,  seed: int = 42):
    """
    Perform cross-validated predictions using a LightGBM classifier.

    :param DataFrame X: Features DataFrame.
    :param DataFrame Y: Target variable DataFrame.
    :param int n_splits: Number of folds for cross-validation.
    :param str method: Classifier method (default LGBM)
    :param bool balanced: Whether to use class weights to balance the classes.
    :param bool saveflag: Whether to save the predictions to a CSV file.
    :param str or None outfile: File name for saving predictions.
    :param bool verbose: Whether to print verbose information.
    :param bool display: Whether to display a confusion matrix plot.
    :param int or None seed: Random seed for reproducibility.

    :returns: Summary statistics of the cross-validated predictions, single measures and label predictions
    :rtype: Tuple(pd.DataFrame,pd.DataFrame,pd.DataFrame)

    :example:
 
    .. code-block:: python

        # Example usage
        X_data = pd.DataFrame(...)
        Y_data = pd.DataFrame(...)
        result, _, _ = predict_cv(X_data, Y_data, n_splits=5, balanced=True, saveflag=False, outfile=None, verbose=True, display=True, seed=42)
    """
    methods = {'LGBM': LGBMClassifier, 'SV': VotingSplitClassifier}

    # silent twdm if no verbosity
    #if not verbose: 
    #    def notqdm(iterable, *args, **kwargs): return iterable
    #    tqdm = notqdm
    # get the list of genes
    genes = Y.index

    # Encode target variable labels
    encoder = LabelEncoder()
    X = X.values
    y = encoder.fit_transform(Y.values.ravel())

    # Display class information
    classes_mapping = dict(zip(encoder.classes_, encoder.transform(encoder.classes_)))
    if verbose: print(f'{classes_mapping}\n{Y.value_counts()}')

    # Set random seed
    set_seed(seed)

    # Initialize StratifiedKFold
    kf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=seed)

    # Initialize classifier
    #clf = LGBMClassifier(class_weight='balanced', verbose=-1) if balanced else LGBMClassifier(verbose=-1)
    clf = methods[method](**method_args)

    nclasses = len(np.unique(y))
    mm = np.array([], dtype=np.int64)
    gg = np.array([])
    yy = np.array([], dtype=np.int64)
    predictions = np.array([], dtype=np.int64)
    probabilities = np.array([])

    # Columns for result summary
    columns_names = ["ROC-AUC", "Accuracy", "BA", "Sensitivity", "Specificity", "MCC", 'CM']
    scores = pd.DataFrame()

    if verbose:
        print(f'Classification with {method}({method_args})...')

    # Iterate over each fold
    for fold, (train_idx, test_idx) in enumerate(tqdm(kf.split(np.arange(len(X)), y), total=kf.get_n_splits(), desc=f"{n_splits}-fold", disable=not verbose)):
        train_x, train_y, test_x, test_y = X[train_idx], y[train_idx], X[test_idx], y[test_idx],
        mm = np.concatenate((mm, test_idx))
        probs = clf.fit(train_x, train_y).predict_proba(test_x)
        preds = np.argmax(probs, axis=1)
        gg = np.concatenate((gg, genes[test_idx]))
        yy = np.concatenate((yy, test_y))
        cm = confusion_matrix(test_y, preds)
        predictions = np.concatenate((predictions, preds))
        probabilities = np.concatenate((probabilities, probs[:, 0]))

        # Calculate and store evaluation metrics for each fold
        roc_auc = roc_auc_score(test_y, probs[:, 1]) if nclasses == 2 else roc_auc_score(test_y, probs, multi_class="ovr", average="macro")
        scores = pd.concat([scores, pd.DataFrame([[roc_auc,
                                                    accuracy_score(test_y, preds),
                                                    balanced_accuracy_score(test_y, preds),
                                                    cm[0, 0] / (cm[0, 0] + cm[0, 1]),
                                                    cm[1, 1] / (cm[1, 0] + cm[1, 1]),
                                                    matthews_corrcoef(test_y, preds),
                                                    cm]],
                                                  columns=columns_names, index=[fold])],
                           axis=0)

    # Calculate mean and standard deviation of evaluation metrics
    df_scores = pd.DataFrame([f'{val:.4f}±{err:.4f}' for val, err in zip(scores.loc[:, scores.columns != "CM"].mean(axis=0).values,
                                                                     scores.loc[:, scores.columns != "CM"].std(axis=0))] +
                             [(scores[['CM']].sum()).values[0].tolist()],
                             columns=['measure'], index=scores.columns)

    # Display confusion matrix if requested
    if display:
        ConfusionMatrixDisplay(confusion_matrix=np.array(df_scores.loc['CM']['measure']), display_labels=encoder.inverse_transform(clf.classes_)).plot()

    # Create DataFrame for storing detailed predictions
    df_results = pd.DataFrame({'gene': gg, 'label': yy, 'prediction': predictions, 'probabilities': probabilities})
    df_results = df_results.set_index(['gene'])

    # Save detailed predictions to a CSV file if requested
    if saveflag:
        df_results.to_csv(outfile)

    # Return the summary statistics of cross-validated predictions, the single measures and the prediction results
    return df_scores, scores, df_results

In [159]:
pp

array([1, 1, 1, ..., 1, 1, 1])

In [35]:
predict_cv_(df_X, df_y, n_splits=5, method='SV', method_args = {'n_jobs' : 1, 'class_weight':'balanced', 'n_voters':10, 'verbose': True}, verbose=True)

{'E': 0, 'NE': 1}
label
NE       15994
E         1242
Name: count, dtype: int64
Classification with SV({'n_jobs': 1, 'class_weight': 'balanced', 'n_voters': 10, 'verbose': True})...


5-fold:   0%|          | 0/5 [00:00<?, ?it/s]

Majority 1 12795, minority 0 993


10-voter: 100%|██████████| 10/10 [00:30<00:00,  3.09s/it]
5-fold:  20%|██        | 1/5 [00:31<02:04, 31.16s/it]

Majority 1 12795, minority 0 994


10-voter: 100%|██████████| 10/10 [00:30<00:00,  3.08s/it]
5-fold:  40%|████      | 2/5 [01:02<01:33, 31.12s/it]

Majority 1 12795, minority 0 994


10-voter: 100%|██████████| 10/10 [00:30<00:00,  3.02s/it]
5-fold:  60%|██████    | 3/5 [01:32<01:01, 30.80s/it]

Majority 1 12795, minority 0 994


10-voter: 100%|██████████| 10/10 [00:30<00:00,  3.01s/it]
5-fold:  80%|████████  | 4/5 [02:02<00:30, 30.61s/it]

Majority 1 12796, minority 0 993


10-voter: 100%|██████████| 10/10 [00:31<00:00,  3.12s/it]
5-fold: 100%|██████████| 5/5 [02:34<00:00, 30.88s/it]


(                                  measure
 ROC-AUC                     0.9569±0.0049
 Accuracy                    0.8767±0.0038
 BA                          0.8890±0.0076
 Sensitivity                 0.9034±0.0127
 Specificity                 0.8746±0.0034
 MCC                         0.5220±0.0120
 CM           [[1122, 120], [2006, 13988]],
     ROC-AUC  Accuracy        BA  Sensitivity  Specificity       MCC  \
 0  0.957739  0.873840  0.889420     0.907631     0.871210  0.519149   
 1  0.964408  0.883087  0.901675     0.923387     0.879962  0.542884   
 2  0.956831  0.876704  0.885217     0.895161     0.875274  0.518009   
 3  0.951590  0.875544  0.886452     0.899194     0.873711  0.517726   
 4  0.953835  0.874093  0.882150     0.891566     0.872733  0.512279   
 
                          CM  
 0  [[226, 23], [412, 2787]]  
 1  [[229, 19], [384, 2815]]  
 2  [[222, 26], [399, 2800]]  
 3  [[223, 25], [404, 2795]]  
 4  [[222, 27], [407, 2791]]  ,
         label  prediction  probab

In [33]:
from sklearn.base import clone, BaseEstimator, ClassifierMixin, RegressorMixin
from joblib import Parallel, delayed
from lightgbm import LGBMClassifier 
import numpy as np
from tqdm import tqdm



class VotingSplitClassifier(BaseEstimator, ClassifierMixin):

    def __init__(self, n_voters=10, voting='soft', n_jobs=-1, verbose=False, **kwargs):
        self.kwargs = kwargs
        # intialize ensemble ov voters
        self.verbose = verbose
        self.n_jobs = n_jobs
        self.n_voters = n_voters
        self.estimators_ = [LGBMClassifier(**kwargs) for i in range(n_voters)]
        pass
    
    
    def _fit_single_estimator(self, i, X, y, index_ne, index_e):
        """Private function used to fit an estimator within a job."""
        df_X = np.append(X[index_ne], X[index_e], axis=0)
        df_y = np.append(y[index_ne], y[index_e], axis=0)
        clf = clone(self.estimators_[i])
        clf.fit(df_X, df_y)
        return clf
    
    def fit(self, X, y):
        # Find the majority and minority class
        assert isinstance(X, np.ndarray) and isinstance(y, np.ndarray), "Only array input!"
        unique, counts = np.unique(y, return_counts=True)
        minlab = unique[np.argmin(counts)]
        maxlab = unique[np.argmax(counts)]

        if self.verbose:
            print(f"Majority {maxlab} {max(counts)}, minority {minlab} {min(counts)}")

        # Separate majority and minority class
        all_index_ne = np.where(y == maxlab)[0]
        index_e = np.where(y == minlab)[0]

        # Split majority class among voters
        splits = np.array_split(all_index_ne, self.n_voters)

        self.estimators_ = Parallel(n_jobs=self.n_jobs)(delayed(self._fit_single_estimator)(i,X, y, index_ne, index_e) for i,index_ne in enumerate(tqdm(splits, desc=f"{self.n_voters}-voter", disable = not self.verbose)))
        #for i,index_ne in enumerate(tqdm(splits, desc=f"{self.n_voters}-voter", disable = not self.verbose)):
        #    df_X = np.append(X[index_ne], X[index_e], axis=0)
        #    df_y = np.append(y[index_ne], y[index_e], axis=0)
        #    self.estimators_[i].fit(df_X,df_y)
        return self
    
    def predict_proba(self, X, y=None):
        # Find the majority and minority class
        assert isinstance(X, np.ndarray), "Only array input!"
        probabilities = np.array([self.estimators_[i].predict_proba(X) for i in range(self.n_voters)])
        return np.sum(probabilities, axis=0)/self.n_voters
    
    def predict(self, X, y=None):
        assert isinstance(X, np.ndarray), "Only array input!"
        probabilities = np.array([self.estimators_[i].predict_proba(X) for i in range(self.n_voters)])
        return np.argmax(np.sum(probabilities, axis=0)/self.n_voters, axis=1)

In [183]:
df_y.values.ravel()

array(['NE', 'NE', 'NE', ..., 'NE', 'NE', 'NE'], dtype=object)

In [194]:
clf = VotingSplitClassifier(n_voters=3, class_weight='balanced', verbose=True)
pp = clf.fit(df_X.values, df_y.values.ravel()).predict_proba(df_X.values)

Majority NE 1242, minority E 15994


3-voter:   0%|          | 0/3 [00:00<?, ?it/s]

3-voter: 100%|██████████| 3/3 [01:02<00:00, 20.90s/it]


In [196]:
pp.shape

(17236, 2)

In [188]:
pp = clf.predict_proba(df_X.values)

In [107]:
x = np.argmax(np.sum(pp, axis=0)/3, axis=1)
indices1 = np.where(x == 1)[0]
indices0 = np.where(x == 0)[0]

array([    7,    14,    16, ..., 17225, 17226, 17227])

# Prediction with Soft Voting

In [6]:
import numpy as np
from help.models.prediction import predict_cv
seed=42
df_y_ne = df_y[df_y['label']=='NE']
df_y_e = df_y[df_y['label']=='E']
#df_y_ne = df_y_ne.sample(frac=1, random_state=seed)
n_voters = 7
splits = np.array_split(df_y_ne, n_voters) 
predictions_ne = pd.DataFrame()
predictions_e = pd.DataFrame(index=df_y_e.index)
d=np.empty((len(df_y_e.index),),object)
d[...]=[list() for _ in range(len(df_y_e.index))]
predictions_e['probabilities'] = d
predictions_e['label'] = np.array([0 for idx in df_y_e.index])
predictions_e['prediction'] = np.array([np.nan for idx in df_y_e.index])
for df_index_ne in splits:
    df_x = pd.concat([df_X.loc[df_index_ne.index], df_X.loc[df_y_e.index]])
    df_yy = pd.concat([df_y.loc[df_index_ne.index], df_y_e])
    _, _, preds = predict_cv(df_x, df_yy, n_splits=5, method='LGBM', balanced=True, verbose=True, seed=seed)
    predictions_ne = pd.concat([predictions_ne, preds.loc[df_index_ne.index]])
    r = np.empty((len(df_y_e.index),),object)
    r[...]=[predictions_e.loc[idx]['probabilities'] + [preds.loc[idx]['probabilities']]  for idx in df_y_e.index]
    predictions_e['probabilities'] = r
predictions_e['prediction'] = predictions_e['probabilities'].map(lambda x: 0 if sum(x)/n_voters > 0.5 else 1)
predictions_e['probabilities'] = predictions_e['probabilities'].map(lambda x: sum(x)/n_voters)
predictions = pd.concat([predictions_ne, predictions_e])
predictions.to_csv(f"pred_Kidney_SV_{n_voters}.csv", index=True)

  return bound(*args, **kwds)


{'E': 0, 'NE': 1}
label
NE       1600
E        1242
Name: count, dtype: int64
Classification with LGBM...


5-fold: 100%|██████████| 5/5 [00:14<00:00,  2.98s/it]


{'E': 0, 'NE': 1}
label
NE       1600
E        1242
Name: count, dtype: int64
Classification with LGBM...


5-fold: 100%|██████████| 5/5 [00:16<00:00,  3.21s/it]


{'E': 0, 'NE': 1}
label
NE       1600
E        1242
Name: count, dtype: int64
Classification with LGBM...


5-fold: 100%|██████████| 5/5 [00:15<00:00,  3.12s/it]


{'E': 0, 'NE': 1}
label
NE       1600
E        1242
Name: count, dtype: int64
Classification with LGBM...


5-fold: 100%|██████████| 5/5 [00:15<00:00,  3.18s/it]


{'E': 0, 'NE': 1}
label
NE       1599
E        1242
Name: count, dtype: int64
Classification with LGBM...


5-fold: 100%|██████████| 5/5 [00:14<00:00,  2.91s/it]


{'E': 0, 'NE': 1}
label
NE       1599
E        1242
Name: count, dtype: int64
Classification with LGBM...


5-fold: 100%|██████████| 5/5 [00:14<00:00,  2.96s/it]


{'E': 0, 'NE': 1}
label
NE       1599
E        1242
Name: count, dtype: int64
Classification with LGBM...


5-fold: 100%|██████████| 5/5 [00:15<00:00,  3.07s/it]


{'E': 0, 'NE': 1}
label
NE       1599
E        1242
Name: count, dtype: int64
Classification with LGBM...


5-fold: 100%|██████████| 5/5 [00:15<00:00,  3.03s/it]


{'E': 0, 'NE': 1}
label
NE       1599
E        1242
Name: count, dtype: int64
Classification with LGBM...


5-fold: 100%|██████████| 5/5 [00:13<00:00,  2.74s/it]


{'E': 0, 'NE': 1}
label
NE       1599
E        1242
Name: count, dtype: int64
Classification with LGBM...


5-fold: 100%|██████████| 5/5 [00:13<00:00,  2.63s/it]


In [113]:
predictions_e.to_csv("pred_Kidney_SV.csv", index=True)