# Cox regression and hazard ratio model

Cox fa analisi di ogni variabile data e predice per ognuna l'hazard ratio, ovvero una probabilità, che se >1 indica che il rischio dell'accadere dell'evento aumenta all'aumentare del valore di quella variabile (o presenza di quella variabile in caso di booleane), mentre diminuisce se l'hazard ratio è <1. 


Input : feature vectors con età del paziente alla diagnosi, last days to follow-up, evento morte booleano, miRNA-seq vector con valori normalizzati con log e quantile.

Pipeline:
   - Scaling con **Z-scaler** su campi di età e miRNA-seq
   - Applicazione di elsatic net tramite ```scikit-survival.CoxnetSurvivalAnalysis``` da addestrare (scikit-survival at: https://scikit-survival.readthedocs.io/en/stable/user_guide/coxnet.html)
      - Applicare grid search e K-fold cross validation per capire set di parametri migliori
   - Calcolo di risk score con funzione di predict
        - possibile prevedere survival function o cumulative hazard function anche, ma necessario fare fine tuning con parametro ```fit_baselin_model=True```

Motivazioni:
   - Z-scaler per portare valori predittivi su stessa scala con varianza 1 e media 0
   - Utilizzo di Cox con penalizzazione per fare feature selection e selezionare solo miRNA con maggiore rilevanza
   - Utilizzo Elastic Net poichè Lasso-Cox normale non ottimale per due motivi: non può selezionare più features di quanti sample ci sono e in gruppo di features con alta correlazione tra loro ne sceglie a caso solo una tra queste. Elastic net risolve questi usando combinazione di l1 e l2 e rendendo più robusto

## Init

In [1]:
import pandas as pd
import os
import numpy as np
import matplotlib.pyplot as plt

In [2]:
base = os.path.basename(os.getcwd())
list = os.getcwd().split(os.sep) 
list.pop(list.index(base))
ROOT = '\\'.join(list)
print(ROOT)
DATA_PATH = os.path.join(ROOT, 'datasets\\preprocessed')

SEED = 42

d:\Universita\2 anno magistrale\Progetto BioInf\miRNA_to_age


In [3]:
dataset = pd.read_csv(os.path.join(DATA_PATH, 'clinical_miRNA_normalized_log.csv'))

In [4]:
print(dataset.shape)
print(dataset.columns)
# print(dataset.head())
print(type(dataset.iloc[0]['days_to_death']))

(760, 1896)
Index(['days_to_death', 'age_at_initial_pathologic_diagnosis',
       'days_to_last_followup', 'Death', 'pathologic_stage_Stage I',
       'pathologic_stage_Stage IA', 'pathologic_stage_Stage IB',
       'pathologic_stage_Stage II', 'pathologic_stage_Stage IIA',
       'pathologic_stage_Stage IIB',
       ...
       'hsa-mir-941-5', 'hsa-mir-942', 'hsa-mir-943', 'hsa-mir-944',
       'hsa-mir-95', 'hsa-mir-9500', 'hsa-mir-96', 'hsa-mir-98', 'hsa-mir-99a',
       'hsa-mir-99b'],
      dtype='object', length=1896)
<class 'numpy.float64'>


## Hyper-parameters

In [5]:
num_folds = 10
scoring = 'accuracy'

In [43]:
from sksurv.datasets import load_breast_cancer
from sksurv.preprocessing import OneHotEncoder

X,y = load_breast_cancer()

Xt = OneHotEncoder().fit_transform(X)
Xt.round(2).head()
print(Xt.select_dtypes(include=['object', 'category']).columns.tolist())
print(Xt.var)

[]
<bound method DataFrame.var of      X200726_at  X200965_s_at  X201068_s_at  X201091_s_at  X201288_at  \
0     10.926361      8.962608     11.630078     10.964107   11.518305   
1     12.242090      9.531718     12.626106     11.594716   12.317659   
2     11.661716     10.238680     12.572919      9.166088   11.698658   
3     12.174021      9.819279     12.109888      9.086937   13.132617   
4     11.484011     11.489233     11.779285      8.887616   10.429663   
..          ...           ...           ...           ...         ...   
193   12.018292      8.323876     11.955274     10.740020   11.150428   
194   11.711415     10.428482     12.420877     11.145993   11.084685   
195   11.939616      9.615587     11.962812     10.463171   11.514539   
196   11.848449     10.528911     11.318453      8.609631   13.719035   
197   11.425778      9.901486     12.167550      9.011730   12.013692   

     X201368_at  X201663_s_at  X201664_at  X202239_at  X202240_at  ...  \
0     12.038527

## Data

In [30]:
unique, count = np.unique(dataset['age_at_initial_pathologic_diagnosis'], return_counts=True)
to_drop = [unique[u] for u in range(len(unique)) if count[u] < 5]
print(to_drop)
print(dataset.shape)

dataset=dataset[~dataset['age_at_initial_pathologic_diagnosis'].isin(to_drop)]
print(dataset.shape)

[np.int64(26), np.int64(27), np.int64(30), np.int64(31), np.int64(32), np.int64(83), np.int64(85), np.int64(86), np.int64(88), np.int64(89)]
(760, 1896)
(739, 1896)


In [47]:
np.isinf(dataset).sum()


days_to_death                          0
age_at_initial_pathologic_diagnosis    0
days_to_last_followup                  0
Death                                  0
pathologic_stage_Stage I               0
                                      ..
hsa-mir-9500                           0
hsa-mir-96                             0
hsa-mir-98                             0
hsa-mir-99a                            0
hsa-mir-99b                            0
Length: 1896, dtype: int64

In [31]:
# y = death_event and days_to_death/last_folowup
# X = all the rest
y_cols = ['Death', 'days_to_death', 'days_to_last_followup']
X_cols = [col for col in dataset.columns if col not in y_cols]

custom_dtype = np.dtype([
    ('death', np.bool_),         # O 'bool'
    ('days', np.float64)      # O 'float'
])

y = []
for index,row in dataset[y_cols].iterrows():
    if row['Death'] == 1:
        y.append(np.array((True, row['days_to_death'].item()), dtype=custom_dtype))
    elif row['Death'] == 0:
        tuple = (False, row['days_to_last_followup'].item())
        y.append(np.array(tuple, dtype=custom_dtype)) 
y = np.array(y)

X = dataset[X_cols]
# remove columns with zero-variance
# print(X.shape)
X = X.loc[:, X.var() != 0]
# print(X.shape)


## Z-scaling

In [32]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

scaled = pd.DataFrame(scaler.fit_transform(X), columns=X.columns)

## Data splitting

In [33]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=X['age_at_initial_pathologic_diagnosis'])

## K-fold

In [34]:
from sklearn.model_selection import StratifiedKFold

kfold = StratifiedKFold(n_splits=num_folds, shuffle=True, random_state=SEED)

## Elastic net (Lasso-Cox)

In [35]:
def plot_coefficients(coefs, n_highlight):
    _, ax = plt.subplots(figsize=(9, 6))
    alphas = coefs.columns
    for row in coefs.itertuples():
        ax.semilogx(alphas, row[1:], ".-", label=row.Index)

    alpha_min = alphas.min()
    top_coefs = coefs.loc[:, alpha_min].map(abs).sort_values().tail(n_highlight)
    for name in top_coefs.index:
        coef = coefs.loc[name, alpha_min]
        plt.text(alpha_min, coef, name + "   ", horizontalalignment="right", verticalalignment="center")

    ax.yaxis.set_label_position("right")
    ax.yaxis.tick_right()
    ax.grid(True)
    ax.set_xlabel("alpha")
    ax.set_ylabel("coefficient")

In [None]:
from sksurv.linear_model import CoxPHSurvivalAnalysis
from sklearn.model_selection import cross_val_score

alphas = 10.0 ** np.linspace(-2, 2, 100)
coefficients = {}

print(alphas)

cph = CoxPHSurvivalAnalysis()
for alpha in alphas:
    cph.set_params(alpha=alpha)
    cph.fit(X_train, y_train)
    key = round(alpha, 5)
    coefficients[key] = cph.coef_
    print(f"Finished fitting for alpha coefficient : {alpha}")

coefficients = pd.DataFrame.from_dict(coefficients).rename_axis(index="feature", columns="alpha").set_index(X.columns)

[1.00000000e-02 1.20679264e-02 1.45634848e-02 1.75751062e-02
 2.12095089e-02 2.55954792e-02 3.08884360e-02 3.72759372e-02
 4.49843267e-02 5.42867544e-02 6.55128557e-02 7.90604321e-02
 9.54095476e-02 1.15139540e-01 1.38949549e-01 1.67683294e-01
 2.02358965e-01 2.44205309e-01 2.94705170e-01 3.55648031e-01
 4.29193426e-01 5.17947468e-01 6.25055193e-01 7.54312006e-01
 9.10298178e-01 1.09854114e+00 1.32571137e+00 1.59985872e+00
 1.93069773e+00 2.32995181e+00 2.81176870e+00 3.39322177e+00
 4.09491506e+00 4.94171336e+00 5.96362332e+00 7.19685673e+00
 8.68511374e+00 1.04811313e+01 1.26485522e+01 1.52641797e+01
 1.84206997e+01 2.22299648e+01 2.68269580e+01 3.23745754e+01
 3.90693994e+01 4.71486636e+01 5.68986603e+01 6.86648845e+01
 8.28642773e+01 1.00000000e+02]


In [None]:
cv = KFold(n_splits=5, shuffle=True, random_state=0)
gcv = GridSearchCV(
    make_pipeline(StandardScaler(), CoxnetSurvivalAnalysis(l1_ratio=0.9)),
    param_grid={"coxnetsurvivalanalysis__alphas": [[v] for v in map(float, estimated_alphas)]},
    cv=cv,
    error_score=0.5,
    n_jobs=1,
).fit(Xt, y)

cv_results = pd.DataFrame(gcv.cv_results_)
