# Cox regression and hazard ratio model

Cox fa analisi di ogni variabile data e predice per ognuna l'hazard ratio, ovvero una probabilità, che se >1 indica che il rischio dell'accadere dell'evento aumenta all'aumentare del valore di quella variabile (o presenza di quella variabile in caso di booleane), mentre diminuisce se l'hazard ratio è <1. 


Input : feature vectors con età del paziente alla diagnosi, last days to follow-up, evento morte booleano, miRNA-seq vector con valori normalizzati con log e quantile.

Pipeline:
   - Scaling con **Z-scaler** su campi di età e miRNA-seq
   - Applicazione di elsatic net tramite ```scikit-survival.CoxnetSurvivalAnalysis``` da addestrare (scikit-survival at: https://scikit-survival.readthedocs.io/en/stable/user_guide/coxnet.html)
      - Applicare grid search e K-fold cross validation per capire set di parametri migliori
   - Calcolo di risk score con funzione di predict
        - possibile prevedere survival function o cumulative hazard function anche, ma necessario fare fine tuning con parametro ```fit_baselin_model=True```

Motivazioni:
   - Z-scaler per portare valori predittivi su stessa scala con varianza 1 e media 0
   - Utilizzo di Cox con penalizzazione per fare feature selection e selezionare solo miRNA con maggiore rilevanza
   - Utilizzo Elastic Net poichè Lasso-Cox normale non ottimale per due motivi: non può selezionare più features di quanti sample ci sono e in gruppo di features con alta correlazione tra loro ne sceglie a caso solo una tra queste. Elastic net risolve questi usando combinazione di l1 e l2 e rendendo più robusto

## Init

In [2]:
import pandas as pd
import os
import numpy as np

In [3]:
base = os.path.basename(os.getcwd())
list = os.getcwd().split(os.sep) 
list.pop(list.index(base))
ROOT = '\\'.join(list)
print(ROOT)
DATA_PATH = os.path.join(ROOT, 'datasets\\preprocessed')

d:\Universita\2 anno magistrale\Progetto BioInf\miRNA_to_age


In [4]:
dataset = pd.read_csv(os.path.join(DATA_PATH, 'clinical_miRNA_normalized.csv'))

In [5]:
print(dataset.shape)
print(dataset.columns)
# print(dataset.head())
print(type(dataset.iloc[0]['days_to_death']))

(760, 1896)
Index(['days_to_death', 'age_at_initial_pathologic_diagnosis',
       'days_to_last_followup', 'Death', 'pathologic_stage_Stage I',
       'pathologic_stage_Stage IA', 'pathologic_stage_Stage IB',
       'pathologic_stage_Stage II', 'pathologic_stage_Stage IIA',
       'pathologic_stage_Stage IIB',
       ...
       'hsa-mir-941-5', 'hsa-mir-942', 'hsa-mir-943', 'hsa-mir-944',
       'hsa-mir-95', 'hsa-mir-9500', 'hsa-mir-96', 'hsa-mir-98', 'hsa-mir-99a',
       'hsa-mir-99b'],
      dtype='object', length=1896)
<class 'numpy.float64'>


## Z-scaling

In [6]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

columns = ['age_at_initial_pathologic_diagnosis']
columns.extend([col for col in dataset.columns if col.startswith('hsa')])

# print(len(columns))

scaled = pd.DataFrame(scaler.fit_transform(dataset[columns]), columns=columns)
# print(scaled.head())

for col in scaled.columns:
    dataset[col] = scaled[col]

print(dataset.head())

# dataset.to_csv(os.path.join(DATA_PATH, 'clinical_miRNA_normalized_scaled.csv'), index=False)

   days_to_death  age_at_initial_pathologic_diagnosis  days_to_last_followup  \
0           -1.0                             1.027759                 1918.0   
1           -1.0                            -0.342319                 1309.0   
2           -1.0                            -0.190088                    0.0   
3           -1.0                             0.494951                  212.0   
4         2763.0                            -0.875128                 2763.0   

   Death  pathologic_stage_Stage I  pathologic_stage_Stage IA  \
0      0                       0.0                        0.0   
1      0                       1.0                        0.0   
2      0                       0.0                        0.0   
3      0                       0.0                        0.0   
4      1                       1.0                        0.0   

   pathologic_stage_Stage IB  pathologic_stage_Stage II  \
0                        0.0                        0.0   
1         

In [8]:
dataset.describe()

Unnamed: 0,days_to_death,age_at_initial_pathologic_diagnosis,days_to_last_followup,Death,pathologic_stage_Stage I,pathologic_stage_Stage IA,pathologic_stage_Stage IB,pathologic_stage_Stage II,pathologic_stage_Stage IIA,pathologic_stage_Stage IIB,...,hsa-mir-941-5,hsa-mir-942,hsa-mir-943,hsa-mir-944,hsa-mir-95,hsa-mir-9500,hsa-mir-96,hsa-mir-98,hsa-mir-99a,hsa-mir-99b
count,760.0,760.0,760.0,760.0,760.0,760.0,760.0,760.0,760.0,760.0,...,760.0,760.0,760.0,760.0,760.0,760.0,760.0,760.0,760.0,760.0
mean,137.390789,-2.126954e-16,793.760526,0.094737,0.086842,0.086842,0.006579,0.005263,0.353947,0.231579,...,0.0,-1.238775e-16,6.544473e-16,6.07701e-17,-8.239023e-17,0.0,-1.449133e-16,-1.454976e-16,-1.659491e-16,-1.706237e-16
std,538.622812,1.000659,967.188233,0.293044,0.281789,0.281789,0.080897,0.072404,0.478508,0.422119,...,0.0,1.000659,1.000659,1.000659,1.000659,0.0,1.000659,1.000659,1.000659,1.000659
min,-1.0,-2.397437,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,-1.581033,-0.3917041,-1.029881,-1.403227,0.0,-1.732017,-1.78667,-1.786656,-1.786655
25%,-1.0,-0.7228968,133.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,-0.7365276,-0.3917041,-0.5243329,-0.7546338,0.0,-0.7095448,-0.7123219,-0.7085968,-0.7083329
50%,-1.0,0.03825804,444.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,-0.1882608,-0.3917041,-0.1999542,-0.1681678,0.0,-0.2021854,-0.2096399,-0.2077656,-0.208685
75%,-1.0,0.6471819,1172.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.4627578,-0.3917041,0.401548,0.4871473,0.0,0.4727309,0.464641,0.4712945,0.4712942
max,4456.0,2.473953,6796.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,0.0,5.338028,6.478936,5.380787,5.341693,0.0,5.33556,5.335512,5.335469,5.335465


In [15]:
# Controlla se tutti i valori nella colonna 'Death' sono 0 o 1
if dataset['Death'].isin([0, 1]).all():
    print("Tutti i valori nella colonna 'Death' sono 0 o 1.")
else:
    print("La colonna 'Death' contiene valori diversi da 0 o 1.")

Tutti i valori nella colonna 'Death' sono 0 o 1.


## Elastic net (Lasso-Cox)

In [None]:
# y = death_event and days_to_death/last_folowup
# X = all the rest
y_cols = ['Death', 'days_to_death', 'days_to_last_followup']
X_cols = [col for col in dataset.columns if col not in y_cols]

custom_dtype = np.dtype([
    ('death', np.bool_),         # O 'bool'
    ('days', np.float64)      # O 'float'
])

y = []
for index,row in dataset[y_cols].iterrows():
    if row['Death'] == 1:
        y.append(np.array((True, row['days_to_death'].item()), dtype=custom_dtype))
    elif row['Death'] == 0:
        tuple = (False, row['days_to_last_followup'].item())
        y.append(np.array(tuple, dtype=custom_dtype)) 
y = np.array(y)

X = dataset[X_cols]
# remove columns with zero-variance
X = X.loc[:, X.var() != 0]

In [63]:
import matplotlib.pyplot as plt

def plot_coefficients(coefs, n_highlight):
    _, ax = plt.subplots(figsize=(9, 6))
    alphas = coefs.columns
    for row in coefs.itertuples():
        ax.semilogx(alphas, row[1:], ".-", label=row.Index)

    alpha_min = alphas.min()
    top_coefs = coefs.loc[:, alpha_min].map(abs).sort_values().tail(n_highlight)
    for name in top_coefs.index:
        coef = coefs.loc[name, alpha_min]
        plt.text(alpha_min, coef, name + "   ", horizontalalignment="right", verticalalignment="center")

    ax.yaxis.set_label_position("right")
    ax.yaxis.tick_right()
    ax.grid(True)
    ax.set_xlabel("alpha")
    ax.set_ylabel("coefficient")

In [91]:
from sksurv.linear_model import CoxPHSurvivalAnalysis

alphas = 10.0 ** np.linspace(-4, 2, 50)
coefficients = {}

cph = CoxPHSurvivalAnalysis()
for alpha in alphas:
    cph.set_params(alpha=alpha)
    cph.fit(X, y)
    key = round(alpha, 5)
    coefficients[key] = cph.coef_

coefficients = pd.DataFrame.from_dict(coefficients).rename_axis(index="feature", columns="alpha").set_index(X.columns)

  risk_set2 += np.exp(xw[k])
  res = np.abs(1 - (loss_new / loss))
  exp_xw = np.exp(offset + np.dot(x, w))
  risk_set_x2 += exp_xw[k] * xk
  risk_set_xx2 += exp_xw[k] * xx
  z = risk_set_x / risk_set
  a = risk_set_xx / risk_set


ValueError: LAPACK reported an illegal value in 5-th argument.

In [None]:
# Controlla se tutti i valori nella colonna 'Death' sono 0 o 1
if X['Death'].isin([0, 1]).all():
    print("Tutti i valori nella colonna 'Death' sono 0 o 1.")
else:
    print("La colonna 'Death' contiene valori diversi da 0 o 1.")

In [89]:
print(X.shape)
X = X.loc[:, X.var() != 0]
print(X.shape)

(760, 1893)
(760, 1585)


In [None]:
plot_coefficients(coefficients, n_highlight=5)