<a href="https://colab.research.google.com/github/ambwhl/datasci_223/blob/Final-Project/Final.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Final Project: Kidney Renal Clear Cell Carcinoma (KIRC) Survival prediction and biomarker identification

### Introduction
I am intersected in building prognostic prediction model and finding the potential biomarker for cancer patient. The data I used for this project is a subset of kidney renal clear cell carcinoma (KIRC), containing of 243 patients (subjects) and 16380 variables. The variables (predictors) are very rich: clinical covariates (e.g. cancer stage, tumor grade, and survival status), messenger RNA (mRNA) expression, microRNA (miRNA) expression, and copy number variation (CNV).
### Strategy and methods
1.Dataset was downloaded, combined and made subset from this public project: https://www.synapse.org/#!Synapse:syn1710282/wiki/27303


In [3]:
# Uncomment and install below packages if not already installed
#%pip install -q numpy pandas scikit-learn

In [None]:
## environment requirement
%reset -f
import pandas as pd
import numpy as np
from sklearn.model_selection import cross_val_score, KFold
from sklearn.metrics import accuracy_score, roc_auc_score, f1_score, precision_score, recall_score
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import StackingClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler


In [25]:
#1.Read in preprocessed KIRC survival data
file_id = '1TX0MHLrz51wpUBCalsTpivVnBSFBvIGJ'
link = f'https://drive.google.com/uc?export=download&id={file_id}'
data = pd.read_csv(link)
print('subject number:',  len(data.index))
print('variable number:',  len(data.columns) - 2) ##minus the "feature" (subject index) and 'OS_vital_status' (survival outcome) columns.
print('example of variables:', data.columns[1:11])


subject number: 243
variable number: 16380
example of variables: Index(['age', 'gender', 'grade', 'stage', 'OS_vital_status', 'CNV_20q',
       'CNV_20p', 'CNV_Xq11.2', 'CNV_14q', 'CNV_17q24.3'],
      dtype='object')


In [28]:
##2. Remove variables with all 0 value
missindex=data.columns[(data==0).all()]
df= data[missindex]
#print(len(df.columns))
newdata= data.drop(missindex,axis=1)

#print(newdata.columns)
newdata = newdata.set_index('feature')
for col in ['gender', 'grade', 'stage']:
    newdata[col] = newdata[col].astype('category')
print("variable numbers after cleaning:",len(newdata.columns)-1)

variable numbers after cleaning: 15882


In [6]:
## 3. Split to train and test data
from sklearn.model_selection import train_test_split
X = newdata.drop(columns=['OS_vital_status'])
Y = newdata['OS_vital_status']
#print(len(Y))
#print(len(X.columns))
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=42)

In [7]:
## 4. Train and cross-validate basic models
# Define classifiers
classifiers = {
    'Gradient Boosting': GradientBoostingClassifier(),
    #'Logistic Regression': LogisticRegression(),
    'Neural Network': MLPClassifier(),
    'SVM': SVC(),
    'Decision Tree': DecisionTreeClassifier(),
    'Random Forest': RandomForestClassifier(),
    'kNN': KNeighborsClassifier()
}

# Define evaluation metrics
scoring = {
    'AUC': 'roc_auc',
    'Accuracy': 'accuracy',
    'F1 Score': 'f1',
    'Precision': 'precision',
    'Recall': 'recall'
}

# Perform cross-validation and calculate evaluation metrics for each classifier
for clf_name, clf in classifiers.items():
    print(f"Evaluation metrics for {clf_name}:")
    for metric_name, scoring_method in scoring.items():
        scores = cross_val_score(clf, X_train, Y_train, cv=KFold(n_splits=5, shuffle=True, random_state=42), scoring=scoring_method)
        print(f"{metric_name}: {np.mean(scores):.4f} (std: {np.std(scores):.4f})")
    print("\n")

Evaluation metrics for Gradient Boosting:
AUC: 0.7711 (std: 0.0823)
Accuracy: 0.7647 (std: 0.0720)
F1 Score: 0.6300 (std: 0.1077)
Precision: 0.7297 (std: 0.1735)
Recall: 0.6051 (std: 0.1317)


Evaluation metrics for Neural Network:
AUC: 0.6200 (std: 0.0940)
Accuracy: 0.6294 (std: 0.0941)
F1 Score: 0.1594 (std: 0.2386)


  _warn_prf(average, modifier, msg_start, len(result))


Precision: 0.4813 (std: 0.2574)
Recall: 0.5213 (std: 0.4306)


Evaluation metrics for SVM:
AUC: 0.6819 (std: 0.0637)
Accuracy: 0.6529 (std: 0.0506)
F1 Score: 0.2912 (std: 0.1357)
Precision: 0.7309 (std: 0.3393)
Recall: 0.2195 (std: 0.1525)


Evaluation metrics for Decision Tree:
AUC: 0.6146 (std: 0.0693)
Accuracy: 0.6235 (std: 0.0753)
F1 Score: 0.5433 (std: 0.1055)
Precision: 0.5104 (std: 0.1199)
Recall: 0.5118 (std: 0.1142)


Evaluation metrics for Random Forest:
AUC: 0.7707 (std: 0.0788)
Accuracy: 0.7412 (std: 0.0600)
F1 Score: 0.6171 (std: 0.1074)
Precision: 0.7143 (std: 0.1229)
Recall: 0.5585 (std: 0.1900)


Evaluation metrics for kNN:
AUC: 0.6097 (std: 0.1224)
Accuracy: 0.6412 (std: 0.0681)
F1 Score: 0.3193 (std: 0.1763)
Precision: 0.5413 (std: 0.3495)
Recall: 0.2469 (std: 0.1453)




In [8]:
## 5. Stack the models with top3 performance

# Base models
gradient_boosting = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0, max_depth=1, random_state=0)
svm = make_pipeline(StandardScaler(), SVC(random_state=0))
random_forest = RandomForestClassifier(n_estimators=100, random_state=0)

# Stacking classifier
stacked_model = StackingClassifier(
    estimators=[
        ('gradient_boosting', gradient_boosting),
        ('svm', svm),
        ('random_forest', random_forest)
    ],
    final_estimator=LogisticRegression()
)

# Uncomment and replace X_train, y_train with your actual training data
stacked_model.fit(X_train, Y_train)

# Uncomment and replace X_test with your actual test data for predictions
predictions = stacked_model.predict(X_test)

# Predict probabilities for the test set
y_proba = stacked_model.predict_proba(X_test)[:, 1]

# Predict classes for the test set
y_pred = stacked_model.predict(X_test)

# Compute metrics
auc = roc_auc_score(Y_test, y_proba)
accuracy = accuracy_score(Y_test, y_pred)
f1 = f1_score(Y_test, y_pred)
precision = precision_score(Y_test, y_pred)
recall = recall_score(Y_test, y_pred)

# Collect the metrics
metrics_stacked = {
     'AUC': auc,
     'Accuracy': accuracy,
     'F1 Score': f1,
     'Precision': precision,
     'Recall': recall
 }
print(metrics_stacked)

{'AUC': 0.8126984126984127, 'Accuracy': 0.7397260273972602, 'F1 Score': 0.5777777777777777, 'Precision': 0.7647058823529411, 'Recall': 0.4642857142857143}


In [16]:
## 6. Pindown biomarkers with top feature importance
predictors = X.columns.tolist()

# Extract feature importances from base estimators
gb_feature_importances = stacked_model.named_estimators_['gradient_boosting'].feature_importances_
rf_feature_importances = stacked_model.named_estimators_['random_forest'].feature_importances_

# Average feature importances across the base estimators
avg_feature_importances = (gb_feature_importances + rf_feature_importances) / 2

# Sort the feature importances in descending order and get the top 10
top_indices = np.argsort(avg_feature_importances)[::-1][:10]
top = [(predictors[i], avg_feature_importances[i]) for i in top_indices]

# Display the top features and their importances
for feature, importance in top:
    print(f"{feature}: {importance}")


mRNA_PSRC1|84722: 0.16786182986701073
mRNA_TPX2|22974: 0.05760733865569538
stage: 0.05708578005415763
mRNA_NCOR1|9611: 0.037316728895259416
mRNA_HK3|3101: 0.02725911232028246
mRNA_PMPCA|23203: 0.025781179652827047
mRNA_CYP26B1|56603: 0.02566115735037458
mRNA_CITED2|10370: 0.02443062193144756
miRNA_hsa-mir-335: 0.016659761568937028
mRNA_FAM177B|400823: 0.013477700423530253
