For this project, you will create a machine learning model to predict the stage of cancer (I, II, III,
or IV) from both RNA and protein-level gene expression for clear cell renal cell carcinoma
(CCRCC) in CPTAC. Stage of cancer can be found using the tumor_stage_pathological column
within the CPTAC clinical data. You can access the data the exact same way as BRCA,
substituting the accession code.

1) Select what features to include in the model by finding the top 5 most differentially
expressed proteins between Stage I and Stage III patients in CPTAC protein data. Repeat
this process to find the top 5 most differential expression RNA between Stage I and Stage
III patients in the CPTAC RNA data.

    a) Use tumor_stage_pathological in the CPTAC clinical data

In [1]:
import pandas as pd
import numpy as np
import os
os.chdir('/Users/erikali/Desktop/QBIO/qbio_490_ErikaLi/analysis_data')

In [4]:
import cptac
cptac.download(dataset="Ccrcc")
ccrcc = cptac.Ccrcc()

                                          

In [134]:
protein_data = ccrcc.get_proteomics()
protein_data.columns = protein_data.columns.get_level_values(0) 
protein_data

Name,A1BG,A1CF,A2M,A4GALT,AAAS,AACS,AADAC,AADAT,AAED1,AAGAB,...,ZSWIM8,ZSWIM9,ZW10,ZWILCH,ZWINT,ZXDC,ZYG11B,ZYX,ZZEF1,ZZZ3
Patient_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
C3L-00004,-0.304302,0.641447,-0.000025,,0.207831,-0.364128,,-1.203886,-0.217934,0.216894,...,-0.064343,,-0.094441,,,-0.021827,0.133927,0.237280,0.114409,
C3L-00010,1.195915,0.194620,1.360294,,0.126956,-0.572843,,-1.596546,,0.221696,...,0.112064,,0.072262,,,-0.205642,0.182434,,0.201374,-0.068340
C3L-00011,-0.286155,-0.780455,-0.101089,,0.292629,0.035812,,,,0.300863,...,0.136957,,0.279732,0.695116,,0.316298,-0.009772,-0.019653,-0.095339,0.008961
C3L-00026,0.135730,0.404286,0.261384,,0.155568,0.336311,,,0.709046,0.244198,...,-0.013139,,0.157541,0.526188,,-0.120501,0.054559,-0.313236,0.062194,0.052825
C3L-00079,-0.123959,-0.677773,-0.362547,,0.187605,-0.320026,,-1.300148,-0.153216,0.229676,...,-0.058953,,0.152341,0.072886,0.068182,,0.178869,0.266290,-0.028647,0.003682
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
C3N-01646.N,-0.533915,0.352304,-0.297937,,0.049507,0.421582,,,,0.108650,...,0.148773,,0.215333,,,0.042826,-0.071776,-0.189634,-0.078515,
C3N-01648.N,-0.732322,0.111213,-0.877605,,0.058466,-0.241223,,,,0.166769,...,0.390944,,0.059535,,,0.366396,-0.070845,0.248763,0.140119,0.200438
C3N-01649.N,0.318404,0.065235,-0.261260,,0.013386,0.648428,,,,-0.009946,...,0.198995,,0.074344,-0.633999,,-0.020812,0.162851,-0.571237,0.203220,0.111064
C3N-01651.N,-0.531707,0.525514,-1.065982,,0.022212,-0.064736,,,-0.335925,-0.017308,...,0.388512,,0.237719,-0.181989,,,-0.008758,0.198887,0.088718,0.185857


In [135]:
clinical_data = ccrcc.get_clinical()
clinical_data

Name,Sample_Tumor_Normal,tumor/normal,gender,age,height_in_cm,height_in_inch,weight_in_kg,weight_in_lb,BMI,race,...,histologic_type_of_normal_tissue,slide_is_free_of_tumor,consistent_with_local_pathology_report,findings_not_consistent_with_local_pathology_report,weight_in_mg,minutes_clamp_1_to_collection,minutes_clamp_2_to_collection,minutes_collection_to_frozen,consistent_with_diagnostic_report,patient_medications
Patient_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
C3L-00004,Tumor,TN,Male,72,170.0,67.0,66.0,145.0,22.80,White,...,,,Yes,,484.0,,,21,Yes,"rivaroxaban,tylenol,Aspirin,glycolax,colace,zocor"
C3L-00010,Tumor,TN,Male,30,177.0,70.0,107.0,236.0,34.15,White,...,,,Yes,,575.0,,,17,Yes,"Rivaroxaban,Esomeprazole ,Tramadol"
C3L-00011,Tumor,TN,Female,63,180.0,71.0,89.0,196.0,27.47,White,...,,,Yes,,272.0,,,18,Yes,"Multi Vitamin,Levothyroxine Sodium,Ibandronate..."
C3L-00026,Tumor,TN,Female,65,163.0,64.0,75.0,165.0,28.23,White,...,,,Yes,,212.0,13.0,,13,Yes,"Levothyroxine Sodium,Cyproheptadine HCL,Citrac..."
C3L-00079,Tumor,TN,Male,49,175.0,69.0,116.0,256.0,37.88,White,...,,,Yes,,675.0,,,28,Yes,"ibuprofen,Norco,miralax"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
C3N-01646.N,Normal,,,,,,,,,,...,"renal cortex, medulla",Yes,Yes,,256.0,,,20,,
C3N-01648.N,Normal,,,,,,,,,,...,renal cortex,Yes,Yes,,288.0,,,8,,
C3N-01649.N,Normal,,,,,,,,,,...,"renal medulla, pelvis",Yes,Yes,,301.0,,,29,,
C3N-01651.N,Normal,,,,,,,,,,...,"renal cortex, medulla",Yes,Yes,,280.0,,,22,,


In [138]:
rna_data = ccrcc.get_transcriptomics()
rna_data

Name,A1BG,A1CF,A2M,A2ML1,A3GALT2,A4GALT,A4GNT,AAAS,AACS,AADAC,...,ZWILCH,ZWINT,ZXDA,ZXDB,ZXDC,ZYG11A,ZYG11B,ZYX,ZZEF1,ZZZ3
Patient_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
C3L-00004,0.995336,16.677828,353.263362,0.046634,0.031027,17.196885,0.896109,15.831130,2.938550,0.019029,...,2.622184,5.415676,5.293887,3.731920,12.541358,2.677696,15.595194,11.401808,9.976986,23.334614
C3L-00010,0.679400,16.682712,359.078446,0.077350,0.068617,13.560508,1.743989,16.690257,3.154143,0.000000,...,2.873604,9.209695,3.669353,2.560578,13.570779,4.097483,15.449647,11.550727,9.432121,25.724814
C3L-00011,0.354549,0.245606,222.075350,0.060736,0.273536,1.321499,0.172369,18.757568,6.942752,0.000000,...,7.998655,28.780560,2.801800,2.503315,10.209840,0.178842,11.670596,11.342045,6.763858,32.090615
C3L-00026,2.543775,16.347532,228.282343,0.085684,0.152020,7.868391,1.448911,17.648610,6.175010,0.031078,...,2.754936,12.639323,5.262024,2.796869,10.718552,0.800663,15.887414,11.788588,8.169953,24.752283
C3L-00079,4.355205,4.858958,275.090167,0.106359,0.000000,6.863003,2.338081,15.480282,4.584445,0.000000,...,7.497914,14.400917,2.907591,2.417113,10.127549,3.442177,12.807428,17.494840,9.733803,24.528238
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
C3N-01646.N,1.287655,11.076241,228.037113,0.106505,0.066692,8.195971,0.982361,18.086400,10.225658,0.040903,...,2.338694,3.274051,3.500041,3.223773,15.663483,3.219438,17.530849,6.471017,11.860328,25.489666
C3N-01648.N,1.435986,9.772280,291.393930,0.172457,0.214180,8.955806,1.838108,14.597847,6.598100,0.056297,...,1.983803,1.486905,3.865995,3.197000,15.581449,5.774722,17.131855,10.133624,15.256983,22.669748
C3N-01649.N,1.082318,8.378616,249.779349,0.100673,0.160751,7.419323,2.290458,17.666594,11.108717,0.000000,...,2.007450,2.590684,3.805016,2.943559,17.274857,5.856484,16.392044,6.775269,12.881048,25.672959
C3N-01651.N,0.770924,15.539566,273.743429,0.208978,0.185383,12.206999,2.377283,13.946563,6.920178,0.022739,...,1.861395,2.904921,3.089317,2.529058,14.839389,5.801733,18.372506,9.149521,13.918938,24.578885


In [136]:
merged_data = pd.merge(protein_data, clinical_data, on='Patient_ID')
merged_data

Name,A1BG,A1CF,A2M,A4GALT,AAAS,AACS,AADAC,AADAT,AAED1,AAGAB,...,histologic_type_of_normal_tissue,slide_is_free_of_tumor,consistent_with_local_pathology_report,findings_not_consistent_with_local_pathology_report,weight_in_mg,minutes_clamp_1_to_collection,minutes_clamp_2_to_collection,minutes_collection_to_frozen,consistent_with_diagnostic_report,patient_medications
Patient_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
C3L-00004,-0.304302,0.641447,-0.000025,,0.207831,-0.364128,,-1.203886,-0.217934,0.216894,...,,,Yes,,484.0,,,21,Yes,"rivaroxaban,tylenol,Aspirin,glycolax,colace,zocor"
C3L-00010,1.195915,0.194620,1.360294,,0.126956,-0.572843,,-1.596546,,0.221696,...,,,Yes,,575.0,,,17,Yes,"Rivaroxaban,Esomeprazole ,Tramadol"
C3L-00011,-0.286155,-0.780455,-0.101089,,0.292629,0.035812,,,,0.300863,...,,,Yes,,272.0,,,18,Yes,"Multi Vitamin,Levothyroxine Sodium,Ibandronate..."
C3L-00026,0.135730,0.404286,0.261384,,0.155568,0.336311,,,0.709046,0.244198,...,,,Yes,,212.0,13.0,,13,Yes,"Levothyroxine Sodium,Cyproheptadine HCL,Citrac..."
C3L-00079,-0.123959,-0.677773,-0.362547,,0.187605,-0.320026,,-1.300148,-0.153216,0.229676,...,,,Yes,,675.0,,,28,Yes,"ibuprofen,Norco,miralax"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
C3N-01646.N,-0.533915,0.352304,-0.297937,,0.049507,0.421582,,,,0.108650,...,"renal cortex, medulla",Yes,Yes,,256.0,,,20,,
C3N-01648.N,-0.732322,0.111213,-0.877605,,0.058466,-0.241223,,,,0.166769,...,renal cortex,Yes,Yes,,288.0,,,8,,
C3N-01649.N,0.318404,0.065235,-0.261260,,0.013386,0.648428,,,,-0.009946,...,"renal medulla, pelvis",Yes,Yes,,301.0,,,29,,
C3N-01651.N,-0.531707,0.525514,-1.065982,,0.022212,-0.064736,,,-0.335925,-0.017308,...,"renal cortex, medulla",Yes,Yes,,280.0,,,22,,


In [140]:
import numpy as np
import pandas as pd

clinical_data = clinical_data[clinical_data['tumor_stage_pathological'].notna()] 

common_index = clinical_data.index  

protein_data = protein_data.loc[common_index]

stage1_mask = clinical_data['tumor_stage_pathological'] == "Stage I"  
stage3_mask = clinical_data['tumor_stage_pathological'] == "Stage III"  

stage1_proteins = protein_data.loc[stage1_mask]
stage3_proteins = protein_data.loc[stage3_mask]

mean_stage1 = stage1_proteins.mean()
mean_stage3 = stage3_proteins.mean()

expression_difference = np.abs(mean_stage1 - mean_stage3)

sorted_proteins = expression_difference.sort_values(ascending=False)

sorted_proteins_filtered = sorted_proteins.dropna()

top_5_proteins = sorted_proteins_filtered.head(5)

print(top_5_proteins)

Name
LDB3      1.911659
BTBD7     1.814474
GDF6      1.646855
COX4I2    1.501131
SNCB      1.494579
dtype: float64


Top 5 Proteins: LDB3, BTBD7, GDF6, COX4I2, SNCB

In [141]:
rna_data = rna_data.loc[common_index]
log_scaled_rna = np.log2(rna_data)

stage1_genes = log_scaled_rna.loc[stage1_mask]
stage3_genes = log_scaled_rna.loc[stage3_mask]

mean_stage1 = stage1_genes.mean()
mean_stage3 = stage3_genes.mean()

expression_difference = np.abs(mean_stage1 - mean_stage3)

sorted_genes = expression_difference.sort_values(ascending=False)

sorted_genes_filtered = sorted_genes.dropna()

top_5_genes = sorted_genes_filtered.head(5)

print(top_5_genes)

Name
LRRC43      inf
SYNPO2L     inf
C1orf145    inf
MAST1       inf
SYTL5       inf
dtype: float64


  result = func(self.values, **kwargs)


Top 5 Genes: LRRC43, SYNPO2L, C1orf145, MAST1, SYTL5

2. Create a new dataframe of your selected features, where the rows are the patients and the
columns are the expression values of genes you selected in step 1 (X data).

In [142]:
selected_genes = ['LRRC43', 'SYNPO2L', 'C1orf145', 'MAST1', 'SYTL5']

selected_genes_data = log_scaled_rna[selected_genes]

print(selected_genes_data)

Name          LRRC43   SYNPO2L  C1orf145     MAST1     SYTL5
Patient_ID                                                  
C3L-00004  -3.675676 -5.311829 -1.807802 -4.312687 -6.237873
C3L-00010  -0.978065 -4.944366 -1.077769 -6.167616 -2.922878
C3L-00011  -4.731917 -4.823750 -0.542115 -3.725072 -3.557149
C3L-00026   0.675903 -7.604106 -2.292724 -6.604964 -1.575954
C3L-00079  -3.130040 -7.444265  0.059762 -1.490926 -5.048381
...              ...       ...       ...       ...       ...
C3N-01646  -3.382436 -3.919053 -2.678061 -4.504874 -1.906499
C3N-01648  -3.127310 -3.634180 -1.352545 -3.535502 -6.367579
C3N-01649  -2.907623 -4.084344 -2.039749 -3.457171 -1.855110
C3N-01651  -2.990224 -4.474374 -1.970347 -2.890269 -4.230493
C3N-01808  -2.446471 -5.220127 -0.809210 -2.220985 -4.561209

[110 rows x 5 columns]


3. Create a separate list of the patients’ cancer stages, ie. tumor_stage_pathological (y data)

In [143]:
cancer_stages = clinical_data['tumor_stage_pathological']

# i removed these patients because they had NA values in the selected_genes_data dataframe
patients_to_remove = ['C3L-00097', 'C3L-00448', 'C3L-00796', 'C3L-01607', 'C3N-00313', 'C3N-01178']

mask = ~cancer_stages.index.isin(patients_to_remove)

cancer_stages_filtered = cancer_stages[mask]

print(cancer_stages_filtered)

Patient_ID
C3L-00004    Stage III
C3L-00010      Stage I
C3L-00011     Stage IV
C3L-00026      Stage I
C3L-00079    Stage III
               ...    
C3N-01646    Stage III
C3N-01648     Stage II
C3N-01649    Stage III
C3N-01651     Stage II
C3N-01808      Stage I
Name: tumor_stage_pathological, Length: 104, dtype: object


4. Scale and encode your features and target.

In [144]:
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer

selected_genes_data.replace([np.inf, -np.inf], np.nan, inplace=True)

selected_genes_data = selected_genes_data.dropna()

scaler = StandardScaler()
scaled_features = scaler.fit_transform(selected_genes_data)

label_encoder = LabelEncoder()
encoded_target = label_encoder.fit_transform(cancer_stages_filtered)

print(selected_genes_data)

Name          LRRC43   SYNPO2L  C1orf145     MAST1     SYTL5
Patient_ID                                                  
C3L-00004  -3.675676 -5.311829 -1.807802 -4.312687 -6.237873
C3L-00010  -0.978065 -4.944366 -1.077769 -6.167616 -2.922878
C3L-00011  -4.731917 -4.823750 -0.542115 -3.725072 -3.557149
C3L-00026   0.675903 -7.604106 -2.292724 -6.604964 -1.575954
C3L-00079  -3.130040 -7.444265  0.059762 -1.490926 -5.048381
...              ...       ...       ...       ...       ...
C3N-01646  -3.382436 -3.919053 -2.678061 -4.504874 -1.906499
C3N-01648  -3.127310 -3.634180 -1.352545 -3.535502 -6.367579
C3N-01649  -2.907623 -4.084344 -2.039749 -3.457171 -1.855110
C3N-01651  -2.990224 -4.474374 -1.970347 -2.890269 -4.230493
C3N-01808  -2.446471 -5.220127 -0.809210 -2.220985 -4.561209

[104 rows x 5 columns]


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  selected_genes_data.replace([np.inf, -np.inf], np.nan, inplace=True)


5) Create a train test split of your X and y data with train_size=0.7.

In [145]:
X_train, X_test, y_train, y_test = train_test_split(selected_genes_data, cancer_stages_filtered, train_size=0.7, random_state=42)

6. Write code to test the accuracy of all 4 classification models we covered in this class (ie.
KNeighborsClassifier, DecisionTreeClassifier, and MLPClassifier, GaussianNB). Since
the accuracy of the models will change depending on the train-test split, you will need to
run each model 10 times and find the average accuracy between all runs.

In [146]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

knn_classifier = KNeighborsClassifier()
decision_tree_classifier = DecisionTreeClassifier()
mlp_classifier = MLPClassifier()
gaussian_nb_classifier = GaussianNB()

accuracy_scores = []

num_runs = 10

for _ in range(num_runs):

    X_train, X_test, y_train, y_test = train_test_split(selected_genes_data, cancer_stages_filtered, train_size=0.7, random_state=42)
   
    # KNeighborsClassifier
    knn_classifier.fit(X_train, y_train)
    knn_predictions = knn_classifier.predict(X_test)
    knn_accuracy = accuracy_score(y_test, knn_predictions)

    # DecisionTreeClassifier
    decision_tree_classifier.fit(X_train, y_train)
    decision_tree_predictions = decision_tree_classifier.predict(X_test)
    decision_tree_accuracy = accuracy_score(y_test, decision_tree_predictions)

    # MLPClassifier
    mlp_classifier.fit(X_train, y_train)
    mlp_predictions = mlp_classifier.predict(X_test)
    mlp_accuracy = accuracy_score(y_test, mlp_predictions)

    # GaussianNB
    gaussian_nb_classifier.fit(X_train, y_train)
    gaussian_nb_predictions = gaussian_nb_classifier.predict(X_test)
    gaussian_nb_accuracy = accuracy_score(y_test, gaussian_nb_predictions)

    accuracy_scores.append({
        'knn': knn_accuracy,
        'decision_tree': decision_tree_accuracy,
        'mlp': mlp_accuracy,
        'gaussian_nb': gaussian_nb_accuracy
    })

avg_accuracies = {model: np.mean([run[model] for run in accuracy_scores]) for model in ['knn', 'decision_tree', 'mlp', 'gaussian_nb']}

print("Average Accuracies:")
print(avg_accuracies)




Average Accuracies:
{'knn': 0.375, 'decision_tree': 0.29375, 'mlp': 0.38125, 'gaussian_nb': 0.5}




7) Compare the 4 mean accuracies and identify which model is best.

From the mean accuracies, it seems that the GaussianNB classifier was the most accurate, followed by MLPClassifier, KNeighbors, and Decision Tree. 