# PDAC Subtype Classification

Dataset: GSE71729

Goal: The goal is to use the gene expression data from these samples to predict different subtypes of PDAC.

The follwoing 4 tasks that needs to be accomplished:

1. Predict Cancer Subtypes Based on Gene Signatures: In this task, we used the gene expression data from the PDAC primary samples to predict cancer subtypes.

Objective: Build a machine learning model that can classify PDAC samples into their respective subtypes based on gene expression patterns.

2. Identify Top N Most Important Genes: In this task, you need to identify the most important genes that help distinguish between the different PDAC subtypes. Feature selection techniques like Random Forest feature importance can be used to identify the most relevant genes.

Objective: Select the top N genes (e.g., top 10, top 20) based on their importance scores.

3. Build Models Using Only Important Features

Objective: To simplify the model by reducing the number of features and to evaluate how well it performs with a smaller, more focused set of features.

4. Compare the Performance and Stability of Two Prediction Models: compare the performance and stability of two machine learning algorithms

Objective: Compare these models in terms of accuracy, precision, recall, F1 score, and potentially other metrics. Also, assess their stability, meaning how consistently they perform across different random splits of the data.

In [1]:
# Import libraries
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.model_selection import train_test_split, cross_val_score
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
# Load gene expression data
gene_expression_data = pd.read_csv(r'C:\Users\Disha\TU Braunschweig\Python Lab\Project\GSE71729_PData.csv')
metadata = pd.read_csv(r'C:\Users\Disha\TU Braunschweig\Python Lab\Project\GSE71729_phenotype_primary.csv')
gene_list_data = pd.read_csv(r'C:\Users\Disha\TU Braunschweig\Python Lab\Project\moffitt_signitures.csv')  

In [3]:
# Preprocessing
# Filter metadata to keep only primary samples
filtered_metadata = metadata[metadata['tissue type:ch2'] == 'Primary']

In [4]:
# Align gene expression data with filtered metadata
sample_ids = filtered_metadata['geo_accession'].values
gene_expression_data_filtered = gene_expression_data[['ID'] + list(sample_ids)]
gene_expression_data_filtered.set_index('ID', inplace=True)

In [5]:
# Filter gene expression data to include only genes in gene_list_data
if not gene_list_data.empty:
    genes_of_interest = gene_list_data['gene'].values
    gene_expression_data_filtered = gene_expression_data_filtered.loc[gene_expression_data_filtered.index.intersection(genes_of_interest)]

In [6]:
# Extract labels (target variable)
target_labels = filtered_metadata['tumor_subtype_0na_1classical_2basal:ch2'].astype(int).values

In [7]:
# Predict Cancer Subtypes
X = gene_expression_data_filtered.T  # Transpose to align samples (rows) with features (columns)
y = target_labels

In [8]:
# Split data into train-test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Using Random Forest

In [9]:
# Train Random Forest to predict the Cancer Subtypess based on the Gene Signature
rf_model = RandomForestClassifier(random_state=42, n_estimators=100)
rf_model.fit(X_train, y_train)

In [10]:
# Evaluate the model
y_pred = rf_model.predict(X_test)
print("Random Forest Performance (All Features):")
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

Random Forest Performance (All Features):
Accuracy: 0.8620689655172413
              precision    recall  f1-score   support

           1       0.84      1.00      0.91        21
           2       1.00      0.50      0.67         8

    accuracy                           0.86        29
   macro avg       0.92      0.75      0.79        29
weighted avg       0.88      0.86      0.85        29



In [11]:
# Identify Top N Important Genes
# Feature importance
feature_importances = rf_model.feature_importances_
important_genes = pd.DataFrame({
    'Gene': X.columns,
    'Importance': feature_importances
}).sort_values(by='Importance', ascending=False)

In [12]:
# Select Top 20 important genes
N = 20  # You can adjust N as needed
top_genes = important_genes.head(N)['Gene'].values
# Print the top N important genes
print(top_genes)

['GPR87' 'FAM83A' 'ANXA8L2' 'S100A2' 'CDH17' 'LGALS4' 'BTNL8' 'TFF3'
 'AGR3' 'KRT6A' 'TSPAN8' 'CLRN3' 'ANXA10' 'SCEL' 'KRT6C' 'KRT15' 'VGLL1'
 'ATAD4' 'REG4' 'CST6']


In [13]:
# Build New Models with Only Important Features
X_train_top = X_train[top_genes]
X_test_top = X_test[top_genes]

In [14]:
# Train Random Forest with top N genes
rf_model_top = RandomForestClassifier(random_state=42, n_estimators=100)
rf_model_top.fit(X_train_top, y_train)

In [15]:
# Evaluate the new model
y_pred_top = rf_model_top.predict(X_test_top)
print("Random Forest Performance (Top N Genes):")
print("Accuracy:", accuracy_score(y_test, y_pred_top))
print(classification_report(y_test, y_pred_top))

Random Forest Performance (Top N Genes):
Accuracy: 0.896551724137931
              precision    recall  f1-score   support

           1       0.88      1.00      0.93        21
           2       1.00      0.62      0.77         8

    accuracy                           0.90        29
   macro avg       0.94      0.81      0.85        29
weighted avg       0.91      0.90      0.89        29



In [16]:
# Compare Models
print("Confusion Matrix (All Features):")
print(confusion_matrix(y_test, y_pred))
print("Confusion Matrix (Top N Genes):")
print(confusion_matrix(y_test, y_pred_top))

Confusion Matrix (All Features):
[[21  0]
 [ 4  4]]
Confusion Matrix (Top N Genes):
[[21  0]
 [ 3  5]]


# Using Support Vector Machine (SVM)

In [17]:
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.model_selection import train_test_split

In [18]:
# X = gene expression data (all features or top N genes)
# y = tumor subtypes (labels)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [19]:
# Step 2: Train SVM with All Features
svm_model = SVC(kernel='linear', probability=True, random_state=42)  # Linear kernel for interpretability
svm_model.fit(X_train, y_train)

In [20]:
# Predict and Evaluate
y_pred_svm = svm_model.predict(X_test)
print("SVM Performance (All Features):")
print("Accuracy:", accuracy_score(y_test, y_pred_svm))
print(classification_report(y_test, y_pred_svm))

SVM Performance (All Features):
Accuracy: 0.9310344827586207
              precision    recall  f1-score   support

           1       0.95      0.95      0.95        21
           2       0.88      0.88      0.88         8

    accuracy                           0.93        29
   macro avg       0.91      0.91      0.91        29
weighted avg       0.93      0.93      0.93        29



In [21]:
# Use Top N Genes identified earlier
X_train_top = X_train[top_genes]
X_test_top = X_test[top_genes]

In [22]:
# Train SVM with Top N Features
svm_model_top = SVC(kernel='linear', probability=True, random_state=42)
svm_model_top.fit(X_train_top, y_train)

In [23]:
# Predict and Evaluate
y_pred_svm_top = svm_model_top.predict(X_test_top)
print("SVM Performance (Top N Genes):")
print("Accuracy:", accuracy_score(y_test, y_pred_svm_top))
print(classification_report(y_test, y_pred_svm_top))

SVM Performance (Top N Genes):
Accuracy: 0.896551724137931
              precision    recall  f1-score   support

           1       0.95      0.90      0.93        21
           2       0.78      0.88      0.82         8

    accuracy                           0.90        29
   macro avg       0.86      0.89      0.88        29
weighted avg       0.90      0.90      0.90        29



In [24]:
# Step 4: Compare Models
print("Confusion Matrix (All Features):")
print(confusion_matrix(y_test, y_pred_svm))
print("Confusion Matrix (Top N Genes):")
print(confusion_matrix(y_test, y_pred_svm_top))

Confusion Matrix (All Features):
[[20  1]
 [ 1  7]]
Confusion Matrix (Top N Genes):
[[19  2]
 [ 1  7]]


1. Based on Accuracy

The SVM model performs better when using all features, with an accuracy of 93.10% compared to Random Forest's accuracy of 86.21%.
When using only the top N genes, both models achieve the same accuracy of 89.66%.

2. Precision, Recall and F1 Score

a. Class 1

SVM outperforms Random Forest in terms of precision for class 1 (classical subtype), with a precision of 0.95 compared to 0.84 for Random Forest (all features).

Recall for Class 1 is perfect (1.00) for Random Forest, meaning it doesn’t miss any of the class 1 samples. SVM has a recall of 0.95, meaning it misses some class 1 samples.

The F1-score for class 1 is higher for SVM when using all features (0.95) compared to Random Forest (0.91). However, when using top N genes, both models have a similar F1-score (0.93).

b. Class 2

Random Forest achieves perfect precision (1.00) for class 2 in both all features and top N genes, meaning it doesn't misclassify any of the class 2 samples as class 1.
SVM has lower precision for class 2, particularly when using the top N genes (0.78).

Recall for class 2 is much higher for SVM (0.88) compared to Random Forest, which has lower recall, particularly with all features (0.50).

F1-score for class 2 is significantly better for SVM compared to Random Forest in both all features and top N genes, reflecting a more balanced performance.

3. Confusion Matrix

Random Forest shows more false negatives for class 2 (1 false negative in both all features and top N genes), meaning it misses some class 2 samples.

SVM shows fewer false positives for class 2, indicating that when it predicts class 2, it's more accurate, even if it occasionally misses some samples (false negatives).

4.  Stability and Generalization

SVM generally performs better in terms of both precision and recall for class 1 (classical subtype), and for class 2 (basal subtype), SVM shows better recall and a more balanced F1-score.

Random Forest, while showing good performance in terms of precision for class 2, is less stable in terms of recall. It struggles more with classifying class 2 correctly, especially when using all features.

Final Thoughts

Best Model for Class 1: SVM is clearly superior in terms of precision, recall, and F1-score for class 1, achieving near-perfect results.

Best Model for Class 2: While Random Forest excels in precision for class 2, SVM is better at identifying more class 2 samples (higher recall and F1-score), which could be more important depending on the context (e.g., minimizing false negatives).

Overall, SVM appears to be the better model when considering the overall balance between precision, recall, and F1-score for both classes. Random Forest may have a slight advantage in certain cases, especially with class 2 precision, but SVM provides better overall stability and performance, particularly when it comes to recall for class 2.

Thus, SVM is recommended as the better model for this task.