<a href="https://colab.research.google.com/github/dzeko5959/AI/blob/main/ML/UDEM/A3_1_Seunghyeon_Lee.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**1. Import the CSV file**

Filters the data to include only samples from classes 2 and 4, then calculates the difference in mean gene expression between these two classes for all genes. It identifies the top 10 genes with the largest absolute mean differences, which may be important for distinguishing between the two cancer types and could serve as potential biomarkers.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import ttest_ind, f_oneway
from statsmodels.stats.multitest import multipletests
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix

sns.set(style="whitegrid")

In [None]:
from google.colab import files
uploaded = files.upload()

Saving A3.1 Khan.csv to A3.1 Khan.csv


In [None]:
df = pd.read_csv("A3.1 Khan.csv")

In [None]:
print(df.isnull().sum().sum())

df_24 = df[df['y'].isin([2, 4])]

mean_diff = df_24[df_24['y'] == 2].mean() - df_24[df_24['y'] == 4].mean()
mean_diff = mean_diff.drop('y')
top10_genes = mean_diff.abs().sort_values(ascending=False).head(10)
print(top10_genes)

0
X187     3.323151
X509     2.906537
X2046    2.424515
X2050    2.401783
X129     2.165185
X1645    2.065460
X1319    2.045941
X1955    2.037340
X1003    2.011337
X246     1.837830
dtype: float64


**2. t-test & multiple test calibration (2 vs 4)**

Performs t-tests on each gene to compare mean expression between class 2 and class 4, and then applies multiple testing correction methods to identify genes with statistically significant differences.

In [None]:
from tqdm import tqdm
p_values = []

for col in df.columns[:-1]:  # exclude 'y'
    stat, p = ttest_ind(df_24[df_24['y'] == 2][col], df_24[df_24['y'] == 4][col])
    p_values.append(p)

p_values = np.array(p_values)

methods = ['bonferroni', 'holm', 'fdr_bh']
significant_genes = {}

for method in methods:
    reject, pvals_corrected, _, _ = multipletests(p_values, alpha=0.05, method=method)
    significant = np.array(df.columns[:-1])[reject]
    significant_genes[method] = significant
    print(f"\n a significant number of genes in {method.upper()}: {len(significant)}")
    print(significant[:10])


 a significant number of genes in BONFERRONI: 74
['X2' 'X36' 'X67' 'X129' 'X174' 'X187' 'X229' 'X246' 'X251' 'X338']

 a significant number of genes in HOLM: 74
['X2' 'X36' 'X67' 'X129' 'X174' 'X187' 'X229' 'X246' 'X251' 'X338']

 a significant number of genes in FDR_BH: 297
['X2' 'X3' 'X29' 'X36' 'X52' 'X67' 'X80' 'X89' 'X119' 'X129']


Bonferroni and Holm methods both found 74 significant genes, showing strong consistency and strict control of false positives.
Benjamini-Hochberg (FDR) detected 297 genes, allowing more discoveries by controlling the false discovery rate rather than family-wise error rate.

**3. ANOVA**

Uses ANOVA to test whether the mean expression of each gene differs across all four cancer types. And applies FDR correction to account for multiple testing.

In [None]:
grouped = df.groupby('y')
pvals_anova = []

for col in df.columns[:-1]:
    groups = [group[col].values for _, group in grouped]
    stat, p = f_oneway(*groups)
    pvals_anova.append(p)

pvals_anova = np.array(pvals_anova)

reject_anova, pvals_corrected_anova, _, _ = multipletests(pvals_anova, alpha=0.05, method='fdr_bh')
significant_anova_genes = np.array(df.columns[:-1])[reject_anova]

print(f"a significant number of genes in ANOVA : {len(significant_anova_genes)}")
print(significant_anova_genes[:10])

a significant number of genes in ANOVA : 1162
['X1' 'X2' 'X3' 'X9' 'X12' 'X17' 'X21' 'X22' 'X27' 'X29']


A total of 1162 genes showed statistically significant differences across the four classes, suggesting that a large portion of the genes may help distinguish between cancer types. Genes like X1, X2, X3 are among the most relevant.

**4. SVM Model Learning**

Trains SVM models with different kernels to classify cancer types and compare which kernel performs best for this task.

In [None]:
selected_genes = top10_genes.index.tolist()

X = df[selected_genes]
y = df['y']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

kernels = ['linear', 'poly', 'rbf']
models = {}

for kernel in kernels:
    if kernel == 'poly':
        model = SVC(kernel=kernel, degree=3)
    else:
        model = SVC(kernel=kernel)
    model.fit(X_train, y_train)
    models[kernel] = model

**5. Model Performance Comparison and Conclusion**

In [None]:
for kernel, model in models.items():
    print(f"\nSVM with {kernel} kernel:")
    y_pred = model.predict(X_test)
    print(confusion_matrix(y_test, y_pred))
    print(classification_report(y_test, y_pred))


SVM with linear kernel:
[[2 0 0 0]
 [0 6 0 0]
 [0 0 4 0]
 [0 0 0 5]]
              precision    recall  f1-score   support

           1       1.00      1.00      1.00         2
           2       1.00      1.00      1.00         6
           3       1.00      1.00      1.00         4
           4       1.00      1.00      1.00         5

    accuracy                           1.00        17
   macro avg       1.00      1.00      1.00        17
weighted avg       1.00      1.00      1.00        17


SVM with poly kernel:
[[1 0 1 0]
 [0 6 0 0]
 [0 0 4 0]
 [0 0 0 5]]
              precision    recall  f1-score   support

           1       1.00      0.50      0.67         2
           2       1.00      1.00      1.00         6
           3       0.80      1.00      0.89         4
           4       1.00      1.00      1.00         5

    accuracy                           0.94        17
   macro avg       0.95      0.88      0.89        17
weighted avg       0.95      0.94      0.93    

Both linear and RBF kernels achieved perfect accuracy, while the polynomial kernel made one error, lowering its performance slightly. So RBF and linear kernels are best suited for this task.