## Prediction of cancer types from gene expression data

You are given a dataset consisting of gene expressions and cancer types for more than 6k patients.

Data has already been pre-process as to keep only 150 gene expressions that are believed to correlate with the response.

Your goal is to fit several classification models on this data


Answer to the questions at the following link: https://forms.cloud.microsoft/e/pDz5AUes92

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [None]:
import warnings
warnings.filterwarnings("ignore") 

In [None]:
# Load dataset
data = pd.read_csv('tcga_150gene.csv')

X = data.drop('label', axis=1).values
y = data['label']

First we consider a binary classification problem, where we predict the most common cancer type against all the other cancer.

Define y_binary to be 1 if patient is affected by the most common cancer type and 0 otherwise.

In [None]:
# Identify the most frequent cancer type
most_common_label = y.value_counts().idxmax()
y_binary = (y == most_common_label).astype(int).values

# Split into training and testing set
X_train, X_test, y_train, y_test = train_test_split(
    X, y_binary, test_size=0.3, random_state=42, stratify=y_binary)

The test set should not be used for any statistical analysis except for comparing different models.

First, let's scale the features as to have zero mean and unit variance. Work only on the scaled features from now on

In [None]:
# Standardize features
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s  = scaler.transform(X_test)

### Fit a logistic regression with a mild Rigde penalty (shrinkage parameter=1e-6), max_iter=100000

NB: be careful about what the shrinkage parameter represents in sklearn and what it means for us!

In [None]:
logreg = None

Q1: what is the f1-score for the predictions on the training set?

Q2: what is the recall for the predictions on the test set

In [None]:
y_pred_train = None
y_pred_test = None

## On to variable selection.

First, perform step-wise variable selection based on the BIC score. The model is the same one as before (logistic regression with mild ridge penalty)

In [None]:
# Some helper functions used to compute the BIC

def count_params(model):
    """
    Return the total number of parameters in the given fitted model.
    
    Supports:
      - LogisticRegression
      - LinearDiscriminantAnalysis (LDA)
      - QuadraticDiscriminantAnalysis (QDA)
    """
    # Logistic Regression
    if isinstance(model, LogisticRegression):
        # coef_: shape (n_classes, n_features) or (1, n_features) for binary
        # intercept_: shape (n_classes,) or (1,)
        total = model.coef_.size + model.intercept_.size
        return int(total)

    elif isinstance(model, LinearDiscriminantAnalysis):
        raise NotImplementedError

    # Quadratic Discriminant Analysis
    elif isinstance(model, QuadraticDiscriminantAnalysis):
        raise NotImplementedError

    else:
        raise ValueError(f"Model of type {type(model)} is not supported.")

        

def bic_score(estimator, X, y):
    raise NotImplementedError

    return bic    

In [None]:
def forward_selection(X, y, base_model, max_features=25):    
    remaining = set(np.arange(X.shape[1]))
    selected, history = [], []
    
    while remaining:
        # try adding each unused feature
        bic_candidates = {}
        
        for feat in remaining:
            feats = np.array(selected + [feat])
            base_model.fit(X[:, feats], y)
            bic = bic_score(base_model, X[:, feats], y) 
            bic_candidates[feat] = bic

        # pick the best feature
        feat, bic = None # TODO

        if len(selected) < max_features:
            selected.append(feat)
            remaining.remove(feat)
            history.append({"step": len(selected),
                            "added_feature": feat,
                            "bic": bic})
        else:
            break

    return pd.DataFrame(history)

In [None]:
lr_model = None #TODO

forward_features = forward_selection(X_train_s, y_train, lr_model)

In [None]:
plt.plot(forward_features.step, forward_features.bic)
plt.xticks(np.arange(1, 26))
plt.show()

Q3: what is the optimal number of features according to the forward selection? Use the elbow method

In [None]:
forward_n_features = None
selected_forward_features = forward_features.added_feature[:forward_n_features] 
selected_forward_features

Q4: write the indexes of the first 4 selected features in oder of importance

## Now we perform Lasso regression. 

Use cross-validation to select the optimal shrinkage parameter based on the accuracy of the classifier

In [None]:
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score


def cross_validate_lasso_logistic(X, y, Cs, cv=5, scoring=accuracy_score, random_state=None):
    """
    Perform manual cross-validation to select the optimal L1 regularization parameter C
    for a logistic regression model.

    Parameters
    ----------
    X : array-like, shape (n_samples, n_features)
        Feature matrix.
    y : array-like, shape (n_samples,)
        Target vector.
    Cs : list or array of floats
        Candidates for the inverse regularization strength (C).
    cv : int
        Number of cross-validation folds.
    scoring : callable
        A function with signature scoring(y_true, y_pred) -> float.
    random_state : int or None
        Seed for reproducibility (passed to StratifiedKFold).

    Returns
    -------
    best_C : float
        The value of C that achieved the highest mean cross-validation score.
    cv_results : dict
        Dictionary with keys:
          - 'C': array of candidate Cs
          - 'mean_score': array of mean validation scores
          - 'std_score': array of std deviations across folds
    """

    skf = StratifiedKFold(n_splits=cv, shuffle=True, random_state=random_state)
    mean_scores = []
    std_scores = []

    # Loop over candidate Cs
    for C in Cs:
        fold_scores = []
        for train_idx, val_idx in skf.split(X, y):
            model = LogisticRegression(
                penalty='l1',
                C=C,
                solver='saga',
                max_iter=5000,
                random_state=random_state
            )
            
            # TODO: fit model on appropriate shard of data
            
            # TODO: score the model on appropriate shard of data

            score = None # TODO
            fold_scores.append(score)

        mean_scores.append(np.mean(fold_scores))
        std_scores.append(np.std(fold_scores))

    mean_scores = np.array(mean_scores)
    std_scores  = np.array(std_scores)

    best_idx = np.argmax(mean_scores)
    best_C   = Cs[best_idx]

    cv_results = {
        'C': Cs,
        'mean_score': mean_scores,
        'std_score': std_scores
    }

    return best_C, cv_results

In [None]:
Cs = np.logspace(-4, 4, 20)


best_C, results = cross_validate_lasso_logistic(X_train_s, y_train, Cs, cv=5, random_state=42)

print(f"Best C (inverse regularization strength): {best_C:.6f}")


plt.errorbar(results['C'], results['mean_score'], yerr=results['std_score'],
             fmt='o-', capsize=5)
plt.xscale('log')
plt.xlabel('C (inverse regularization strength)')
plt.ylabel('Mean validation accuracy')
plt.title('Lasso Logistic Regression CV Curve')
plt.axvline(best_C, color='red', linestyle='--', label=f'Best C = {best_C:.2e}')
plt.legend()
plt.show()

Q5) What is the optimal parameter? Report the value of C found by cross-validation

Now we re-fit the model with the C=0.01 see which features have been selected

In [None]:
lasso_lr = None
lasso_lr.fit(X_train_s, y_train)
lasso_idx = np.where(lasso_lr.coef_[0] != 0)[0]

Q6) What is the jaccard similarity between the first 7 features selected via step-wise variable selection and lasso C=0.01?

In [None]:
jaccard_similarity = None

print(f'Jaccard similarity between selected feature sets: {jaccard_similarity:.3f}')

## Now let's fit LDA and QDA

Q7) Using step-wise forward feature selections based on the BIC score, what is the optimal number of features for LDA?

Q8) What about for QDA?

In [None]:
lda_model = None
forward_features = forward_selection(X_train_s, y_train, lda_model)

plt.plot(forward_features.step, forward_features.bic)
plt.xticks(np.arange(1, 26))
plt.show()

In [None]:
qda_model = None
forward_features = forward_selection(X_train_s, y_train, qda_model)

plt.plot(forward_features.step, forward_features.bic)
plt.xticks(np.arange(1, 26))
plt.show()

# The multi-class classification problem

Now we move to the multi-class classification problem. Use scikit learn to implement the one-vs-rest classification approach based on the Logistic Regression and compare it with the standard softmax link function.

For both models, use Cross-Validation to estimate the optimal Lasso shrinkage parameter with K=5. 

Hint: use LogisticRegressionCV to automatically perform cross validation

Report the accuracies on the test sets.

#### NB: don't scale the features this time!

In [None]:
data = pd.read_csv('tcga_150gene.csv')

X = data.drop('label', axis=1)
y = data['label']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y)

In [None]:
from sklearn.linear_model import LogisticRegressionCV
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import accuracy_score, f1_score


Cs = np.logspace(2, -4, 25)  # search grid (high → low C)

ovr_lasso = OneVsRestClassifier() # TODO
ovr_lasso.fit(X_train_s, y_train)

y_pred_ovr = ovr_lasso.predict(X_test_s)
acc_ovr = accuracy_score(y_test, y_pred_ovr)

In [None]:
Cs = np.logspace(2, -4, 25)  # search grid (high → low C)

multi_lr = None # TODO

multi_lr.fit(X_train_s, y_train)

y_pred_multi = multi_lr.predict(X_test_s)

acc_multi = accuracy_score(y_test, y_pred_multi)