<a href="https://colab.research.google.com/github/a1ren-code/PyWGCNA/blob/main/ML_for_Biomedical_Informatics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

We'll be using a powerful library called "scikit-learn" (sklearn), which has already implemented all the machine learning models we've talked about. I'd say >80% of all ML workflows use sklearn, so this is the industry standard!

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Import datasets
from sklearn.datasets import load_breast_cancer
from sklearn.datasets import load_diabetes

# Import ML packages
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, ConfusionMatrixDisplay, mean_absolute_error, mean_squared_error, r2_score
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression, LogisticRegression, Perceptron, Lasso
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor, plot_tree
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor

# Supervised Learning



Generally speaking, there are two types of supervised learning tasks you will encounter: classification and regression. This notebook will give you a taste of each.

- Classification: Want to predict what "class" a data point belongs to (ex: malignant or benign tumor)
- Regression: Want to predict a quantitative value (ex: diabetes risk score)

# Classification

For this example, we'll use the breast cancer dataset from sklearn. Each row is a tumor sample, and each column is a feature related to the tumor (ex: size, volume). The target variable is whether the tumor is benign or malignant (0 or 1).

We want to find the model, and combination of features, that leads to the best prediction of whether someone has a malignant (AKA cancerous) tumor!

In [None]:
# Load data
data_cancer = load_breast_cancer()

# Preview data
df_cancer = pd.DataFrame(data_cancer.data, columns=data_cancer.feature_names)
print(df_cancer.shape)
print(list(data_cancer.target_names))
df_cancer.head()

In [None]:
# Split data
X_train, X_test, y_train, y_test = train_test_split(
    data_cancer.data, data_cancer.target, test_size=0.2, random_state=42
)

# Scale data
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Logistic Regression

This is the simplest option for binary classification.

In [None]:
lr_model = LogisticRegression()
lr_model.fit(X_train, y_train)
y_pred_lr = lr_model.predict(X_test)

print(classification_report(y_test, y_pred_lr, target_names=data_cancer.target_names))
ConfusionMatrixDisplay.from_estimator(lr_model, X_test, y_test, display_labels=data_cancer.target_names, cmap='Blues')
plt.title("Logistic Regression: Confusion Matrix")
plt.show()

# Perceptron

In [None]:
perc_model = Perceptron(max_iter=1000, tol=1e-3, random_state=42)
perc_model.fit(X_train, y_train)
y_pred_perc = perc_model.predict(X_test)

print(classification_report(y_test, y_pred_perc, target_names=data_cancer.target_names))
ConfusionMatrixDisplay.from_estimator(perc_model, X_test, y_test, display_labels=data_cancer.target_names, cmap='Greens')
plt.title("Perceptron: Confusion Matrix")
plt.show()

# Decision Tree

In [None]:
dt_model = DecisionTreeClassifier(max_depth=3, random_state=42)
dt_model.fit(X_train, y_train)
y_pred_dt = dt_model.predict(X_test)

plt.figure(figsize=(15, 7))
plot_tree(dt_model, feature_names=data_cancer.feature_names, class_names=data_cancer.target_names, filled=True)
plt.title("Decision Tree Logic Flow")
plt.show()

# Random Forest

In [None]:
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
y_pred_rf = rf_model.predict(X_test)

# Show Feature Importance, what matters the most?
importances = pd.Series(rf_model.feature_importances_, index=data_cancer.feature_names).sort_values(ascending=False)
plt.figure(figsize=(10, 6))
sns.barplot(x=importances.head(10), y=importances.head(10).index)
plt.title("Top 10 Clinical Features (Random Forest)")
plt.show()

#Best Model?


In [None]:
from sklearn.metrics import RocCurveDisplay

fig, ax = plt.subplots(figsize=(8, 6))

# Plot each model's ROC curve
RocCurveDisplay.from_estimator(lr_model, X_test, y_test, ax=ax, name="Logistic Regression")
RocCurveDisplay.from_estimator(dt_model, X_test, y_test, ax=ax, name="Decision Tree")
RocCurveDisplay.from_estimator(rf_model, X_test, y_test, ax=ax, name="Random Forest")
RocCurveDisplay.from_estimator(perc_model, X_test, y_test, ax=ax, name="Perceptron")

plt.plot([0, 1], [0, 1], 'k--', label="Chance (AUC = 0.5)")
plt.title("Comparison of Diagnostic Models (ROC Curve)")
plt.legend()
plt.show()

In [None]:
import pandas as pd
from sklearn.metrics import accuracy_score, recall_score, f1_score

# Assuming you have run predictions for all models:
# y_pred_lr, y_pred_perc, y_pred_dt, y_pred_rf

class_results = [
    {
        "Model": "Logistic Regression",
        "Accuracy": accuracy_score(y_test, y_pred_lr),
        "Recall (Sensitivity)": recall_score(y_test, y_pred_lr),
        "F1-Score": f1_score(y_test, y_pred_lr)
    },
    {
        "Model": "Perceptron",
        "Accuracy": accuracy_score(y_test, y_pred_perc),
        "Recall": recall_score(y_test, y_pred_perc),
        "F1-Score": f1_score(y_test, y_pred_perc)
    },
    {
        "Model": "Decision Tree",
        "Model": "Decision Tree",
        "Accuracy": accuracy_score(y_test, y_pred_dt),
        "Recall": recall_score(y_test, y_pred_dt),
        "F1-Score": f1_score(y_test, y_pred_dt)
    },
    {
        "Model": "Random Forest",
        "Accuracy": accuracy_score(y_test, y_pred_rf),
        "Recall": recall_score(y_test, y_pred_rf),
        "F1-Score": f1_score(y_test, y_pred_rf)
    }
]

# Create and Sort by Recall (In medicine, we want to catch every case!)
summary_df = pd.DataFrame(class_results).sort_values(by="Recall (Sensitivity)", ascending=False).reset_index(drop=True)

print("--- Clinical Classification Leaderboard ---")
print(summary_df)

# Identify the winner based on the highest Recall
best_model_name = summary_df.iloc[0]['Model']
print(f"The Best Model is: {best_model_name}")

# Regression

In this example, we'll use sklearn's diabetes dataset. Each row is a patient with diabetes, and each column is a feature relevant to diabetes.

The goal is to predict a quantitative measure of disease progression one year after a set of baseline measurements were taken!

In [None]:
# Load data
data_diabetes = load_diabetes()

# Preview data
df_diabetes = pd.DataFrame(data_diabetes.data, columns=data_diabetes.feature_names)
print(df_diabetes.shape)
print(list(data_diabetes.target))
df_diabetes.head()


In [None]:
# Split
X, y = data_diabetes.data, data_diabetes.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Linear Regression

Pretty much always our first choice for any regression task.

In [None]:
lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)
y_pred_lin = lin_reg.predict(X_test)

print(f"Linear Regression R^2 Score: {r2_score(y_test, y_pred_lin):.2f}")
print(f"Mean Absolute Error: {mean_absolute_error(y_test, y_pred_lin):.2f} points")

# Plotting Actual vs Predicted
plt.scatter(y_test, y_pred_lin, alpha=0.5)
plt.plot([y.min(), y.max()], [y.min(), y.max()], 'r--', lw=2)
plt.xlabel("Actual Progression")
plt.ylabel("Predicted Progression")
plt.title("Linear Regression: Actual vs Predicted")
plt.show()

# Regularized Linear Regression

In [None]:
from sklearn.linear_model import Lasso
from sklearn.metrics import mean_absolute_error, r2_score

lasso_model = Lasso(alpha=0.1)
lasso_model.fit(X_train, y_train)

y_pred_lasso = lasso_model.predict(X_test)

mae = mean_absolute_error(y_test, y_pred_lasso)
r2 = r2_score(y_test, y_pred_lasso)

print(f"Lasso Regression Results:")
print(f"R^2 Score: {r2:.3f}")
print(f"Mean Absolute Error: {mae:.2f}")

In [None]:

# Show which features Lasso kept (non-zero coefficients)
feature_importance = pd.Series(lasso_model.coef_, index=data_diabetes.feature_names)
print("Lasso Coefficients (importance of each feature):")
print(feature_importance.sort_values(ascending=False))

plt.figure(figsize=(8, 4))
feature_importance.plot(kind='barh')
plt.title("Lasso Regression: Feature Impact")
plt.show()

# Decision Tree



In [None]:
dt_reg = DecisionTreeRegressor(max_depth=3, random_state=42)
dt_reg.fit(X_train, y_train)

y_pred_dt = dt_reg.predict(X_test)

print(f"Decision Tree R^2 Score: {r2_score(y_test, y_pred_dt):.2f}")

plt.figure(figsize=(20, 8))
plot_tree(dt_reg, feature_names=data_diabetes.feature_names, filled=True, fontsize=10)
plt.title("Decision Tree Regression Logic")
plt.show()

# Random Forest

In [None]:
rf_reg = RandomForestRegressor(n_estimators=100, random_state=42)
rf_reg.fit(X_train, y_train)
y_pred_rf = rf_reg.predict(X_test)

print(f"Random Forest R^2 Score: {r2_score(y_test, y_pred_rf):.2f}")

# Best Model?


In [None]:
import pandas as pd
from sklearn.metrics import mean_absolute_error, r2_score

# Assuming you have run predictions for all models:
# y_pred_lin, y_pred_lasso, y_pred_dt, y_pred_rf

model_results = [
    {"Model": "Linear Regression", "MAE": mean_absolute_error(y_test, y_pred_lin), "R2": r2_score(y_test, y_pred_lin)},
    {"Model": "Lasso Regression", "MAE": mean_absolute_error(y_test, y_pred_lasso), "R2": r2_score(y_test, y_pred_lasso)},
    {"Model": "Decision Tree", "MAE": mean_absolute_error(y_test, y_pred_dt), "R2": r2_score(y_test, y_pred_dt)},
    {"Model": "Random Forest", "MAE": mean_absolute_error(y_test, y_pred_rf), "R2": r2_score(y_test, y_pred_rf)}
]

# Create DataFrame
summary_df = pd.DataFrame(model_results)

# Sort by R2 (higher is better)
summary_df = summary_df.sort_values(by="R2", ascending=False).reset_index(drop=True)
print(summary_df)

# Automatically identify the winner
best_model_name = summary_df.iloc[0]['Model']
print(f"The Best Performing Model is: {best_model_name}")

# Hyperparameter Tuning

So this is generally how you would perform an ML pipeline by testing various models and picking the best-performing one. There is 1 thing we couldn't cover that I'll briefly mention.

Other than checking different models against each other, you can also check the same model against itself, but with different parameters. For example, would a random forest regressor with n_estimators=1000 perform better than the one we used with n_estimators=100? There are so many things we could test to optimize our machine learning model; far to many for this workshop.

This process of iteratively picking the best parameters is known as "hyperparameter tuning," and is something you would always do.

#That's all folks!