<a href="https://colab.research.google.com/github/eyaguirat10/CoWin-Breast-Cancer-Detection/blob/eya/Eya_Guirat_Breast_Cancer_Detection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# I. Business Understanding

## **Problem Statement**

Breast cancer is one of the most common and life-threatening diseases affecting women worldwide. According to the World Health Organization, it accounts for a significant proportion of cancer-related deaths among women, with **approximately 2.3 million new cases diagnosed each year**. While the highest incidence is usually observed in women aged **40 to 70**, younger women are increasingly being diagnosed, highlighting the need for vigilance across all age groups. Men can also develop breast cancer, though it is much rarer.

Early detection of breast cancer is crucial because survival rates are significantly higher when the disease is identified at an initial stage. However, **late diagnosis remains common**, particularly in regions with limited access to screening programs, diagnostic facilities, or trained specialists. Factors contributing to delayed detection include:
- Socio-economic barriers  
- Lack of awareness about breast health  
- Cultural stigmas  
- Shortages of radiologists or screening centers  

In some areas, women may delay seeking care due to fear, misinformation, or limited healthcare infrastructure — leading to advanced diagnoses when treatment is more difficult and costly.

Breast cancer can manifest in different forms — **ductal carcinoma, lobular carcinoma, and other subtypes** — each with distinct characteristics and progression rates. The disease may present **asymptomatically** in its early stages, which is why imaging techniques such as **mammography** are critical for screening. Even with mammograms, subtle signs like **microcalcifications** or small lesions can be easily overlooked, even by experienced radiologists.

This issue is particularly pressing in regions where incidence is high and **medical resources are limited**, such as parts of **North Africa**, the **Middle East**, and **some areas of the United States**, where datasets like **CBIS-DDSM** and **MIAS** have been collected. These datasets demonstrate the variability in breast tissue density and lesion appearance, increasing the challenge of accurate detection.

## Stakeholders Affected

- **Doctors / Radiologists**: Responsible for interpreting mammograms and providing medical decisions. Face diagnostic complexity, heavy workloads, and pressure to avoid errors.
- **Hospitals and Clinics**: Handle screening, treatment, and patient management. Diagnostic support tools can improve outcomes and reduce costs.
- **Public Health Organizations**: Lead screening campaigns and healthcare policy. Better tools improve allocation of resources and reduce mortality.
- **Researchers**: Use datasets and AI models to develop new diagnostic tools and improve early detection performance.


Given these challenges, there is a **critical need for computer-assisted systems** that help healthcare professionals detect breast cancer early, accurately, and efficiently.

Such systems can:
- Generate **predictions** and **personalized recommendations** based on mammography or clinical data  
- Improve clinical **decision-making** while minimizing risk to patients  
- Ensure **role-based access** to data: doctors get detailed results, patients receive **validated, understandable recommendations**  

Examples of AI-guided outputs:
- Follow-up imaging requests
- Lifestyle or risk-reduction recommendations
- Alerts for uncertain or borderline diagnosis cases

In summary, breast cancer is a complex issue involving **medical, social, and economic factors**. Our project integrates **artificial intelligence** into the diagnostic workflow to:
- **Enhance early detection**
- **Reduce late diagnoses**
- **Provide safe, actionable guidance for patients**

Ultimately, the project aims to assist all stakeholders and improve breast cancer management outcomes.


---

## **Business Objective (BO)**

1.  ### Determine whether a tumor is malignant or benign based on morphological data extracted from breast tissue imagery, in order to assist clinicians in initial diagnosis.


2.  ### Characterize and differentiate malignant tumor profiles into distinct groups based on their severity or aggressiveness, enabling more targeted follow-up strategies.
   
3.  ### Propose individualized recommendations for new patients by comparing their physiological and biochemical profiles to previously identified cancer risk patterns.  

---

## **Data Science Objectives (DSO)**

1.  ### Develop a classification system that distinguishes between benign and malignant tumors using morphological features from the WDBC dataset and evaluate its accuracy using appropriate performance metrics.


2.  ### Identify natural groupings within malignant tumor cases using unsupervised clustering techniques in order to define clinically relevant cancer subtypes.
   
3.  ### Implement a recommendation system that maps new patient profiles (from the Coimbra dataset) to known cluster patterns and delivers personalized insights based on proximity to known cancer risk profiles.

### DSO1: Predict the diagnosis type — **M (Malignant)** or **B (Benign)**

| Modèles             | Liste des variables | Liste des paramètres |
|---------------------|---------------------|-----------------------|
| GRU SVM             |                     |                       |
| SVM                 |                     |                       |
| Linear Regression   |                     |                       |
| MLP                 |                     |                       |
| Nearest Neighbor    |                     |                       |
| Softmax Regression  |                     |                       |
| XGBOOST (new)       |                     |                       |
| Random Forest (new) |                     |                       |


### DSO2: Cluster diagnosis patterns

| Modèles                | Liste des variables | Liste des paramètres |
|------------------------|---------------------|-----------------------|
| Kmeans                 |                     |                       |
| DBSCAN                 |                     |                       |
| Gaussian Mixture Model |                     |                       |


### DSO3: Cluster-based recommendation system

| Modèles               | Liste des variables | Liste des paramètres |
|------------------------|---------------------|-----------------------|
| KNeighborsClassifier   |                     |                       |
| DecisionTreeClassifier |                     |                       |
| Apriori                |                     |                       |


# II. Data Understanding

##II.1. DSO1 : Predict the diagnosis type — M (Malignant) or B (Benign)

In [1]:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt

In [2]:
from google.colab import drive
drive.mount('/content/drive')

MessageError: Error: credential propagation was unsuccessful

In [None]:
path = "/content/drive/MyDrive/Datasets/CancerData1.csv"
df = pd.read_csv(path)


In [None]:
print("Dimensions :", df.shape)
print("Colonnes :", df.columns.tolist())
print("\nTypes :", df.dtypes)

In [None]:
print(df.head())
print(df.tail())
print(f"Rows: {df.shape[0]}, Columns: {df.shape[1]}")

In [None]:
df.info()

In [None]:
print("Missing values")
missing_data = df.isnull().sum()
missing_percent = (df.isnull().sum() / len(df)) * 100

missing_df = pd.DataFrame({
    'Missing Count': missing_data,
    'Missing Percentage': missing_percent
})
print(missing_df[missing_df['Missing Count'] > 0])


plt.figure(figsize=(12, 6))
sns.heatmap(df.isnull(), yticklabels=False, cbar=True, cmap='viridis')
plt.title('Missing Values Heatmap')
plt.show()

In [None]:
print("Basic statistics")
print(df.describe())

# For the diagnosis column specifically
print("\n Diagnosis distribution")
print(df['diagnosis'].value_counts())
print(f"Malignant Percentage: {(df['diagnosis'] == 'M').mean()*100:.2f}%")

In [None]:
plt.figure(figsize=(10, 8))
sns.heatmap(df.corr(numeric_only=True), cmap="coolwarm", center=0)
plt.title("Corrélation entre les variables numériques")
plt.show()

In [None]:
print("Outlier detection")

# Method 1: IQR for numerical columns
def detect_outliers_iqr(column):
    Q1 = column.quantile(0.25)
    Q3 = column.quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    outliers = column[(column < lower_bound) | (column > upper_bound)]
    return outliers

# Check outliers in key numerical columns
numerical_cols = df.select_dtypes(include=[np.number]).columns

print("Outliers in key features:")
for col in numerical_cols[:5]:  # Check first 5 numerical columns
    outliers = detect_outliers_iqr(df[col])
    print(f"{col}: {len(outliers)} outliers")

# Visualize outliers for key features
key_features = ['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean']
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
axes = axes.ravel()

for i, feature in enumerate(key_features):
    df.boxplot(column=feature, ax=axes[i])
    axes[i].set_title(f'Outliers in {feature}')

plt.tight_layout()
plt.show()

In [None]:
print("Data quality summary")
print(f"Total records: {len(df)}")
print(f"Total features: {len(df.columns)}")
print(f"Duplicate rows: {df.duplicated().sum()}")

# Check for constant columns (columns with only one value)
constant_cols = [col for col in df.columns if df[col].nunique() == 1]
print(f"Constant columns: {constant_cols}")

# Check for columns with too many zeros
print("\n columns with many zeros")
for col in numerical_cols:
    zero_percent = (df[col] == 0).mean() * 100
    if zero_percent > 50:  # Show columns with more than 50% zeros
        print(f"{col}: {zero_percent:.2f}% zeros")

In [None]:
print("Target variable analysis")
diagnosis_counts = df['diagnosis'].value_counts()

plt.figure(figsize=(15, 5))

# Subplot 1: Count plot
plt.subplot(1, 3, 1)
sns.countplot(data=df, x='diagnosis')
plt.title('Diagnosis Distribution')

# Subplot 2: Pie chart
plt.subplot(1, 3, 2)
plt.pie(diagnosis_counts.values, labels=diagnosis_counts.index, autopct='%1.1f%%')
plt.title('Diagnosis Proportion')

# Subplot 3: Statistics by diagnosis
plt.subplot(1, 3, 3)
diagnosis_stats = df.groupby('diagnosis')['radius_mean'].agg(['mean', 'std', 'min', 'max'])
diagnosis_stats.plot(kind='bar', ax=plt.gca())
plt.title('Radius Mean by Diagnosis')
plt.xticks(rotation=0)

plt.tight_layout()
plt.show()

print(f"Diagnosis value counts:\n{diagnosis_counts}")

In [None]:
print("Correlation analysis")

# Convert diagnosis to numerical for correlation
df_corr = df.copy()
df_corr['diagnosis_numeric'] = df_corr['diagnosis'].map({'M': 1, 'B': 0})

# Correlation with target
corr_with_target = df_corr.corr(numeric_only=True)['diagnosis_numeric'].sort_values(ascending=False)
print("Top 10 features correlated with diagnosis:")
print(corr_with_target.head(10))

print("\n Bottom 10 features correlated with diagnosis:")
print(corr_with_target.tail(10))

# Plot correlation heatmap for top features
top_corr_features = corr_with_target.index[1:11]  # Exclude diagnosis itself
plt.figure(figsize=(10, 8))
sns.heatmap(df_corr[top_corr_features].corr(), annot=True, cmap='coolwarm', center=0)
plt.title('Correlation Heatmap of Top 10 Features')
plt.show()

In [None]:
sns.countplot(x="diagnosis", data=df)
plt.title("Distribution des diagnostics (Bénin/Malin)")
plt.show()

In [None]:
# Compare a key feature between malignant and benign
plt.figure(figsize=(10, 6))
sns.boxplot(data=df, x='diagnosis', y='radius_mean')
plt.title('Tumor Radius: Malignant vs Benign')
plt.ylabel('Mean Radius')
plt.xlabel('Diagnosis')
plt.show()

In [None]:
# Selecting few key features to compare
features_to_plot = ['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean']

fig, axes = plt.subplots(2, 2, figsize=(15, 10))
axes = axes.ravel()

for i, feature in enumerate(features_to_plot):
    sns.boxplot(data=df, x='diagnosis', y=feature, ax=axes[i])
    axes[i].set_title(f'{feature} by Diagnosis')

plt.tight_layout()
plt.show()

In [None]:
# Find two features most correlated with diagnosis
top_two = corr_with_target.index[1:3]

plt.figure(figsize=(10, 8))
sns.scatterplot(data=df, x=top_two[0], y=top_two[1], hue='diagnosis', style='diagnosis')
plt.title(f'Relationship between {top_two[0]} and {top_two[1]}')
plt.show()

##II.2. Dataset 2: DSO2 + DSO3

In [None]:
path = "/content/drive/MyDrive/Dataset_ML/dataR2.csv"
df2 = pd.read_csv(path)

In [None]:
df2.head()

In [None]:
df2.tail()

In [None]:
df2.dtypes

In [None]:
df2.info()

In [None]:
df2.describe()

 Missing Values Analysis

In this step, we calculate how many missing (null) values exist in each column of the dataset. This information is important because columns with many missing values may need special handling, such as imputation or removal, to ensure the quality of our analysis and models.

In [None]:
info_df = pd.DataFrame({
        'Column': df2.columns,
        'Non_Null_Count': df2.count(),
        'Null_Count': df2.isnull().sum(),
        'Missing_%': (df2.isnull().sum() / len(df2) * 100).round(2)
    })
print(info_df.to_string(index=False))


Duplicate Analysis

In this step, we check the dataset for duplicate rows. Duplicate records can skew analysis and models by overrepresenting certain data points.

In [None]:
duplicate_count = df2.duplicated().sum()
print(f"\nNumber of duplicate rows: {duplicate_count}")
if duplicate_count > 0:
    print(f"Percentage: {(duplicate_count/len(df2)*100):.2f}%")
print()

Data Visualization

Boxplots are used to visualize the distribution of numerical data and detect outliers. They display the median, quartiles, and potential extreme values for each feature, helping us understand variability and spot anomalies that may need further investigation or treatment.

Create boxplots for numerical columns

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import math

numerical_cols = df2.select_dtypes(include=['int64', 'float64']).columns

numerical_cols = [col for col in numerical_cols if df2[col].dropna().shape[0] > 0]
cols_per_fig = 9  # number of boxplots per figure

for start in range(0, len(numerical_cols), cols_per_fig):
    end = start + cols_per_fig
    batch = numerical_cols[start:end]

    plt.figure(figsize=(14, 10))
    for i, col in enumerate(batch, 1):
        plt.subplot(3, 3, i)  # 2 rows, 3 columns
        sns.boxplot(y=df2[col])
        plt.title(col)

    plt.tight_layout()
    plt.show()


This dataset was constructed for research purposes → outliers are not errors, they are anomalies related to the disease.

Correlation Visualization:

In [None]:
# correlation matrix between variables
df_features = df2.drop(columns=['Classification'])
plt.figure(figsize=(18, 12))
sns.heatmap(df2.corr(), annot=False, cmap='coolwarm')
plt.show()

In [None]:
# correlation matrix of variables with the target
# creation of numerical diagnosis

df2.corr()['Classification'].sort_values(ascending=False)

Principal Component Analysis (PCA)

In [None]:
# standardisation
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# --- 1. Select only numeric variables
X = df2.select_dtypes(include=[np.number])

# --- 2. Standardisation
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# --- 3. PCA
pca = PCA(n_components=2)
pca_result = pca.fit_transform(X_scaled)

# Explained variance
print("Variance expliquée par PC1 et PC2 :", pca.explained_variance_ratio_)


In [None]:
# --- correlation circle ---
def plot_correlation_circle(pca, features, dim1=1, dim2=2):
    pcs = pca.components_
    pc1 = pcs[dim1-1]
    pc2 = pcs[dim2-1]

    fig, ax = plt.subplots(figsize=(10,10))

    # Circle
    circle = plt.Circle((0,0), 1, color='grey', fill=False, linestyle='--')
    ax.add_artist(circle)

    # Axes
    ax.axhline(0, color='black', lw=1)
    ax.axvline(0, color='black', lw=1)

    # Variables
    for i, feature in enumerate(features):
        x = pc1[i]
        y = pc2[i]
        ax.arrow(0, 0, x, y, head_width=0.03, head_length=0.03, linewidth=1.2)
        ax.text(x*1.08, y*1.08, feature, fontsize=11)

    ax.set_xlabel(f"PC{dim1} ({round(pca.explained_variance_ratio_[dim1-1]*100,2)}%)", fontsize=13)
    ax.set_ylabel(f"PC{dim2} ({round(pca.explained_variance_ratio_[dim2-1]*100,2)}%)", fontsize=13)
    ax.set_title("Cercle de corrélation PCA", fontsize=16)
    ax.set_xlim(-1.1, 1.1)
    ax.set_ylim(-1.1, 1.1)
    plt.grid(False)
    plt.show()

# --- Function call ---
plot_correlation_circle(pca, X.columns)


#III. Data Preparation

##III.1. DSO1 : Predict the diagnosis type — M (Malignant) or B (Benign)

In [None]:
df.drop(columns=['Unnamed: 32', 'id'], inplace=True, errors='ignore')
df.columns

In [None]:
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

In [None]:
scaler = StandardScaler()
X = df.drop('diagnosis', axis=1)
y = df["diagnosis"]
X_scaled = scaler.fit_transform(X)
scores = X_scaled

In [None]:
pca = PCA()
pca.fit(X_scaled)

In [None]:
cumulative_variance = np.cumsum(pca.explained_variance_ratio_)

plt.plot(cumulative_variance)
plt.xlabel("Number of Components")
plt.ylabel("Cumulative Explained Variance")
plt.title("PCA Explained Variance")
plt.grid(True)
plt.show()

In [None]:
eigvals = pca.explained_variance_
loadings_corr = pca.components_.T * np.sqrt(eigvals)  # shape (n_features, 2)

fig, ax = plt.subplots(figsize=(7, 7))
theta = np.linspace(0, 2*np.pi, 500)
ax.plot(np.cos(theta), np.sin(theta), 'b--', alpha=0.6)

names = X.columns
for i, name in enumerate(names):
    x, y_ = loadings_corr[i, 0], loadings_corr[i, 1]
    ax.arrow(0, 0, x, y_, color='crimson', alpha=0.75,
             head_width=0.02, length_includes_head=True)
    ax.text(x*1.07, y_*1.07, name, fontsize=8)

ax.axhline(0, color='grey', lw=1, ls='--')
ax.axvline(0, color='grey', lw=1, ls='--')
ax.set_xlabel(f"PC1 ({pca.explained_variance_ratio_[0]*100:.1f}% var)")
ax.set_ylabel(f"PC2 ({pca.explained_variance_ratio_[1]*100:.1f}% var)")
ax.set_title("PCA Correlation Circle")
ax.set_aspect('equal', 'box')
plt.show()

In [None]:
pcs = [f"PC{i}" for i in range(1, scores.shape[1] + 1)]
scores_df = pd.DataFrame(scores @ pca.components_.T, columns=pcs)
scores_df["diagnosis"] = df["diagnosis"]

In [None]:
plt.figure(figsize=(7, 6))
sns.scatterplot(
    data=scores_df, x="PC1", y="PC2",
    hue="diagnosis",
    palette={"M": "crimson", "B": "steelblue"},
    alpha=0.75, edgecolor="white", s=45
)
plt.axhline(0, color="grey", ls="--", lw=1)
plt.axvline(0, color="grey", ls="--", lw=1)
plt.xlabel(f"PC1 ({pca.explained_variance_ratio_[0]*100:.1f}% var)")
plt.ylabel(f"PC2 ({pca.explained_variance_ratio_[1]*100:.1f}% var)")
plt.title("Samples in PC space (PC1 vs PC2)")
plt.legend(title="Diagnosis")
plt.tight_layout()
plt.show()

##### 1. Data preparation for the model

- I start from the cleaned dataframe `df` (after removing `id` and `Unnamed: 32`).
- I create:
  - `X_model` = copy of all 30 feature columns  
  - `y_model` = copy of the `diagnosis` column
- I encode the target so that:
  - `B` → 0 (Benign)  
  - `M` → 1 (Malignant)
- The input features used by the model are the **standardised features** `X_scaled` (output of `StandardScaler`), so each variable has mean 0 and variance 1.
- I split the data into **70% training / 30% test** using `train_test_split` with `stratify=y_model` to keep the same proportion of benign and malignant cases in both sets.  
  This reproduces the protocol used in the research paper.

In [None]:
# Defensive copies (good practice)
X_model = X.copy()
y_model = y.copy()

In [None]:
# Encode target: Malignant = 1, Benign = 0
y_model = y_model.map({'M': 1, 'B': 0})

In [None]:
# 70% train / 30% test, stratified by the target
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y_model, test_size=0.3, random_state=42, stratify=y_model
)

In this part, we build and evaluate a **Random Forest classifier** to predict whether a tumour is **benign (0)** or **malignant (1)** using the 30 numerical features of the dataset (data.csv)

##III.2. Dataset 2: DSO2 + DSO3

DSO2

In [None]:
features = [col for col in df2.columns if col != "Classification"]
X2 = df2[features].values
y2 = df2["Classification"].values


In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled2 = scaler.fit_transform(X2)

#IV. Modeling

##IV.1. DSO1 : Predict the diagnosis type — M (Malignant) or B (Benign)

### Random Forest Model

#####Training the Random Forest

- I create a `RandomForestClassifier` with:
  - `n_estimators = 100` trees,
  - `class_weight = 'balanced'` to compensate the slight class imbalance,
  - `random_state = 42` for reproducibility.
- I fit the model **only on the training set** (`X_train`, `y_train`).

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import cross_val_score

rf = RandomForestClassifier(n_estimators=100, random_state=42, class_weight='balanced')
# Fit on training data only
rf.fit(X_train, y_train)

#####Visualisation of one tree

- The Random Forest is an ensemble of 100 trees, which is hard to visualise directly.
- For interpretability, I plot **one tree** from the forest (`rf.estimators_[5]`) with a maximum depth of 3.
- This shows how the model uses the most important features (for example `area_worst`, `concave points_mean`, etc.) to split between benign and malignant tumours.

In [None]:
from sklearn.tree import plot_tree
features = X.columns
estimator = rf.estimators_[5]
plt.figure(figsize=(24, 10))
plot_tree(
    estimator,
    feature_names=features,
    class_names=["Benign", "Malignant"],
    filled=True,
    rounded=True,
    max_depth=3,
    fontsize=10
)
plt.title("Visualisation d'un arbre du Random Forest (profondeur limitée à 3)")
plt.show()

### MLP

In this part, we build and tune a **Multi-Layer Perceptron (MLP) classifier** to predict whether a tumour is **benign (0)** or **malignant (1)** using the same 30 numerical features as for the Random Forest.


#### Data preparation for the MLP

- Start from the cleaned dataframe `df`.
- Create:
  - `X_model_2` = copy of the 30 feature columns  
  - `y_model_2` = copy of the `diagnosis` column
- Encode the target:
  - `B` → 0 (Benign)
  - `M` → 1 (Malignant)
- Use the standardised features `X_scaled` as input to the MLP.
- Split into **70% train / 30% test**, stratified by the target, to keep the same class proportions.

In [None]:
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix, roc_auc_score, roc_curve
import matplotlib.pyplot as plt

In [None]:
X_model_2 = X.copy()
y_model_2 = y.copy()
y_model_2 = y_model_2.map({'M': 1, 'B': 0})

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled,
    y_model_2,
    test_size=0.3,
    random_state=42,
    stratify=y_model_2
)

- I use `predict_proba` to get the predicted probability of the malignant class on the test set.

The ROC curve of the Random Forest on the test set is very close to the top-left
corner. The Area Under the Curve is **AUC = 0.997**, which indicates an excellent
ability to separate benign (0) from malignant (1) tumours.  
The model keeps a very high true positive rate (sensitivity) while maintaining a
very low false positive rate, which is desirable in a medical screening context.

####Base MLP architecture

- Define a base `MLPClassifier` with:
  - activation `relu`
  - optimiser `adam`
  - batch size 32
  - `max_iter = 500`
  - `early_stopping = True` with `validation_fraction = 0.2` and `n_iter_no_change = 20`  
    → adds an anti-overfitting safety: training stops when validation score stagnates.
- The exact hidden layer sizes and regularisation will be chosen by grid search.

In [None]:
mlp_base = MLPClassifier(
    activation='relu',
    solver='adam',
    batch_size=32,
    max_iter=500,
    early_stopping=True,      # on garde une sécurité anti-overfitting
    validation_fraction=0.2,
    n_iter_no_change=20,
    random_state=42
)

### GRU-SVM

We implement a **GRU-SVM model**: a GRU network that outputs a single linear score,
trained with **hinge loss** (SVM style) instead of cross-entropy.

####Data preparation

- The starting point is still `X_scaled` (30 standardised features) and `y`.
- For the SVM formulation we encode the labels as:
  - `M` → +1
  - `B` → −1
- GRUs expect a 3D input `(samples, timesteps, features)`.  
  Here we treat the 30 features as a “sequence” of length 30 with 1 feature per step:
  `X_gru.shape = (n_samples, 30, 1)`.
- We perform a **70% / 30%** train–test split, stratified on `y_gru`.

In [None]:
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import GRU, Dense, Dropout
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, roc_curve, roc_auc_score
import matplotlib.pyplot as plt

In [None]:
y_gru = y.map({'M': 1, 'B': -1}).astype('float32')

In [None]:
n_samples, n_features = X_scaled.shape
X_gru = X_scaled.reshape(n_samples, n_features, 1)

In [None]:
X_train_gru, X_test_gru, y_train_gru, y_test_gru = train_test_split(
    X_gru,
    y_gru,
    test_size=0.3,
    random_state=42,
    stratify=y_gru
)

####GRU-SVM architecture

- Custom metric `svm_accuracy`:
  - takes the **sign** of the model output,
  - compares it to the true labels in {−1, +1}.
- Model structure:
  - one `GRU(32)` layer (sequence → 32-dim embedding),
  - one `Dropout(0.3)` layer to reduce overfitting,
  - one `Dense(1, activation='linear')` output → raw SVM score.
- We compile with:
  - optimiser: `Adam(learning_rate=1e-3)`
  - loss: `'hinge'` (SVM margin loss)
  - metric: `svm_accuracy`.

In [None]:
def svm_accuracy(y_true, y_pred):
    # signe de la prédiction
    y_pred_sign = tf.sign(y_pred)
    # si le modèle sort exactement 0, on force à +1
    y_pred_sign = tf.where(tf.equal(y_pred_sign, 0.0),
                           tf.ones_like(y_pred_sign),
                           y_pred_sign)
    equal = tf.equal(y_true, y_pred_sign)
    return tf.reduce_mean(tf.cast(equal, tf.float32))

gru_svm = Sequential([
    GRU(32, input_shape=(n_features, 1), return_sequences=False),
    Dropout(0.3),
    Dense(1, activation='linear')    # sortie linéaire pour hinge
])

gru_svm.compile(
    optimizer=Adam(learning_rate=1e-3),
    loss='hinge',
    metrics=[svm_accuracy]
)


####Training the GRU-SVM

- Use `EarlyStopping` on `val_loss` with:
  - `patience = 30`
  - `restore_best_weights = True`
- Training configuration:
  - validation split: 20% of the training set,
  - `epochs = 300` (training may stop earlier thanks to early stopping),
  - `batch_size = 32`,
  - `verbose = 0` for a clean notebook.
- At the end, we print the epoch where training actually stopped.

In [None]:
early_stop = EarlyStopping(
    monitor='val_loss',
    patience=30,
    restore_best_weights=True
)

history_gru = gru_svm.fit(
    X_train_gru, y_train_gru,
    validation_split=0.2,
    epochs=300,
    batch_size=32,
    callbacks=[early_stop],
    verbose=0
)

print(f"Training stopped at epoch {len(history_gru.history['loss'])}")

##IV.1. DSO2: Cluster diagnosis patterns

In [None]:
#1) Keep only the sick
df_cancer = df2[df2["Classification"] == 2].copy() #df containing only patients
df_non_cancer = df2[df2["Classification"] == 1].copy() #df containing only non-patients
print(df_cancer.shape)  # number of sick patients

### **Modéle 1 (K-means)**

In [None]:
# 2) Features = all explanatory columns WITHOUT Classification
features = [col for col in df_cancer.columns if col != "Classification"]
X_cancer = df_cancer[features].values

In [None]:
from sklearn.preprocessing import StandardScaler

scaler_cancer = StandardScaler()
X_cancer_scaled = scaler_cancer.fit_transform(X_cancer)


In [None]:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt

inertias = []
sil_scores = []
K_range = range(2, 5)  # K = 2, 3, 4 for cancer subtypes

for k in K_range:
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    km.fit(X_cancer_scaled)
    inertias.append(km.inertia_)
    labels = km.labels_
    sil = silhouette_score(X_cancer_scaled, labels)
    sil_scores.append(sil)
    print(f"K={k}: silhouette={sil:.3f}")

# Elobow
plt.figure(figsize=(6,4))
plt.plot(K_range, inertias, marker='o')
plt.xlabel("Nombre de clusters K")
plt.ylabel("Inertie")
plt.title("Méthode du coude – KMeans (patients avec cancer)")
plt.grid(True)
plt.show()

# Silhouette
plt.figure(figsize=(6,4))
plt.plot(K_range, sil_scores, marker='o')
plt.xlabel("Nombre de clusters K")
plt.ylabel("Score de silhouette moyen")
plt.title("Silhouette – KMeans (patients avec cancer)")
plt.grid(True)
plt.show()


In [None]:
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage

# Hierarchical link with Ward's criterion (suitable for standardized continuous data)
Z = linkage(X_cancer_scaled, method='ward')

plt.figure(figsize=(10, 5))
dendrogram(Z, truncate_mode=None, color_threshold=None)
plt.title("Dendrogramme – CAH (patients avec cancer)")
plt.xlabel("Patients")
plt.ylabel("Distance (Ward)")
plt.show()


For k=2

In [None]:
best_k = 2

kmeans_cancer = KMeans(n_clusters=best_k, random_state=42, n_init=10)
kmeans_cancer.fit(X_cancer_scaled)

cluster_labels_cancer = kmeans_cancer.labels_

# Add cluster labels in df_cancer
df_cancer["cluster_kmeans_cancer"] = cluster_labels_cancer

# If you want to reintegrate into the complete df (NaN for non-patients)
df2["cluster_kmeans_cancer"] = np.nan
df2.loc[df_cancer.index, "cluster_kmeans_cancer"] = cluster_labels_cancer


In [None]:
# Average of variables per cluster
cluster_profile = df_cancer.groupby("cluster_kmeans_cancer")[features].mean()
print(cluster_profile)

# Number of patients per cluster
print(df_cancer["cluster_kmeans_cancer"].value_counts())


In [None]:
from sklearn.decomposition import PCA

# 2D PCA projection for patients with cancer
pca_vis = PCA(n_components=2)
X_cancer_pca = pca_vis.fit_transform(X_cancer_scaled)

# IMPORTANT: add columns to df_cancer
df_cancer = df_cancer.copy()
df_cancer["PC1"] = X_cancer_pca[:, 0]
df_cancer["PC2"] = X_cancer_pca[:, 1]


In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(7, 6))
sns.scatterplot(
    data=df_cancer,
    x="PC1", y="PC2",
    hue="cluster_kmeans_cancer",
    palette="viridis",
    s=60,
    edgecolor="k"
)
plt.xlabel(f"PC1 ({pca_vis.explained_variance_ratio_[0]*100:.1f} %)")
plt.ylabel(f"PC2 ({pca_vis.explained_variance_ratio_[1]*100:.1f} %)")
plt.title("K-Means (K=2) – Patients avec cancer, projection PCA")
plt.legend(title="Cluster")
plt.grid(True)
plt.show()


Cluster 0:

Younger age, lower BMI, significantly lower glucose, insulin, and HOMA than cluster 1 → less disturbed metabolic profile.

Leptin, resistin, and MCP.1 also lower, consistent with more moderate obesity/inflammation.

A consistent name: " Cancers with moderate metabolic risk“ or ”Moderate metabolic subtype."

Cluster 1:

Higher average age (63 years vs. ~53), significantly higher BMI (31), very high glucose (~123), insulin and HOMA multiplied by 3–4 → high insulin resistance/metabolic syndrome. Leptin and MCP.1 much higher, resistin also higher → profile of marked obesity and greater systemic inflammation.

A consistent name: “High metabolic risk cancers (obesity and insulin resistance)” or, more briefly, “Severe metabolic subtype.”

For k=3

In [None]:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# We start with df_cancer, features, and X_cancer_scaled already defined.
best_k = 3

kmeans_cancer_3 = KMeans(n_clusters=best_k, random_state=42, n_init=10)
kmeans_cancer_3.fit(X_cancer_scaled)

cluster_labels_cancer_3 = kmeans_cancer_3.labels_
df_cancer["cluster_kmeans_cancer_3"] = cluster_labels_cancer_3

# Internal quality
sil_3 = silhouette_score(X_cancer_scaled, cluster_labels_cancer_3)
print("Silhouette K=3 (cancers) :", sil_3)


In [None]:
df_cancer.head(50)

In [None]:
df_cancer.tail()

In [None]:
# Number of patients per cluster
print(df_cancer["cluster_kmeans_cancer_3"].value_counts())

# Average of variables per cluster
cluster_profile_3 = df_cancer.groupby("cluster_kmeans_cancer_3")[features].mean()
print(cluster_profile_3)


In [None]:
from sklearn.decomposition import PCA
import seaborn as sns
import matplotlib.pyplot as plt

pca_vis_3 = PCA(n_components=2)
X_cancer_pca_3 = pca_vis_3.fit_transform(X_cancer_scaled)

df_cancer["PC1_3"] = X_cancer_pca_3[:, 0]
df_cancer["PC2_3"] = X_cancer_pca_3[:, 1]

plt.figure(figsize=(7, 6))
sns.scatterplot(
    data=df_cancer,
    x="PC1_3", y="PC2_3",
    hue="cluster_kmeans_cancer_3",
    palette="viridis",
    s=60,
    edgecolor="k"
)
plt.xlabel(f"PC1 ({pca_vis_3.explained_variance_ratio_[0]*100:.1f} %)")
plt.ylabel(f"PC2 ({pca_vis_3.explained_variance_ratio_[1]*100:.1f} %)")
plt.title("K-Means (K=3) – Patients avec cancer, projection PCA")
plt.legend(title="Cluster")
plt.grid(True)
plt.show()


Looking at the figure:

Cluster 0 (purple) is mainly on the left, with lower PC1 values → cancers with a relatively less disturbed metabolic profile, closer to “normal” in terms of glucose/insulin/HOMA.

Cluster 1 (green) occupies the central/right area, with higher PC1 than cluster 0 → cancers with intermediate metabolic abnormalities (higher glucose/HOMA), but not as extreme as cluster 2.

Cluster 2 (yellow) includes a few patients on the far right in PC1 → the most extreme cases on the metabolic axis (very high hyperglycemia and/or insulin resistance, probably also high BMI/leptin).

Cluster 0: “Moderate metabolic cancers”

Least disturbed metabolic profile among patients.

Cluster 1: “Intermediate metabolic risk cancers”

Clear but not extreme metabolic abnormalities, central group.

Cluster 2: “Severe metabolic cancers (extreme profiles)”

A few patients with very marked disorders, located at the extreme end of PC1.

### **Modéle 2 (Agglomerative)**

In [None]:
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics import silhouette_score

# CAH model with k = 3
agg_cancer_3 = AgglomerativeClustering(
    n_clusters=3,
    linkage="ward"       # euclidean + Ward
)

labels_agg_3 = agg_cancer_3.fit_predict(X_cancer_scaled)

# Add labels to df_cancer
df_cancer["cluster_agg_3"] = labels_agg_3

# (optional) put back into the complete df
df2["cluster_agg_3"] = np.nan
df2.loc[df_cancer.index, "cluster_agg_3"] = labels_agg_3

# Silhouette to evaluate quality (optional but useful for comparison with KMeans)
sil_agg_3 = silhouette_score(X_cancer_scaled, labels_agg_3)
print("Silhouette CAH k= :", sil_agg_3)


In [None]:
df_cancer.head(50)

In [None]:
cluster_profile_agg_3 = df_cancer.groupby("cluster_agg_3")[features].mean()
print(cluster_profile_agg_3)

print(df_cancer["cluster_agg_3"].value_counts())


In [None]:
from sklearn.decomposition import PCA
import seaborn as sns
import matplotlib.pyplot as plt

# We start with X_cancer_scaled (cancer patients, standardized)
pca_agg = PCA(n_components=2)
X_cancer_pca_agg = pca_agg.fit_transform(X_cancer_scaled)

# Add the components to df_cancer
df_cancer["PC1_agg"] = X_cancer_pca_agg[:, 0]
df_cancer["PC2_agg"] = X_cancer_pca_agg[:, 1]

plt.figure(figsize=(7, 6))
sns.scatterplot(
    data=df_cancer,
    x="PC1_agg", y="PC2_agg",
    hue="cluster_agg_3",        # labels CAH (0 et 1)
    palette="Set1",
    s=60,
    edgecolor="k"
)
plt.xlabel(f"PC1 ({pca_agg.explained_variance_ratio_[0]*100:.1f} %)")
plt.ylabel(f"PC2 ({pca_agg.explained_variance_ratio_[1]*100:.1f} %)")
plt.title("Agglomerative (k=3) – Patients avec cancer, projection PCA")
plt.legend(title="Cluster CAH")
plt.grid(True)
plt.show()


### **Modèle 3 (GMM)**


In [None]:
from sklearn.mixture import GaussianMixture
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt

# Test k = 2, 3, 4
K_range = range(2, 5)
sil_scores_gmm = []

for k in K_range:
    gmm = GaussianMixture(
        n_components=k,
        covariance_type="full",
        random_state=42
    )
    gmm.fit(X_cancer_scaled)

    labels = gmm.predict(X_cancer_scaled)
    sil = silhouette_score(X_cancer_scaled, labels)
    sil_scores_gmm.append(sil)

    print(f"GMM k={k}: silhouette={sil:.3f}")

# Silhouette chart
plt.figure(figsize=(6, 4))
plt.plot(K_range, sil_scores_gmm, marker='o')
plt.xlabel("Nombre de clusters K")
plt.ylabel("Score de silhouette")
plt.title("Silhouette – GMM (patients avec cancer)")
plt.grid(True)
plt.show()


In [None]:
import numpy as np

# Choose the best k
best_k_gmm = 3   # (or the best according to silhouette)

gmm_final = GaussianMixture(
    n_components=best_k_gmm,
    covariance_type="full",
    random_state=42
)
gmm_final.fit(X_cancer_scaled)

# Labels clusters
labels_gmm = gmm_final.predict(X_cancer_scaled)

# Add to df_cancer
df_cancer["cluster_gmm"] = labels_gmm

# Integrate into complete df
df2["cluster_gmm"] = np.nan
df2.loc[df_cancer.index, "cluster_gmm"] = labels_gmm

# Clustering quality
sil_gmm = silhouette_score(X_cancer_scaled, labels_gmm)
print("Silhouette GMM :", sil_gmm)

# Number of patients per cluster
print(df_cancer["cluster_gmm"].value_counts())

# Average cluster profile
cluster_profile_gmm = df_cancer.groupby("cluster_gmm")[features].mean()
print(cluster_profile_gmm)


In [None]:
df_cancer.head(50)

In [None]:
from sklearn.decomposition import PCA
import seaborn as sns
import matplotlib.pyplot as plt

# PCA 2D
pca_gmm = PCA(n_components=2)
X_cancer_pca_gmm = pca_gmm.fit_transform(X_cancer_scaled)

df_cancer["PC1_gmm"] = X_cancer_pca_gmm[:, 0]
df_cancer["PC2_gmm"] = X_cancer_pca_gmm[:, 1]

# Scatterplot
plt.figure(figsize=(7, 6))
sns.scatterplot(
    data=df_cancer,
    x="PC1_gmm",
    y="PC2_gmm",
    hue="cluster_gmm",
    palette="viridis",
    s=60,
    edgecolor="k"
)
plt.xlabel(f"PC1 ({pca_gmm.explained_variance_ratio_[0]*100:.1f} %)")
plt.ylabel(f"PC2 ({pca_gmm.explained_variance_ratio_[1]*100:.1f} %)")
plt.title("GMM – Patients avec cancer, projection PCA")
plt.legend(title="Cluster GMM")
plt.grid(True)
plt.show()


In [None]:
cols_to_drop = ["PC1", "PC2", "PC1_3", "PC2_3", "PC1_agg", "PC2_agg", "PC1_gmm","PC2_gmm"]
df_cancer = df_cancer.drop(columns=cols_to_drop, errors="ignore")


In [None]:
df_cancer.head()

In [None]:
#assign the cluster to the non-patient
df_non_cancer["cluster_kmeans_cancer"] = 3
df_non_cancer["cluster_kmeans_cancer_3"] = 3
df_non_cancer["cluster_agg_3"] = 3
df_non_cancer["cluster_gmm"] = 3

In [None]:
df_non_cancer.head()

In [None]:
#merge the dfs
df_complet = pd.concat([df_cancer, df_non_cancer], axis=0)
df_complet = df_complet.sort_index()


In [None]:
print(df_complet["cluster_kmeans_cancer"].value_counts())
print(df_complet["cluster_kmeans_cancer_3"].value_counts())
print(df_complet["cluster_agg_3"].value_counts())
print(df_complet["cluster_gmm"].value_counts())


In [None]:
df_complet.head()

In [None]:
df_complet.tail()

In [None]:
df_complet.dtypes

##IV.3. DSO3

*We are going to prepare the Coimbra Dataset for the reccomendation system*

In [None]:
df_complete = df_complet.copy()

In [None]:
df_complete.drop(columns=["Classification"], inplace=True)
df_complete.drop(columns=["cluster_kmeans_cancer_3"], inplace=True)
df_complete.drop(columns=["cluster_agg_3"], inplace=True)
df_complete.drop(columns=["cluster_gmm"], inplace=True)

In [None]:
df_complete = df_complete.rename(columns={
    "cluster_kmeans_cancer": "cluster_final"
})

In [None]:
df_complete.columns


In [None]:
df_complete.dtypes

In [None]:
df_complete.head()

In [None]:
df_complete.tail()

In [None]:
features = [col for col in df_complete.columns if col != "cluster_final"]
df_cancer_rec = df_complete[df_complete["cluster_final"].isin([0, 1, 2])].copy()
df_non_cancer_rec = df_complete[df_complete["cluster_final"] == 3].copy()

In [None]:
df_non_cancer_rec.tail()

In [None]:
baseline_stats = df_non_cancer_rec[features].describe().T
baseline_stats = baseline_stats[['mean', 'std', 'min', 'max']]
baseline_stats['5th_percentile'] =  df_non_cancer_rec[features].quantile(0.05)
baseline_stats['95th_percentile'] = df_non_cancer_rec[features].quantile(0.95)

In [None]:
print("=== HEALTHY REFERENCE RANGES ===")
print(baseline_stats)

**Mean → typical healthy value


Std → natural variability


Min / Max → extreme observed healthy values


5th percentile → the lower boundary of what is still considered normal


95th percentile → the upper boundary of what is still considered normal**


**the dataset contains healthy women around 58, slightly overweight (28) they have a normal metabolic glucose profile (88mg/dL) and normal insulin sensitivity (6.9µU/m)**

##How far each patient is from the healthy average

In [None]:
def calculate_z_scores(df, baseline):
    z_scores = pd.DataFrame(index=df.index, columns=features)

    for feature in features:
        mean = baseline.loc[feature, 'mean']
        std = baseline.loc[feature, 'std']
        z_scores[feature] = (df[feature] - mean) / std

    return z_scores

z_scores_all = calculate_z_scores(df_complete, baseline_stats)


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns


# Put all your 9 biomarkers in a list
features = ["Age","BMI","Glucose","Insulin","HOMA","Leptin",
            "Adiponectin","Resistin","MCP.1"]


df_z_with_cluster = z_scores_all.copy()
df_z_with_cluster["cluster"] = df_complete["cluster_final"]


# One boxplot per feature
for feature in features:
    plt.figure(figsize=(6,4))
    sns.boxplot(data=df_z_with_cluster, x="cluster", y=feature)
    plt.title(f"{feature} Z‑score Distribution per Cluster")
    plt.show()


**What is shown in these plots**

Each boxplot represents the **distribution of Z-scores** for a given variable  
(**Age, BMI, Insulin, Leptin, Adiponectin, Resistin, MCP-1**) across the different clusters.

Z-scores indicate how far a patient’s value deviates from the dataset average:
- **Z = 0** → average value  
- **Z > 0** → higher than average  
- **Z < 0** → lower than average  


**Variable-specific interpretation logic**

The following interpretations apply to all boxplots:

- **Age**  
  Positive Z-scores indicate **older** patients, while negative Z-scores indicate **younger** patients.

- **BMI**  
  Higher Z-scores correspond to **higher body mass index** relative to the average.

- **Insulin & HOMA**  
  Higher Z-scores indicate **higher insulin levels** and **greater insulin resistance**.

- **Leptin & Resistin**  
  Higher Z-scores suggest **altered adipokine levels**, often associated with metabolic imbalance.

- **Adiponectin**  
  Lower Z-scores indicate **reduced levels**, commonly linked to metabolic dysfunction.

- **MCP-1**  
  Higher Z-scores reflect **increased inflammatory activity**.


In [None]:
import seaborn as sns
import matplotlib.pyplot as plt


plt.figure(figsize=(12, 8))
sns.heatmap(z_scores_all, cmap="coolwarm", center=0)
plt.title("Z-score Heatmap for All Biomarkers")
plt.show()


**Most patients are near normal(light colors)** : biomarker values close to healthy individuals.
* 0  → the patient is close to the healthy average


* Blue → the patient is below the healthy average


* Red → the patient is above the healthy average


* Darker colors → stronger deviation


* Rows = individual patients


* Columns = biomarkers
The biggest deviation is in glucose regulation:


* High insulin


* High glucose


* High HOMA




####Convert Z-score to interpretable severity label

In [None]:
def label_deviation(z_value):

    abs_z = abs(z_value)
    if abs_z < 1.0:
        return "Normal"
    elif abs_z < 1.5:
        return "Mild"
    elif abs_z < 2.5:
        return "Moderate"
    else:
        return "Severe"


deviation_labels = z_scores_all.applymap(label_deviation)
deviation_labels.columns = [f"{col}_deviation" for col in deviation_labels.columns]

In [None]:
import pandas as pd

plt.figure(figsize=(14, 6))
deviation_counts = deviation_labels.apply(pd.Series.value_counts)

deviation_counts.T.plot(kind="bar", figsize=(14, 6), colormap="viridis")
plt.title("Deviation Distribution per Biomarker")
plt.ylabel("Number of Patients")
plt.xticks(rotation=45)
plt.legend(title="Deviation Level")
plt.show()


create biological meaning dictioctioannry

In [None]:
biological_meaning = {
    "Age": {
        "full_name": "Age",
        "meaning": "Patient's age in years; cancer risk increases with age",
        "unit": "years",
        "clinical_significance": "Age >50 associated with higher breast cancer risk",
        "mechanism": "Accumulation of genetic mutations over time",
        "normal_range": None,
        "interpretation": {
            "High": "Increased cancer risk; prioritize screening",
            "Low": "Lower baseline risk but not protective"
        }
    },
    "BMI": {
        "full_name": "Body Mass Index",
        "meaning": "Body fat measure; obesity linked to estrogen production and inflammation",
        "unit": "kg/m²",
        "clinical_significance": "BMI >30 increases breast cancer risk by 30-50%",
        "mechanism": "Adipose tissue produces estrogen and inflammatory cytokines",
        "normal_range": (18.5, 24.9),
        "interpretation": {
            "High": "Weight reduction reduces cancer risk and improves metabolic health",
            "Low": "Rule out malnutrition or cachexia"
        }
    },
    "Glucose": {
        "full_name": "Fasting Blood Glucose",
        "meaning": "Primary energy source; elevated in insulin resistance and diabetes",
        "unit": "mg/dL",
        "clinical_significance": "Hyperglycemia fuels cancer cell growth (Warburg effect)",
        "mechanism": "Cancer cells preferentially use glucose for rapid proliferation",
        "normal_range": (70, 100),
        "interpretation": {
            "High": "Target glycemic control to reduce cancer progression risk",
            "Low": "Risk of hypoglycemia; adjust medications"
        }
    },
    "Insulin": {
        "full_name": "Fasting Insulin",
        "meaning": "Hormone that regulates glucose uptake; elevated in insulin resistance",
        "unit": "µU/mL",
        "clinical_significance": "Hyperinsulinemia activates IGF-1 pathway promoting cancer growth",
        "mechanism": "Insulin binds to IGF-1 receptors on cancer cells, stimulating proliferation",
        "normal_range": (2.6, 24.9),
        "interpretation": {
            "High": "Insulin-sensitizing drugs may reduce cancer risk and progression",
            "Low": "Check C-peptide; may indicate beta-cell dysfunction"
        }
    },
    "HOMA": {
        "full_name": "Homeostatic Model Assessment for Insulin Resistance",
        "meaning": "Calculated index of insulin resistance; HOMA = (Glucose × Insulin) / 405",
        "unit": "ratio",
        "clinical_significance": "HOMA >2.5 indicates insulin resistance, linked to poor cancer outcomes",
        "mechanism": "Insulin resistance creates pro-growth, pro-inflammatory environment",
        "normal_range": (0.5, 2.5),
        "interpretation": {
            "High": "Lifestyle modification and metformin can improve insulin sensitivity",
            "Low": "Normal insulin sensitivity; maintain healthy habits"
        }
    },
    "Leptin": {
        "full_name": "Leptin",
        "meaning": "Satiety hormone from adipose tissue; signals energy stores to brain",
        "unit": "ng/mL",
        "clinical_significance": "Elevated leptin in obesity promotes cancer via JAK/STAT pathway",
        "mechanism": "Leptin activates pro-survival and anti-apoptotic signals in cancer cells",
        "normal_range": (4, 25),
        "interpretation": {
            "High": "Leptin resistance in obesity; target visceral fat reduction",
            "Low": "May indicate low body fat or leptin deficiency"
        }
    },
    "Adiponectin": {
        "full_name": "Adiponectin",
        "meaning": "Anti-inflammatory adipokine; inversely related to body fat",
        "unit": "µg/mL",
        "clinical_significance": "Low adiponectin linked to cancer risk; protective molecule",
        "mechanism": "Activates AMPK pathway, inhibits mTOR (cancer suppression)",
        "normal_range": (5, 30),
        "interpretation": {
            "High": "Protective; associated with better metabolic health",
            "Low": "Metabolic dysfunction; increase through exercise and weight loss"
        }
    },
    "Resistin": {
        "full_name": "Resistin",
        "meaning": "Pro-inflammatory protein from adipose tissue; promotes insulin resistance",
        "unit": "ng/mL",
        "clinical_significance": "Elevated in inflammation and obesity; linked to cancer progression",
        "mechanism": "Induces NF-κB pathway activation, promoting inflammation and angiogenesis",
        "normal_range": (4, 12),
        "interpretation": {
            "High": "Anti-inflammatory interventions critical; monitor CRP",
            "Low": "Low inflammatory state; favorable"
        }
    },
    "MCP.1": {
        "full_name": "Monocyte Chemoattractant Protein-1",
        "meaning": "Chemokine that recruits immune cells to sites of inflammation",
        "unit": "pg/dL",
        "clinical_significance": "Elevated MCP-1 promotes tumor-associated macrophage infiltration",
        "mechanism": "Creates pro-tumor microenvironment via M2 macrophage polarization",
        "normal_range": (300, 600),
        "interpretation": {
            "High": "Chronic inflammation; consider anti-inflammatory diet and omega-3s",
            "Low": "Reduced inflammatory signaling; favorable"
        }
    }
}


print("✓ Biological meaning dictionary created for", len(biological_meaning), "features")



Add biological Interpretations to baseline stats

In [None]:
# Add new columns to baseline_stats
baseline_stats['biological_meaning'] = ""
baseline_stats['unit'] = ""
baseline_stats['clinical_significance'] = ""


for feature in features:
    if feature in biological_meaning:
        baseline_stats.loc[feature, 'biological_meaning'] = biological_meaning[feature]['meaning']
        baseline_stats.loc[feature, 'unit'] = biological_meaning[feature]['unit']
        baseline_stats.loc[feature, 'clinical_significance'] = biological_meaning[feature]['clinical_significance']


# Display enriched baseline
print("=" * 100)
print("ENRICHED BASELINE STATISTICS (Healthy Population Reference)")
print("=" * 100)
print(baseline_stats.head())
print("\n✓ Baseline stats enriched with biological meaning")

Add Z_score and deviation labels to df_complete

In [None]:
for col in z_scores_all.columns:
    df_complete[f"{col}_zscore"] = z_scores_all[col]


print("✓ Z-scores added to df_complete")
print(f"  New shape: {df_complete.shape}")
print(f"  Z-score columns: {[col for col in df_complete.columns if '_zscore' in col][:3]}...")


df_complete = pd.concat([df_complete, deviation_labels], axis=1)


print("✓ Deviation labels added to df_complete")
print(f"  New shape: {df_complete.shape}")
print(f"  Deviation columns: {[col for col in df_complete.columns if '_deviation' in col][:3]}...")



In [None]:
df_complete.head()

Calculate overall seveeruty score

**Now the question is: “Overall… how sick does this patient look?”**


**This is intensity-based.**


Two patients can be in the same cluster




*  one is mildly abnormal
*  the other is very abnormal

In [None]:
# Create severity mapping
severity_map = {
    "Normal": 0,
    "Mild": 1,
    "Moderate": 2,
    "Severe": 3
}


# Convert deviation labels to numeric scores
severity_scores = deviation_labels.applymap(lambda x: severity_map[x])


# Calculate average severity across all biomarkers for each patient
df_complete['overall_severity'] = severity_scores.mean(axis=1)


print(f"✓ Overall severity score calculated")
print(f"  Range: {df_complete['overall_severity'].min():.2f} - {df_complete['overall_severity'].max():.2f}")
print(f"  Mean: {df_complete['overall_severity'].mean():.2f}")


# Show distribution by cluster
print("\nSeverity by Cluster:")
print(df_complete.groupby('cluster_final')['overall_severity'].agg(['mean', 'std', 'min', 'max']))


# Preview some patients
print("\nExample patients with different severity levels:")
print(df_complete[['cluster_final', 'overall_severity']].head(10))




## Add Biological Interpretations for Each Patient

**We convert numerical biomarker deviations into human-readable biological
interpretations by evaluating deviation magnitude, direction (high/low),
and predefined medical knowledge for each biomarker.**




```text
┌───────────────────────────────────────────────┐
│            INPUT (Per Patient)                │
│                                               │
│  Biomarker Z-score (e.g. Glucose_zscore)      │
│                                               │
│  Example: Z = +2.1                             │
└───────────────────────┬───────────────────────┘
                        │
                        ▼
┌───────────────────────────────────────────────┐
│   Is |Z-score| < 1 ?                           │
│                                               │
│   YES → "Within normal range"                  │
│   NO  → Continue                              │
└───────────────────────┬───────────────────────┘
                        │
                        ▼
┌───────────────────────────────────────────────┐
│        Determine Direction                    │
│                                               │
│   Z-score > 0  → HIGH                         │
│   Z-score < 0  → LOW                          │
└───────────────────────┬───────────────────────┘
                        │
                        ▼
┌───────────────────────────────────────────────┐
│   Lookup Biological Meaning                   │
│                                               │
│   Feature = Glucose / Insulin / HOMA / ...    │
│   Direction = HIGH or LOW                     │
│                                               │
│   Use predefined medical interpretation       │
└───────────────────────┬───────────────────────┘
                        │
                        ▼
┌───────────────────────────────────────────────┐
│        OUTPUT (Human Explanation)              │
│                                               │
│   "High glucose → impaired regulation"        │
│   "Low adiponectin → metabolic risk"           │
│                                               │
│   One sentence per biomarker                  │
└───────────────────────────────────────────────┘


In [None]:
def interpret_deviation(feature, z_score):
    """Get biological interpretation for a specific deviation"""
    if abs(z_score) < 1.0:
        return "Within normal range"


    direction = "High" if z_score > 0 else "Low"


    if feature in biological_meaning:
        interpretation = biological_meaning[feature]["interpretation"].get(
            direction,
            "Abnormal value - consult specialist"
        )
        return f"{direction}: {interpretation}"


    return f"{direction} value detected"


# Add interpretation columns for each feature

In [None]:
print("Adding interpretation columns...")
for feature in features:
    df_complete[f"{feature}_interpretation"] = df_complete.apply(
        lambda row: interpret_deviation(feature, row[f"{feature}_zscore"]),
        axis=1
    )


print(f"✓ Biological interpretations added for {len(features)} features")


In [None]:
print("\n" + "-"*80)
print("EXAMPLE: Patient Interpretations")
print("-"*80)
sample_patient_idx = df_cancer_rec.index[0]  # First cancer patient
print(f"\nPatient ID: {sample_patient_idx}")
print(f"Cluster: {df_complete.loc[sample_patient_idx, 'cluster_final']}")
print(f"Overall Severity: {df_complete.loc[sample_patient_idx, 'overall_severity']:.2f}")


print("\nBiomarker Interpretations:")
for feature in ['Glucose', 'Insulin', 'HOMA', 'Leptin', 'Adiponectin']:
    value = df_complete.loc[sample_patient_idx, feature]
    zscore = df_complete.loc[sample_patient_idx, f"{feature}_zscore"]
    deviation = df_complete.loc[sample_patient_idx, f"{feature}_deviation"]
    interpretation = df_complete.loc[sample_patient_idx, f"{feature}_interpretation"]


    print(f"\n{feature}:")
    print(f"  Value: {value:.2f} | Z-score: {zscore:.2f} | Severity: {deviation}")
    print(f"  → {interpretation}")

#Profile clusters with biological meaning

In [None]:
# Calculate mean Z-scores for each cluster (cancer patients only)
z_scores_cancer = z_scores_all.loc[df_cancer_rec.index]


cluster_z_profiles = z_scores_cancer.groupby(
    df_complete.loc[df_cancer_rec.index, 'cluster_final']
).mean()


print("Mean Z-Scores by Cluster:")
print(cluster_z_profiles.round(2))


# Detailed cluster profiling
cluster_profiles = []


for cluster_id in sorted(df_cancer_rec['cluster_final'].unique()):
    print("\n" + "="*80)
    print(f"CLUSTER {cluster_id} PROFILE")
    print("="*80)


    # Get patients in this cluster
    cluster_patients = df_complete[df_complete['cluster_final'] == cluster_id]
    n_patients = len(cluster_patients)


    print(f"Number of patients: {n_patients}")
    print(f"Average severity: {cluster_patients['overall_severity'].mean():.2f}")


    # Get mean Z-scores for this cluster
    cluster_z_mean = cluster_z_profiles.loc[cluster_id]


    # Identify top 3 most abnormal features (by absolute Z-score)
    top_features = cluster_z_mean.abs().nlargest(3)


    print(f"\nTop 3 Defining Abnormalities:")
    print("-" * 80)


    profile_data = {
        "cluster_id": int(cluster_id),
        "n_patients": n_patients,
        "avg_severity": float(cluster_patients['overall_severity'].mean()),
        "top_abnormalities": []
    }


    for rank, (feature, abs_z) in enumerate(top_features.items(), 1):
        z_score = cluster_z_mean[feature]
        direction = "Elevated" if z_score > 0 else "Reduced"


        print(f"\n{rank}. {feature} ({direction})")
        print(f"   Z-score: {z_score:.2f} standard deviations from healthy mean")
        print(f"   Meaning: {biological_meaning[feature]['meaning']}")
        print(f"   Mechanism: {biological_meaning[feature]['mechanism']}")
        print(f"   Clinical Significance: {biological_meaning[feature]['clinical_significance']}")


        profile_data["top_abnormalities"].append({
            "feature": feature,
            "z_score": float(z_score),
            "direction": direction,
            "meaning": biological_meaning[feature]['meaning'],
            "mechanism": biological_meaning[feature]['mechanism']
        })


    cluster_profiles.append(profile_data)


# Store cluster profiles for later use
print("\n" + "="*80)
print("✓ Cluster profiles created with biological context")
print(f"  {len(cluster_profiles)} cancer clusters characterized")



CLUSTER 0: "Metabolically Favorable Cancer Patients" (n=39, severity=0.30)
What the data shows:
Slightly elevated glucose (0.71σ) - mild concern
LOWER BMI (-0.70σ) - actually protective!
Lower Leptin (-0.55σ) - less body fat
Nearly normal Insulin (0.03σ), HOMA (0.10σ)
Normal Adiponectin (0.01σ)
Clinical Interpretation: This is the "good metabolic health despite cancer" group. These patients have:
Lower body fat (hence low BMI and leptin)
Normal insulin sensitivity
Mild glucose elevation (probably stress-related or early metabolic change)
Why this is interesting: These cancer patients have better metabolic profiles than expected. This might represent:
Early-stage cancer patients
Patients who were metabolically healthy before diagnosis
Potential for better treatment response

CLUSTER 1: "Classic Insulin Resistance" (n=22, severity=0.76)
What the data shows:
HOMA severely elevated (3.01σ) - SEVERE insulin resistance
Insulin very high (2.50σ) - compensatory hyperinsulinemia
Glucose high (2.20σ) - losing glycemic control
Leptin elevated (0.83σ) - moderate obesity
Slightly higher BMI (0.52σ)
Clinical Interpretation: This is the "metabolic syndrome + cancer" group. Classic insulin resistance pattern:
High insulin trying to overcome resistance
Glucose starting to rise (pre-diabetic range)
Moderate obesity driving the metabolic dysfunction
This is your TARGET GROUP for aggressive metabolic interventions!

CLUSTER 2: "Severe Metabolic Crisis" (n=3, severity=1.67)
What the data shows:
HOMA EXTREMELY elevated (12.86σ) - OFF THE CHARTS!
Glucose severely elevated (10.84σ) - likely diabetic range
Insulin very high (5.82σ) - body failing to compensate
Resistin very high (2.86σ) - severe inflammation
MCP-1 very high (3.39σ) - inflammatory storm
Clinical Interpretation: This is the "metabolic emergency" group. Only 3 patients, but they're in crisis:
Likely have Type 2 Diabetes
Severe insulin resistance + inflammatory state
May have worst cancer prognosis
Probably older (Age Z=0.79) with longer metabolic dysfunction


####Knowledge base from actual coimbra cluster profiles

In [None]:
print("="*80)
print("BUILDING KNOWLEDGE BASE FROM ACTUAL CLUSTER ANALYSIS")
print("="*80)

# Based on the real cluster profiles you provided:
cluster_strategies = {

    # ========================================================================
    # CLUSTER 0: Metabolically Favorable Cancer Patients
    # ========================================================================
    0: {
        "name": "Metabolically Favorable Profile",
        # WHY: Low severity (0.30), near-normal insulin markers, lower BMI

        "description": """Patients with relatively preserved metabolic health despite cancer diagnosis.
        Lower BMI and leptin suggest less adipose tissue dysfunction. Mild glucose elevation may be
        stress-related or early metabolic adaptation to cancer.""",

        "key_characteristics": {
            "Glucose": "Mildly elevated (0.71σ) - slight concern",
            "BMI": "Below average (-0.70σ) - protective factor",
            "Leptin": "Below average (-0.55σ) - less adiposity",
            "Insulin_Sensitivity": "Preserved (HOMA 0.10σ, Insulin 0.03σ)"
        },

        "priority_interventions": [
            # Focus: MAINTAIN good metabolic status, address mild glucose elevation

            "Monitor fasting glucose monthly to detect early insulin resistance",
            # WHY: Glucose slightly elevated; want to catch progression early

            "Maintain healthy weight - avoid both weight loss AND weight gain",
            # WHY: BMI is already lower; excessive loss could indicate cachexia

            "Balanced Mediterranean diet with emphasis on anti-cancer foods",
            # WHY: Preserve metabolic health while supporting cancer treatment

            "Moderate exercise: 150 min/week moderate-intensity aerobic + strength 2x/week",
            # WHY: Maintain insulin sensitivity without metabolic stress

            "Stress management (yoga, meditation) to address potential stress-induced hyperglycemia",
            # WHY: Mild glucose elevation may be stress-related in otherwise healthy metabolism

            "NO aggressive metabolic drugs needed at this stage"
            # WHY: Metabolic parameters are near-normal; avoid polypharmacy
        ],

        "monitoring": [
            "Fasting glucose monthly (watch for progression toward Cluster 1 pattern)",
            "HbA1c every 3 months (target <5.7%)",
            "Weight and BMI monthly (avoid cachexia)",
            "Comprehensive metabolic panel every 6 months"
        ],

        "dietary_specifics": [
            "Adequate protein 1.0-1.2 g/kg body weight (prevent muscle loss during cancer treatment)",
            "Complex carbohydrates with low glycemic index",
            "Colorful vegetables 5-7 servings/day (phytonutrients for cancer)",
            "Healthy fats from fish, nuts, olive oil",
            "Maintain caloric intake - avoid unintentional weight loss"
        ],

        "supplement_recommendations": [
            "Vitamin D3 2000 IU/day if deficient",
            "Omega-3 (EPA+DHA) 1-2g/day (anti-inflammatory, cardioprotective during chemo)",
            "Multivitamin if dietary gaps exist",
            "NO intensive metabolic supplements needed"
        ],

        "prognosis_note": "Best metabolic prognosis group. Focus on maintaining health status."
    },

    # ========================================================================
    # CLUSTER 1: Classic Insulin Resistance
    # ========================================================================
    1: {
        "name": "Insulin Resistance Dominant",
        # WHY: HOMA 3.01σ, Insulin 2.50σ, Glucose 2.20σ - classic triad

        "description": """Patients with established insulin resistance and compensatory hyperinsulinemia.
        This is the metabolic syndrome phenotype: moderate obesity, high insulin, rising glucose.
        Strong evidence that insulin-sensitizing interventions can improve cancer outcomes in this group.""",

        "key_characteristics": {
            "HOMA": "Severely elevated (3.01σ) - established insulin resistance",
            "Insulin": "Very high (2.50σ) - compensatory hyperinsulinemia",
            "Glucose": "High (2.20σ) - pre-diabetic range likely",
            "BMI": "Elevated (0.52σ) - overweight/obese",
            "Leptin": "Elevated (0.83σ) - leptin resistance from adiposity"
        },

        "priority_interventions": [
            # Focus: AGGRESSIVE insulin sensitization

            "Metformin 500mg BID, titrate to 1000mg BID over 2 weeks (REQUIRES ONCOLOGIST APPROVAL)",
            # WHY: First-line insulin sensitizer; Goodwin et al. showed 25% recurrence reduction
            # EVIDENCE: Goodwin PJ et al. J Clin Oncol. 2022;40(12):1353-1361

            "Target 7-10% body weight reduction over 6 months",
            # WHY: Weight loss improves insulin sensitivity by ~30-40%
            # EVIDENCE: Wing RR et al. Diabetes Care. 2011;34(7):1481-1486

            "Low glycemic index diet (<100g carbs/day initially)",
            # WHY: Reduces glucose spikes that worsen insulin resistance
            # EVIDENCE: Esposito K et al. Diabetologia. 2015;58(4):773-780

            "Exercise prescription: 3x/week resistance training + 150min/week brisk walking",
            # WHY: Muscle contractions increase GLUT4 translocation (insulin-independent glucose uptake)
            # EVIDENCE: Strasser B et al. Diabetes Care. 2013;36(4):872-877

            "Consider SGLT2 inhibitor (e.g., Empagliflozin 10mg/day) if glucose >120 mg/dL persistently",
            # WHY: Forces glucose excretion through urine, lowers insulin demand
            # EVIDENCE: Deng L et al. Cancer Res. 2021;81(13):3480-3493

            "Intermittent fasting 16:8 protocol (if tolerated during cancer treatment)",
            # WHY: Improves insulin sensitivity and may enhance chemotherapy efficacy
            # EVIDENCE: de Groot S et al. BMC Cancer. 2015;15:652
        ],

        "monitoring": [
            "HbA1c monthly (target <6.0%, ideally <5.7%)",
            "Fasting insulin every 2 months (target <10 µU/mL)",
            "HOMA-IR monthly (target <2.0)",
            "Fasting glucose weekly initially, then biweekly (target <100 mg/dL)",
            "Lipid panel quarterly (insulin resistance worsens dyslipidemia)",
            "Liver function tests if on metformin (every 3 months)"
        ],

        "dietary_specifics": [
            "Carbohydrates: <30% of calories (or <100g/day)",
            "Focus on: non-starchy vegetables, lean proteins, healthy fats",
            "AVOID: white bread, pasta, rice, potatoes, sugary foods, fruit juice",
            "Meal timing: Concentrate carbs around exercise",
            "Protein: 1.2-1.5 g/kg body weight (preserve muscle during weight loss)"
        ],

        "supplement_recommendations": [
            "Berberine 500mg 3x/day (insulin sensitizer, comparable to metformin)",
            # EVIDENCE: Yin J et al. Metabolism. 2008;57(5):712-717

            "Chromium picolinate 200-400 mcg/day (enhances insulin signaling)",
            # EVIDENCE: Kleefstra N et al. Diabetes Care. 2006;29(8):1826-1832

            "Alpha-lipoic acid 600mg/day (improves glucose metabolism)",
            # EVIDENCE: Ziegler D et al. Diabetes Care. 2004;27(10):2365-2371

            "Inositol 2-4g/day (improves insulin sensitivity)",
            # EVIDENCE: Pintaudi B et al. Int J Endocrinol. 2016;2016:9132052

            "Magnesium 400mg/day (cofactor for insulin signaling)",
            # EVIDENCE: Rodríguez-Morán M et al. Diabetes Care. 2003;26(4):1147-1152
        ],

        "prognosis_note": "Moderate metabolic risk. Aggressive intervention can normalize insulin sensitivity."
    },

    # ========================================================================
    # CLUSTER 2: Severe Metabolic Crisis
    # ========================================================================
    2: {
        "name": "Severe Metabolic Dysfunction with Inflammation",
        # WHY: HOMA 12.86σ (!), Glucose 10.84σ, Resistin 2.86σ, MCP-1 3.39σ

        "description": """CRITICAL: Only 3 patients but with extreme metabolic derangement.
        Likely undiagnosed or poorly controlled Type 2 Diabetes with severe insulin resistance.
        High inflammatory markers (Resistin, MCP-1) suggest chronic systemic inflammation.
        These patients need URGENT endocrinology referral and aggressive intervention.""",

        "key_characteristics": {
            "HOMA": "EXTREME elevation (12.86σ) - severe insulin resistance",
            "Glucose": "EXTREME elevation (10.84σ) - likely diabetic range (>126 mg/dL fasting)",
            "Insulin": "Very high (5.82σ) - pancreatic exhaustion imminent",
            "Resistin": "Very high (2.86σ) - severe inflammation",
            "MCP-1": "Very high (3.39σ) - inflammatory storm",
            "Age": "Older (0.79σ) - chronic metabolic disease"
        },

        "priority_interventions": [
            # Focus: URGENT medical stabilization + aggressive metabolic correction

            "⚠️ IMMEDIATE ENDOCRINOLOGY REFERRAL - likely need diabetes diagnosis and management",
            # WHY: Glucose Z-score of 10.84 suggests fasting glucose likely >150 mg/dL

            "⚠️ Check HbA1c IMMEDIATELY - likely >7.0% (diabetic range)",
            # WHY: Need to confirm diabetes diagnosis and assess chronic glycemic control

            "Metformin 1000mg BID PLUS second-line agent (GLP-1 agonist preferred)",
            # WHY: Single agent insufficient for this level of dysfunction
            # EVIDENCE: GLP-1 agonists reduce CV risk and may have anti-cancer effects
            # Davies MJ et al. N Engl J Med. 2017;377(13):1228-1239

            "Consider insulin therapy if glucose >200 mg/dL or HbA1c >9%",
            # WHY: May have pancreatic beta-cell failure; need immediate glycemic control

            "AGGRESSIVE anti-inflammatory protocol:",
            "  - Omega-3 fatty acids 4g/day (high dose)",
            "  - Curcumin 1500mg/day with piperine",
            "  - Low-dose aspirin 81mg daily (if no contraindications)",
            # WHY: Resistin and MCP-1 extremely elevated; need to dampen inflammatory cascade

            "STRICT low-carb diet (<50g carbs/day initially) - essentially ketogenic approach",
            # WHY: Need immediate glucose reduction; standard low-GI diet insufficient

            "Daily blood glucose monitoring (4x/day: fasting, post-meals, bedtime)",
            # WHY: Need tight monitoring to prevent hyperglycemic crisis

            "Assess for diabetic complications: retinopathy, nephropathy, neuropathy",
            # WHY: With this level of hyperglycemia, may already have end-organ damage
        ],

        "monitoring": [
            "⚠️ URGENT: HbA1c immediately, then monthly until <7.0%",
            "Blood glucose 4x/day with log",
            "Fasting insulin biweekly (assess pancreatic reserve)",
            "CRP weekly (monitor inflammation)",
            "Comprehensive metabolic panel biweekly (watch kidney function - metformin contraindicated if eGFR <30)",
            "Lipid panel monthly",
            "Urinalysis for proteinuria (diabetic nephropathy screening)",
            "Ophthalmology referral for diabetic retinopathy screening"
        ],

        "dietary_specifics": [
            "STRICT carbohydrate restriction: <50g/day (ketogenic approach)",
            "Focus: leafy greens, cruciferous vegetables, lean proteins, healthy fats",
            "ELIMINATE: all refined carbs, sugar, fruit (except berries in small amounts)",
            "Meal timing: 3 meals/day, no snacking (avoid insulin spikes)",
            "Consider medical nutrition therapy with registered dietitian",
            "Adequate hydration (2-3L/day) - prevent diabetic complications"
        ],

        "supplement_recommendations": [
            "ALL supplements from Cluster 1 PLUS:",
            "High-dose omega-3: EPA+DHA 4g/day (anti-inflammatory)",
            "Curcumin 1500mg/day with piperine (NF-κB inhibition)",
            "Vitamin D3 4000-5000 IU/day (immune modulation)",
            "Coenzyme Q10 200mg/day (mitochondrial support)",
            "Probiotics high-CFU strain (gut-inflammation axis)"
        ],

        "prognosis_note": "⚠️ HIGHEST METABOLIC RISK GROUP. Require urgent medical intervention. Worst cancer outcomes predicted if metabolic dysfunction not corrected."
    },

    # ========================================================================
    # CLUSTER 3: Healthy Control Group
    # ========================================================================
    3: {
        "name": "Healthy Reference Group",
        "description": "Patients without breast cancer diagnosis. Metabolic parameters within normal range.",

        "priority_interventions": [
            "Continue current healthy lifestyle habits",
            "Annual breast cancer screening per guidelines (mammography)",
            "Annual metabolic screening to detect early changes",
            "Maintain healthy weight (BMI 18.5-24.9)",
            "Regular exercise per CDC guidelines (150min/week moderate intensity)"
        ],

        "monitoring": [
            "Annual comprehensive metabolic panel",
            "Fasting glucose and lipids annually",
            "BMI and waist circumference at each medical visit",
            "Mammography per age-appropriate guidelines"
        ],

        "dietary_specifics": [
            "Balanced Mediterranean-style diet",
            "5-7 servings fruits and vegetables daily",
            "Limit processed foods and added sugars",
            "Moderate alcohol consumption (if any)"
        ],

        "supplement_recommendations": [
            "Standard multivitamin if dietary gaps exist",
            "Vitamin D if deficient (<30 ng/mL)",
            "Calcium if dietary intake insufficient"
        ],

        "prognosis_note": "Low metabolic risk. Focus on prevention and maintenance."
    }
}

print("✓ Knowledge base created from actual cluster analysis")
print(f"  Cluster 0 (n=39): Metabolically Favorable - MAINTAIN health")
print(f"  Cluster 1 (n=22): Insulin Resistance - AGGRESSIVE intervention")
print(f"  Cluster 2 (n=3):  Metabolic Crisis - ⚠️ URGENT referral")
print(f"  Cluster 3:      Healthy Controls - PREVENTION focus")



###  Building the Recommendation System

**Elaborating Methods for NEW patients**

**1.Calculate Z-Scores**

numeric deviation from healthy norms

In [None]:

def calculate_z_scores(self, patient_data):
    """
    Compute Z-scores for each biomarker relative to healthy reference values.

    INPUT: patient_data (pandas Series)
    OUTPUT: pandas Series of Z-scores per feature
    """
    z_scores = pd.Series(index=self.features, dtype=float)

    for feature in self.features:
        mean = self.baseline_stats.loc[feature, 'mean']
        std = self.baseline_stats.loc[feature, 'std']
        z_scores[feature] = (patient_data[feature] - mean) / std

    return z_scores


**2.Label Deviation Severity**

In [None]:
class HybridMetabolicRecommendationSystem:

    def __init__(self, baseline_stats, cluster_strategies, biological_meaning,
                 kmeans_model, scaler):

        self.baseline_stats = baseline_stats
        self.cluster_strategies = cluster_strategies
        self.biological_meaning = biological_meaning
        self.kmeans_model = kmeans_model
        self.scaler = scaler
        self.features = baseline_stats.index.tolist()
        # ^ Extract feature names (Age, BMI, Glucose, etc.)

        print("✓ Recommendation system initialized")
        print(f"  Baseline stats loaded for {len(self.features)} features")
        print(f"  Knowledge base contains {len(cluster_strategies)} cluster strategies")


**3.Assign a cluster to a new patient (K-means)**

In [None]:
def assign_cluster(patient_data, is_cancer, scaler, kmeans_model, features):
    """
    Assign a metabolic cluster to a new patient.

    INPUT:
    - patient_data: pandas Series with biomarker values
    - is_cancer: Boolean (True if patient has breast cancer)
    - scaler: StandardScaler fitted on training data
    - kmeans_model: Trained KMeans model (trained only on cancer patients)
    - features: list of features used in training

    OUTPUT:
    - cluster: Integer
        0,1,2 -> cancer clusters
        3     -> healthy patient
    """

    # Step 1: Check if patient has cancer
    if not is_cancer:
        return 3  # Healthy patient, skip KMeans

    # Step 2: Extract features in same order as training
    patient_values = patient_data[features].values.reshape(1, -1)

    # Step 3: Scale patient data using training scaler
    patient_scaled = scaler.transform(patient_values)

    # Step 4: Predict cluster using trained KMeans model
    cluster = kmeans_model.predict(patient_scaled)[0]

    return int(cluster)


**4.Calculate Overall Severity Score**

Get overall severity score for a patient

In [None]:
def calculate_severity_score(self, deviation_labels):
    """
    Aggregate severity across all biomarkers into a single score.

    INPUT:
    - deviation_labels: pandas Series of "Normal"/"Mild"/"Moderate"/"Severe"

    OUTPUT:
    - severity_score: float between 0.0 (all normal) and 3.0 (all severe)
    """
    severity_map = {
        "Normal": 0,
        "Mild": 1,
        "Moderate": 2,
        "Severe": 3
    }

    # Convert labels to numeric, ignoring unknown labels
    numeric_severities = deviation_labels.map(severity_map).dropna()

    # Compute average severity
    severity_score = numeric_severities.mean()

    return float(severity_score)


**5.Generate Feature-Specific Recommendations**

Return list of specific interventions for a biomarker

Recommendations are meant to be actionable and focused on significant abnormalities (Moderate/Severe)

In [None]:
def get_feature_recommendations(self, feature, z_score, deviation):
    """
    Generate biomarker-specific recommendations based on deviation.

    INPUT:
    - feature:= "Insulin"...
    - z_score: float, deviation from healthy mean
    - deviation: "Normal", "Mild", "Moderate", "Severe"

    OUTPUT:
    - recommendations: list of strings with actionable advice
    """
    recommendations = []

    # Only provide recommendations for moderate/severe deviations
    if deviation not in ["Moderate", "Severe"]:
        return recommendations

    # Determine direction
    # Positive Z-score = above healthy mean (High)
    # Negative Z-score = below healthy mean (Low)
    direction = "High" if z_score > 0 else "Low"


    # Look up recommendations from knowledge base
    if feature in self.biological_meaning:
        interpretation = self.biological_meaning[feature]["interpretation"].get(
            direction,
            "Abnormal value detected - consult specialist"
        )
        recommendations.append(interpretation)

    return recommendations


## **Main Recommendation Generation**

In [None]:
class HybridMetabolicRecommendationSystem:

    def __init__(self, baseline_stats, cluster_strategies, biological_meaning,
                 kmeans_model, scaler):
        self.baseline_stats = baseline_stats
        self.cluster_strategies = cluster_strategies
        self.biological_meaning = biological_meaning
        self.kmeans_model = kmeans_model
        self.scaler = scaler
        self.features = baseline_stats.index.tolist()

        print("✓ Recommendation system initialized")
        print(f"  Baseline stats loaded for {len(self.features)} features")
        print(f"  Knowledge base contains {len(cluster_strategies)} cluster strategies")

    def calculate_z_scores(self, patient_data):
        z_scores = pd.Series(index=self.features, dtype=float)

        for feature in self.features:
            mean = self.baseline_stats.loc[feature, 'mean']
            std = self.baseline_stats.loc[feature, 'std']
            z_scores[feature] = (patient_data[feature] - mean) / std

        return z_scores

    def label_deviations(self, z_scores):
        labels = pd.Series(index=z_scores.index, dtype=str)

        for feature in z_scores.index:
            abs_z = abs(z_scores[feature])

            if abs_z < 1.0:
                labels[feature] = "Normal"
            elif abs_z < 1.5:
                labels[feature] = "Mild"
            elif abs_z < 2.5:
                labels[feature] = "Moderate"
            else:
                labels[feature] = "Severe"

        return labels

    def assign_cluster(self, patient_data, is_cancer=True):
        if not is_cancer:
            return 3
        patient_values = patient_data[self.features].values.reshape(1, -1)
        patient_scaled = self.scaler.transform(patient_values)
        cluster = self.kmeans_model.predict(patient_scaled)[0]
        return int(cluster)

    def calculate_severity_score(self, deviation_labels):
        severity_map = {"Normal": 0, "Mild": 1, "Moderate": 2, "Severe": 3}
        numeric_severities = deviation_labels.map(severity_map).dropna()
        return float(numeric_severities.mean())

    def get_feature_recommendations(self, feature, z_score, deviation):
        recommendations = []
        if deviation not in ["Moderate", "Severe"]:
            return recommendations
        direction = "High" if z_score > 0 else "Low"
        if feature in self.biological_meaning:
            interpretation = self.biological_meaning[feature]["interpretation"].get(
                direction,
                "Abnormal value detected - consult specialist"
            )
            recommendations.append(interpretation)
        return recommendations

    def _interpret_severity(self, severity_score):
        if severity_score < 0.5:
            return "Low Risk/Well-Controlled"
        elif severity_score < 1.0:
            return "Moderate Risk/Needs Attention"
        elif severity_score < 2.0:
            return "High Risk/Urgent Intervention"
        else:
            return "Critical Risk/Emergency"

    def generate_recommendations(self, patient_data, is_cancer=True):
        z_scores = self.calculate_z_scores(patient_data)
        deviations = self.label_deviations(z_scores)
        severity_score = self.calculate_severity_score(deviations)
        cluster = self.assign_cluster(patient_data, is_cancer)
        cluster_info = self.cluster_strategies[cluster]

        recommendations = list(cluster_info["priority_interventions"])
        explanations = [f"Patient assigned to '{cluster_info['name']}' based on metabolic profile. Primary strategy: {cluster_info['description']}..."]

        for feature in self.features:
            z = z_scores[feature]
            deviation = deviations[feature]
            if deviation in ["Moderate", "Severe"]:
                feature_recs = self.get_feature_recommendations(feature, z, deviation)
                recommendations.extend(feature_recs)
                direction = "elevated" if z > 0 else "reduced"
                explanation = (
                    f"{feature}: {patient_data[feature]:.2f} "
                    f"({deviation}, {z:.2f}σ {direction}) - "
                    f"{self.biological_meaning[feature]['clinical_significance']}"
                )
                explanations.append(explanation)

        recommendations = list(set(recommendations))
        output = {
            "patient_profile": {
                "cluster": int(cluster),
                "cluster_name": cluster_info["name"],
                "severity_score": float(severity_score),
                "cancer_status": "Diagnosed" if is_cancer else "Healthy",
                "interpretation": self._interpret_severity(severity_score)
            },
            "biomarker_analysis": {
                feature: {
                    "value": float(patient_data[feature]),
                    "z_score": float(z_scores[feature]),
                    "deviation": deviations[feature],
                    "interpretation": self.biological_meaning.get(feature, {}).get("meaning", ""),
                    "clinical_significance": self.biological_meaning.get(feature, {}).get("clinical_significance", "")
                } for feature in self.features
            },
            "recommendations": {
                "priority_interventions": recommendations,
                "monitoring_plan": cluster_info["monitoring"],
                "dietary_guidance": cluster_info["dietary_specifics"],
                "supplement_protocol": cluster_info.get("supplement_recommendations", [])
            },
            "explanations": explanations,
            "clinical_notes": {
                "prognosis": cluster_info.get("prognosis_note", ""),
                "cluster_description": cluster_info["description"]
            }
        }
        return output

##LLM

In [None]:
!pip install groq

In [None]:
GROQ_API_KEY = ""

In [None]:
def format_biomarkers_for_doctor(biomarker_analysis):
    """Format biomarker data for doctor prompt (technical)"""
    lines = []
    for feature, data in biomarker_analysis.items():
        if data['deviation'] in ['Moderate', 'Severe']:
            lines.append(
                f"- {feature}: {data['value']:.2f} "
                f"(Z-score: {data['z_score']:.2f}, {data['deviation']}) "
                f"- {data['clinical_significance']}"
            )
    return "\n".join(lines) if lines else "All biomarkers within normal range"

In [None]:
def format_biomarkers_for_patient(biomarker_analysis):
    """Format biomarker data for patient prompt (simple language)"""
    lines = []

    # Map medical terms to patient-friendly terms
    friendly_names = {
        'Glucose': 'Blood Sugar',
        'Insulin': 'Insulin Hormone',
        'HOMA': 'Insulin Resistance Score',
        'BMI': 'Body Weight Index',
        'Age': 'Age',
        'Leptin': 'Appetite Hormone',
        'Adiponectin': 'Protective Hormone',
        'Resistin': 'Inflammation Marker',
        'MCP.1': 'Inflammation Marker'
    }

    for feature, data in biomarker_analysis.items():
        if data['deviation'] in ['Moderate', 'Severe']:
            friendly = friendly_names.get(feature, feature)
            direction = "high" if data['z_score'] > 0 else "low"
            lines.append(f"- Your {friendly} is {direction} ({data['deviation']} concern)")

    return "\n".join(lines) if lines else "All your test results look good!"

In [None]:
# Doctor Prompt (Technical & Detailed)
DOCTOR_PROMPT_TEMPLATE = """You are an expert medical AI assistant providing clinical recommendations to a healthcare professional.

**Patient Profile:**
- Cluster: {cluster_name}
- Cancer Status: {cancer_status}
- Severity Score: {severity_score}/3.0 ({interpretation})

**Biomarker Analysis:**
{biomarker_summary}

**Evidence-Based Recommendations:**
{raw_recommendations}

**Task:** Rewrite these recommendations as a professional clinical report for a doctor. Include:

1. **Clinical Summary**: Synthesize the metabolic status in 2-3 sentences
2. **Risk Stratification**: Explain why this patient is in this cluster
3. **Intervention Priorities**: List top 3-5 interventions with rationale
4. **Monitoring Protocol**: Specify tests and frequency
5. **Red Flags**: Any urgent concerns requiring immediate attention

Use medical terminology. Be precise, evidence-based, and actionable. Format as a structured clinical note."""


# Patient Prompt (Simple & Empathetic)
PATIENT_PROMPT_TEMPLATE = """You are a compassionate medical AI assistant explaining health recommendations to a patient in simple, clear terms.

**Your Health Profile:**
- Health Category: {cluster_name}
- Cancer Diagnosis: {cancer_status}
- Overall Health Score: {severity_score}/3.0 ({interpretation})

**Your Test Results:**
{biomarker_summary_simple}

**Medical Recommendations:**
{raw_recommendations}

**Task:** Rewrite these recommendations in patient-friendly language:

1. **What This Means**: Explain the test results in simple terms (avoid medical jargon)
2. **Why It Matters**: Help them understand how this affects their health and cancer treatment
3. **What You Can Do**: Give clear, specific action steps they can start today
4. **Encouragement**: End with supportive, motivating words

IMPORTANT:
- Use simple language (8th grade reading level)
- Be empathetic and supportive
- Avoid scary or alarming words
- Focus on what they CAN control
- Make it feel personal and caring"""

In [None]:
from groq import Groq

def generate_llm_interpretation(prompt, max_tokens=300):
    """
    Generate interpretation using Groq's free API
    MUCH faster and more reliable than HuggingFace!
    """
    try:
        client = Groq(api_key=GROQ_API_KEY)

        # Truncate if needed (Groq has generous limits)
        if len(prompt) > 3000:
            prompt = prompt[:3000] + "\n\n[Content truncated due to length. Summarize key points.]"

        print(f"🔄 Generating with Groq (Llama 3.1)...")

        response = client.chat.completions.create(
            model="llama-3.1-8b-instant",  # Fast, free, excellent quality
            messages=[
                {
                    "role": "system",
                    "content": "You are an expert medical AI assistant providing clear, accurate clinical recommendations."
                },
                {
                    "role": "user",
                    "content": prompt
                }
            ],
            max_tokens=max_tokens,
            temperature=0.7,
            top_p=0.9
        )

        generated_text = response.choices[0].message.content
        print(f"✅ Generated {len(generated_text)} characters")

        return generated_text

    except Exception as e:
        error_msg = str(e)
        print(f"❌ Groq Error: {error_msg}")

        # Provide helpful error messages
        if "api_key" in error_msg.lower():
            return "[Error: Invalid API key. Get a free key at https://console.groq.com/keys]"
        elif "rate limit" in error_msg.lower():
            return "[Error: Rate limit exceeded. Groq free tier: 30 requests/min. Please wait.]"
        else:
            return f"[Error: {error_msg[:200]}]"

In [None]:
def generate_llm_interpretations(recsys_instance, recommendations_dict, target_audience="doctor"):
    """
    Generate human-like interpretation - now using Groq!
    """
    profile = recommendations_dict['patient_profile']
    biomarkers = recommendations_dict['biomarker_analysis']
    raw_recs = recommendations_dict['recommendations']['priority_interventions']

    # Build prompt
    if target_audience == "doctor":
        prompt = DOCTOR_PROMPT_TEMPLATE.format(
            cluster_name=profile['cluster_name'],
            cancer_status=profile['cancer_status'],
            severity_score=profile['severity_score'],
            interpretation=profile['interpretation'],
            biomarker_summary=format_biomarkers_for_doctor(biomarkers),
            raw_recommendations="\n".join([f"- {r}" for r in raw_recs[:5]])
        )
    else:
        prompt = PATIENT_PROMPT_TEMPLATE.format(
            cluster_name=profile['cluster_name'],
            cancer_status=profile['cancer_status'],
            severity_score=profile['severity_score'],
            interpretation=profile['interpretation'],
            biomarker_summary_simple=format_biomarkers_for_patient(biomarkers),
            raw_recommendations="\n".join([f"- {r}" for r in raw_recs[:5]])
        )

    print(f"\n{'='*60}")
    print(f"🎯 Generating {target_audience.upper()} interpretation")
    print(f"📝 Prompt length: {len(prompt)} characters")
    print(f"{'='*60}\n")

    # Generate
    llm_output = generate_llm_interpretation(prompt)

    # Fallback if failed
    if llm_output.startswith("[Error"):
        print("⚠️ LLM generation failed, using structured fallback...")
        llm_output = f"""
**Automated Clinical Summary** (LLM temporarily unavailable)

**Patient Profile:**
- Classification: {profile['cluster_name']}
- Severity Score: {profile['severity_score']:.2f}/3.0 ({profile['interpretation']})
- Cancer Status: {profile['cancer_status']}

**Priority Interventions:**
{chr(10).join([f"{i+1}. {r}" for i, r in enumerate(raw_recs[:5])])}

**Monitoring Plan:**
{chr(10).join([f"• {m}" for m in recommendations_dict['recommendations']['monitoring_plan'][:3]])}

*Please check API configuration and try again.*
"""

    recommendations_dict['llm_interpretation'] = {
        'audience': target_audience,
        'generated_text': llm_output,
        'generation_status': 'success' if not llm_output.startswith("[Error") else 'fallback',
        'model_used': 'groq-llama-3.1-8b'
    }

    return recommendations_dict

In [None]:
# ============================================
# TEST 1: Simple test
# ============================================
print("Testing Groq API...")
test_result = generate_llm_interpretation(
    "Summarize in 2 sentences: Patient has high glucose and needs diet changes.",
    max_tokens=100
)

print(f"\n{'='*60}")
print("TEST RESULT:")
print(f"{'='*60}")
print(test_result)
print(f"{'='*60}\n")

# ============================================
# TEST 2: Full patient test (only if test 1 works)
# ============================================

if not test_result.startswith("[Error"):
    print("✅ API working! Running full patient test...\n")

    # Your test patient
    single_patient = pd.Series({
        'Age': 62, 'BMI': 31, 'Glucose': 145, 'Insulin': 28,
        'HOMA': 10, 'Leptin': 35, 'Adiponectin': 5,
        'Resistin': 15, 'MCP.1': 450
    })

    # Generate base recommendations
    result = recsys.generate_recommendations(single_patient, is_cancer=True)

    # Generate doctor interpretation
    result_doctor = generate_llm_interpretations(recsys, result, target_audience="doctor")

    print("\n" + "="*80)
    print("👨‍⚕️ DOCTOR INTERPRETATION:")
    print("="*80)
    print(result_doctor['llm_interpretation']['generated_text'])

    # Generate patient interpretation
    result_patient = generate_llm_interpretations(recsys, result, target_audience="patient")

    print("\n" + "="*80)
    print("👤 PATIENT INTERPRETATION:")
    print("="*80)
    print(result_patient['llm_interpretation']['generated_text'])
else:
    print("❌ Please set your GROQ_API_KEY first!")
    print("Get one free at: https://console.groq.com/keys")

#V. Evaluation

##V.1. DSO1 : Predict the diagnosis type — M (Malignant) or B (Benign)


####Evaluation on the test set Random forest

- I predict the labels on `X_test` and print:
  - the **classification report** (precision, recall, f1-score for Benign and Malignant),
  - the **confusion matrix** (number of correct and incorrect predictions).
- I also compute the **test accuracy**, which is 0.9766 in my run.
- To check that the result is not due to a lucky split, I run a **5-fold cross-validation** on all the data (`X_scaled`, `y_model`).  
  The mean cross-validation accuracy is very close to the test accuracy, which suggests that the model generalises well and is not strongly overfitting.


In [None]:
y_pred = rf.predict(X_test)
print("=== Classification Report ===")
print(classification_report(y_test, y_pred))
print("=== Confusion Matrix ===")
print(confusion_matrix(y_test, y_pred))

In [None]:
# 3. Evaluation on the test set
from sklearn.metrics import accuracy_score

accuracy = accuracy_score(y_test, y_pred)
print(f"Test Accuracy: {accuracy:.4f}")

In [None]:
cv_mean = cross_val_score(rf, X_scaled, y_model, cv=5).mean()
print(f"Mean Cross-Validation Accuracy (5-fold): {cv_mean:.4f}")

#### ROC curve and AUC random forest


In [None]:
from sklearn.metrics import roc_curve, roc_auc_score
import matplotlib.pyplot as plt

# Probabilités pour la classe 1 (Malignant)
y_test_proba = rf.predict_proba(X_test)[:, 1]

# Courbe ROC
fpr, tpr, thresholds = roc_curve(y_test, y_test_proba)
auc_rf = roc_auc_score(y_test, y_test_proba)

plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, label=f"Random Forest ROC (AUC = {auc_rf:.3f})", color="darkred", linewidth=2)
plt.plot([0, 1], [0, 1], linestyle="--", color="gray")

plt.xlabel("False Positive Rate (FPR)")
plt.ylabel("True Positive Rate (TPR)")
plt.title("Courbe ROC – Random Forest (Test set)")
plt.legend(loc="lower right")
plt.grid(True)
plt.tight_layout()
plt.show()

The ROC curve of the Random Forest on the test set is very close to the top-left
corner. The Area Under the Curve is **AUC = 0.997**, which indicates an excellent
ability to separate benign (0) from malignant (1) tumours.  
The model keeps a very high true positive rate (sensitivity) while maintaining a
very low false positive rate, which is desirable in a medical screening context.

####Hyperparameter search with GridSearchCV mlp

- Search space:
  - `hidden_layer_sizes`:
    - `(32,)`, `(64,)`  → one hidden layer
    - `(32, 16)`, `(64, 32)` → two hidden layers
  - `alpha` ∈ {1e-4, 1e-3, 1e-2} (L2 regularisation)
  - `learning_rate_init` ∈ {1e-3, 5e-4}
- Use `GridSearchCV` with:
  - estimator: `mlp_base`
  - scoring: `"accuracy"`
  - `cv = 5`
- Fit the grid on all data (`X_scaled`, `y_model_2`) and display:
  - best hyperparameters
  - best mean CV accuracy.

In [None]:
param_grid = {
    "hidden_layer_sizes": [
        (32,),         # une couche
        (64,),
        (32, 16),      # deux couches
        (64, 32)
    ],
    "alpha": [1e-4, 1e-3, 1e-2],          # régularisation L2
    "learning_rate_init": [1e-3, 5e-4]    # taux d’apprentissage
}

grid = GridSearchCV(
    estimator=mlp_base,
    param_grid=param_grid,
    scoring="accuracy",
    cv=5,
    n_jobs=-1
)

In [None]:
grid.fit(X_scaled, y_model_2)

In [None]:
print("=== Meilleurs hyperparamètres trouvés ===")
print(grid.best_params_)
print(f"Meilleure accuracy moyenne CV (5-fold): {grid.best_score_:.4f}")

#### Evaluation on the test set mlp

- Take the best estimator returned by `GridSearchCV` and refit it on the **training set**.
- On the test set:
  - print the **classification report** (precision, recall, f1-score),
  - print the **confusion matrix**,
  - compute the **test accuracy**.
- Also compute the **training accuracy** to check overfitting.
- In my run:
  - Test accuracy ≈ 0.977
  - Train accuracy ≈ 0.993  
  The values are close, which indicates good generalisation and limited overfitting.

In [None]:
best_mlp = grid.best_estimator_
best_mlp.fit(X_train, y_train)

In [None]:
y_pred = best_mlp.predict(X_test)
print("\n=== Best MLP - Classification Report (Test) ===")
print(classification_report(y_test, y_pred))

print("=== Best MLP - Confusion Matrix (Test) ===")
print(confusion_matrix(y_test, y_pred))

test_acc = accuracy_score(y_test, y_pred)
print(f"Best MLP Test Accuracy: {test_acc:.4f}")

# Train accuracy pour vérifier l’overfitting
y_train_pred = best_mlp.predict(X_train)
train_acc = accuracy_score(y_train, y_train_pred)
print(f"Best MLP Train Accuracy: {train_acc:.4f}")

#### ROC curve and AUC MLP

- Use `predict_proba` to get the probability of the malignant class on the test set.
- Compute and plot the **ROC curve** and the **AUC** for the tuned MLP.

In [None]:
y_test_proba = best_mlp.predict_proba(X_test)[:, 1]
fpr, tpr, _ = roc_curve(y_test, y_test_proba)
auc_mlp = roc_auc_score(y_test, y_test_proba)

plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, label=f"Best MLP ROC (AUC = {auc_mlp:.3f})")
plt.plot([0, 1], [0, 1], linestyle="--", color="gray")
plt.xlabel("Taux de faux positifs (FPR)")
plt.ylabel("Taux de vrais positifs (TPR)")
plt.title("Courbe ROC - Best MLP (Test set)")
plt.legend(loc="lower right")
plt.grid(True)
plt.tight_layout()
plt.show()

The ROC curve of the tuned MLP on the test set is very close to the top-left
corner, with an AUC of about **0.997**. This shows an excellent ability to
separate benign from malignant tumours. Together with a test accuracy around
97.7% and a train accuracy slightly higher, the model appears powerful and only
mildly overfitted.

#### Evaluation on the test and train sets gru-svm

- The GRU-SVM outputs a real score in ℝ.  
  We convert it to labels in {−1, +1} using a **sign threshold at 0**:
  - score ≥ 0 → +1 (Malignant)
  - score  < 0 → −1 (Benign)
- For compatibility with scikit-learn metrics, we also build 0/1 versions:
  - `1` = Malignant, `0` = Benign.
- On the **test set**, we print:
  - classification report,
  - confusion matrix,
  - test accuracy.
- On the **train set**, we compute:
  - training accuracy,
  - to check for overfitting (train vs test).

In [None]:
y_test_scores = gru_svm.predict(X_test_gru).flatten()
y_test_pred_svm = np.where(y_test_scores >= 0, 1.0, -1.0)

# pour les métriques scikit-learn classiques, on repasse en 0/1
y_test_bin = (y_test_gru == 1).astype(int)
y_pred_bin = (y_test_pred_svm == 1).astype(int)

print("\n=== GRU-SVM - Classification Report (Test) ===")
print(classification_report(y_test_bin, y_pred_bin, target_names=['Benign', 'Malignant']))

print("=== GRU-SVM - Confusion Matrix (Test) ===")
print(confusion_matrix(y_test_bin, y_pred_bin))

test_acc = accuracy_score(y_test_bin, y_pred_bin)
print(f"GRU-SVM Test Accuracy: {test_acc:.4f}")

In [None]:
# --- Accuracy sur le TRAIN pour détecter l’overfitting ---
y_train_scores = gru_svm.predict(X_train_gru).flatten()
y_train_pred_svm = np.where(y_train_scores >= 0, 1.0, -1.0)

y_train_bin = (y_train_gru == 1).astype(int)
y_train_pred_bin = (y_train_pred_svm == 1).astype(int)

from sklearn.metrics import accuracy_score

train_acc = accuracy_score(y_train_bin, y_train_pred_bin)
print(f"GRU-SVM Train Accuracy: {train_acc:.4f}")
print(f"GRU-SVM Test  Accuracy: {test_acc:.4f}")


#### ROC curve and AUC svm-gru

- The SVM score is a signed distance to the decision boundary.  
  We convert this score to a pseudo-probability using a logistic function:
  \[ p = 1 / (1 + e^{-score}) \]
- With these probabilities, we compute:
  - the **ROC curve**,
  - the **AUC** for the GRU-SVM on the test set.

In [None]:
y_test_proba = 1 / (1 + np.exp(-y_test_scores))

fpr, tpr, _ = roc_curve(y_test_bin, y_test_proba)
auc_gru = roc_auc_score(y_test_bin, y_test_proba)

plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, label=f"GRU-SVM ROC (AUC = {auc_gru:.3f})", color="navy", linewidth=2)
plt.plot([0, 1], [0, 1], linestyle="--", color="gray")
plt.xlabel("False Positive Rate (FPR)")
plt.ylabel("True Positive Rate (TPR)")
plt.title("Courbe ROC – GRU-SVM (Test set)")
plt.legend(loc="lower right")
plt.grid(True)
plt.tight_layout()
plt.show()

The ROC curve for GRU-SVM is close to the top-left corner with a high AUC
(≈ 0.99 in my run). Combined with similar train and test accuracies, this
indicates that the GRU-SVM model also achieves very good separation between
benign and malignant tumours, without strong overfitting.

##V.2. DSO2: Cluster diagnosis patterns


In [None]:
from sklearn.metrics import (
    silhouette_score, davies_bouldin_score, calinski_harabasz_score,
    adjusted_rand_score, normalized_mutual_info_score,
    homogeneity_score, completeness_score, v_measure_score
)

def evaluate_clustering(X2, labels):
    results = {}

    # Internal metrics
    results["silhouette"] = silhouette_score(X2, labels)
    results["davies_bouldin"] = davies_bouldin_score(X2, labels)
    results["calinski_harabasz"] = calinski_harabasz_score(X2, labels)


    return results


In [None]:
results_kmeans_3 = evaluate_clustering(
    X_cancer_scaled,
    df_cancer["cluster_kmeans_cancer_3"]
)
print("Évaluation KMeans k=3 :")
print(results_kmeans_3)


In [None]:
results_agg = evaluate_clustering(
    X_cancer_scaled,
    df_cancer["cluster_agg_3"]
)
print("Évaluation Agglomératif :")
print(results_agg)


In [None]:
results_gmm = evaluate_clustering(
    X_cancer_scaled,
    df_cancer["cluster_gmm"]
)
print("Évaluation GMM :")
print(results_gmm)


##V.3. DSO3: Recommandation system


In [None]:
recsys = HybridMetabolicRecommendationSystem(
    baseline_stats=baseline_stats,
    cluster_strategies=cluster_strategies,
    biological_meaning=biological_meaning,
    kmeans_model=kmeans_cancer_3,
    scaler=scaler_cancer,
)


In [None]:
import pandas as pd

single_patient = pd.Series({
    'Age': 55,
    'BMI': 28,
    'Glucose': 120,
    'Insulin': 25,
    'HOMA': 7,
    'Leptin': 15,
    'Adiponectin': 8,
    'Resistin': 10,
    'MCP.1': 3
})

# Define cancer status for this patient
is_cancer = False  # or False if healthy

# Generate recommendations
result = recsys.generate_recommendations(single_patient, is_cancer=is_cancer)

# View main output
print("Patient Profile:")
print(result['patient_profile'])

print("\nPriority Interventions:")
for r in result['recommendations']['priority_interventions']:
    print("-", r)

print("\nFeature Explanations:")
for e in result['explanations']:
    print("-", e)


#VI. Deployment