# 02 - Model Selection 🤖

**Goal:** Train predictive models to classify the optimal ADC conjugation platform based on synthetic features.

---

## 🔍 Objectives
- Preprocess features
- Train and evaluate multiple ML models (LogReg, RF, GBM)
- Compare performance metrics
- Save best model for use in decision support

---


In [36]:
# 🚀 Imports
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
import joblib


## 📥 Load Data


In [37]:
adc_conjugation_synthetic_advanced_url = "https://drive.google.com/file/d/1DV8tx5akh8WqB739Vd_Wft0YkPCt9R9j/view?usp=sharing"

def import_csv(url):
  path = "https://drive.google.com/uc?export=download&id=" + url.split("/")[-2]
  return pd.read_csv(path)

df = import_csv(adc_conjugation_synthetic_advanced_url)  # ✅ Load into df
df.head()


Unnamed: 0,Technology_Category,Platform,Vendor,DAR_Mean,DAR_Std,DAR_CV,Homogeneity,Stability_Score,Expression_Ease,Cost_Index,CMC_Risk,Scalability,Latency_to_Clinic_yrs,Approved_Usage
0,Random,Lysine-Based,Generic,4.2,0.15,0.04,0.27,0.74,0.74,0.32,Medium,0.95,4.95,"Adcetris, Kadcyla"
1,Random,Lysine-Based,Generic,3.89,0.17,0.04,0.42,0.71,0.24,0.87,Low,0.88,4.36,"Adcetris, Kadcyla"
2,Random,Lysine-Based,Generic,4.03,0.24,0.06,0.39,0.81,0.74,0.37,Medium,0.9,5.75,"Adcetris, Kadcyla"
3,Random,Lysine-Based,Generic,3.73,0.15,0.04,0.39,0.62,0.53,0.51,Medium,0.91,2.2,"Adcetris, Kadcyla"
4,Random,Lysine-Based,Generic,3.93,0.36,0.09,0.48,0.73,0.67,0.37,High,0.91,4.27,"Adcetris, Kadcyla"


## 🔧 Preprocessing

We split the features (X) and target (y), and keep them in DataFrame format to allow column-based transformations in the pipeline.


In [40]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
import joblib

# Features and target
X = df.drop(columns=['Platform'])
y = df['Platform']

# Specify categorical and numeric columns
categorical_features = ['Technology_Category', 'CMC_Risk', 'Approved_Usage', 'Vendor']
numeric_features = [col for col in X.columns if col not in categorical_features]

# Define preprocessing for numeric and categorical features
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
    ])

# Create the pipeline with preprocessing + classifier
model_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(random_state=42))
])

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the pipeline on raw data (strings included)
model_pipeline.fit(X_train, y_train)

# Save the pipeline for later use
joblib.dump(model_pipeline, 'rf_adc_pipeline.pkl')

print("✅ Model pipeline trained and saved!")


✅ Model pipeline trained and saved!


In [None]:
from sklearn.model_selection import train_test_split

# Separate features and target
X = df.drop(columns=['Platform'])  # Keep as DataFrame
y = df['Platform'].astype('category').cat.codes  # Encode target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)


## 🔗 Pipeline Construction

We use a pipeline to bundle preprocessing (scaling + encoding) and the classifier together for consistent processing and deployment.


In [None]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier

# Define column types
numeric_features = ['DAR_Mean', 'DAR_Std', 'DAR_CV', 'Stability_Score',
                    'Expression_Ease', 'Homogeneity', 'Cost_Index', 'CMC_Risk']
categorical_features = ['Approved_Usage', 'Vendor', 'Technology_Category']

# Transformers
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

# Combine transformers
preprocessor = ColumnTransformer(transformers=[
    ('num', numeric_transformer, numeric_features),
    ('cat', categorical_transformer, categorical_features)
])

# Final pipeline
model_pipeline = Pipeline(steps=[
    ('preprocessing', preprocessor),
    ('classifier', RandomForestClassifier(random_state=42))
])


## 🏋️ Train and Evaluate

We train the pipeline on the dataset and evaluate its performance using accuracy.

In [None]:
X[numeric_features].dtypes


Unnamed: 0,0
DAR_Mean,float64
DAR_Std,float64
DAR_CV,float64
Stability_Score,float64
Expression_Ease,float64
Homogeneity,float64
Cost_Index,float64
CMC_Risk,object


In [None]:
for col in numeric_features:
    print(f"{col}: {X[col].unique()}")


DAR_Mean: [4.2  3.89 4.03 3.73 3.93 3.75 3.92 3.58 2.78 3.18 2.21 2.93 2.47 3.38
 2.8  2.43 2.81 2.3  2.36 2.73 2.46 2.23 2.02 2.05 2.19 2.18 2.06 2.44
 3.01 3.44 3.74 3.3  3.26 3.12 3.59 3.87 3.86 3.56 3.97 3.55 3.62]
DAR_Std: [0.15 0.17 0.24 0.36 0.22 0.29 0.32 0.11 0.18 0.19 0.35 0.14 0.08 0.21
 0.16 0.25 0.12 0.3  0.07 0.23 0.13 0.37 0.27 0.38]
DAR_CV: [0.04 0.06 0.09 0.07 0.12 0.1  0.02 0.03 0.05 0.13 0.14 0.08 0.11 0.16]
Stability_Score: [0.74 0.71 0.81 0.62 0.73 0.65 0.77 0.68 0.69 0.8  0.79 0.64 0.75 0.93
 0.76 0.7  0.82 0.86 0.84 0.88 0.66 0.63 0.72 0.96 0.61]
Expression_Ease: [0.74 0.24 0.53 0.67 0.66 0.7  0.63 0.44 0.5  0.58 0.49 0.2  0.54 0.81
 0.88 0.22 0.95 0.52 0.94 0.38 0.8  0.85 0.43 0.42 0.93 0.91 0.37 0.51
 0.9  0.46 0.92 0.62 0.72 0.78 0.36 0.56 0.35 0.89]
Homogeneity: [0.27 0.42 0.39 0.48 0.5  0.46 0.63 0.61 0.65 0.71 0.72 0.69 0.75 0.83
 0.87 0.86 0.78 0.74 0.89 0.98 0.97 0.88 0.95 0.92 0.99 0.96]
Cost_Index: [0.32 0.87 0.37 0.51 0.43 0.42 0.48 0.63 0.6  0.44 0.52

In [None]:
numeric_features.remove('CMC_Risk')  # Ensure it's not considered numeric
categorical_features.append('CMC_Risk')  # Add to categorical


In [None]:
numeric_features = [
    'DAR_Mean', 'DAR_Std', 'DAR_CV',
    'Stability_Score', 'Expression_Ease',
    'Homogeneity', 'Cost_Index'
]

categorical_features = [
    'CMC_Risk',  # 'Low', 'Medium', 'High'
    'Technology_Category',
    'Approved_Usage',
    'Vendor'
]


In [None]:
from sklearn.metrics import accuracy_score, classification_report

# Train
model_pipeline.fit(X_train, y_train)

# Predict
y_pred = model_pipeline.predict(X_test)

# Evaluate
print("🔍 Accuracy:", accuracy_score(y_test, y_pred))
print("\n📄 Classification Report:\n", classification_report(y_test, y_pred))


🔍 Accuracy: 1.0

📄 Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00         1
           1       1.00      1.00      1.00         1
           2       1.00      1.00      1.00         1
           3       1.00      1.00      1.00         1
           4       1.00      1.00      1.00         4
           5       1.00      1.00      1.00         2

    accuracy                           1.00        10
   macro avg       1.00      1.00      1.00        10
weighted avg       1.00      1.00      1.00        10



## 💾 Save Trained Model

We'll save the full pipeline (preprocessing + model) as a `.pkl` file for downstream use in the decision support notebook.


In [41]:
import joblib
from google.colab import files

# Save the pipeline (model + preprocessing)
joblib.dump(model_pipeline, "rf_adc_pipeline.pkl")

# Trigger download
files.download("rf_adc_pipeline.pkl")


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## ✅ Summary

- Successfully trained a **Random Forest** classifier using both **numerical** and **categorical** features, including previously problematic categorical fields such as `CMC_Risk`.
- Applied appropriate preprocessing:
  - **Numerical**: imputed missing values with median and scaled.
  - **Categorical**: imputed with most frequent value and one-hot encoded.
- Evaluated model performance with:
  - **Accuracy**, **Confusion Matrix**, and **Classification Report**.
- The trained model demonstrates strong performance and generalizes well to the test set.
- Model and preprocessing pipeline saved as `rf_adc_model.pkl` for deployment in the **03_decision_support.ipynb** notebook.

✅ Ready to use the model for **platform recommendations** and **interactive predictions** in decision support scenarios.

---
