# üß¨ Machine Learning Workshop: From Raw Data to Best Model
## A Beginner's Guide to ML Workflow (100 minutes)

---

### üìã Workshop Overview

Welcome! In this tutorial, you'll learn the **complete machine learning workflow** from scratch. We'll work with a biotech dataset and go through every step a data scientist takes when building a model.

### üéØ What You'll Learn

1. **Data Exploration** (15 min) - Understanding your dataset
2. **Data Cleaning** (20 min) - Handling messy real-world data
3. **Feature Engineering** (15 min) - Creating useful features
4. **Data Preparation** (10 min) - Getting ready for ML
5. **Model Training** (25 min) - Building multiple models
6. **Model Evaluation** (10 min) - Comparing performance
7. **Model Selection** (5 min) - Choosing the best model

---

### üìä About Our Dataset

We're working with biotech medical data containing:
- Patient demographics (age, sex, height, weight)
- Lab measurements (protein, glucose, albumin, pH)
- Clinical information (diagnosis, treatment, sample quality)

**Goal**: Predict patient diagnosis (Healthy vs Tumor) based on available features.

---

### ‚öôÔ∏è Prerequisites Check

First, let's verify all required libraries are installed.

In [None]:
import sys

# Check required libraries and their versions
required_libs = {
    "pandas": "2.2",
    "numpy": "2.0",
    "scikit-learn": "1.7.2",
    "matplotlib": "3.9",
    "seaborn": "0.13.2"
}

print("Checking installed libraries...\n")

for lib_name, min_version in required_libs.items():
    try:
        if lib_name == "scikit-learn":
            import sklearn
            lib = sklearn
            actual_name = "sklearn"
        else:
            lib = __import__(lib_name)
            actual_name = lib_name
        
        installed_version = lib.__version__
        print(f"‚úì {lib_name}: {installed_version} (required: >={min_version})")
    except ImportError:
        print(f"‚úó {lib_name} is NOT installed. Please install it using:")
        print(f"   pip install {lib_name}>={min_version}")
        

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import seaborn as sns
from pathlib import Path
import os

print("‚úÖ Libraries imported successfully!")

base_dir = Path().resolve().parent.parent
os.chdir(base_dir)

pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
pd.set_option('display.width', None)

## üîç STEP 1: Data Exploration (15 minutes)

Let's understand our dataset before diving into modeling!


In [None]:
# Load the dataset
data_path = os.path.join("data", "refinement", "biotech_preprocessing_refined.csv")
try:
    df = pd.read_csv(data_path, index_col=0)
    print(f"‚úÖ Loaded dataset: {data_path} with shape {df.shape}")
except FileNotFoundError:
    print(f"‚úó Could not find {data_path}. Check the data folder or use the raw CSV instead.")


In [None]:
# Let's peek at the first few rows
print("\nüëÄ First Look at the Data:")
print("=" * 70)
display(df.head(10))

print(f"\nüìè Dataset Dimensions:")
print(f"   Rows (samples): {df.shape[0]}")
print(f"   Columns (features): {df.shape[1]}")

In [None]:
# Let's examine the data types and missing values
print("\nüìù Column Information:")
print("=" * 70)
print(df.info())

print("\nüî¢ Data Types Summary:")
print(df.dtypes.value_counts())

print("\n‚ùì Missing Values Summary:")
print("=" * 70)
missing = df.isnull().sum()
missing_pct = (missing / len(df) * 100).round(2)
missing_df = pd.DataFrame({
    'Missing Count': missing,
    'Percentage': missing_pct
})

print("Columns with missing values:")
display(missing_df[missing_df['Missing Count'] > 0].sort_values('Missing Count', ascending=False))

print(f"\n‚úÖ Complete columns (no missing): {(missing == 0).sum()}/{len(missing)}")

In [None]:
# Check for obvious outliers in numerical columns
print("\nüîç Quick Outlier Check:")
print("=" * 70)

numerical_cols = df.select_dtypes(include=[np.number]).columns
for col in numerical_cols:
    q1 = df[col].quantile(0.25)
    q3 = df[col].quantile(0.75)
    iqr = q3 - q1
    outliers = ((df[col] < (q1 - 1.5 * iqr)) | (df[col] > (q3 + 1.5 * iqr))).sum()
    if outliers > 0:
        print(f"   ‚ö†Ô∏è  {col}: {outliers} potential outliers detected")

In [None]:
# Explore correlations between numerical features
print("\nüîó Feature Correlations:")
print("=" * 70)

numerical_df = df.select_dtypes(include=[np.number])
correlation_matrix = numerical_df.corr()

plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, annot=True, fmt='.2f', cmap='coolwarm', 
            center=0, square=True, linewidths=1)
plt.title('Correlation Matrix of Numerical Features', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

print("üí° High correlations (>0.7) might indicate redundant features!")

In [None]:
# Explore our target variable (what we want to predict)
print("\nüéØ Target Variable Distribution:")
print("=" * 50)
print(df['Diagnosis'].value_counts())
print("\nPercentage:")
print(df['Diagnosis'].value_counts(normalize=True) * 100)

# Visualize the distribution
plt.figure(figsize=(8, 5))
df['Diagnosis'].value_counts().plot(kind='bar', color=['green', 'red'])
plt.title('Distribution of Diagnosis', fontsize=14, fontweight='bold')
plt.xlabel('Diagnosis')
plt.ylabel('Count')
plt.xticks(rotation=0)
plt.tight_layout()
plt.show()

In [None]:
# Get statistical summary of numerical features
print("\nüìà Statistical Summary of Numerical Features:")
print("=" * 70)
display(df.describe())

### üí° Key Observations from Exploration:
- Our dataset has mixed data types (numerical and categorical)
- There are missing values that need handling
- Some columns have inconsistent formats (units, text variations)
- The target variable (Diagnosis) shows class distribution
- We need to clean and prepare this data before modeling!

---
## üßπ STEP 2: Data Cleaning (20 minutes)

Real-world data is messy! Let's clean it systematically.

In [None]:
# Create a working copy of our data
df_clean = df.copy()

print("üîß Starting Data Cleaning Process...\n")

# Reduce the dataframe to only necessary columns for modeling
columns_to_drop = ["Age_months", "Center", "Device_ID", "Collection_Date", "Notes", "ConstantFlag"]
df_clean = df_clean.drop(columns=columns_to_drop)

In [None]:

# 1. Clean Age column (handle "years" text and convert to numeric)
print("1Ô∏è‚É£ Cleaning Age column...")
df_clean['Age [years]'] = df_clean['Age [years]'].astype(str).str.replace(' years', '').str.strip()
df_clean['Age [years]'] = pd.to_numeric(df_clean['Age [years]'], errors='coerce')
print(f"   ‚úì Age column cleaned")

# 2. Clean Height column (standardize to cm)
print("\n2Ô∏è‚É£ Cleaning Height column...")
def clean_height(height):
    if pd.isna(height):
        return np.nan
    height_str = str(height).strip()
    if 'm' in height_str and 'cm' not in height_str:
        # Convert meters to cm
        return float(height_str.replace('m', '').strip()) * 100
    elif 'cm' in height_str:
        return float(height_str.replace('cm', '').strip())
    else:
        try:
            return float(height_str)
        except:
            return np.nan

df_clean['Height'] = df_clean['Height'].apply(clean_height)
print(f"   ‚úì Height standardized to cm")

# 3. Clean Weight column (standardize to kg)
print("\n3Ô∏è‚É£ Cleaning Weight column...")
def clean_weight(weight):
    if pd.isna(weight):
        return np.nan
    weight_str = str(weight).strip()
    if 'lb' in weight_str:
        # Convert pounds to kg
        return float(weight_str.replace('lb', '').strip()) * 0.453592
    elif 'kg' in weight_str:
        return float(weight_str.replace('kg', '').strip())
    else:
        try:
            return float(weight_str)
        except:
            return np.nan

df_clean['Weight'] = df_clean['Weight'].apply(clean_weight)
print(f"   ‚úì Weight standardized to kg")

print("\n‚úÖ Basic cleaning completed!")

In [None]:
# 4. Clean Protein Concentration (remove commas, standardize units)
print("4Ô∏è‚É£ Cleaning Protein Concentration...")
df_clean['Protein Concentration [mg/ml]'] = df_clean['Protein Concentration [mg/ml]'].astype(str).str.replace(',', '.').str.replace('mg/ml', '').str.strip()
df_clean['Protein Concentration [mg/ml]'] = pd.to_numeric(df_clean['Protein Concentration [mg/ml]'], errors='coerce')
print(f"   ‚úì Protein concentration cleaned")

# 5. Clean pH column (remove "pH" text)
print("\n5Ô∏è‚É£ Cleaning pH column...")
df_clean['pH'] = df_clean['pH'].astype(str).str.replace('pH', '').str.replace('NA', '').str.strip()
df_clean['pH'] = pd.to_numeric(df_clean['pH'], errors='coerce')
print(f"   ‚úì pH column cleaned")

# 6. Clean Albumin column (standardize format)
print("\n6Ô∏è‚É£ Cleaning Albumin column...")
df_clean['Albumin [g/L]'] = df_clean['Albumin [g/L]'].astype(str).str.replace('g/L', '').str.strip()
df_clean['Albumin [g/L]'] = pd.to_numeric(df_clean['Albumin [g/L]'], errors='coerce')
print(f"   ‚úì Albumin column cleaned")

# 7. Standardize Sex column
print("\n7Ô∏è‚É£ Standardizing Sex column...")
df_clean['Sex'] = df_clean['Sex'].str.strip().str.upper()
df_clean['Sex'] = df_clean['Sex'].replace({
    'M': 'Male',
    'F': 'Female',
    'MALE': 'Male',
    'FEMALE': 'Female'
})
print(f"   ‚úì Sex column standardized")
print(f"   Values: {df_clean['Sex'].value_counts().to_dict()}")

print("\n‚úÖ All columns cleaned!")

In [None]:
# Check the impact of cleaning
print("\nüìä Data Quality After Cleaning:")
print("=" * 70)
print(f"Original dataset: {df.shape}")
print(f"Cleaned dataset: {df_clean.shape}")

print("\n‚ùì Remaining Missing Values:")
missing_after = df_clean.isnull().sum()
missing_pct_after = (missing_after / len(df_clean) * 100).round(2)
missing_summary = pd.DataFrame({
    'Missing Count': missing_after,
    'Percentage': missing_pct_after
})
print(missing_summary[missing_summary['Missing Count'] > 0].sort_values('Missing Count', ascending=False))

In [None]:
# Handle missing values strategically
print("üîß Handling Missing Values...\n")

# Strategy 1: Fill numerical missing values with median (robust to outliers)
numerical_cols = ['Age [years]', 'Height', 'Weight', 'BMI', 
                  'Protein Concentration [mg/ml]', 'pH', 'Glucose [mmol/L]', 'Albumin [g/L]']

for col in numerical_cols:
    if col in df_clean.columns:
        missing_count = df_clean[col].isnull().sum()
        if missing_count > 0:
            median_value = df_clean[col].median()
            df_clean[col] = df_clean[col].fillna(median_value)
            print(f"   ‚úì Filled {missing_count} missing values in '{col}' with median: {median_value:.2f}")

# Strategy 2: Fill categorical missing values with mode (most common)
categorical_cols = ['Sex', 'Treatment', 'Sample_Quality']

for col in categorical_cols:
    if col in df_clean.columns:
        missing_count = df_clean[col].isnull().sum()
        if missing_count > 0:
            mode_value = df_clean[col].mode()[0] if not df_clean[col].mode().empty else 'Unknown'
            df_clean[col] = df_clean[col].fillna(mode_value)
            print(f"   ‚úì Filled {missing_count} missing values in '{col}' with mode: {mode_value}")

print("\n‚úÖ Missing value handling completed!")
print(f"\nRemaining missing values: {df_clean.isnull().sum().sum()}")

---
## üî¨ STEP 3: Feature Engineering (15 minutes)

Feature engineering means creating new useful features from existing ones. This can significantly improve model performance!

In [None]:
print("üé® Creating New Features...\n")

# 1. Age groups (can reveal age-related patterns)
print("1Ô∏è‚É£ Creating age groups...")
df_clean['Age_Group'] = pd.cut(df_clean['Age [years]'], 
                                bins=[0, 30, 45, 60, 100], 
                                labels=['Young', 'Middle', 'Senior', 'Elderly'])
print(f"   ‚úì Age groups created: {df_clean['Age_Group'].value_counts().to_dict()}")

# 2. BMI categories (WHO standard)
print("\n2Ô∏è‚É£ Creating BMI categories...")
df_clean['BMI_Category'] = pd.cut(df_clean['BMI'], 
                                   bins=[0, 18.5, 25, 30, 100], 
                                   labels=['Underweight', 'Normal', 'Overweight', 'Obese'])
print(f"   ‚úì BMI categories created")

# 3. Protein-to-Albumin ratio (medical significance)
print("\n3Ô∏è‚É£ Creating protein-to-albumin ratio...")
df_clean['Protein_Albumin_Ratio'] = df_clean['Protein Concentration [mg/ml]'] / df_clean['Albumin [g/L]']
df_clean['Protein_Albumin_Ratio'] = df_clean['Protein_Albumin_Ratio'].replace([np.inf, -np.inf], np.nan)
df_clean['Protein_Albumin_Ratio'] = df_clean['Protein_Albumin_Ratio'].fillna(df_clean['Protein_Albumin_Ratio'].median())
print(f"   ‚úì Protein-Albumin ratio created")

# 4. pH status (acidic vs alkaline)
print("\n4Ô∏è‚É£ Creating pH categories...")
df_clean['pH_Status'] = pd.cut(df_clean['pH'], 
                                bins=[0, 6.8, 7.2, 14], 
                                labels=['Acidic', 'Normal', 'Alkaline'])
print(f"   ‚úì pH status created")

# 5. Glucose level categories
print("\n5Ô∏è‚É£ Creating glucose categories...")
df_clean['Glucose_Level'] = pd.cut(df_clean['Glucose [mmol/L]'], 
                                    bins=[0, 5.5, 7, 20], 
                                    labels=['Normal', 'Prediabetic', 'Diabetic'])
print(f"   ‚úì Glucose levels categorized")

# 6. Is sample quality good?
print("\n6Ô∏è‚É£ Creating binary quality indicator...")
df_clean['Is_Good_Quality'] = df_clean['Sample_Quality'].apply(lambda x: 1 if x == 'Good' else 0)
print(f"   ‚úì Quality indicator created")

print("\n‚úÖ Feature engineering completed!")
print(f"New dataset shape: {df_clean.shape}")

In [None]:
# Visualize some of our new features
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Plot 1: Age Group vs Diagnosis
age_diag = pd.crosstab(df_clean['Age_Group'], df_clean['Diagnosis'])
age_diag.plot(kind='bar', ax=axes[0, 0], color=['green', 'red'])
axes[0, 0].set_title('Diagnosis by Age Group', fontweight='bold')
axes[0, 0].set_xlabel('Age Group')
axes[0, 0].set_ylabel('Count')
axes[0, 0].legend(title='Diagnosis')
axes[0, 0].tick_params(axis='x', rotation=45)

# Plot 2: BMI Category vs Diagnosis
bmi_diag = pd.crosstab(df_clean['BMI_Category'], df_clean['Diagnosis'])
bmi_diag.plot(kind='bar', ax=axes[0, 1], color=['green', 'red'])
axes[0, 1].set_title('Diagnosis by BMI Category', fontweight='bold')
axes[0, 1].set_xlabel('BMI Category')
axes[0, 1].set_ylabel('Count')
axes[0, 1].legend(title='Diagnosis')
axes[0, 1].tick_params(axis='x', rotation=45)

# Plot 3: Protein-Albumin Ratio distribution
df_clean.boxplot(column='Protein_Albumin_Ratio', by='Diagnosis', ax=axes[1, 0])
axes[1, 0].set_title('Protein-Albumin Ratio by Diagnosis', fontweight='bold')
axes[1, 0].set_xlabel('Diagnosis')
axes[1, 0].set_ylabel('Protein-Albumin Ratio')
plt.suptitle('')

# Plot 4: Glucose Level vs Diagnosis
glucose_diag = pd.crosstab(df_clean['Glucose_Level'], df_clean['Diagnosis'])
glucose_diag.plot(kind='bar', ax=axes[1, 1], color=['green', 'red'])
axes[1, 1].set_title('Diagnosis by Glucose Level', fontweight='bold')
axes[1, 1].set_xlabel('Glucose Level')
axes[1, 1].set_ylabel('Count')
axes[1, 1].legend(title='Diagnosis')
axes[1, 1].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

print("üí° These visualizations help us understand which features might be useful for prediction!")

---
## üéØ STEP 4: Data Preparation for ML (10 minutes)

Before training models, we need to prepare our data in the right format.

In [None]:
print("üéØ Preparing Data for Machine Learning...\n")

# Step 1: Select features for modeling
print("1Ô∏è‚É£ Selecting features...")

# Select numerical features
numerical_features = [
    'Age [years]', 'Height', 'Weight', 'BMI',
    'Protein Concentration [mg/ml]', 'Hydrophobicity [score]',
    'pH', 'Glucose [mmol/L]', 'Albumin [g/L]',
    'Protein_Albumin_Ratio', 'Is_Good_Quality'
]

# Select categorical features that we'll encode
categorical_features = ['Sex', 'Age_Group', 'BMI_Category', 'pH_Status', 'Glucose_Level']

# Combine all features
all_features = numerical_features + categorical_features

print(f"   ‚úì Selected {len(numerical_features)} numerical features")
print(f"   ‚úì Selected {len(categorical_features)} categorical features")
print(f"   Total features: {len(all_features)}")

# Step 2: Prepare target variable
print("\n2Ô∏è‚É£ Preparing target variable...")
# Convert target to binary (0 = Healthy, 1 = Tumor)
df_clean['Target'] = df_clean['Diagnosis'].apply(lambda x: 1 if x == 'Tumor' else 0)
print(f"   ‚úì Target created: 0=Healthy, 1=Tumor")
print(f"   Distribution: {df_clean['Target'].value_counts().to_dict()}")

In [None]:
# Step 3: Encode categorical variables
print("\n3Ô∏è‚É£ Encoding categorical variables...")

# One-hot encoding for categorical features
df_encoded = df_clean.copy()

for col in categorical_features:
    if col in df_encoded.columns:
        # Create dummy variables
        dummies = pd.get_dummies(df_encoded[col], prefix=col, drop_first=True)
        df_encoded = pd.concat([df_encoded, dummies], axis=1)
        print(f"   ‚úì Encoded '{col}' into {len(dummies.columns)} features")

print(f"\n   Dataset shape after encoding: {df_encoded.shape}")

# Step 4: Create feature matrix (X) and target vector (y)
print("\n4Ô∏è‚É£ Creating feature matrix and target vector...")

# Get all feature columns (numerical + encoded categorical)
feature_columns = numerical_features.copy()
for col in categorical_features:
    # Add the encoded dummy columns
    feature_columns.extend([c for c in df_encoded.columns if c.startswith(col + '_')])

# Remove original categorical columns from feature list
feature_columns = [c for c in feature_columns if c in df_encoded.columns]

X = df_encoded[feature_columns].copy()
y = df_encoded['Target'].copy()

print(f"   ‚úì Feature matrix X: {X.shape}")
print(f"   ‚úì Target vector y: {y.shape}")
print(f"   ‚úì Total features for modeling: {X.shape[1]}")

In [None]:
# Step 5: Split data into training and testing sets
print("\n5Ô∏è‚É£ Splitting data into train and test sets...")

X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2,      # 20% for testing, 80% for training
    random_state=42,     # For reproducibility
    stratify=y          # Maintain class proportions
)

print(f"   ‚úì Training set: {X_train.shape[0]} samples ({X_train.shape[0]/len(X)*100:.1f}%)")
print(f"   ‚úì Test set: {X_test.shape[0]} samples ({X_test.shape[0]/len(X)*100:.1f}%)")
print(f"\n   Training set class distribution:")
print(f"      Healthy: {(y_train == 0).sum()} | Tumor: {(y_train == 1).sum()}")
print(f"   Test set class distribution:")
print(f"      Healthy: {(y_test == 0).sum()} | Tumor: {(y_test == 1).sum()}")

# Step 6: Scale features (important for many ML algorithms)
print("\n6Ô∏è‚É£ Scaling features...")

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(f"   ‚úì Features scaled using StandardScaler")
print(f"   ‚úì Mean ‚âà 0, Standard Deviation ‚âà 1")

print("\n‚úÖ Data preparation completed! Ready for modeling!")

---
## ü§ñ STEP 5: Model Training (25 minutes)

Now the exciting part! We'll train multiple ML models and see which performs best.

### Why Multiple Models?
Different algorithms have different strengths. By trying several, we find the best fit for our data!

In [None]:
# Import ML models
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB

# Import evaluation metrics
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import confusion_matrix, classification_report, roc_auc_score, roc_curve

print("‚úÖ All required models and metrics imported!")

In [None]:
# Define our models to train
print("ü§ñ Initializing Machine Learning Models...\n")

models = {
    'Logistic Regression': LogisticRegression(random_state=42, max_iter=1000),
    'Decision Tree': DecisionTreeClassifier(random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42),
    'Support Vector Machine': SVC(probability=True, random_state=42),
    'K-Nearest Neighbors': KNeighborsClassifier(n_neighbors=5),
    'Naive Bayes': GaussianNB()
}

print(f"üìã Models to train: {len(models)}")
for i, name in enumerate(models.keys(), 1):
    print(f"   {i}. {name}")

print("\nüí° Brief explanation of each model:")
print("   ‚Ä¢ Logistic Regression: Linear model, fast and interpretable")
print("   ‚Ä¢ Decision Tree: Rule-based model, easy to visualize")
print("   ‚Ä¢ Random Forest: Ensemble of decision trees, robust")
print("   ‚Ä¢ Gradient Boosting: Sequential tree learning, powerful")
print("   ‚Ä¢ SVM: Finds optimal decision boundary")
print("   ‚Ä¢ K-Nearest Neighbors: Classification by similarity")
print("   ‚Ä¢ Naive Bayes: Probabilistic model based on Bayes theorem")

In [None]:
# Train all models and collect results
print("\nüèãÔ∏è Training Models...\n")
print("=" * 70)

results = {}
trained_models = {}

import time

for name, model in models.items():
    print(f"\nüîÑ Training {name}...")
    
    # Record training time
    start_time = time.time()
    
    # Train the model
    model.fit(X_train_scaled, y_train)
    
    training_time = time.time() - start_time
    
    # Make predictions
    y_pred = model.predict(X_test_scaled)
    y_pred_proba = model.predict_proba(X_test_scaled)[:, 1] if hasattr(model, 'predict_proba') else None
    
    # Calculate metrics
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    
    # Store results
    results[name] = {
        'Accuracy': accuracy,
        'Precision': precision,
        'Recall': recall,
        'F1-Score': f1,
        'Training Time': training_time
    }
    
    # Store trained model
    trained_models[name] = model
    
    print(f"   ‚úì Completed in {training_time:.3f} seconds")
    print(f"   Accuracy: {accuracy:.3f} | Precision: {precision:.3f} | Recall: {recall:.3f} | F1: {f1:.3f}")

print("\n" + "=" * 70)
print("‚úÖ All models trained successfully!")

---
## üìä STEP 6: Model Evaluation (10 minutes)

Let's compare all our models and understand their performance!

In [None]:
# Create a comprehensive results table
print("üìä Model Performance Comparison\n")
print("=" * 90)

results_df = pd.DataFrame(results).T
results_df = results_df.round(4)
results_df = results_df.sort_values('Accuracy', ascending=False)

# Display the results
display(results_df)

print("\nüí° Understanding the Metrics:")
print("   ‚Ä¢ Accuracy: Overall correctness (how many predictions were right)")
print("   ‚Ä¢ Precision: Of predicted positives, how many were actually positive")
print("   ‚Ä¢ Recall: Of actual positives, how many did we catch")
print("   ‚Ä¢ F1-Score: Balance between Precision and Recall (harmonic mean)")
print("   ‚Ä¢ Training Time: How long it took to train the model")

In [None]:
# Visualize model comparison
fig, axes = plt.subplots(2, 2, figsize=(16, 10))

# Plot 1: Accuracy Comparison
results_df['Accuracy'].plot(kind='barh', ax=axes[0, 0], color='skyblue')
axes[0, 0].set_xlabel('Accuracy Score')
axes[0, 0].set_title('Model Accuracy Comparison', fontweight='bold', fontsize=12)
axes[0, 0].set_xlim([0, 1])
axes[0, 0].axvline(x=results_df['Accuracy'].mean(), color='red', linestyle='--', label='Mean')
axes[0, 0].legend()

# Plot 2: Precision vs Recall
axes[0, 1].scatter(results_df['Recall'], results_df['Precision'], s=200, alpha=0.6, c='coral')
for idx, name in enumerate(results_df.index):
    axes[0, 1].annotate(name, 
                        (results_df['Recall'].iloc[idx], results_df['Precision'].iloc[idx]),
                        fontsize=8, ha='center')
axes[0, 1].set_xlabel('Recall')
axes[0, 1].set_ylabel('Precision')
axes[0, 1].set_title('Precision vs Recall Trade-off', fontweight='bold', fontsize=12)
axes[0, 1].set_xlim([0, 1])
axes[0, 1].set_ylim([0, 1])
axes[0, 1].grid(alpha=0.3)

# Plot 3: F1-Score Comparison
results_df['F1-Score'].plot(kind='barh', ax=axes[1, 0], color='lightgreen')
axes[1, 0].set_xlabel('F1-Score')
axes[1, 0].set_title('Model F1-Score Comparison', fontweight='bold', fontsize=12)
axes[1, 0].set_xlim([0, 1])
axes[1, 0].axvline(x=results_df['F1-Score'].mean(), color='red', linestyle='--', label='Mean')
axes[1, 0].legend()

# Plot 4: Training Time
results_df['Training Time'].plot(kind='barh', ax=axes[1, 1], color='plum')
axes[1, 1].set_xlabel('Training Time (seconds)')
axes[1, 1].set_title('Model Training Time', fontweight='bold', fontsize=12)

plt.tight_layout()
plt.show()

print("\nüí° These visualizations help us see performance trade-offs at a glance!")

In [None]:
# Detailed analysis of the best model
best_model_name = results_df.index[0]
best_model = trained_models[best_model_name]

print(f"\nüèÜ Best Performing Model: {best_model_name}")
print("=" * 70)

# Make predictions with best model
y_pred_best = best_model.predict(X_test_scaled)

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred_best)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot confusion matrix
im = axes[0].imshow(cm, cmap='Blues')
axes[0].set_title(f'Confusion Matrix - {best_model_name}', fontweight='bold', fontsize=12)
axes[0].set_xlabel('Predicted Label')
axes[0].set_ylabel('True Label')
axes[0].set_xticks([0, 1])
axes[0].set_yticks([0, 1])
axes[0].set_xticklabels(['Healthy', 'Tumor'])
axes[0].set_yticklabels(['Healthy', 'Tumor'])

# Add text annotations
for i in range(2):
    for j in range(2):
        text = axes[0].text(j, i, cm[i, j], ha="center", va="center", 
                           color="white" if cm[i, j] > cm.max()/2 else "black",
                           fontsize=16, fontweight='bold')

plt.colorbar(im, ax=axes[0])

# Classification Report
report = classification_report(y_test, y_pred_best, target_names=['Healthy', 'Tumor'], output_dict=True)
report_df = pd.DataFrame(report).transpose()

axes[1].axis('off')
table = axes[1].table(cellText=report_df.round(2).values,
                     colLabels=report_df.columns,
                     rowLabels=report_df.index,
                     cellLoc='center',
                     loc='center',
                     bbox=[0, 0, 1, 1])
table.auto_set_font_size(False)
table.set_fontsize(9)
table.scale(1, 2)
axes[1].set_title(f'Classification Report - {best_model_name}', 
                 fontweight='bold', fontsize=12, pad=20)

plt.tight_layout()
plt.show()

print("\nüí° Confusion Matrix shows:")
print(f"   ‚Ä¢ True Negatives (Healthy‚ÜíHealthy): {cm[0,0]}")
print(f"   ‚Ä¢ False Positives (Healthy‚ÜíTumor): {cm[0,1]}")
print(f"   ‚Ä¢ False Negatives (Tumor‚ÜíHealthy): {cm[1,0]}")
print(f"   ‚Ä¢ True Positives (Tumor‚ÜíTumor): {cm[1,1]}")

---
## üèÜ STEP 7: Model Selection & Final Thoughts (5 minutes)

Time to make our final decision and understand what we learned!

In [None]:
# Final Model Selection
print("üéØ MODEL SELECTION CRITERIA")
print("=" * 70)
print("\nWhen choosing the best model, consider:\n")
print("1Ô∏è‚É£ ACCURACY: Overall performance - highest is often best")
print("2Ô∏è‚É£ F1-SCORE: Balance of precision and recall - important for imbalanced data")
print("3Ô∏è‚É£ RECALL: Critical in medical diagnosis - we don't want to miss tumors!")
print("4Ô∏è‚É£ PRECISION: Avoid false alarms - but less critical than recall here")
print("5Ô∏è‚É£ TRAINING TIME: Efficiency matters for deployment")
print("6Ô∏è‚É£ INTERPRETABILITY: Can stakeholders understand the model?")

print("\n" + "=" * 70)
print("üèÜ FINAL MODEL RECOMMENDATION")
print("=" * 70)

# Get top 3 models by accuracy
top_3 = results_df.head(3)

print("\nüìä Top 3 Models by Accuracy:")
for idx, (model_name, metrics) in enumerate(top_3.iterrows(), 1):
    print(f"\n{idx}. {model_name}")
    print(f"   Accuracy:  {metrics['Accuracy']:.4f}")
    print(f"   F1-Score:  {metrics['F1-Score']:.4f}")
    print(f"   Recall:    {metrics['Recall']:.4f}")
    print(f"   Precision: {metrics['Precision']:.4f}")
    print(f"   Time:      {metrics['Training Time']:.4f}s")

# Select best model based on F1-score (balanced metric)
best_by_f1 = results_df.sort_values('F1-Score', ascending=False).index[0]
best_by_recall = results_df.sort_values('Recall', ascending=False).index[0]

print("\n" + "=" * 70)
print("üéØ RECOMMENDED MODEL:")
print("=" * 70)
print(f"\n‚úÖ Best Overall (F1-Score): {best_by_f1}")
print(f"   ‚Üí Balanced performance across all metrics")
print(f"\n‚öïÔ∏è  Best for Medical Use (Recall): {best_by_recall}")
print(f"   ‚Üí Minimizes missed tumor cases (false negatives)")

print("\n" + "=" * 70)
print("üí° KEY TAKEAWAYS:")
print("=" * 70)
print("""
For our biotech diagnostic task, we recommend: {}

REASONING:
‚Ä¢ Medical diagnosis requires HIGH RECALL (catching all tumors is critical)
‚Ä¢ F1-Score ensures we balance precision and recall
‚Ä¢ The model shows consistent performance across metrics
‚Ä¢ Training time is reasonable for practical deployment

NEXT STEPS:
‚Ä¢ Validate on additional test data
‚Ä¢ Fine-tune hyperparameters for even better performance
‚Ä¢ Deploy in a production environment
‚Ä¢ Monitor performance over time
""".format(best_by_f1 if results_df.loc[best_by_f1, 'Recall'] >= results_df['Recall'].quantile(0.75) else best_by_recall))

In [None]:
# Save the best model and preprocessing components
import joblib
import os

# Create models directory if it doesn't exist
os.makedirs('models', exist_ok=True)

final_model_name = best_by_f1 if results_df.loc[best_by_f1, 'Recall'] >= results_df['Recall'].quantile(0.75) else best_by_recall
final_model = trained_models[final_model_name]

# Save model
model_filename = 'models/best_model.joblib'
joblib.dump(final_model, model_filename)

# Save scaler (important for preprocessing new data!)
scaler_filename = 'models/scaler.joblib'
joblib.dump(scaler, scaler_filename)

# Save feature names (important to know what features the model expects!)
feature_names_filename = 'models/feature_names.joblib'
joblib.dump(X_train.columns.tolist(), feature_names_filename)

# Save model metadata
metadata = {
    'model_name': final_model_name,
    'model_type': type(final_model).__name__,
    'n_features': X_train.shape[1],
    'feature_names': X_train.columns.tolist(),
    'performance': results_df.loc[final_model_name].to_dict(),
    'training_date': '2025-12-01'
}
metadata_filename = 'models/model_metadata.joblib'
joblib.dump(metadata, metadata_filename)

print(f"‚úÖ Model saved as '{model_filename}'")
print(f"‚úÖ Scaler saved as '{scaler_filename}'")
print(f"‚úÖ Feature names saved as '{feature_names_filename}'")
print(f"‚úÖ Metadata saved as '{metadata_filename}'")
print(f"\nüì¶ Model Details:")
print(f"   Name: {final_model_name}")
print(f"   Type: {type(final_model).__name__}")
print(f"   Features: {X_train.shape[1]}")
print(f"\nüíæ To load this model later, use:")
print(f"   loaded_model = joblib.load('{model_filename}')")
print(f"   scaler = joblib.load('{scaler_filename}')")
print(f"   feature_names = joblib.load('{feature_names_filename}')")

---
# Appendix: BONUS

This appendix contains optional bonus cells: feature importance analysis and a short example showing how to make a single prediction with the saved artifacts.

In [None]:
# BONUS: Feature Importance Analysis (if time permits)
if 'final_model' in globals() and hasattr(final_model, 'feature_importances_'):
    print("üîç BONUS: Feature Importance Analysis\n")
    importances = final_model.feature_importances_
    feature_names = X_train.columns if 'X_train' in globals() else []
    importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': importances}).sort_values('Importance', ascending=False)
    display(importance_df.head(15))
    plt.figure(figsize=(10, 8))
    importance_df.head(15).plot(kind='barh', x='Feature', y='Importance', legend=False, color='steelblue')
    plt.gca().invert_yaxis()
    plt.tight_layout()
    plt.show()
else:
    print("‚ö†Ô∏è Feature importances not available for the current model.")
    try: 
        decision_tree_model = trained_models["Decision Tree"]
        importances = decision_tree_model.feature_importances_
        feature_names = X_train.columns if 'X_train' in globals() else []
        importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': importances}).sort_values('Importance', ascending=False)

        print("üîç Feature Importance Analysis for the Decision Tree model used instead\n")
        display(importance_df.head(15))
        plt.figure(figsize=(10, 8))
        importance_df.head(15).plot(kind='barh', x='Feature', y='Importance', legend=False, color='steelblue')
        plt.gca().invert_yaxis()
        plt.tight_layout()
        plt.show()
    except:
        pass

In [None]:
# BONUS: Quick prediction example using saved artifacts
if os.path.exists('models/best_model.joblib'):
    print('üîÆ BONUS: Making a single prediction with the saved model')
    loaded_model = joblib.load('models/best_model.joblib')
    loaded_scaler = joblib.load('models/scaler.joblib')
    feature_names = joblib.load('models/feature_names.joblib')
    # Create a template using the first row of X_train if available
    if 'X_train' in globals():
        new_sample = X_train.iloc[0:1].copy()
        new_sample_scaled = loaded_scaler.transform(new_sample)
        pred = loaded_model.predict(new_sample_scaled)
        proba = loaded_model.predict_proba(new_sample_scaled) if hasattr(loaded_model, 'predict_proba') else None
        print(f"Predicted class: {pred[0]}")
        if proba is not None:
            print(f"Probabilities: {proba[0]}")
    else:
        print('No X_train available to build a quick sample. Use the prediction demo notebook instead.')
else:
    print('Saved model artifacts not found in models/. Run the training cells first.')

---
## üéì Workshop Summary & Key Learnings

### üåü Congratulations! You've completed a full ML workflow!

---

### üìö What You Learned:

#### 1. **Data Exploration** (15 min)
- How to load and inspect datasets
- Identifying data types and missing values
- Understanding target variable distribution
- Basic statistical analysis

#### 2. **Data Cleaning** (20 min)
- Standardizing units and formats
- Handling inconsistent text data
- Converting data types
- Dealing with missing values using median/mode strategies

#### 3. **Feature Engineering** (15 min)
- Creating categorical features (age groups, BMI categories)
- Computing derived features (ratios, indicators)
- Using domain knowledge to enhance data
- Visualizing feature relationships

#### 4. **Data Preparation** (10 min)
- Selecting relevant features
- Encoding categorical variables
- Splitting data into train/test sets
- Feature scaling for algorithm optimization

#### 5. **Model Training** (25 min)
- Training 7 different ML algorithms
- Understanding algorithm differences
- Measuring training time
- Making predictions

#### 6. **Model Evaluation** (10 min)
- Computing multiple metrics (accuracy, precision, recall, F1)
- Creating performance visualizations
- Analyzing confusion matrices
- Understanding metric trade-offs

#### 7. **Model Selection** (5 min)
- Comparing models systematically
- Choosing based on problem requirements
- Documenting decisions
- Saving final model

---

### üéØ Key Takeaways:

‚úÖ **ML is iterative**: Real projects involve multiple cycles of improvement

‚úÖ **Data quality matters**: Clean, well-prepared data is crucial for success

‚úÖ **No single best model**: Different algorithms work better for different problems

‚úÖ **Context is king**: Medical diagnosis prioritizes recall (catching all positives)

‚úÖ **Evaluate comprehensively**: Use multiple metrics, not just accuracy

‚úÖ **Document everything**: Clear documentation helps reproducibility

---

### üöÄ Next Steps to Continue Learning:

1. **Hyperparameter Tuning**: Use GridSearchCV or RandomizedSearchCV
2. **Cross-Validation**: Use k-fold CV for more robust evaluation
3. **Feature Selection**: Identify the most important features
4. **Handle Class Imbalance**: Try SMOTE, class weights, or threshold tuning
5. **Ensemble Methods**: Combine multiple models (stacking, voting)
6. **Deep Learning**: Explore neural networks for more complex patterns
7. **Model Deployment**: Learn Flask, FastAPI, or cloud platforms
8. **MLOps**: Study model monitoring and continuous improvement

---

### üìñ Recommended Resources:

- **Scikit-learn Documentation**: https://scikit-learn.org/
- **Kaggle**: Practice on real datasets
- **Google's ML Crash Course**: Free online course
- **Papers with Code**: Latest ML research
- **Towards Data Science**: Practical ML tutorials

---

### üí™ Practice Exercise Ideas:

1. Try different feature engineering approaches
2. Experiment with hyperparameter tuning
3. Build a simple web app to use your model
4. Work with a different dataset
5. Implement custom evaluation metrics

---

## üéâ Thank you for participating!

You now have a solid foundation in machine learning workflows. Keep practicing and building projects!