<a href="https://colab.research.google.com/github/cconsta1/ML_classification_YouTube/blob/main/ML_classification_YouTube.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Colab Tutorial: Basic ML Classification with scikit-learn, PyTorch, and a Dash web app

This Colab notebook demonstrates a full, workflow you can use to build ML classification models, and deploy them as web apps:
- Generate synthetic data (make_classification) with 10000 points, 4 features and 2 classes (Yes / No).
- Train and evaluate XGBoost, Random Forest, Logistic Regression, scikit-learn MLPClassifier.
- Implement an equivalent deep but simple neural network in PyTorch; train and evaluate it.
- Compare models, choose the two best, save them.
- Build an interactive web app (Dash + JupyterDash) that runs inside Colab and accepts 4 feature inputs and returns a prediction from either of the chosen models.


## 1. Setup: Install packages and imports


- We'll install the required packages jupyter-dash, everything else is pre-installed.
- Then we import common libraries and set random seeds for reproducibility.
- Running this cell prepares the Colab runtime with the libraries used later.

In [None]:
# Install required packages (run once in Colab)
!pip install -q jupyter-dash

# Imports and seeds
import numpy as np
import pandas as pd
import random
import joblib
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, classification_report
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
import xgboost as xgb
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import TensorDataset, DataLoader

# For Dash app later
from jupyter_dash import JupyterDash
import dash
from dash import dcc, html, Input, Output, State
import plotly.express as px

# Reproducibility
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)
random.seed(RANDOM_STATE)
torch.manual_seed(RANDOM_STATE)


## 2. Data generation: make_classification

Explanation:
- We generate a binary classification dataset with 10000 samples, 4 features, 4 informative features, 0 redundant, 2 classes.
- We map labels 0/1 to "No"/"Yes" for display purposes, but models will train on numerical labels.
- We'll split into train, validation and test sets and scale features using StandardScaler (save the scaler for deployment).

In [None]:
from sklearn.preprocessing import MinMaxScaler

# Generate data
X, y = make_classification(
    n_samples=10000,
    n_features=4,
    n_informative=4,
    n_redundant=0,
    n_repeated=0,
    n_classes=2,
    random_state=RANDOM_STATE
)

# Map to DataFrame for easier display if needed
df = pd.DataFrame(X, columns=[f"f{i+1}" for i in range(X.shape[1])])
df['target'] = y

# Train / test split (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=RANDOM_STATE, stratify=y)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Save scaler for deployment
joblib.dump(scaler, "scaler.joblib")

print("Shapes:", X_train.shape, X_test.shape)
print("Class distribution (train):", np.bincount(y_train))


In [None]:
df.describe()

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Compute the correlation matrix
corr_matrix = df[["f1","f2","f3","f4"]].corr()

# Visualize the correlation matrix
plt.figure(figsize=(8, 6))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Matrix of Features')
plt.show()

# You can also identify highly correlated pairs programmatically
# Let's find pairs with absolute correlation greater than a threshold (e.g., 0.8)
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
to_drop = [column for column in upper.columns if any(upper[column].abs() > 0.8)]

print("\nFeatures to potentially drop due to high correlation (threshold > 0.8):")
print(to_drop)

In [None]:
pd.DataFrame(X_train_scaled).describe()

## 3. Evaluation utilities

Explanation:
- Define a helper function to compute and print consistent evaluation metrics (accuracy, precision, recall, f1, ROC AUC).
- We'll use this for every model so the comparison is fair and standardized.

In [None]:
def evaluate_model(model_name, y_true, y_pred, y_proba=None):
    acc = accuracy_score(y_true, y_pred)
    prec = precision_score(y_true, y_pred, zero_division=0)
    rec = recall_score(y_true, y_pred, zero_division=0)
    f1 = f1_score(y_true, y_pred, zero_division=0)
    roc = np.nan
    if y_proba is not None:
        try:
            # Expect y_proba shape (n_samples, 2) or (n_samples,) for positives
            if y_proba.ndim == 2 and y_proba.shape[1] == 2:
                roc = roc_auc_score(y_true, y_proba[:,1])
            else:
                roc = roc_auc_score(y_true, y_proba.reshape(-1))
        except Exception:
            roc = np.nan
    print(f"Model: {model_name}")
    print(f" Accuracy: {acc:.4f}")
    print(f" Precision: {prec:.4f}")
    print(f" Recall: {rec:.4f}")
    print(f" F1: {f1:.4f}")
    print(f" ROC AUC: {roc:.4f}")
    print("Classification report:")
    print(classification_report(y_true, y_pred, zero_division=0))
    return {"model": model_name, "accuracy": acc, "precision": prec, "recall": rec, "f1": f1, "roc_auc": roc}

## 4. Train classical models: Logistic Regression, Random Forest, XGBoost (evaluate on test)

Explanation:
- Train each classical model on the training set and evaluate on the held-out test set (80/20 split).
- Save the trained models for the web app.


In [None]:
# Logistic Regression
lr = LogisticRegression(random_state=RANDOM_STATE, max_iter=1000)
lr.fit(X_train_scaled, y_train)
y_test_pred_lr = lr.predict(X_test_scaled)
y_test_proba_lr = lr.predict_proba(X_test_scaled)
res_lr = evaluate_model("LogisticRegression", y_test, y_test_pred_lr, y_test_proba_lr)

# Random Forest
rf = RandomForestClassifier(n_estimators=200, random_state=RANDOM_STATE, n_jobs=-1)
rf.fit(X_train_scaled, y_train)
y_test_pred_rf = rf.predict(X_test_scaled)
y_test_proba_rf = rf.predict_proba(X_test_scaled)
res_rf = evaluate_model("RandomForest", y_test, y_test_pred_rf, y_test_proba_rf)

# XGBoost
xgb_clf = xgb.XGBClassifier(eval_metric='logloss', random_state=RANDOM_STATE, n_estimators=200)
xgb_clf.fit(X_train_scaled, y_train)
y_test_pred_xgb = xgb_clf.predict(X_test_scaled)
y_test_proba_xgb = xgb_clf.predict_proba(X_test_scaled)
res_xgb = evaluate_model("XGBoost", y_test, y_test_pred_xgb, y_test_proba_xgb)

# Save these models (we will possibly use them later)
joblib.dump(lr, "logistic_regression.joblib")
joblib.dump(rf, "random_forest.joblib")
joblib.dump(xgb_clf, "xgboost.joblib")

## 5. scikit-learn MLPClassifier (train on train, evaluate on test)

Explanation:
- Train an MLPClassifier with hidden layers (128,64,32) and evaluate on the test set.

In [None]:
mlp = MLPClassifier(hidden_layer_sizes=(128,64,32), activation='relu', solver='adam', max_iter=200, random_state=RANDOM_STATE)
mlp.fit(X_train_scaled, y_train)
y_test_pred_mlp = mlp.predict(X_test_scaled)
y_test_proba_mlp = mlp.predict_proba(X_test_scaled)
res_mlp = evaluate_model("Sklearn-MLP", y_test, y_test_pred_mlp, y_test_proba_mlp)

joblib.dump(mlp, "sklearn_mlp.joblib")

## 6. PyTorch network: design, training loop, and evaluation (train on train, evaluate on test)

Explanation:
- Implement a PyTorch network with capacity similar to scikit's MLP.
- Train on the training set and evaluate on the test set (used for final comparison and app inclusion).


In [None]:
# Define PyTorch model
class SimpleNet(nn.Module):
    def __init__(self, input_dim=4):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(input_dim, 128),
            nn.ReLU(),
            nn.Linear(128, 64),
            nn.ReLU(),
            nn.Linear(64, 32),
            nn.ReLU(),
            nn.Linear(32, 1)  # single logit for binary classification
        )
    def forward(self, x):
        # Return logits shaped (batch,) to keep compatibility with BCEWithLogitsLoss
        return self.net(x).squeeze(1)

# Prepare datasets and loaders
batch_size = 128
train_dataset = TensorDataset(torch.tensor(X_train_scaled, dtype=torch.float32), torch.tensor(y_train, dtype=torch.float32))
test_dataset = TensorDataset(torch.tensor(X_test_scaled, dtype=torch.float32), torch.tensor(y_test, dtype=torch.float32))

train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=1024, shuffle=False)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = SimpleNet(input_dim=X_train_scaled.shape[1]).to(device)
criterion = nn.BCEWithLogitsLoss()
optimizer = optim.Adam(model.parameters(), lr=1e-3)

# Training loop (evaluate on test each epoch, keep best by test loss)
n_epochs = 100
best_test_loss = float('inf')
for epoch in range(1, n_epochs+1):
    model.train()
    total_loss = 0.0
    for xb, yb in train_loader:
        xb, yb = xb.to(device), yb.to(device)
        optimizer.zero_grad()
        logits = model(xb)
        loss = criterion(logits, yb)
        loss.backward()
        optimizer.step()
        total_loss += loss.item() * xb.size(0)
    avg_train_loss = total_loss / len(train_loader.dataset)

    # Test evaluation
    model.eval()
    test_loss = 0.0
    preds = []
    probs = []
    with torch.no_grad():
        for xb, yb in test_loader:
            xb, yb = xb.to(device), yb.to(device)
            logits = model(xb)
            loss = criterion(logits, yb)
            test_loss += loss.item() * xb.size(0)

            prob = torch.sigmoid(logits).cpu().numpy().reshape(-1, 1)
            probs.append(prob)
            preds.append((prob >= 0.5).astype(int))
    avg_test_loss = test_loss / len(test_loader.dataset)
    probs_np = np.vstack(probs)
    preds_np = np.vstack(preds).flatten()
    test_acc = accuracy_score(y_test, preds_np)
    print(f"Epoch {epoch}/{n_epochs} - train_loss: {avg_train_loss:.4f}, test_loss: {avg_test_loss:.4f}, test_acc: {test_acc:.4f}")
    if avg_test_loss < best_test_loss:
        best_test_loss = avg_test_loss
        torch.save(model.state_dict(), "pytorch_net_best.pth")

# Load best model and evaluate on test
best_model = SimpleNet(input_dim=X_train_scaled.shape[1]).to(device)
best_model.load_state_dict(torch.load("pytorch_net_best.pth", map_location=device))
best_model.eval()
with torch.no_grad():
    X_test_tensor = torch.tensor(X_test_scaled, dtype=torch.float32).to(device)
    logits = best_model(X_test_tensor)
    probs_test = torch.sigmoid(logits).cpu().numpy().reshape(-1, 1)
    preds_test = (probs_test >= 0.5).astype(int).reshape(-1)
res_torch = evaluate_model("PyTorch-Net", y_test, preds_test, y_proba=np.hstack([1-probs_test, probs_test]))

## 7. Compare models on the test set and pick top two (by accuracy)

Explanation:
- Collect test-set metrics for all trained models and choose the two best by accuracy.
- We'll include those two plus the PyTorch model in the web app.

In [None]:
# Collect results (we computed res_* earlier on test set)
results = [res_lr, res_rf, res_xgb, res_mlp, res_torch]
results_df = pd.DataFrame(results).sort_values(by=['accuracy', 'f1', 'roc_auc'], ascending=False)
results_df.reset_index(drop=True, inplace=True)
results_df

# Choose top 2 by accuracy
top2 = results_df.iloc[:2]['model'].tolist()
print("Top 2 selected models (by accuracy):", top2)


## 8. Save models and artifacts (confirm)

Explanation:
- We already saved scikit models and the PyTorch state_dict. Confirm saved files exist.

In [None]:
import os
# Save check (skipped saving repeated objects if already present)
joblib.dump(lr, "logistic_regression.joblib")
joblib.dump(rf, "random_forest.joblib")
joblib.dump(xgb_clf, "xgboost.joblib")
joblib.dump(mlp, "sklearn_mlp.joblib")
# pytorch saved above as pytorch_net_best.pth

for fn in ["scaler.joblib", "logistic_regression.joblib", "random_forest.joblib", "xgboost.joblib", "sklearn_mlp.joblib", "pytorch_net_best.pth"]:
    print(fn, "exists?", os.path.exists(fn))

## 9. Build a Dash web app in Colab with a vintage-modern theme

Explanation:
- The app will contain:
  - The two best scikit models (selected by accuracy) and the PyTorch model.
  - Four numeric inputs for the features.
  - A dropdown to choose among these three models.
  - A vintage color palette with a modern font ('Montserrat').

Notes:
- The app is launched inside Colab as an iframe.
- The UI styling is lightweight and inline so it works in Colab without extra files.


In [None]:
# Load artifacts needed for the app and create the subset of models to include: top2 + PyTorch
scaler = joblib.load("scaler.joblib")

# Load saved scikit models
models_all = {}
if os.path.exists("logistic_regression.joblib"):
    models_all["LogisticRegression"] = joblib.load("logistic_regression.joblib")
if os.path.exists("random_forest.joblib"):
    models_all["RandomForest"] = joblib.load("random_forest.joblib")
if os.path.exists("xgboost.joblib"):
    models_all["XGBoost"] = joblib.load("xgboost.joblib")
if os.path.exists("sklearn_mlp.joblib"):
    models_all["Sklearn-MLP"] = joblib.load("sklearn_mlp.joblib")
# Load PyTorch model
models_loaded = {}
for name in top2:
    if name in models_all:
        models_loaded[name] = models_all[name]

if os.path.exists("pytorch_net_best.pth"):
    pyt_model = SimpleNet(input_dim=X_train_scaled.shape[1])
    pyt_model.load_state_dict(torch.load("pytorch_net_best.pth", map_location='cpu'))
    pyt_model.eval()
    models_loaded["PyTorch-Net"] = pyt_model

print("Models included in app:", list(models_loaded.keys()))

# Build and run the Dash app with vintage-modern styling
import threading
from google.colab import output as colab_output

# Minimal external stylesheet for font
external_stylesheets = ["https://fonts.googleapis.com/css2?family=Montserrat:wght@300;400;600;700&display=swap"]

app = dash.Dash(__name__, external_stylesheets=external_stylesheets, suppress_callback_exceptions=True)

# Vintage-modern color scheme
BG = "#f4efe6"        # soft cream
CARD = "#ffffff"      # card white
ACCENT = "#6b4f4f"    # muted brown
ACCENT2 = "#b26500"   # warm amber
TEXT = "#2b2b2b"

app.layout = html.Div([
    html.Link(href="https://fonts.googleapis.com/css2?family=Montserrat:wght@300;400;600;700&display=swap", rel="stylesheet"),
    html.Div([
        html.H2("ML Classifier Demo", style={"marginBottom":"5px"}),
        html.Div("Enter 4 features, choose a model, and get a prediction.", style={"marginBottom":"15px", "color":"#444"}),
        html.Div([
            html.Label("Model", style={"fontWeight":"600"}),
            dcc.Dropdown(
                id="model-dropdown",
                options=[{"label": k, "value": k} for k in models_loaded.keys()],
                value=list(models_loaded.keys())[0],
                clearable=False,
                style={"marginBottom":"12px"}
            ),
            html.Div([
                html.Div([
                    html.Label(f"Feature {i+1}", style={"fontSize":"14px", "fontWeight":"500"}),
                    dcc.Input(id=f"f{i+1}", type="number", value=0.0, step=0.01, style={"width":"100%", "padding":"6px", "border":"1px solid #ccc", "borderRadius":"4px"})
                ], style={"padding":"6px", "width":"48%"}) for i in range(4)
            ], style={"display":"flex", "flexWrap":"wrap", "gap":"8px", "justifyContent":"space-between"}),
            html.Button("Predict", id="predict-btn", n_clicks=0,
                        style={"marginTop":"12px", "backgroundColor":ACCENT2, "color":"white", "border":"none",
                               "padding":"10px 18px", "borderRadius":"6px", "fontWeight":"600", "cursor":"pointer"})
        ], style={"padding":"16px", "borderRadius":"8px", "backgroundColor":CARD, "boxShadow":"0 2px 6px rgba(0,0,0,0.05)"}),
        html.Div(id="prediction-output", style={"marginTop":20, "fontSize":"18px", "fontWeight":"600", "color":ACCENT}),
        html.Div(id="probability-output", style={"marginTop":8, "fontSize":"16px", "color":TEXT})
    ], style={"maxWidth":"720px", "margin":"20px auto", "fontFamily":"Montserrat, sans-serif", "color":TEXT})
], style={"minHeight":"600px", "backgroundColor":BG, "padding":"20px"})


@app.callback(
    [Output("prediction-output", "children"),
     Output("probability-output", "children")],
    [Input("predict-btn", "n_clicks")],
    [State("model-dropdown", "value"),
     State("f1", "value"),
     State("f2", "value"),
     State("f3", "value"),
     State("f4", "value")]
)
def predict(n_clicks, model_name, f1, f2, f3, f4):
    if not n_clicks:
        return "No prediction yet.", ""
    x = np.array([[f1, f2, f3, f4]], dtype=float)
    x_scaled = scaler.transform(x)

    if model_name in ["LogisticRegression", "RandomForest", "XGBoost", "Sklearn-MLP"]:
        mdl = models_loaded.get(model_name)
        proba = float(mdl.predict_proba(x_scaled)[0][1])
        pred = int(mdl.predict(x_scaled)[0])
    elif model_name == "PyTorch-Net":
        mdl = models_loaded.get("PyTorch-Net")
        with torch.no_grad():
            logits = mdl(torch.tensor(x_scaled, dtype=torch.float32))
            prob = torch.sigmoid(logits).cpu().numpy()
            proba = float(np.array(prob).reshape(-1)[0])
            pred = int(proba >= 0.5)
    else:
        return "Model not available.", ""
    label = "Yes" if pred == 1 else "No"
    return f"Predicted class: {label}", f"Probability(Yes): {proba:.4f}"


# Run the Dash app in a background thread and embed as an iframe inside Colab
def _run():
    # newest Dash uses app.run(...)
    app.run(host="0.0.0.0", port=8050, debug=False)

thread = threading.Thread(target=_run, daemon=True)
thread.start()

# Embed the UI inside the notebook output area
from google.colab import output as colab_output
colab_output.serve_kernel_port_as_iframe(8050, height=800)

In [None]:
df.describe()