# Section 5: Model Deployment - Training & Saving a Churn Prediction Model

## Overview

In the previous session we trained a model for predicting churn and evaluated it. Now let's deploy it.

**What is Model Deployment?**

Deployment means taking a trained machine learning model and putting it into production where it can:
1. **Accept new data** - Receive customer information in real-time
2. **Make predictions** - Generate churn probability for new customers
3. **Serve predictions** - Return results to business applications/users
4. **Perform at scale** - Handle many requests efficiently

**Why Save Models?**

- **Reusability**: Train once, use many times without retraining
- **Consistency**: Same model behavior across different environments
- **Efficiency**: Don't retrain the entire pipeline each time
- **Version control**: Keep track of which model is in production

**This Notebook Covers:**

1. **Data preparation** - Load and preprocess churn data
2. **Model training** - Train logistic regression with cross-validation
3. **Model persistence** - Save trained model to disk using pickle
4. **Model loading** - Load saved model back into memory
5. **Making predictions** - Use loaded model on new customer data
6. **API integration** - Make HTTP requests to deployed model service

In [None]:
# Import required libraries for data processing, modeling, and evaluation

import pandas as pd  # Data manipulation and loading
import numpy as np   # Numerical computations

# Model training and validation
from sklearn.model_selection import train_test_split  # Split data into train/test
from sklearn.model_selection import KFold             # K-fold cross-validation

# Feature engineering and model training
from sklearn.feature_extraction import DictVectorizer  # Convert dicts to feature vectors
from sklearn.linear_model import LogisticRegression   # Classification model

# Model evaluation
from sklearn.metrics import roc_auc_score  # Area under ROC curve metric

In [None]:
# Step 1: Load and preprocess the churn data
df = pd.read_csv('data-week-3.csv')

# Standardize column names: convert to lowercase and replace spaces with underscores
df.columns = df.columns.str.lower().str.replace(' ', '_')

# Identify categorical (object dtype) columns
categorical_columns = list(df.dtypes[df.dtypes == 'object'].index)

# Standardize categorical values: lowercase and replace spaces with underscores
# This ensures consistency in feature names when vectorizing
for c in categorical_columns:
    df[c] = df[c].str.lower().str.replace(' ', '_')

# Convert totalcharges to numeric (some values might be empty/invalid)
df.totalcharges = pd.to_numeric(df.totalcharges, errors='coerce')
# Fill missing values with 0 (customers with no previous charges)
df.totalcharges = df.totalcharges.fillna(0)

# Convert churn target to binary: 'yes' -> 1, anything else -> 0
df.churn = (df.churn == 'yes').astype(int)

In [None]:
# Step 2: Split data into training and test sets
# We keep 20% of data for final testing (untouched during model development)
df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=1)

In [None]:
# Step 3: Define feature lists for modeling

# Numerical features: continuous values that don't need encoding
numerical = ['tenure', 'monthlycharges', 'totalcharges']

# Categorical features: text values that need to be converted to numbers
categorical = [
    'gender',
    'seniorcitizen',
    'partner',
    'dependents',
    'phoneservice',
    'multiplelines',
    'internetservice',
    'onlinesecurity',
    'onlinebackup',
    'deviceprotection',
    'techsupport',
    'streamingtv',
    'streamingmovies',
    'contract',
    'paperlessbilling',
    'paymentmethod',
]

In [None]:
# Step 4: Define train() function - trains logistic regression model
def train(df_train, y_train, C=1.0):
    """
    Train a logistic regression model on the given data
    
    Parameters:
    -----------
    df_train : DataFrame
        Training data with features (categorical + numerical columns)
    y_train : array-like
        Binary target variable (0 or 1)
    C : float
        Regularization parameter (inverse of regularization strength)
        Smaller C = stronger regularization (simpler model)
    
    Returns:
    --------
    dv : DictVectorizer
        Fitted vectorizer that transforms customer dicts to feature vectors
    model : LogisticRegression
        Trained logistic regression model
    """
    # Convert DataFrame rows to list of dictionaries
    # Each dict represents one customer's features
    dicts = df_train[categorical + numerical].to_dict(orient='records')

    # Initialize DictVectorizer to convert categorical/numerical dicts to numeric arrays
    dv = DictVectorizer(sparse=False)
    # Fit vectorizer on training data and transform to feature matrix
    X_train = dv.fit_transform(dicts)

    # Train logistic regression model with regularization parameter C
    model = LogisticRegression(C=C, max_iter=1000)
    model.fit(X_train, y_train)
    
    return dv, model

In [None]:
# Step 5: Define predict() function - makes predictions on new data
def predict(df, dv, model):
    """
    Make churn probability predictions for new customers
    
    Parameters:
    -----------
    df : DataFrame
        Data to make predictions on (new customers)
    dv : DictVectorizer
        Fitted vectorizer (must be the same one used in training)
    model : LogisticRegression
        Trained model (must be trained with same dv)
    
    Returns:
    --------
    y_pred : array
        Predicted churn probability for each customer (0 to 1)
    """
    # Convert DataFrame to list of dictionaries (same format as training)
    dicts = df[categorical + numerical].to_dict(orient='records')

    # Transform using the fitted vectorizer (no fitting, just transformation)
    X = dv.transform(dicts)
    
    # Get probability predictions: model.predict_proba returns [prob_no_churn, prob_churn]
    # We take [:, 1] to get probability of churn (second column)
    y_pred = model.predict_proba(X)[:, 1]

    return y_pred

In [None]:
# Step 6: Set hyperparameters for model training

# C: Regularization strength parameter
# C=1.0 is default; adjust based on cross-validation results
C = 1.0

# n_splits: Number of folds for k-fold cross-validation
# 5-fold is standard; more folds = more reliable but slower
n_splits = 5

In [None]:
# Step 7: Perform k-fold cross-validation to estimate model performance

# Initialize KFold splitter with shuffling for better estimation
kfold = KFold(n_splits=n_splits, shuffle=True, random_state=1)

# List to store AUC scores from each fold
scores = []

# Iterate through each fold (5 times in this case)
for train_idx, val_idx in kfold.split(df_full_train):
    # Get training and validation data for this fold
    df_train = df_full_train.iloc[train_idx]
    df_val = df_full_train.iloc[val_idx]

    # Extract target variable for this fold
    y_train = df_train.churn.values
    y_val = df_val.churn.values

    # Train model on this fold's training data
    dv, model = train(df_train, y_train, C=C)
    
    # Make predictions on this fold's validation data
    y_pred = predict(df_val, dv, model)

    # Calculate ROC AUC score for this fold
    auc = roc_auc_score(y_val, y_pred)
    scores.append(auc)

# Print cross-validation results: mean AUC ± standard deviation
print('C=%s %.3f +- %.3f' % (C, np.mean(scores), np.std(scores)))

C=1.0 0.841 +- 0.008


In [None]:
# Display the individual AUC scores from each fold
# Example output: [0.832, 0.848, 0.825, 0.851, 0.839]
scores

[0.8423083263338855,
 0.8450681201165409,
 0.8324061810154525,
 0.8319390707936304,
 0.8522598914373568]

In [None]:
# Step 8: Final model training and test evaluation

# Train final model on ALL training data (best practice after CV)
dv, model = train(df_full_train, df_full_train.churn.values, C=1.0)

# Make predictions on held-out test set (never seen by model before)
y_pred = predict(df_test, dv, model)

# Get test target values
y_test = df_test.churn.values

# Calculate ROC AUC on test set - this is our final performance metric
auc = roc_auc_score(y_test, y_pred)
auc  # Display the final AUC score

0.8572386167896259

## Section 5.1: Model Persistence - Saving Models to Disk

**Why Save Models?**

Instead of retraining the model every time we need predictions, we:
1. **Train once** - Expensive operation (computationally)
2. **Save to disk** - Serialization using pickle format (.bin files)
3. **Load when needed** - Fast deserialization for serving predictions

**Pickle Format:**

- **What**: Python's native serialization format
- **Pros**: Preserves all Python object structure (DictVectorizer + LogisticRegression)
- **Cons**: Python-specific (not easily used in other languages)
- **Use case**: Great for Python-based ML applications

**Alternative Formats** (for production):
- ONNX: Cross-language model format
- Protocol Buffers: Google's serialization format
- JSON: Human-readable but limited type support

In [None]:
# Import pickle module for serializing Python objects
import pickle

In [None]:
# Define output filename for saved model
# Format: model_C=<C_value>.bin
# The C value is included to track which hyperparameter was used
output_file = f'model_C={C}.bin'

In [None]:
# Display the output filename
output_file

'model_C=1.0.bin'

In [None]:
# Old way: Manual file handling (less safe)
# Open file in write-binary mode, dump model, close manually
f_out = open(output_file, 'wb') 
pickle.dump((dv, model), f_out)
f_out.close()

In [None]:
# List binary files in current directory to verify model was saved
# Shows filename and file size (-h for human-readable format)
!ls -lh *.bin

-rwxrwxrwx 1 alexey alexey 2.5K Sep 30 14:10 'model_C=1.0.bin'


In [None]:
# Better way: Use context manager (automatically closes file)
# This is the recommended approach: safer and cleaner code
with open(output_file, 'wb') as f_out: 
    pickle.dump((dv, model), f_out)
# File is automatically closed when exiting the 'with' block

## Section 5.2: Model Deserialization - Loading Models from Disk

**Loading Saved Models:**

Instead of retraining, we can load the previously saved model in a fraction of a second. This is how production systems work:
1. Train model in development environment (notebook)
2. Save to disk
3. Deploy saved model to web server
4. Load model at startup
5. Serve predictions in real-time

In [None]:
# Import pickle for deserializing saved models
import pickle

In [None]:
# Define the path to the saved model file
# This is the file we saved earlier
input_file = 'model_C=1.0.bin'

In [None]:
# Load the saved model from disk
# Use context manager to safely handle file operations
with open(input_file, 'rb') as f_in: 
    # pickle.load deserializes the tuple: (dv, model)
    dv, model = pickle.load(f_in)
# Now we have dv and model ready to use for predictions

In [None]:
# Display the loaded model object
# Shows it's a LogisticRegression model with C=1.0 parameter
model

LogisticRegression(max_iter=1000)

In [None]:
# Test the loaded model with a single customer record
# This is a sample customer with all required features

customer = {
    'gender': 'female',                    # Customer demographic
    'seniorcitizen': 0,                    # Age group (0 = not senior)
    'partner': 'yes',                      # Has a partner
    'dependents': 'no',                    # No dependents
    'phoneservice': 'no',                  # Phone service status
    'multiplelines': 'no_phone_service',   # Multiple lines
    'internetservice': 'dsl',              # Internet type
    'onlinesecurity': 'no',                # Online security add-on
    'onlinebackup': 'yes',                 # Online backup add-on
    'deviceprotection': 'no',              # Device protection
    'techsupport': 'no',                   # Tech support
    'streamingtv': 'no',                   # TV streaming service
    'streamingmovies': 'no',               # Movie streaming service
    'contract': 'month-to-month',          # Contract type (high churn risk)
    'paperlessbilling': 'yes',             # Paperless billing enabled
    'paymentmethod': 'electronic_check',   # Payment method
    'tenure': 1,                           # Months as customer (NEW customer!)
    'monthlycharges': 29.85,               # Monthly bill
    'totalcharges': 29.85                  # Total charges (just first month)
}

In [None]:
# Transform customer dictionary to feature vector using the loaded DictVectorizer
# This converts categorical variables to numeric one-hot encoded features
X = dv.transform([customer])

In [None]:
# Make prediction using the loaded model
# predict_proba returns probabilities for both classes: [no_churn, churn]
# [0, 1] gets the churn probability for first (and only) customer
y_pred = model.predict_proba(X)[0, 1]

In [None]:
# Display input customer data and output prediction
# y_pred is probability between 0 and 1
# 0.8 means 80% chance customer will churn, 20% chance they'll stay
print('input:', customer)
print('output:', y_pred)

input: {'gender': 'female', 'seniorcitizen': 0, 'partner': 'yes', 'dependents': 'no', 'phoneservice': 'no', 'multiplelines': 'no_phone_service', 'internetservice': 'dsl', 'onlinesecurity': 'no', 'onlinebackup': 'yes', 'deviceprotection': 'no', 'techsupport': 'no', 'streamingtv': 'no', 'streamingmovies': 'no', 'contract': 'month-to-month', 'paperlessbilling': 'yes', 'paymentmethod': 'electronic_check', 'tenure': 1, 'monthlycharges': 29.85, 'totalcharges': 29.85}
output: 0.5912433520805763


## Section 5.3: Production Deployment - API Integration

**From Notebook to Web Service:**

So far we've tested predictions locally. In production:
1. **Web Server** - Receives HTTP requests from client applications
2. **Model Service** - Endpoint that accepts customer data (JSON)
3. **Prediction** - Model processes data and returns churn probability
4. **Response** - Web service returns prediction back to client (JSON)

**Why Use HTTP/API?**
- **Language-agnostic**: Clients can use any language (Python, JavaScript, Java, etc.)
- **Scalable**: Can handle multiple requests concurrently
- **Stateless**: Each request is independent
- **Deployable**: Can run on any cloud platform (AWS, GCP, Azure, Heroku)

**Example Flow:**
```
Client App (e.g., CRM System)
    ↓
HTTP POST /predict
    ↓
Web Server (Flask/FastAPI/Django)
    ↓
Load model from disk
    ↓
Transform customer data
    ↓
Get prediction
    ↓
HTTP Response (JSON)
    ↓
Client receives churn probability
    ↓
Send automated email to at-risk customers
```

In [None]:
# Import requests library to make HTTP calls to web server
# This is how client applications would interact with deployed model
import requests

In [None]:
# Define the URL of the deployed prediction service
# In production, this would point to your web server (local or cloud-based)
url = 'http://localhost:9696/predict'

In [None]:
# Create another test customer record to send to the API
# This customer is different from the first: has 2-year contract (lower churn risk)
customer = {
    'gender': 'female',
    'seniorcitizen': 0,
    'partner': 'yes',
    'dependents': 'no',
    'phoneservice': 'no',
    'multiplelines': 'no_phone_service',
    'internetservice': 'dsl',
    'onlinesecurity': 'no',
    'onlinebackup': 'yes',
    'deviceprotection': 'no',
    'techsupport': 'no',
    'streamingtv': 'no',
    'streamingmovies': 'no',
    'contract': 'two_year',  # 2-year contract (more committed customer)
    'paperlessbilling': 'yes',
    'paymentmethod': 'electronic_check',
    'tenure': 1,
    'monthlycharges': 29.85,
    'totalcharges': 29.85
}

In [None]:
# Send customer data to the web service as JSON and get prediction
# requests.post sends HTTP POST request with customer data
# .json() parses the JSON response from server
response = requests.post(url, json=customer).json()

In [None]:
# Display the response from web service
# Expected format: {'churn': True/False, 'churn_probability': 0.XXXX}
response

{'churn': True, 'churn_probability': 0.5133820686195286}

In [None]:
# Business logic: If prediction shows high churn probability, take action
# Example: Send retention email or special offer
if response['churn']:
    # In production, this would integrate with CRM/email system
    print('sending email to', 'asdx-123d')

sending email to asdx-123d
