# üìà End-to-End Classification Pipeline

A complete production-ready classification workflow from raw data to deployment.

## Workflow
1. Data loading and validation
2. Exploratory data analysis
3. Feature engineering
4. Model training with hyperparameter tuning
5. Model evaluation and interpretation
6. Model deployment preparation

**Level**: Intermediate  
**Time Required**: ~45 minutes

In [None]:
import sys
sys.path.insert(0, '../../')

from data_science_master_system import *
from sklearn.model_selection import train_test_split, GridSearchCV
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

print("‚úÖ All imports ready!")

## 1. Data Loading and Validation

In [None]:
# Load data
loader = DataLoader()
df = loader.read('../data/csv/customer_churn.csv')

# Validate data quality
from data_science_master_system.utils.validators import validate_dataframe

validate_dataframe(df, required_columns=['customer_id', 'churn'], min_rows=100)
print(f"‚úÖ Data validated: {df.shape}")

# Check for issues
print(f"\nüîç Data Quality:")
print(f"  Missing values: {df.isnull().sum().sum()}")
print(f"  Duplicates: {df.duplicated().sum()}")
print(f"  Class balance: {df['churn'].value_counts().to_dict()}")

## 2. Feature Engineering with Pipeline

In [None]:
# Remove ID column
df_features = df.drop(columns=['customer_id'])

# Automatic feature generation
factory = FeatureFactory()

# Generate datetime-like features from tenure
df_features['tenure_years'] = df_features['tenure_months'] / 12
df_features['is_new_customer'] = (df_features['tenure_months'] < 6).astype(int)
df_features['is_long_term'] = (df_features['tenure_months'] > 36).astype(int)

# Create interaction features
df_features['charges_per_month'] = df_features['total_charges'] / (df_features['tenure_months'] + 1)
df_features['support_ratio'] = df_features['num_support_tickets'] / (df_features['tenure_months'] + 1)

print(f"Features after engineering: {df_features.shape[1]}")
df_features.head()

## 3. Prepare for Modeling

In [None]:
# Split features and target
X = df_features.drop(columns=['churn'])
y = df_features['churn']

# Handle categorical variables
cat_cols = X.select_dtypes(include=['object']).columns.tolist()
X_encoded = pd.get_dummies(X, columns=cat_cols, drop_first=True)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X_encoded, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training: {X_train.shape}, Test: {X_test.shape}")

## 4. Hyperparameter Tuning

In [None]:
from sklearn.ensemble import RandomForestClassifier

# Define parameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [5, 10, None],
    'min_samples_split': [2, 5],
}

# Grid search
rf = RandomForestClassifier(random_state=42)
grid_search = GridSearchCV(rf, param_grid, cv=5, scoring='f1', n_jobs=-1)
grid_search.fit(X_train, y_train)

print(f"Best parameters: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_:.4f}")

In [None]:
# Train final model
best_model = ClassificationModel('random_forest', **grid_search.best_params_)
best_model.fit(X_train, y_train)

# Evaluate
y_pred = best_model.predict(X_test)
y_proba = best_model.predict_proba(X_test)

metrics = calculate_metrics(y_test, y_pred, 'classification', y_proba)

print("\nüìä Final Model Performance:")
for k, v in metrics.items():
    print(f"  {k}: {v:.4f}")

## 5. Model Interpretation

In [None]:
# Feature importance
importance_df = best_model.feature_importance(top_n=15)

plotter = Plotter()
fig = plotter.feature_importance(importance_df, title='Top 15 Features')
plt.show()

In [None]:
# ROC Curve
fig = plotter.roc_curve(y_test, y_proba[:, 1], title='ROC Curve')
plt.show()

In [None]:
# Confusion Matrix
from data_science_master_system.evaluation.metrics import ClassificationMetrics

cm = ClassificationMetrics.confusion_matrix(y_test, y_pred)
fig = plotter.confusion_matrix(cm, labels=['No Churn', 'Churn'], normalize=True)
plt.show()

## 6. Save and Deploy

In [None]:
# Save the model
best_model.save('churn_model_production.joblib')
print("‚úÖ Model saved!")

# Example prediction function for deployment
def predict_churn(customer_data: dict) -> dict:
    """Predict churn probability for a customer."""
    model = ClassificationModel.load('churn_model_production.joblib')
    df = pd.DataFrame([customer_data])
    # Apply same preprocessing...
    df_encoded = pd.get_dummies(df)
    # Align columns with training data
    proba = model.predict_proba(df_encoded)[0, 1]
    return {'churn_probability': proba, 'will_churn': proba > 0.5}

print("\nüì¶ Deployment function ready!")

## üéØ Summary

Complete production workflow:
1. ‚úÖ Data loading and validation
2. ‚úÖ Feature engineering
3. ‚úÖ Hyperparameter tuning
4. ‚úÖ Model evaluation
5. ‚úÖ Model interpretation
6. ‚úÖ Deployment preparation