# üìä Classification Models Complete Guide

**Author**: Data Science Master System  
**Difficulty**: ‚≠ê‚≠ê Intermediate  
**Time**: 60 minutes  
**Prerequisites**: 00_getting_started completed

## Learning Objectives
- Understand classification problem types
- Implement multiple classification algorithms
- Compare model performance
- Handle imbalanced data

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification, load_breast_cancer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

np.random.seed(42)

## 1. Load Real Dataset

In [None]:
# Breast Cancer dataset
data = load_breast_cancer()
X, y = data.data, data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(f"Dataset: {data.DESCR[:200]}...")
print(f"\nFeatures: {X.shape[1]}, Samples: {X.shape[0]}")
print(f"Classes: {np.bincount(y)}")

## 2. Model Comparison

In [None]:
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'Decision Tree': DecisionTreeClassifier(max_depth=5),
    'Random Forest': RandomForestClassifier(n_estimators=100),
    'Gradient Boosting': GradientBoostingClassifier(n_estimators=100),
    'SVM': SVC(probability=True),
    'KNN': KNeighborsClassifier(n_neighbors=5)
}

results = []
for name, model in models.items():
    model.fit(X_train_scaled, y_train)
    train_acc = model.score(X_train_scaled, y_train)
    test_acc = model.score(X_test_scaled, y_test)
    cv_scores = cross_val_score(model, X_train_scaled, y_train, cv=5)
    results.append({'Model': name, 'Train': train_acc, 'Test': test_acc, 'CV Mean': cv_scores.mean()})

results_df = pd.DataFrame(results).sort_values('Test', ascending=False)
print("üìä Model Comparison:")
display(results_df)

## 3. Best Model Analysis

In [None]:
best_model = models['Random Forest']
y_pred = best_model.predict(X_test_scaled)

print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=['Malignant', 'Benign']))

## 4. Next Steps

- **06_regression_models**: Continuous target prediction
- **08_feature_engineering**: Improve features
- **09_hyperparameter_tuning**: Optimize models

---
**Tags**: classification, supervised-learning, model-comparison