# Day 7: Classification Mastery

Today we tackle **Classification** - predicting categories instead of numbers!

### Topics Covered:
1. Classification vs Regression
2. Logistic Regression
3. Decision Trees
4. Random Forests
5. Model Evaluation (Accuracy, Precision, Recall, F1)
6. **Mini Project: Titanic Survival Prediction**

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.preprocessing import StandardScaler, LabelEncoder

plt.style.use('seaborn-v0_8-whitegrid')
np.random.seed(42)
print("Libraries loaded!")

## 1. Classification vs Regression

| Aspect | Regression | Classification |
|--------|------------|----------------|
| **Output** | Continuous (price, temp) | Discrete (yes/no, A/B/C) |
| **Example** | Predict house price | Spam or not spam |
| **Algorithms** | Linear Regression | Logistic Regression, Trees |

In [None]:
# Binary vs Multiclass Classification
print("CLASSIFICATION TYPES")
print("="*50)
print("\nBINARY: 2 classes")
print("  • Spam / Not Spam")
print("  • Survived / Died")
print("  • Fraud / Legitimate")
print("\nMULTICLASS: 3+ classes")
print("  • Cat / Dog / Bird")
print("  • Low / Medium / High")
print("  • Digits 0-9")

## 2. Logistic Regression

Despite the name, it's a **classification** algorithm! Uses sigmoid function to output probabilities (0-1).

```
P(y=1) = 1 / (1 + e^(-z))  where z = wx + b
```

In [None]:
# Visualize the Sigmoid Function
z = np.linspace(-10, 10, 100)
sigmoid = 1 / (1 + np.exp(-z))

plt.figure(figsize=(10, 5))
plt.plot(z, sigmoid, 'b-', linewidth=3)
plt.axhline(y=0.5, color='r', linestyle='--', label='Decision Boundary (0.5)')
plt.axvline(x=0, color='gray', linestyle='--', alpha=0.5)
plt.fill_between(z, sigmoid, 0.5, where=(sigmoid > 0.5), alpha=0.3, color='green', label='Predict Class 1')
plt.fill_between(z, sigmoid, 0.5, where=(sigmoid < 0.5), alpha=0.3, color='red', label='Predict Class 0')
plt.title('Sigmoid Function - Heart of Logistic Regression', fontsize=14, fontweight='bold')
plt.xlabel('z = wx + b')
plt.ylabel('Probability P(y=1)')
plt.legend()
plt.ylim(-0.1, 1.1)
plt.show()

In [None]:
# Simple Logistic Regression Example
np.random.seed(42)
# Study hours -> Pass/Fail
hours = np.concatenate([np.random.normal(3, 1, 50), np.random.normal(7, 1, 50)])
passed = np.array([0]*50 + [1]*50)

X = hours.reshape(-1, 1)
y = passed

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train model
log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)

# Predictions
y_pred = log_reg.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print(f"Accuracy: {accuracy*100:.1f}%")
print(f"\nCoefficient: {log_reg.coef_[0][0]:.3f}")
print(f"Intercept: {log_reg.intercept_[0]:.3f}")

## 3. Decision Trees

Makes decisions by asking a series of questions, creating a tree-like structure.

In [None]:
# Create sample dataset
np.random.seed(42)
n = 200
age = np.random.randint(18, 70, n)
income = np.random.randint(20000, 150000, n)
# Buy if: (age > 30 AND income > 50000) OR income > 100000
will_buy = ((age > 30) & (income > 50000)) | (income > 100000)
will_buy = will_buy.astype(int)

df = pd.DataFrame({'Age': age, 'Income': income, 'Will_Buy': will_buy})

X = df[['Age', 'Income']]
y = df['Will_Buy']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Decision Tree
tree = DecisionTreeClassifier(max_depth=3, random_state=42)
tree.fit(X_train, y_train)

print(f"Decision Tree Accuracy: {tree.score(X_test, y_test)*100:.1f}%")

In [None]:
# Visualize the Decision Tree
plt.figure(figsize=(15, 8))
plot_tree(tree, feature_names=['Age', 'Income'], class_names=['No', 'Yes'],
          filled=True, rounded=True, fontsize=10)
plt.title('Decision Tree Visualization', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

## 4. Random Forests

An **ensemble** of many decision trees. Each tree votes, and majority wins!

**Advantages:**
- Reduces overfitting
- More robust predictions
- Provides feature importance

In [None]:
# Train Random Forest
rf = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
rf.fit(X_train, y_train)

print(f"Random Forest Accuracy: {rf.score(X_test, y_test)*100:.1f}%")
print(f"\nFeature Importances:")
for feat, imp in zip(X.columns, rf.feature_importances_):
    print(f"  {feat}: {imp:.3f}")

In [None]:
# Compare all 3 models
models = {
    'Logistic Regression': LogisticRegression(),
    'Decision Tree': DecisionTreeClassifier(max_depth=5, random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42)
}

print("MODEL COMPARISON")
print("="*40)
for name, model in models.items():
    model.fit(X_train, y_train)
    acc = model.score(X_test, y_test)
    print(f"{name:25}: {acc*100:.1f}%")

## 5. Evaluation Metrics

| Metric | Formula | When to Use |
|--------|---------|-------------|
| **Accuracy** | Correct / Total | Balanced classes |
| **Precision** | TP / (TP + FP) | Cost of false positives high |
| **Recall** | TP / (TP + FN) | Cost of false negatives high |
| **F1 Score** | 2 × (P×R)/(P+R) | Balance precision & recall |

In [None]:
# Confusion Matrix Visualization
y_pred_rf = rf.predict(X_test)
cm = confusion_matrix(y_test, y_pred_rf)

plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=['No Buy', 'Buy'], yticklabels=['No Buy', 'Buy'])
plt.title('Confusion Matrix', fontsize=14, fontweight='bold')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

print("\nClassification Report:")
print(classification_report(y_test, y_pred_rf, target_names=['No Buy', 'Buy']))

---
## Mini Project: Titanic Survival Prediction

**Goal:** Predict if a passenger survived the Titanic disaster.

In [None]:
# Create Titanic-like dataset
np.random.seed(42)
n = 800

pclass = np.random.choice([1, 2, 3], n, p=[0.25, 0.25, 0.50])
sex = np.random.choice(['male', 'female'], n, p=[0.65, 0.35])
age = np.random.normal(30, 15, n).clip(1, 80)
fare = np.where(pclass == 1, np.random.normal(80, 30, n),
                np.where(pclass == 2, np.random.normal(30, 15, n),
                         np.random.normal(15, 10, n))).clip(5, 200)
sibsp = np.random.choice([0, 1, 2, 3], n, p=[0.6, 0.25, 0.1, 0.05])

# Survival logic
survival_prob = 0.3
survival_prob += np.where(sex == 'female', 0.35, -0.1)
survival_prob += np.where(pclass == 1, 0.25, np.where(pclass == 2, 0.1, -0.15))
survival_prob += np.where(age < 18, 0.15, np.where(age > 60, -0.1, 0))
survived = (np.random.random(n) < survival_prob).astype(int)

titanic = pd.DataFrame({
    'Pclass': pclass, 'Sex': sex, 'Age': age.round(0).astype(int),
    'Fare': fare.round(2), 'SibSp': sibsp, 'Survived': survived
})

print(" TITANIC DATASET")
print(f"Total Passengers: {len(titanic)}")
print(f"Survived: {titanic['Survived'].sum()} ({titanic['Survived'].mean()*100:.1f}%)")
print(titanic.head(10))

In [None]:
# EDA: Survival Rates
fig, axes = plt.subplots(1, 3, figsize=(14, 4))

# By Sex
titanic.groupby('Sex')['Survived'].mean().plot(kind='bar', ax=axes[0], color=['#3498db', '#e74c3c'])
axes[0].set_title('Survival by Sex')
axes[0].set_ylabel('Survival Rate')
axes[0].set_xticklabels(['Female', 'Male'], rotation=0)

# By Class
titanic.groupby('Pclass')['Survived'].mean().plot(kind='bar', ax=axes[1], color='#2ecc71')
axes[1].set_title('Survival by Class')
axes[1].set_ylabel('Survival Rate')
axes[1].set_xticklabels(['1st', '2nd', '3rd'], rotation=0)

# Age Distribution
titanic[titanic['Survived']==1]['Age'].hist(alpha=0.7, label='Survived', ax=axes[2], color='#2ecc71')
titanic[titanic['Survived']==0]['Age'].hist(alpha=0.7, label='Died', ax=axes[2], color='#e74c3c')
axes[2].set_title('Age Distribution by Survival')
axes[2].legend()

plt.tight_layout()
plt.show()

In [None]:
# Prepare Data
titanic_ml = titanic.copy()
titanic_ml['Sex'] = LabelEncoder().fit_transform(titanic_ml['Sex'])  # male=1, female=0

X = titanic_ml[['Pclass', 'Sex', 'Age', 'Fare', 'SibSp']]
y = titanic_ml['Survived']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(f"Training: {len(X_train)} | Test: {len(X_test)}")

In [None]:
# Train Multiple Models
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'Decision Tree': DecisionTreeClassifier(max_depth=5, random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
}

results = []
print(" MODEL COMPARISON")
print("="*60)

for name, model in models.items():
    if name == 'Logistic Regression':
        model.fit(X_train_scaled, y_train)
        y_pred = model.predict(X_test_scaled)
    else:
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
    
    acc = accuracy_score(y_test, y_pred)
    prec = precision_score(y_test, y_pred)
    rec = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    
    results.append({'Model': name, 'Accuracy': acc, 'Precision': prec, 'Recall': rec, 'F1': f1})
    print(f"\n{name}:")
    print(f"  Accuracy:  {acc*100:.1f}%")
    print(f"  Precision: {prec*100:.1f}%")
    print(f"  Recall:    {rec*100:.1f}%")
    print(f"  F1 Score:  {f1*100:.1f}%")

results_df = pd.DataFrame(results)

In [None]:
# Best Model Analysis
best_model = models['Random Forest']
y_pred_best = best_model.predict(X_test)

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred_best)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[0],
            xticklabels=['Died', 'Survived'], yticklabels=['Died', 'Survived'])
axes[0].set_title('Confusion Matrix - Random Forest', fontweight='bold')
axes[0].set_xlabel('Predicted')
axes[0].set_ylabel('Actual')

# Feature Importance
importance = pd.DataFrame({'Feature': X.columns, 'Importance': best_model.feature_importances_})
importance = importance.sort_values('Importance', ascending=True)
axes[1].barh(importance['Feature'], importance['Importance'], color=plt.cm.viridis(np.linspace(0.3, 0.9, 5)))
axes[1].set_title('Feature Importance', fontweight='bold')
axes[1].set_xlabel('Importance')

plt.tight_layout()
plt.show()

In [None]:
# Predict New Passengers
new_passengers = pd.DataFrame({
    'Pclass': [1, 3, 2, 1, 3],
    'Sex': [0, 1, 0, 1, 0],  # 0=female, 1=male
    'Age': [25, 35, 8, 60, 22],
    'Fare': [100, 10, 30, 150, 8],
    'SibSp': [1, 0, 2, 0, 1]
})

predictions = best_model.predict(new_passengers)
probabilities = best_model.predict_proba(new_passengers)[:, 1]

print(" NEW PASSENGER PREDICTIONS")
print("="*60)
for i, (pred, prob) in enumerate(zip(predictions, probabilities)):
    p = new_passengers.iloc[i]
    sex = 'Female' if p['Sex'] == 0 else 'Male'
    status = 'SURVIVED' if pred == 1 else 'DIED'
    print(f"\nPassenger {i+1}: {sex}, Age {p['Age']}, Class {p['Pclass']}")
    print(f"  Prediction: {status} (Probability: {prob*100:.1f}%)")

In [None]:
# Final Summary
print("\n" + "="*60)
print(" DAY 7 COMPLETE!")
print("="*60)
print("""
 KEY TAKEAWAYS:

 1. LOGISTIC REGRESSION
    - Uses sigmoid function for probabilities
    - Good baseline, interpretable

 2. DECISION TREES
    - Easy to visualize and understand
    - Prone to overfitting

 3. RANDOM FORESTS
    - Ensemble of trees, reduces overfitting
    - Usually best performance

 4. METRICS
    - Accuracy: Overall correctness
    - Precision: Quality of positive predictions
    - Recall: Coverage of actual positives
    - F1: Balance of precision and recall
""")
print("="*60)
print(" Next: Day 8 - Unsupervised Learning!")

---
## Practice Exercises

1. Try different `max_depth` values for Decision Tree
2. Experiment with `n_estimators` in Random Forest
3. Add new features like `Parch` (parents/children aboard)
4. Use `cross_val_score` for robust evaluation

---
**Next Up:** Day 8 - Unsupervised Learning (K-Means, PCA)!