# Unit 4 - Example 11: Classification Basics

## üìö Learning Objectives

By completing this notebook, you will:
- Understand the key concepts of this topic
- Apply the topic using Python code examples
- Practice with small, realistic datasets or scenarios

## üîó Prerequisites

- ‚úÖ Basic Python
- ‚úÖ Basic NumPy/Pandas (when applicable)

---

## Official Structure Reference

This notebook supports **Course 05, Unit 4** requirements from `DETAILED_UNIT_DESCRIPTIONS.md`.

---


# Unit 4 - Example 11: Classification Basics

## üîó Solving the Problem from Example 10 | ÿ≠ŸÑ ÿßŸÑŸÖÿ¥ŸÉŸÑÿ© ŸÖŸÜ ÿßŸÑŸÖÿ´ÿßŸÑ 10

**Remember the dead end from Example 10?**
- We learned linear regression for predicting continuous values
- But we discovered we need to predict categories/classes, not continuous values
- Linear regression doesn't work well for classification problems

**This notebook solves that problem!**
- We'll learn **classification algorithms** (Logistic Regression, Decision Trees, etc.)
- We'll learn how to **predict categories** instead of continuous values
- We'll learn **classification metrics** (accuracy, precision, recall, F1-score)

**This solves the classification problem from Example 10!**

In [1]:
# Step 1: Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (accuracy_score, precision_score, recall_score,
                             f1_score, confusion_matrix, roc_curve, roc_auc_score)

# Set style for better plots
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

print("=" * 70)
print("Example 11: Classification Basics | ÿ£ÿ≥ÿßÿ≥Ÿäÿßÿ™ ÿßŸÑÿ™ÿµŸÜŸäŸÅ")
print("=" * 70)
print("\nüìö Prerequisites: Example 10 completed, linear regression knowledge")
print("üîó This is the SECOND example in Unit 4 - classification algorithms")
print("üéØ Goal: Master classification with logistic regression and decision trees")


Example 11: Classification Basics | ÿ£ÿ≥ÿßÿ≥Ÿäÿßÿ™ ÿßŸÑÿ™ÿµŸÜŸäŸÅ

üìö Prerequisites: Example 10 completed, linear regression knowledge
üîó This is the SECOND example in Unit 4 - classification algorithms
üéØ Goal: Master classification with logistic regression and decision trees


## 


In [2]:
# 1. CREATE CLASSIFICATION DATA


## 


In [3]:
print("\n1. Creating Classification Data")
print("-" * 70)
np.random.seed(42)
n_samples = 500
X1 = np.random.normal(2, 1.5, n_samples)
X2 = np.random.normal(3, 1.5, n_samples)
X = np.column_stack([X1, X2])
y = ((X1 - 2)**2 + (X2 - 3)**2 < 4).astype(int) + np.random.binomial(1, 0.1, n_samples)
y = np.clip(y, 0, 1)
df = pd.DataFrame(X, columns=['feature_1', 'feature_2'])
df['target'] = y
print(f"Data shape: {df.shape}")
print(f"Target distribution:\n{df['target'].value_counts()}")


1. Creating Classification Data
----------------------------------------------------------------------
Data shape: (500, 3)
Target distribution:
target
1    323
0    177
Name: count, dtype: int64


## 


In [4]:
# 2. LOGISTIC REGRESSION


## 


In [5]:
print("\n\n2. Logistic Regression")
print("-" * 70)
X_data = df[['feature_1', 'feature_2']]
y_data = df['target']
X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, test_size=0.2, random_state=42, stratify=y_data)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
logistic_model = LogisticRegression(random_state=42, max_iter=1000)
logistic_model.fit(X_train_scaled, y_train)
y_test_pred_lr = logistic_model.predict(X_test_scaled)
y_test_proba_lr = logistic_model.predict_proba(X_test_scaled)[:, 1]
accuracy_lr = accuracy_score(y_test, y_test_pred_lr)
print(f"\nLogistic Regression Accuracy: {accuracy_lr:.4f}")



2. Logistic Regression
----------------------------------------------------------------------

Logistic Regression Accuracy: 0.6500


## 


In [6]:
# 3. DECISION TREE


## 


In [7]:
print("\n\n3. Decision Tree")
print("-" * 70)
tree_model = DecisionTreeClassifier(random_state=42, max_depth=5)
tree_model.fit(X_train, y_train)
y_test_pred_dt = tree_model.predict(X_test)
y_test_proba_dt = tree_model.predict_proba(X_test)[:, 1]
accuracy_dt = accuracy_score(y_test, y_test_pred_dt)
print(f"\nDecision Tree Accuracy: {accuracy_dt:.4f}")



3. Decision Tree
----------------------------------------------------------------------

Decision Tree Accuracy: 0.8900


## 


In [8]:
# 4. CONFUSION MATRICES


## 


In [9]:
print("\n\n4. Confusion Matrices")
print("-" * 70)
cm_lr = confusion_matrix(y_test, y_test_pred_lr)
cm_dt = confusion_matrix(y_test, y_test_pred_dt)
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
fig.suptitle('Confusion Matrices')
sns.heatmap(cm_lr, annot=True, fmt='d', cmap='Blues', ax=axes[0],
xticklabels=['Class 0', 'Class 1'], yticklabels=['Class 0', 'Class 1'])
axes[0].set_title('Logistic Regression')
axes[0].set_xlabel('Predicted')
axes[0].set_ylabel('Actual')
sns.heatmap(cm_dt, annot=True, fmt='d', cmap='Greens', ax=axes[1],
xticklabels=['Class 0', 'Class 1'], yticklabels=['Class 0', 'Class 1'])
axes[1].set_title('Decision Tree')
axes[1].set_xlabel('Predicted')
axes[1].set_ylabel('Actual')
plt.tight_layout()
plt.savefig('11_confusion_matrices.png', dpi=300, bbox_inches='tight')
print("‚úì Confusion matrices saved")
plt.close()



4. Confusion Matrices
----------------------------------------------------------------------


‚úì Confusion matrices saved


## 


In [10]:
# 5. ROC CURVES


## 


In [11]:
print("\n\n5. ROC Curves")
print("-" * 70)
fpr_lr, tpr_lr, _ = roc_curve(y_test, y_test_proba_lr)
auc_lr = roc_auc_score(y_test, y_test_proba_lr)
fpr_dt, tpr_dt, _ = roc_curve(y_test, y_test_proba_dt)
auc_dt = roc_auc_score(y_test, y_test_proba_dt)
plt.figure(figsize=(10, 6))
plt.plot(fpr_lr, tpr_lr, linewidth=2, label=f'Logistic Regression (AUC = {auc_lr:.4f})')
plt.plot(fpr_dt, tpr_dt, linewidth=2, label=f'Decision Tree (AUC = {auc_dt:.4f})')
plt.plot([0, 1], [0, 1], 'k--', linewidth=1, label='Random')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curves Comparison')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('11_roc_curves.png', dpi=300, bbox_inches='tight')
print("‚úì ROC curves saved")
plt.close()



5. ROC Curves
----------------------------------------------------------------------
‚úì ROC curves saved


## 


In [12]:
# 6. SUMMARY


## 


In [13]:
print("\n" + "=" * 70)
print("Summary")
print("=" * 70)
print("\nKey Concepts Covered:")
print("1. Logistic Regression for classification")
print("2. Decision Tree classifier")
print("3. Confusion matrix analysis")
print("4. ROC curves and AUC")
print("\nNext Steps: Continue to Example 12 for Model Evaluation")



Summary

Key Concepts Covered:
1. Logistic Regression for classification
2. Decision Tree classifier
3. Confusion matrix analysis
4. ROC curves and AUC

Next Steps: Continue to Example 12 for Model Evaluation


## üö´ When Classification Hits a Dead End | ÿπŸÜÿØŸÖÿß ÿ™Ÿàÿßÿ¨Ÿá ÿßŸÑÿ™ÿµŸÜŸäŸÅ ÿ∑ÿ±ŸäŸÇ ŸÖÿ≥ÿØŸàÿØ

**BEFORE**: We've learned to build classification models.

**AFTER**: We discover we need proper evaluation beyond just accuracy!

**Why this matters**: Accuracy alone can be misleading - we need comprehensive evaluation metrics!

---

### The Problem We've Discovered

We've learned:
- ‚úÖ How to build classification models (Logistic Regression, Decision Trees)
- ‚úÖ How to calculate accuracy
- ‚úÖ How to create confusion matrices and ROC curves

**But we have a problem:**
- ‚ùì **What if accuracy is misleading (imbalanced classes)?**
- ‚ùì **What if we need to understand model performance in detail?**
- ‚ùì **What if we need to compare multiple models properly?**

**The Dead End:**
- We can build models and calculate accuracy
- But accuracy alone doesn't tell the full story
- We need comprehensive evaluation metrics and techniques

---

### Demonstrating the Problem

Let's see why accuracy alone can be misleading:


In [14]:
print("\n" + "=" * 70)
print("üö´ DEMONSTRATING THE DEAD END: Accuracy Can Be Misleading")
print("=" * 70)

# Create imbalanced dataset to show the problem
np.random.seed(42)
n_samples = 1000
# Imbalanced: 90% class 0, 10% class 1
y_imbalanced = np.random.choice([0, 1], size=n_samples, p=[0.9, 0.1])
X_imbalanced = np.random.randn(n_samples, 5)

# Dummy classifier that always predicts class 0 (majority class)
y_pred_dummy = np.zeros(n_samples)

accuracy_dummy = accuracy_score(y_imbalanced, y_pred_dummy)
print(f"\nüìä Imbalanced Dataset Example:")
print(f"   - Total samples: {n_samples}")
print(f"   - Class 0 (majority): {(y_imbalanced == 0).sum()} ({(y_imbalanced == 0).sum()/n_samples*100:.1f}%)")
print(f"   - Class 1 (minority): {(y_imbalanced == 1).sum()} ({(y_imbalanced == 1).sum()/n_samples*100:.1f}%)")

print(f"\n‚ö†Ô∏è  Dummy Classifier (Always Predicts Class 0):")
print(f"   - Accuracy: {accuracy_dummy:.2%}")
print(f"   - This looks good! But the model is useless!")
print(f"   - It never predicts class 1 (the important class)")

print(f"\nüí° The Problem:")
print(f"   - Accuracy alone can be misleading with imbalanced data")
print(f"   - We need precision, recall, F1-score to understand true performance")
print(f"   - We need to understand trade-offs (precision vs recall)")
print(f"   - We need proper evaluation techniques (cross-validation, learning curves)")

print(f"\nüìã What We Need for Proper Evaluation:")
print(f"   1. Multiple metrics (precision, recall, F1, AUC)")
print(f"   2. Cross-validation (robust performance estimation)")
print(f"   3. Learning curves (understand model behavior)")
print(f"   4. Model comparison (which model is actually better?)")

print(f"\n‚û°Ô∏è  Solution Needed:")
print(f"   - We need comprehensive model evaluation techniques")
print(f"   - We need to understand metrics beyond accuracy")
print(f"   - We need proper validation methods")
print(f"   - This leads us to Example 12: Model Evaluation")

print("\n" + "=" * 70)



üö´ DEMONSTRATING THE DEAD END: Accuracy Can Be Misleading

üìä Imbalanced Dataset Example:
   - Total samples: 1000
   - Class 0 (majority): 900 (90.0%)
   - Class 1 (minority): 100 (10.0%)

‚ö†Ô∏è  Dummy Classifier (Always Predicts Class 0):
   - Accuracy: 90.00%
   - This looks good! But the model is useless!
   - It never predicts class 1 (the important class)

üí° The Problem:
   - Accuracy alone can be misleading with imbalanced data
   - We need precision, recall, F1-score to understand true performance
   - We need to understand trade-offs (precision vs recall)
   - We need proper evaluation techniques (cross-validation, learning curves)

üìã What We Need for Proper Evaluation:
   1. Multiple metrics (precision, recall, F1, AUC)
   2. Cross-validation (robust performance estimation)
   3. Learning curves (understand model behavior)
   4. Model comparison (which model is actually better?)

‚û°Ô∏è  Solution Needed:
   - We need comprehensive model evaluation techniques
   - 

### What We Need Next

**The Solution**: We need comprehensive model evaluation:
- **Multiple metrics**: Precision, recall, F1-score, AUC (not just accuracy)
- **Cross-validation**: Robust performance estimation
- **Learning curves**: Understand model behavior and overfitting
- **Model comparison**: Proper techniques to compare models

**This dead end leads us to Example 12: Model Evaluation**
- Example 12 will teach us comprehensive evaluation techniques
- We'll learn metrics beyond accuracy
- We'll learn validation methods to properly assess models!
