# Module 19: Classification Algorithms

## Topics Covered
1. K-Nearest Neighbors
2. Decision Trees
3. Random Forests
4. Support Vector Machines
5. Naive Bayes
6. Algorithm Comparison
7. When to Use Which
8. Hyperparameter Tuning

## Learning Objectives

By the end of this module, you will be able to:
- Understand and apply k-nearest neighbors
- Work with decision trees effectively
- Implement random forests in practice
- Evaluate models using appropriate metrics and techniques
- Apply these concepts to real-world data science problems

---

---
# Section 1: K-Nearest Neighbors (KNN)
---

## What is K-Nearest Neighbors (KNN)?

KNN classifies data points based on the majority class of their k nearest neighbors. It's simple, intuitive, and doesn't require training, but can be slow for large datasets.

### Why This Matters in Data Science

This technique is essential for building effective machine learning models. It's used across industries for problems ranging from customer segmentation to fraud detection, recommendation systems, and predictive analytics.

In [None]:
# Example 1: K-Nearest Neighbors (KNN)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Generate synthetic data
np.random.seed(42)
n_samples = 200

# Create sample dataset
X = np.random.randn(n_samples, 2)
y = (X[:, 0] + X[:, 1] > 0).astype(int)

print(f"Dataset shape: {X.shape}")
print(f"Class distribution: {np.bincount(y)}")

# Visualize data
plt.figure(figsize=(10, 6))
plt.scatter(X[y==0, 0], X[y==0, 1], label='Class 0', alpha=0.6, s=50)
plt.scatter(X[y==1, 0], X[y==1, 1], label='Class 1', alpha=0.6, s=50)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title(f'{data["section1"]["title"]} - Sample Data')
plt.legend()
plt.grid(alpha=0.3)
plt.show()

In [None]:
# Example 2: Building and Evaluating a Model
from sklearn.metrics import accuracy_score, classification_report

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(f"Training set size: {len(X_train)}")
print(f"Test set size: {len(X_test)}")
print(f"\nFeatures are now scaled with mean=0 and std=1")
print(f"Training data mean: {X_train_scaled.mean(axis=0)}")
print(f"Training data std: {X_train_scaled.std(axis=0)}")

## Practice Exercise 1.1

**Task:** Apply K-Nearest Neighbors to a real-world scenario:

Create a synthetic dataset with 150 samples and 3 features. Split it into training and test sets (80/20). Scale the features and visualize the first two features colored by class.

**Expected Output:**
```
Training samples: 120
Test samples: 30
Features scaled successfully
```

In [None]:
# Your code here


In [None]:
# Solution 1.1

# Create dataset
np.random.seed(42)
X_practice = np.random.randn(150, 3)
y_practice = (X_practice.sum(axis=1) > 0).astype(int)

# Split data
X_tr, X_te, y_tr, y_te = train_test_split(X_practice, y_practice, test_size=0.2, random_state=42)

# Scale
scaler = StandardScaler()
X_tr_scaled = scaler.fit_transform(X_tr)
X_te_scaled = scaler.transform(X_te)

print(f"Training samples: {len(X_tr)}")
print(f"Test samples: {len(X_te)}")
print("Features scaled successfully")

# Visualize
plt.figure(figsize=(10, 6))
plt.scatter(X_tr_scaled[y_tr==0, 0], X_tr_scaled[y_tr==0, 1], label='Class 0', alpha=0.6)
plt.scatter(X_tr_scaled[y_tr==1, 0], X_tr_scaled[y_tr==1, 1], label='Class 1', alpha=0.6)
plt.xlabel('Feature 1 (scaled)')
plt.ylabel('Feature 2 (scaled)')
plt.title('Training Data Visualization')
plt.legend()
plt.grid(alpha=0.3)
plt.show()

---
# Section 2: Decision Trees and Random Forests
---

## Understanding Decision Trees and Random Forests

Decision trees split data recursively based on feature values. Random Forests combine multiple trees to reduce overfitting and improve performance.

### Why This Matters in Data Science

Proper evaluation is critical for understanding model performance in production. Different metrics highlight different strengths and weaknesses, helping you choose the right model for your specific business needs.

In [None]:
# Example 1: Decision Trees and Random Forests in Practice
from sklearn.metrics import confusion_matrix, classification_report
import seaborn as sns

# Create sample predictions (you would use real model predictions)
np.random.seed(42)
y_true = np.random.randint(0, 2, 100)
y_pred = y_true.copy()
# Add some errors
error_indices = np.random.choice(100, 15, replace=False)
y_pred[error_indices] = 1 - y_pred[error_indices]

# Confusion matrix
cm = confusion_matrix(y_true, y_pred)
print("Confusion Matrix:")
print(cm)

# Visualize confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar=False)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()

# Classification report
print("\nClassification Report:")
print(classification_report(y_true, y_pred))

## Practice Exercise 2.1

**Task:** Evaluate model performance:

Given true labels and predictions:
```python
y_true = [0, 1, 1, 0, 1, 1, 0, 0, 1, 0]
y_pred = [0, 1, 1, 0, 0, 1, 0, 1, 1, 0]
```

1. Create a confusion matrix
2. Calculate accuracy
3. Calculate precision and recall
4. Interpret the results

**Expected Output:**
```
Accuracy: 0.80
Precision: 0.75
Recall: 0.75
```

In [None]:
# Your code here


In [None]:
# Solution 2.1
from sklearn.metrics import accuracy_score, precision_score, recall_score

y_true = np.array([0, 1, 1, 0, 1, 1, 0, 0, 1, 0])
y_pred = np.array([0, 1, 1, 0, 0, 1, 0, 1, 1, 0])

# Confusion matrix
cm = confusion_matrix(y_true, y_pred)
print("Confusion Matrix:")
print(cm)
print(f"True Negatives: {cm[0,0]}")
print(f"False Positives: {cm[0,1]}")
print(f"False Negatives: {cm[1,0]}")
print(f"True Positives: {cm[1,1]}")

# Calculate metrics
accuracy = accuracy_score(y_true, y_pred)
precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)

print(f"\nAccuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")

print("\nInterpretation:")
print(f"- The model correctly classifies {accuracy*100:.0f}% of all samples")
print(f"- Of all positive predictions, {precision*100:.0f}% were correct")
print(f"- Of all actual positives, {recall*100:.0f}% were identified")

---
# Module Summary

## Key Takeaways

- **K-Nearest Neighbors** is essential for classification algorithms
- **Decision Trees** is essential for classification algorithms
- **Random Forests** is essential for classification algorithms
- Understanding when and how to apply these techniques is crucial for success
- Model evaluation must use appropriate metrics for the problem type
- Real-world applications require careful consideration of trade-offs
- Practice with diverse datasets strengthens your understanding

## Next Module

In the next module, we'll continue building your machine learning toolkit with more advanced techniques and algorithms. You'll learn how to tackle increasingly complex problems with confidence.

## Additional Practice

For extra practice, try these challenges:

1. Apply K-Nearest Neighbors to the Iris dataset from sklearn
2. Compare Decision Trees with Random Forests on a real dataset
3. Experiment with different hyperparameters and observe the impact
4. Create a complete pipeline from data loading to model evaluation
5. Visualize decision boundaries for different models