# Module 13: Introduction to Machine Learning

## Topics Covered
1. What is Machine Learning?
2. Types of ML (Supervised, Unsupervised, Reinforcement)
3. The ML Workflow
4. Train-Test Split
5. Overview of Common Algorithms (Linear Regression, Logistic Regression, Decision Trees)
6. Model Evaluation Concepts
7. Cross-Validation
8. Introduction to Scikit-Learn

## Learning Objectives

By the end of this module, you will be able to:
- Explain what machine learning is and identify different types of ML problems
- Understand the complete machine learning workflow from problem definition to deployment
- Split data into training and testing sets properly
- Recognize common machine learning algorithms and when to use them
- Understand fundamental evaluation concepts for regression and classification
- Apply cross-validation techniques to assess model performance
- Use scikit-learn's consistent API for machine learning tasks
- Build complete ML pipelines with preprocessing and modeling

---

In [None]:
# Import libraries we'll use throughout this module
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Scikit-learn imports
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor, plot_tree
from sklearn.metrics import (mean_squared_error, mean_absolute_error, r2_score,
                             accuracy_score, precision_score, recall_score, 
                             f1_score, confusion_matrix, classification_report)
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris, load_boston, make_classification

# Set random seed for reproducibility
np.random.seed(42)

# Configure display options
pd.set_option('display.max_columns', None)
%matplotlib inline
plt.style.use('seaborn-v0_8-whitegrid')

# Suppress warnings for cleaner output
import warnings
warnings.filterwarnings('ignore')

---
# Section 1: What is Machine Learning?
---

## Definition

Machine Learning (ML) is a subset of artificial intelligence that enables computers to learn from data and make predictions or decisions without being explicitly programmed for each specific task.

### Traditional Programming vs Machine Learning

**Traditional Programming:**
```
Data + Rules → Program → Output
```

**Machine Learning:**
```
Data + Output → ML Algorithm → Rules (Model)
```

### Why Machine Learning?

Machine learning is useful when:
- Rules are too complex to code manually (e.g., image recognition)
- Rules change over time (e.g., spam detection)
- Patterns are not obvious to humans (e.g., customer behavior)
- Personalization is required (e.g., recommendations)

### Real-World Applications

| Domain | Application |
|--------|-------------|
| Healthcare | Disease diagnosis, drug discovery |
| Finance | Fraud detection, credit scoring |
| Retail | Product recommendations, demand forecasting |
| Transportation | Self-driving cars, route optimization |
| Marketing | Customer segmentation, churn prediction |

In [None]:
# Simple example: Traditional programming vs ML approach
# Task: Predict house prices

# Traditional approach - explicit rules (oversimplified)
def predict_price_traditional(sqft, bedrooms, location_score):
    """Manual rules for house pricing - hard to get right!"""
    base_price = 50000
    price = base_price + (sqft * 100) + (bedrooms * 10000) + (location_score * 20000)
    return price

# ML approach - learn from data
# Generate synthetic housing data
np.random.seed(42)
n_houses = 100

sqft = np.random.uniform(1000, 3000, n_houses)
bedrooms = np.random.randint(1, 6, n_houses)
location_score = np.random.uniform(1, 10, n_houses)

# True relationship (unknown to us in real scenarios)
actual_prices = (50000 + sqft * 150 + bedrooms * 15000 + 
                 location_score * 25000 + np.random.normal(0, 20000, n_houses))

# Create DataFrame
housing_data = pd.DataFrame({
    'sqft': sqft,
    'bedrooms': bedrooms,
    'location_score': location_score,
    'price': actual_prices
})

print("Sample Housing Data:")
print(housing_data.head(10))
print(f"\nDataset shape: {housing_data.shape}")

In [None]:
# Compare traditional vs ML predictions
# Traditional prediction
housing_data['traditional_pred'] = housing_data.apply(
    lambda row: predict_price_traditional(row['sqft'], row['bedrooms'], row['location_score']),
    axis=1
)

# ML prediction (Linear Regression - we'll learn this soon)
X = housing_data[['sqft', 'bedrooms', 'location_score']]
y = housing_data['price']

model = LinearRegression()
model.fit(X, y)
housing_data['ml_pred'] = model.predict(X)

# Compare errors
traditional_error = np.sqrt(mean_squared_error(housing_data['price'], housing_data['traditional_pred']))
ml_error = np.sqrt(mean_squared_error(housing_data['price'], housing_data['ml_pred']))

print("Prediction Error Comparison (RMSE):")
print(f"  Traditional rules: ${traditional_error:,.2f}")
print(f"  Machine Learning:  ${ml_error:,.2f}")
print(f"\nML model reduces error by {((traditional_error - ml_error) / traditional_error * 100):.1f}%")

---
# Section 2: Types of Machine Learning
---

## Three Main Categories

### 1. Supervised Learning
- Learning from labeled data (input-output pairs)
- Goal: Learn a mapping from inputs to outputs
- Examples: Spam detection, price prediction, image classification

**Two types:**
- **Regression**: Predict continuous values (e.g., house prices)
- **Classification**: Predict categories (e.g., spam vs. not spam)

### 2. Unsupervised Learning
- Learning from unlabeled data
- Goal: Find patterns or structure in data
- Examples: Customer segmentation, anomaly detection

**Common techniques:**
- Clustering (K-means, hierarchical)
- Dimensionality reduction (PCA)
- Association rules

### 3. Reinforcement Learning
- Learning through interaction with an environment
- Goal: Maximize cumulative reward
- Examples: Game playing, robotics, autonomous vehicles

In [None]:
# Visual comparison of Supervised Learning types
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Regression example
np.random.seed(42)
x_reg = np.linspace(0, 10, 50)
y_reg = 2 * x_reg + 3 + np.random.normal(0, 2, 50)

axes[0].scatter(x_reg, y_reg, alpha=0.6)
axes[0].plot(x_reg, 2 * x_reg + 3, color='red', linewidth=2, label='Predicted line')
axes[0].set_xlabel('Feature (X)')
axes[0].set_ylabel('Target (Y) - Continuous')
axes[0].set_title('Regression: Predict Continuous Values')
axes[0].legend()

# Classification example
from sklearn.datasets import make_blobs
X_class, y_class = make_blobs(n_samples=100, centers=2, random_state=42, cluster_std=1.5)

colors = ['blue' if label == 0 else 'orange' for label in y_class]
axes[1].scatter(X_class[:, 0], X_class[:, 1], c=colors, alpha=0.6)
axes[1].set_xlabel('Feature 1')
axes[1].set_ylabel('Feature 2')
axes[1].set_title('Classification: Predict Categories')

# Add legend manually
from matplotlib.patches import Patch
legend_elements = [Patch(facecolor='blue', label='Class 0'),
                   Patch(facecolor='orange', label='Class 1')]
axes[1].legend(handles=legend_elements)

plt.tight_layout()
plt.show()

In [None]:
# Unsupervised Learning example: Clustering
from sklearn.cluster import KMeans

# Generate unlabeled data
np.random.seed(42)
X_unlabeled, _ = make_blobs(n_samples=150, centers=3, random_state=42, cluster_std=1.0)

# Apply K-means clustering
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
cluster_labels = kmeans.fit_predict(X_unlabeled)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Before clustering
axes[0].scatter(X_unlabeled[:, 0], X_unlabeled[:, 1], alpha=0.6)
axes[0].set_xlabel('Feature 1')
axes[0].set_ylabel('Feature 2')
axes[0].set_title('Before Clustering (No Labels)')

# After clustering
scatter = axes[1].scatter(X_unlabeled[:, 0], X_unlabeled[:, 1], 
                          c=cluster_labels, cmap='viridis', alpha=0.6)
axes[1].scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],
                marker='X', s=200, c='red', edgecolor='black', linewidth=2,
                label='Cluster Centers')
axes[1].set_xlabel('Feature 1')
axes[1].set_ylabel('Feature 2')
axes[1].set_title('After K-Means Clustering')
axes[1].legend()

plt.tight_layout()
plt.show()

---
# Section 3: The Machine Learning Workflow
---

## Standard ML Pipeline

```
1. Define Problem → 2. Collect Data → 3. Prepare Data → 4. Explore Data
                                                              ↓
8. Deploy Model ← 7. Fine-tune ← 6. Evaluate Model ← 5. Build Model
```

### Step-by-Step Breakdown

1. **Define the Problem**
   - What are you trying to predict?
   - Is it regression or classification?
   - What metrics define success?

2. **Collect Data**
   - Gather relevant data
   - Ensure data quality
   - Consider data privacy

3. **Prepare Data**
   - Handle missing values
   - Encode categorical variables
   - Scale/normalize features

4. **Explore Data (EDA)**
   - Understand distributions
   - Identify patterns and correlations
   - Detect outliers

5. **Build Model**
   - Select algorithm(s)
   - Split data (train/test)
   - Train the model

6. **Evaluate Model**
   - Test on unseen data
   - Calculate performance metrics
   - Compare with baseline

7. **Fine-tune**
   - Adjust hyperparameters
   - Try different algorithms
   - Feature engineering

8. **Deploy**
   - Put model into production
   - Monitor performance
   - Update as needed

In [None]:
# Let's walk through a complete ML workflow example
# Problem: Predict if a customer will make a purchase based on their behavior

# Step 1: Define Problem
print("STEP 1: Define Problem")
print("="*50)
print("Task: Predict customer purchase (Yes/No)")
print("Type: Binary Classification")
print("Success Metric: Accuracy, with focus on Recall")
print()

In [None]:
# Step 2 & 3: Collect and Prepare Data
print("STEP 2 & 3: Collect and Prepare Data")
print("="*50)

# Generate synthetic customer data
np.random.seed(42)
n_customers = 500

customer_data = pd.DataFrame({
    'age': np.random.randint(18, 70, n_customers),
    'income': np.random.normal(50000, 20000, n_customers).clip(20000, 150000),
    'time_on_site': np.random.exponential(5, n_customers),  # minutes
    'pages_viewed': np.random.poisson(5, n_customers),
    'previous_purchases': np.random.poisson(2, n_customers),
    'email_subscribed': np.random.choice([0, 1], n_customers, p=[0.4, 0.6])
})

# Create target variable (purchase) based on features
purchase_probability = (
    0.1 + 
    0.01 * (customer_data['income'] / 10000) +
    0.05 * customer_data['time_on_site'] +
    0.03 * customer_data['pages_viewed'] +
    0.1 * customer_data['previous_purchases'] +
    0.15 * customer_data['email_subscribed']
).clip(0, 1)

customer_data['purchased'] = np.random.binomial(1, purchase_probability)

print(f"Dataset shape: {customer_data.shape}")
print(f"\nFeatures: {list(customer_data.columns[:-1])}")
print(f"Target: purchased")
print(f"\nClass distribution:")
print(customer_data['purchased'].value_counts())
print(f"\nFirst few rows:")
print(customer_data.head())

In [None]:
# Step 4: Explore Data (Quick EDA)
print("STEP 4: Explore Data")
print("="*50)

# Summary statistics
print("Summary Statistics:")
print(customer_data.describe().round(2))

# Correlation with target
print("\nCorrelation with 'purchased':")
correlations = customer_data.corr()['purchased'].drop('purchased').sort_values(ascending=False)
print(correlations)

In [None]:
# Visualize feature distributions by purchase status
fig, axes = plt.subplots(2, 3, figsize=(15, 10))

features = ['age', 'income', 'time_on_site', 'pages_viewed', 'previous_purchases', 'email_subscribed']

for ax, feature in zip(axes.flat, features):
    if feature == 'email_subscribed':
        # Bar plot for binary feature
        customer_data.groupby(['email_subscribed', 'purchased']).size().unstack().plot(
            kind='bar', ax=ax, alpha=0.7)
        ax.set_xlabel(feature)
        ax.legend(['No Purchase', 'Purchase'])
    else:
        # Histogram for continuous features
        customer_data[customer_data['purchased']==0][feature].hist(
            ax=ax, alpha=0.5, label='No Purchase', bins=20)
        customer_data[customer_data['purchased']==1][feature].hist(
            ax=ax, alpha=0.5, label='Purchase', bins=20)
        ax.set_xlabel(feature)
        ax.legend()
    ax.set_title(f'Distribution of {feature}')

plt.tight_layout()
plt.show()

---
# Section 4: Train-Test Split
---

## Why Split the Data?

We need to evaluate our model on data it hasn't seen during training. This helps us:

1. **Estimate real-world performance**: How well will the model work on new data?
2. **Detect overfitting**: Is the model memorizing training data instead of learning patterns?
3. **Compare models fairly**: Consistent evaluation across different models

### Common Split Ratios

| Split | Training | Testing | Use Case |
|-------|----------|---------|----------|
| 80/20 | 80% | 20% | Most common, general purpose |
| 70/30 | 70% | 30% | When you need more test data |
| 90/10 | 90% | 10% | Large datasets |

### Important Considerations

- **Random sampling**: Ensure representative splits
- **Stratification**: Maintain class proportions in imbalanced datasets
- **No data leakage**: Test data should never influence training

## Syntax

```python
from sklearn.model_selection import train_test_split

# Basic split
X_train, X_test, y_train, y_test = train_test_split(
    X,              # Features
    y,              # Target
    test_size=0.2,  # 20% for testing
    random_state=42 # For reproducibility
)

# Stratified split (for classification)
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    stratify=y,     # Maintain class proportions
    random_state=42
)
```

In [None]:
# Step 5 (Part 1): Split the data
print("STEP 5 (Part 1): Train-Test Split")
print("="*50)

# Separate features and target
X = customer_data.drop('purchased', axis=1)
y = customer_data['purchased']

# Split with stratification (important for imbalanced classes)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2, 
    stratify=y,
    random_state=42
)

print(f"Total samples: {len(X)}")
print(f"Training samples: {len(X_train)} ({len(X_train)/len(X)*100:.0f}%)")
print(f"Testing samples: {len(X_test)} ({len(X_test)/len(X)*100:.0f}%)")

print(f"\nClass distribution in original data:")
print(y.value_counts(normalize=True))

print(f"\nClass distribution in training set:")
print(y_train.value_counts(normalize=True))

print(f"\nClass distribution in test set:")
print(y_test.value_counts(normalize=True))

In [None]:
# Visualize the split
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Dataset sizes
sizes = [len(X_train), len(X_test)]
labels = [f'Training\n({len(X_train)} samples)', f'Testing\n({len(X_test)} samples)']
colors = ['steelblue', 'coral']
axes[0].pie(sizes, labels=labels, colors=colors, autopct='%1.0f%%', startangle=90)
axes[0].set_title('Train-Test Split')

# Class distribution
x_pos = np.arange(2)
width = 0.35

train_dist = y_train.value_counts().sort_index().values
test_dist = y_test.value_counts().sort_index().values

bars1 = axes[1].bar(x_pos - width/2, train_dist, width, label='Training', color='steelblue')
bars2 = axes[1].bar(x_pos + width/2, test_dist, width, label='Testing', color='coral')

axes[1].set_xlabel('Class')
axes[1].set_ylabel('Count')
axes[1].set_title('Class Distribution After Split')
axes[1].set_xticks(x_pos)
axes[1].set_xticklabels(['No Purchase (0)', 'Purchase (1)'])
axes[1].legend()

plt.tight_layout()
plt.show()

---
# Section 5: Overview of Common Machine Learning Algorithms
---

In this section, we'll briefly introduce three fundamental machine learning algorithms. Linear and Logistic Regression will be previewed here and covered in depth in dedicated modules. Decision Trees will be fully explored in this module.

## Algorithm Types

| Algorithm | Type | Use Case | Covered In |
|-----------|------|----------|------------|
| Linear Regression | Regression | Predict continuous values | Module 14 |
| Logistic Regression | Classification | Predict binary outcomes | Module 15 |
| Decision Trees | Both | Interpretable models | This Module (Full Coverage) |

---

## Linear Regression (Preview)

Linear regression is the simplest supervised learning algorithm for regression tasks. It models the relationship between features and a continuous target by fitting a straight line (or hyperplane).

**Equation:** `y = b0 + b1*x1 + b2*x2 + ... + bn*xn`

**When to use:**
- Predicting continuous values (prices, temperatures, sales)
- Understanding linear relationships between variables
- When interpretability is important

**Key strengths:**
- Fast to train
- Highly interpretable coefficients
- Works well with linearly separable data

**Note:** Module 14 covers Linear Regression in detail, including polynomial regression, regularization (Ridge, Lasso), and advanced diagnostics.

In [None]:
# Quick example: Linear Regression syntax (full coverage in Module 14)
from sklearn.linear_model import LinearRegression

# Use housing data from Section 1
X_house = housing_data[['sqft', 'bedrooms', 'location_score']]
y_house = housing_data['price']

# Simple sklearn API pattern
model_lr = LinearRegression()
model_lr.fit(X_house, y_house)

print("Linear Regression Model Created!")
print(f"Coefficients: {model_lr.coef_}")
print(f"Intercept: ${model_lr.intercept_:,.2f}")
print(f"\nThis demonstrates the basic sklearn API pattern.")
print(f"We'll explore this algorithm in depth in Module 14.")

---

## Logistic Regression (Preview)

Despite its name, logistic regression is a **classification** algorithm. It predicts the probability that an observation belongs to a particular class using the sigmoid function.

**How it works:**
1. Calculates a linear combination of features
2. Applies sigmoid function: `P(y=1) = 1 / (1 + e^(-z))`
3. Converts probability to class label using threshold (usually 0.5)

**When to use:**
- Binary classification (spam/not spam, fraud/legit)
- When you need probability estimates
- When interpretability is important

**Key strengths:**
- Provides probability estimates
- Works well with linearly separable classes
- Fast and efficient

**Note:** Module 15 covers Logistic Regression in detail, including multiclass classification, decision boundaries, ROC curves, and class imbalance handling.

In [None]:
# Quick example: Logistic Regression syntax (full coverage in Module 15)
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

# Use customer data from Section 3
X_customer = customer_data.drop('purchased', axis=1)
y_customer = customer_data['purchased']

# Scale features (good practice for logistic regression)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_customer)

# Same sklearn API pattern!
model_log = LogisticRegression(random_state=42)
model_log.fit(X_scaled, y_customer)

# Make predictions
sample_prediction = model_log.predict(X_scaled[:5])
sample_probabilities = model_log.predict_proba(X_scaled[:5])

print("Logistic Regression Model Created!")
print(f"First 5 predictions: {sample_prediction}")
print(f"First 5 probabilities (class 0, class 1):")
print(sample_probabilities)
print(f"\nNotice the consistent sklearn API: fit(), predict(), predict_proba()")
print(f"We'll explore this algorithm in depth in Module 15.")

---

## Decision Trees (Full Coverage)

Decision trees are versatile algorithms that can perform both classification and regression. Unlike Linear and Logistic Regression which we'll cover in dedicated modules, we'll fully explore Decision Trees here as a foundational algorithm.

### How They Work

1. Start with all data at the root node
2. Find the best feature and value to split the data
3. Create child nodes for each split
4. Repeat until stopping criteria are met
5. Assign predictions at leaf nodes

### Advantages

- Easy to understand and interpret
- Handle both numerical and categorical data
- Require little data preprocessing
- Can capture non-linear relationships

### Disadvantages

- Prone to overfitting
- Can be unstable (small changes in data = different tree)
- May create biased trees with imbalanced data

## Syntax

```python
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor

# For Classification
clf = DecisionTreeClassifier(
    max_depth=5,           # Limit tree depth to prevent overfitting
    min_samples_split=10,  # Minimum samples to split a node
    min_samples_leaf=5,    # Minimum samples in a leaf
    random_state=42
)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

# For Regression
reg = DecisionTreeRegressor(
    max_depth=5,
    random_state=42
)
reg.fit(X_train, y_train)
y_pred = reg.predict(X_test)

# Feature importance
importances = clf.feature_importances_
```

In [None]:
# Decision Tree for customer purchase prediction
print("Decision Tree: Customer Purchase Prediction")
print("="*50)

# Create and train model (use unscaled data - trees don't need scaling)
dt_model = DecisionTreeClassifier(
    max_depth=4,          # Limit depth to prevent overfitting and aid visualization
    min_samples_split=20,
    min_samples_leaf=10,
    random_state=42
)
dt_model.fit(X_train, y_train)

# Make predictions
y_pred_dt = dt_model.predict(X_test)

# Evaluate
accuracy_dt = accuracy_score(y_test, y_pred_dt)
print(f"\nDecision Tree Accuracy: {accuracy_dt:.4f} ({accuracy_dt*100:.1f}%)")

print(f"\nClassification Report:")
print(classification_report(y_test, y_pred_dt, target_names=['No Purchase', 'Purchase']))

In [None]:
# Feature importance
feature_importance = pd.DataFrame({
    'Feature': X_train.columns,
    'Importance': dt_model.feature_importances_
}).sort_values('Importance', ascending=False)

print("Feature Importance:")
print(feature_importance.to_string(index=False))

# Visualize feature importance
plt.figure(figsize=(10, 6))
plt.barh(feature_importance['Feature'], feature_importance['Importance'])
plt.xlabel('Importance')
plt.title('Decision Tree Feature Importance')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

In [None]:
# Visualize the decision tree
plt.figure(figsize=(20, 10))
plot_tree(dt_model, 
          feature_names=X_train.columns,
          class_names=['No Purchase', 'Purchase'],
          filled=True,
          rounded=True,
          fontsize=10)
plt.title('Decision Tree Visualization')
plt.tight_layout()
plt.show()

In [None]:
# Compare models: Logistic Regression vs Decision Tree
print("Model Comparison")
print("="*50)

comparison_df = pd.DataFrame({
    'Metric': ['Accuracy', 'Precision', 'Recall', 'F1-Score'],
    'Logistic Regression': [
        accuracy_score(y_test, y_pred_class),
        precision_score(y_test, y_pred_class),
        recall_score(y_test, y_pred_class),
        f1_score(y_test, y_pred_class)
    ],
    'Decision Tree': [
        accuracy_score(y_test, y_pred_dt),
        precision_score(y_test, y_pred_dt),
        recall_score(y_test, y_pred_dt),
        f1_score(y_test, y_pred_dt)
    ]
})

comparison_df['Logistic Regression'] = comparison_df['Logistic Regression'].round(4)
comparison_df['Decision Tree'] = comparison_df['Decision Tree'].round(4)
print(comparison_df.to_string(index=False))

---
# Section 6: Model Evaluation Concepts
---

## Why Evaluation Matters

After training a model, we need to assess how well it performs. The metrics we use depend on the problem type: regression or classification.

## Regression Metrics

For predicting continuous values (like prices, temperatures):

| Metric | What It Measures | Interpretation |
|--------|-----------------|----------------|
| MSE | Mean Squared Error | Lower is better; penalizes large errors heavily |
| RMSE | Root Mean Squared Error | Same units as target; easier to interpret than MSE |
| MAE | Mean Absolute Error | Average absolute difference; robust to outliers |
| R² | R-squared | Proportion of variance explained (0-1, higher is better) |

**When to use which:**
- Use **R²** for overall model quality
- Use **RMSE** when large errors are particularly bad
- Use **MAE** when all errors should be weighted equally

## Classification Metrics

For predicting categories (like spam/not spam, fraud/legit):

| Metric | What It Measures | When to Prioritize |
|--------|-----------------|-------------------|
| Accuracy | Overall correctness | Balanced datasets |
| Precision | Correctness of positive predictions | When false positives are costly |
| Recall | Coverage of actual positives | When false negatives are costly |
| F1-Score | Balance between precision and recall | When you need both |

### The Confusion Matrix

```
                  Predicted
                  Negative    Positive
Actual  Negative     TN         FP
        Positive     FN         TP
```

- **True Positive (TP)**: Correctly predicted positive
- **True Negative (TN)**: Correctly predicted negative
- **False Positive (FP)**: Incorrectly predicted positive (Type I error)
- **False Negative (FN)**: Incorrectly predicted negative (Type II error)

**Formulas:**
- Accuracy = (TP + TN) / (TP + TN + FP + FN)
- Precision = TP / (TP + FP)
- Recall = TP / (TP + FN)
- F1-Score = 2 × (Precision × Recall) / (Precision + Recall)

**Note:** We'll practice calculating and interpreting these metrics extensively in Modules 14-16.

---
# Section 7: Cross-Validation
---

## What is Cross-Validation?

Cross-validation is a technique to evaluate model performance more reliably by training and testing on different subsets of data multiple times.

### Why Use Cross-Validation?

- **More reliable estimates**: Single train-test split can be misleading
- **Better use of data**: All data points are used for both training and testing
- **Detect overfitting**: Large gap between training and CV scores indicates overfitting

### K-Fold Cross-Validation

1. Split data into K equal folds
2. For each fold:
   - Use that fold as test set
   - Use remaining K-1 folds as training set
   - Calculate performance metric
3. Average the K results

## Syntax

```python
from sklearn.model_selection import cross_val_score, KFold, StratifiedKFold

# Basic cross-validation
scores = cross_val_score(
    model,           # The model to evaluate
    X,               # Features
    y,               # Target
    cv=5,            # Number of folds
    scoring='accuracy'  # Metric to use
)

print(f"Mean: {scores.mean():.4f}")
print(f"Std: {scores.std():.4f}")

# Custom K-Fold
kfold = KFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=kfold)

# Stratified K-Fold (for classification)
skfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=skfold)
```

In [None]:
# Visualize K-Fold Cross-Validation
from sklearn.model_selection import KFold

# Create sample data indices
n_samples = 25
indices = np.arange(n_samples)

# Set up 5-fold CV
kf = KFold(n_splits=5, shuffle=False)

fig, ax = plt.subplots(figsize=(12, 4))

for fold_idx, (train_idx, test_idx) in enumerate(kf.split(indices)):
    # Plot training indices
    ax.scatter(train_idx, [fold_idx] * len(train_idx), c='steelblue', s=100, 
               label='Training' if fold_idx == 0 else '')
    # Plot test indices
    ax.scatter(test_idx, [fold_idx] * len(test_idx), c='coral', s=100,
               label='Testing' if fold_idx == 0 else '')

ax.set_xlabel('Sample Index')
ax.set_ylabel('Fold')
ax.set_yticks(range(5))
ax.set_yticklabels([f'Fold {i+1}' for i in range(5)])
ax.set_title('5-Fold Cross-Validation Visualization')
ax.legend(loc='upper right')
plt.tight_layout()
plt.show()

In [None]:
# Apply cross-validation to our models
from sklearn.model_selection import cross_val_score, StratifiedKFold

print("Cross-Validation Results (5-Fold)")
print("="*50)

# Use stratified K-fold for classification
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Logistic Regression
lr_cv = LogisticRegression(random_state=42, max_iter=1000)
lr_scores = cross_val_score(lr_cv, scaler.fit_transform(X), y, cv=cv, scoring='accuracy')

print(f"\nLogistic Regression:")
print(f"  Scores per fold: {[f'{s:.4f}' for s in lr_scores]}")
print(f"  Mean accuracy: {lr_scores.mean():.4f} (+/- {lr_scores.std()*2:.4f})")

# Decision Tree
dt_cv = DecisionTreeClassifier(max_depth=4, random_state=42)
dt_scores = cross_val_score(dt_cv, X, y, cv=cv, scoring='accuracy')

print(f"\nDecision Tree:")
print(f"  Scores per fold: {[f'{s:.4f}' for s in dt_scores]}")
print(f"  Mean accuracy: {dt_scores.mean():.4f} (+/- {dt_scores.std()*2:.4f})")

In [None]:
# Compare models with multiple metrics using cross-validation
from sklearn.model_selection import cross_validate

# Define scoring metrics
scoring = ['accuracy', 'precision', 'recall', 'f1']

# Evaluate Logistic Regression
lr_cv_results = cross_validate(
    LogisticRegression(random_state=42, max_iter=1000),
    scaler.fit_transform(X), y,
    cv=cv,
    scoring=scoring,
    return_train_score=True
)

# Evaluate Decision Tree
dt_cv_results = cross_validate(
    DecisionTreeClassifier(max_depth=4, random_state=42),
    X, y,
    cv=cv,
    scoring=scoring,
    return_train_score=True
)

# Create comparison table
print("Comprehensive Cross-Validation Comparison")
print("="*60)

for metric in scoring:
    lr_test = lr_cv_results[f'test_{metric}']
    dt_test = dt_cv_results[f'test_{metric}']
    
    print(f"\n{metric.upper()}:")
    print(f"  Logistic Regression: {lr_test.mean():.4f} (+/- {lr_test.std()*2:.4f})")
    print(f"  Decision Tree:       {dt_test.mean():.4f} (+/- {dt_test.std()*2:.4f})")

In [None]:
# Visualize cross-validation results
fig, ax = plt.subplots(figsize=(10, 6))

metrics = ['Accuracy', 'Precision', 'Recall', 'F1']
x = np.arange(len(metrics))
width = 0.35

lr_means = [lr_cv_results[f'test_{m.lower()}'].mean() for m in metrics]
lr_stds = [lr_cv_results[f'test_{m.lower()}'].std() for m in metrics]
dt_means = [dt_cv_results[f'test_{m.lower()}'].mean() for m in metrics]
dt_stds = [dt_cv_results[f'test_{m.lower()}'].std() for m in metrics]

bars1 = ax.bar(x - width/2, lr_means, width, yerr=lr_stds, label='Logistic Regression', capsize=5)
bars2 = ax.bar(x + width/2, dt_means, width, yerr=dt_stds, label='Decision Tree', capsize=5)

ax.set_ylabel('Score')
ax.set_title('Cross-Validation Results Comparison')
ax.set_xticks(x)
ax.set_xticklabels(metrics)
ax.legend()
ax.set_ylim(0, 1)

plt.tight_layout()
plt.show()

---
# Section 8: Introduction to Scikit-Learn
---

## What is Scikit-Learn?

Scikit-learn is the most popular machine learning library in Python. It provides:

- **Simple and consistent API**: All models follow the same pattern
- **Comprehensive algorithms**: Classification, regression, clustering, etc.
- **Data preprocessing tools**: Scaling, encoding, imputation
- **Model selection utilities**: Cross-validation, hyperparameter tuning
- **Excellent documentation**: Well-documented with examples

## The Scikit-Learn API Pattern

All scikit-learn estimators follow the same pattern:

```python
# 1. Import
from sklearn.module import EstimatorClass

# 2. Instantiate
model = EstimatorClass(hyperparameters)

# 3. Fit (train)
model.fit(X_train, y_train)

# 4. Predict
predictions = model.predict(X_test)

# 5. Evaluate
score = model.score(X_test, y_test)
```

In [None]:
# Overview of common scikit-learn modules
print("Scikit-Learn Module Overview")
print("="*60)

modules = [
    ("sklearn.linear_model", "Linear models", 
     ["LinearRegression", "LogisticRegression", "Ridge", "Lasso"]),
    ("sklearn.tree", "Tree-based models",
     ["DecisionTreeClassifier", "DecisionTreeRegressor"]),
    ("sklearn.ensemble", "Ensemble methods",
     ["RandomForestClassifier", "GradientBoostingClassifier"]),
    ("sklearn.svm", "Support Vector Machines",
     ["SVC", "SVR"]),
    ("sklearn.neighbors", "Nearest Neighbors",
     ["KNeighborsClassifier", "KNeighborsRegressor"]),
    ("sklearn.cluster", "Clustering",
     ["KMeans", "DBSCAN", "AgglomerativeClustering"]),
    ("sklearn.preprocessing", "Data preprocessing",
     ["StandardScaler", "MinMaxScaler", "LabelEncoder"]),
    ("sklearn.model_selection", "Model selection",
     ["train_test_split", "cross_val_score", "GridSearchCV"]),
    ("sklearn.metrics", "Evaluation metrics",
     ["accuracy_score", "mean_squared_error", "confusion_matrix"])
]

for module, description, classes in modules:
    print(f"\n{module}")
    print(f"  {description}")
    print(f"  Classes: {', '.join(classes)}")

In [None]:
# Complete ML pipeline with scikit-learn
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

print("Building a Complete ML Pipeline")
print("="*50)

# Create a pipeline that combines preprocessing and modeling
pipeline = Pipeline([
    ('scaler', StandardScaler()),           # Step 1: Scale features
    ('classifier', LogisticRegression())    # Step 2: Train classifier
])

# Fit the entire pipeline
pipeline.fit(X_train, y_train)

# Predict using the pipeline
y_pred_pipeline = pipeline.predict(X_test)

# Evaluate
accuracy = pipeline.score(X_test, y_test)
print(f"\nPipeline Accuracy: {accuracy:.4f}")

# Cross-validate the entire pipeline
cv_scores = cross_val_score(pipeline, X, y, cv=5)
print(f"\nCross-Validation Scores: {[f'{s:.4f}' for s in cv_scores]}")
print(f"Mean CV Score: {cv_scores.mean():.4f} (+/- {cv_scores.std()*2:.4f})")

In [None]:
# Hyperparameter tuning with GridSearchCV
from sklearn.model_selection import GridSearchCV

print("Hyperparameter Tuning with GridSearchCV")
print("="*50)

# Define parameter grid for Decision Tree
param_grid = {
    'max_depth': [2, 3, 4, 5, 6],
    'min_samples_split': [5, 10, 20],
    'min_samples_leaf': [2, 5, 10]
}

# Create GridSearchCV
grid_search = GridSearchCV(
    DecisionTreeClassifier(random_state=42),
    param_grid,
    cv=5,
    scoring='accuracy',
    return_train_score=True
)

# Fit grid search
grid_search.fit(X_train, y_train)

print(f"\nBest Parameters: {grid_search.best_params_}")
print(f"Best CV Score: {grid_search.best_score_:.4f}")

# Evaluate best model on test set
best_model = grid_search.best_estimator_
test_accuracy = best_model.score(X_test, y_test)
print(f"Test Set Accuracy: {test_accuracy:.4f}")

In [None]:
# Visualize grid search results
results_df = pd.DataFrame(grid_search.cv_results_)

# Plot heatmap of results for different max_depth and min_samples_split
# (fixing min_samples_leaf at the best value)
best_leaf = grid_search.best_params_['min_samples_leaf']
subset = results_df[results_df['param_min_samples_leaf'] == best_leaf]

pivot_table = subset.pivot_table(
    values='mean_test_score',
    index='param_max_depth',
    columns='param_min_samples_split'
)

plt.figure(figsize=(10, 6))
sns.heatmap(pivot_table, annot=True, fmt='.4f', cmap='YlGnBu')
plt.title(f'Grid Search Results (min_samples_leaf={best_leaf})')
plt.xlabel('min_samples_split')
plt.ylabel('max_depth')
plt.tight_layout()
plt.show()

## Practice Exercise 10.1

**Task:** Use scikit-learn to build a complete machine learning solution for the Iris dataset.

Steps:
1. Load the Iris dataset using `load_iris()`
2. Split into training and testing sets
3. Create a pipeline with StandardScaler and LogisticRegression
4. Perform 5-fold cross-validation
5. Use GridSearchCV to tune the `C` parameter (try [0.01, 0.1, 1, 10, 100])
6. Report the best parameters and test set accuracy

In [None]:
# Your code here
from sklearn.datasets import load_iris

# Load dataset
iris = load_iris()
X_iris = iris.data
y_iris = iris.target

print(f"Dataset shape: {X_iris.shape}")
print(f"Features: {iris.feature_names}")
print(f"Classes: {iris.target_names}")


In [None]:
# Solution 10.1
from sklearn.datasets import load_iris
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV

# Step 1: Load dataset
iris = load_iris()
X_iris = iris.data
y_iris = iris.target

print("Step 1: Load Dataset")
print(f"  Shape: {X_iris.shape}")
print(f"  Classes: {list(iris.target_names)}")

# Step 2: Split data
X_train_i, X_test_i, y_train_i, y_test_i = train_test_split(
    X_iris, y_iris, test_size=0.2, stratify=y_iris, random_state=42
)
print(f"\nStep 2: Split Data")
print(f"  Training samples: {len(X_train_i)}")
print(f"  Testing samples: {len(X_test_i)}")

# Step 3: Create pipeline
pipeline_iris = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression(max_iter=1000, random_state=42))
])
print(f"\nStep 3: Pipeline Created")
print(f"  Steps: {[step[0] for step in pipeline_iris.steps]}")

# Step 4: Cross-validation
cv_scores_iris = cross_val_score(pipeline_iris, X_train_i, y_train_i, cv=5)
print(f"\nStep 4: Cross-Validation")
print(f"  Scores: {[f'{s:.4f}' for s in cv_scores_iris]}")
print(f"  Mean: {cv_scores_iris.mean():.4f} (+/- {cv_scores_iris.std()*2:.4f})")

# Step 5: GridSearchCV
param_grid_iris = {
    'classifier__C': [0.01, 0.1, 1, 10, 100]
}

grid_iris = GridSearchCV(pipeline_iris, param_grid_iris, cv=5, scoring='accuracy')
grid_iris.fit(X_train_i, y_train_i)

print(f"\nStep 5: GridSearchCV")
print(f"  Best C: {grid_iris.best_params_['classifier__C']}")
print(f"  Best CV Score: {grid_iris.best_score_:.4f}")

# Step 6: Final evaluation
test_accuracy_iris = grid_iris.score(X_test_i, y_test_i)
print(f"\nStep 6: Final Evaluation")
print(f"  Test Set Accuracy: {test_accuracy_iris:.4f}")

# Detailed results
print(f"\nClassification Report:")
y_pred_iris = grid_iris.predict(X_test_i)
print(classification_report(y_test_i, y_pred_iris, target_names=iris.target_names))

---
# Module Summary

## Key Takeaways

### What is Machine Learning?
- ML enables computers to learn patterns from data without explicit programming
- Three main types: Supervised, Unsupervised, and Reinforcement Learning
- Supervised learning includes regression (continuous) and classification (categorical)

### The ML Workflow
1. Define the problem
2. Collect and prepare data
3. Explore data (EDA)
4. Build and train models
5. Evaluate performance
6. Fine-tune and deploy

### Train-Test Split
- Always split data before training to evaluate on unseen data
- Common ratio: 80% training, 20% testing
- Use stratification for imbalanced classification problems

### Common Algorithms
- **Linear Regression**: Predicts continuous values (covered in Module 14)
- **Logistic Regression**: Binary classification (covered in Module 15)
- **Decision Trees**: Interpretable, handles non-linear relationships

### Evaluation Metrics
- **Regression**: MSE, RMSE, MAE, R-squared
- **Classification**: Accuracy, Precision, Recall, F1-score
- Choose metrics based on business context (e.g., recall for fraud detection)

### Cross-Validation
- More reliable than single train-test split
- K-Fold CV uses all data for both training and testing
- Report mean and standard deviation of scores

### Scikit-Learn
- Consistent API across all algorithms: `fit()`, `predict()`, `score()`
- Pipelines combine preprocessing and modeling
- GridSearchCV for hyperparameter tuning

## Next Modules

Now that you understand the machine learning fundamentals and workflow, we'll dive deep into specific algorithms:

- **Module 14: Linear Regression** - Full coverage including polynomial regression, regularization, and diagnostics
- **Module 15: Logistic Regression** - Complete treatment of binary and multiclass classification
- **Module 16: Classification Algorithms** - K-NN, Naive Bayes, and SVM

## Additional Practice

For extra practice, try these challenges:
1. Load the wine or breast cancer dataset from sklearn and apply the complete ML workflow
2. Compare Decision Tree, Linear Regression, and Logistic Regression on appropriate datasets
3. Create an end-to-end pipeline with preprocessing, modeling, and cross-validation
4. Experiment with different train-test split ratios and observe the impact on model performance