# **AI TECH INSTITUTE** · *Intermediate AI & Data Science*
### Week 7 - Notebook 01: ML Fundamentals & The ML Lifecycle
**Instructor:** Amir Charkhi |  **Goal:** Understanding Machine Learning Foundations

> Format: theory → implementation → best practices → real-world application.

## Welcome to Machine Learning! 🎯

**Learning Objectives:**
- Understand what machine learning is and when to use it
- Learn the complete ML lifecycle that you'll use for every project
- Distinguish between classification and regression problems
- Master train/test split - the foundation of all ML
- Build your first ML model end-to-end
- Establish a framework you'll use for Weeks 7-12

**Prerequisites:** Python, pandas, data visualization basics



## 🤔 What Exactly is Machine Learning?

Let's start with the big picture. You've spent Phase 1 learning to:
- Clean data (Pandas)
- Visualize patterns (Plotly, dashboards)
- Query databases (SQL)
- Understand causality (A/B testing)

**Machine Learning is the next step**: Teaching computers to find patterns and make predictions.

### Traditional Programming vs Machine Learning

**Traditional Programming:**
```
Rules + Data → Output
```
Example: "If temperature > 30°C, send 'It's hot' alert"

**Machine Learning:**
```
Data + Output → Rules (learned automatically)
```
Example: "Here's 10,000 images of cats and dogs. Learn to distinguish them yourself."

### When to Use ML vs Traditional Analysis

**Use ML when:**
- ✅ The rules are too complex to code manually
- ✅ You need predictions ("What will happen?")
- ✅ Patterns exist in data but aren't obvious
- ✅ You have enough historical data

**Don't use ML when:**
- ❌ Simple rules work fine (don't use ML to check if a number is even)
- ❌ You need to understand WHY (ML is often a black box)
- ❌ You have too little data (< 100 samples)
- ❌ The stakes are too high without explainability (medical diagnosis)

In [None]:
# Essential imports for this week
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Machine learning libraries
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris, load_diabetes
from sklearn.metrics import accuracy_score, mean_squared_error

import warnings
warnings.filterwarnings('ignore')

# Set plotting style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

print("📚 Library Guide for Week 7:")
print("")
print("1. scikit-learn (sklearn): The #1 ML library in Python")
print("   - Consistent API for all algorithms")
print("   - Built-in datasets for learning")
print("   - Comprehensive evaluation tools")
print("")
print("2. Key sklearn modules we'll use:")
print("   - model_selection: Split data, cross-validation")
print("   - linear_model: Linear/Logistic regression")
print("   - tree: Decision trees")
print("   - metrics: Evaluate model performance")
print("")
print("✅ All libraries loaded and ready!")

---

## 🎯 The Two Main Types of ML Problems

Before we dive into the lifecycle, you need to know what type of problem you're solving.

### 1. Classification: Predicting Categories

**Question format:** "Which group does this belong to?"

**Examples:**
- Is this email spam or not spam? (Binary classification)
- Will this customer churn? (Binary: Yes/No)
- What species is this flower? (Multi-class: setosa/versicolor/virginica)
- Which product category? (Multi-class: electronics/clothing/food)

**Output:** Discrete categories (labels)

### 2. Regression: Predicting Numbers

**Question format:** "What's the value?"

**Examples:**
- What will the house price be? (Continuous number: $450,000)
- How much revenue next month? (Continuous: $1.2M)
- What's the temperature tomorrow? (Continuous: 23.5°C)
- Customer lifetime value? (Continuous: $2,340)

**Output:** Continuous numbers

In [None]:
# Let's visualize the difference
print("🎨 VISUALIZING CLASSIFICATION vs REGRESSION\n")

# Create sample data
np.random.seed(42)

# Classification example: Iris dataset
iris = load_iris()
iris_df = pd.DataFrame(iris.data, columns=iris.feature_names)
iris_df['species'] = iris.target
iris_df['species_name'] = iris_df['species'].map({0: 'setosa', 1: 'versicolor', 2: 'virginica'})

# Regression example: Simple house price simulation
size_sqft = np.random.uniform(800, 3000, 100)
price = 200 * size_sqft + np.random.normal(0, 50000, 100)
house_df = pd.DataFrame({'size_sqft': size_sqft, 'price': price})

# Plot
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Classification plot
for species in iris_df['species_name'].unique():
    subset = iris_df[iris_df['species_name'] == species]
    axes[0].scatter(subset['petal length (cm)'], subset['petal width (cm)'], 
                    label=species, s=100, alpha=0.6)
axes[0].set_xlabel('Petal Length (cm)', fontsize=12)
axes[0].set_ylabel('Petal Width (cm)', fontsize=12)
axes[0].set_title('CLASSIFICATION: Predict Discrete Categories\n(Iris Species)', fontsize=13, fontweight='bold')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Regression plot
axes[1].scatter(house_df['size_sqft'], house_df['price'], alpha=0.6, s=100, color='coral')
z = np.polyfit(house_df['size_sqft'], house_df['price'], 1)
p = np.poly1d(z)
axes[1].plot(house_df['size_sqft'], p(house_df['size_sqft']), "r--", linewidth=2, label='Trend line')
axes[1].set_xlabel('House Size (sqft)', fontsize=12)
axes[1].set_ylabel('Price ($)', fontsize=12)
axes[1].set_title('REGRESSION: Predict Continuous Values\n(House Prices)', fontsize=13, fontweight='bold')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\n💡 Key Insight:")
print("  LEFT: Classification → Predicting which colored group a point belongs to")
print("  RIGHT: Regression → Predicting the exact y-value on the line")

---

## 🔄 The Complete ML Lifecycle

This is your framework for EVERY ML project. Memorize this!

```
┌─────────────────────────────────────────────────────────┐
│              THE ML PROJECT LIFECYCLE                   │
└─────────────────────────────────────────────────────────┘

1. 📊 UNDERSTAND THE PROBLEM
   ↓ What are we trying to predict? Why?
   
2. 📁 COLLECT & PREPARE DATA
   ↓ Get data, clean it, handle missing values
   
3. 🔍 EXPLORATORY DATA ANALYSIS (EDA)
   ↓ Visualize, find patterns, understand distributions
   
4. 🛠️ FEATURE ENGINEERING
   ↓ Create new features, encode categories, scale data
   
5. ✂️ SPLIT DATA (Train/Validation/Test)
   ↓ CRITICAL: Never touch test set until the end!
   
6. 🤖 BUILD MODELS
   ↓ Try different algorithms (linear, tree-based, etc.)
   
7. ⚙️ TUNE HYPERPARAMETERS
   ↓ Optimize model settings for best performance
   
8. 📊 EVALUATE & COMPARE
   ↓ Use metrics to find the best model
   
9. 🎯 FINAL EVALUATION (Test Set)
   ↓ Get unbiased performance estimate
   
10. 🚀 DEPLOY & MONITOR
    ↓ Put model in production, track performance over time
```

### Why This Order Matters

**Common mistakes:**
- ❌ Jumping straight to modeling without EDA
- ❌ Not splitting data properly (data leakage!)
- ❌ Testing on training data (overly optimistic results)
- ❌ Touching test set multiple times

**We'll focus on steps 5-9 this week**, assuming you already know steps 1-4 from Phase 1.

---

## ✂️ The Foundation: Train/Test Split

This is THE MOST IMPORTANT concept in ML. Get this wrong, and everything else fails.

### Why We Need Train/Test Split

**The Problem:**
If I give you the answers before a test, you'll ace it. But did you really learn?

**In ML:**
- If we train AND test on the same data, models will look amazing
- But they'll fail miserably on new, unseen data
- This is called **OVERFITTING** - memorizing instead of learning

### The Solution: Data Splitting

```
Your Complete Dataset (100%)
        |
        ├─── Training Set (70-80%)
        |    → Model learns patterns from this
        |    → Adjust model based on this data
        |
        ├─── Validation Set (10-15%) [Optional but recommended]
        |    → Tune hyperparameters
        |    → Compare different models
        |    → Can use multiple times
        |
        └─── Test Set (10-15%)
             → NEVER TOUCH until the very end!
             → Final, unbiased performance check
             → Use ONLY ONCE
```

### Critical Rules:

1. **Split BEFORE any analysis** (to avoid data leakage)
2. **Test set = vault** (locked until final evaluation)
3. **Random splitting** (to avoid bias)
4. **Same seed** (for reproducibility)
5. **Stratified split** (for classification, to keep class balance)

In [None]:
print("✂️ TRAIN/TEST SPLIT - PRACTICAL IMPLEMENTATION\n")

# Load a simple dataset
iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = pd.Series(iris.target, name='species')

print(f"📊 Dataset Info:")
print(f"   Total samples: {len(X)}")
print(f"   Features: {X.shape[1]} ({list(X.columns)})")
print(f"   Target variable: {y.name}")
print(f"   Classes: {np.unique(y)}")
print()

# Method 1: Simple train/test split (80/20)
print("Method 1: Simple 80/20 Split")
print("="*50)
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,        # 20% for testing
    random_state=42       # For reproducibility
)

print(f"✅ Training set: {len(X_train)} samples ({len(X_train)/len(X)*100:.0f}%)")
print(f"✅ Test set: {len(X_test)} samples ({len(X_test)/len(X)*100:.0f}%)")
print()

# Method 2: Stratified split (recommended for classification)
print("Method 2: Stratified Split (BETTER for classification)")
print("="*50)
X_train_strat, X_test_strat, y_train_strat, y_test_strat = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y            # Maintains class proportions!
)

# Compare class distributions
print("\n📊 Class Distribution Comparison:")
comparison_df = pd.DataFrame({
    'Original': y.value_counts(normalize=True).sort_index(),
    'Simple Split (Train)': y_train.value_counts(normalize=True).sort_index(),
    'Stratified Split (Train)': y_train_strat.value_counts(normalize=True).sort_index()
})
print(comparison_df)

print("\n💡 Key Insight:")
print("   Stratified split preserves class proportions better!")
print("   This is especially important for imbalanced datasets.")

# Visualize the split
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Simple split
y_train.value_counts().plot(kind='bar', ax=axes[0], color='skyblue', alpha=0.7)
axes[0].set_title('Simple Train/Test Split\n(Class Counts in Training Set)', fontsize=12, fontweight='bold')
axes[0].set_xlabel('Class')
axes[0].set_ylabel('Count')
axes[0].set_xticklabels(['Setosa', 'Versicolor', 'Virginica'], rotation=0)

# Plot 2: Stratified split
y_train_strat.value_counts().plot(kind='bar', ax=axes[1], color='coral', alpha=0.7)
axes[1].set_title('Stratified Train/Test Split\n(Class Counts in Training Set)', fontsize=12, fontweight='bold')
axes[1].set_xlabel('Class')
axes[1].set_ylabel('Count')
axes[1].set_xticklabels(['Setosa', 'Versicolor', 'Virginica'], rotation=0)

plt.tight_layout()
plt.show()

print("\n🎯 Remember: Use stratified split for classification problems!")

---

## 🚀 Your First End-to-End ML Project

Let's put everything together and build a complete ML pipeline!

### Step 1: Understanding the Problem

In [None]:
print("🎯 PROJECT: Iris Species Classification\n")
print("Problem: Given flower measurements, predict the species")
print("Type: Multi-class classification (3 species)")
print("Why it matters: Foundation for plant identification systems")
print("\nLet's follow the ML lifecycle!\n")
print("="*60)

### Step 2: Load & Explore Data

In [None]:
print("📊 STEP 2: DATA EXPLORATION\n")

# Load data
iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = pd.Series(iris.target, name='species')

print("Dataset shape:", X.shape)
print(f"Features: {list(X.columns)}")
print(f"Target distribution:\n{y.value_counts()}\n")

# Quick EDA
print("Basic statistics:")
print(X.describe())

# Visualize feature distributions
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.ravel()

for i, col in enumerate(X.columns):
    for species in [0, 1, 2]:
        data = X[y == species][col]
        axes[i].hist(data, alpha=0.5, bins=15, 
                    label=iris.target_names[species])
    axes[i].set_xlabel(col)
    axes[i].set_ylabel('Frequency')
    axes[i].set_title(f'Distribution of {col}')
    axes[i].legend()
    axes[i].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\n✅ EDA complete: Features look reasonable, no obvious outliers")

### Step 3: Split the Data

In [None]:
print("✂️ STEP 3: TRAIN/TEST SPLIT\n")

# Split the data (stratified for classification)
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

print(f"Training set: {len(X_train)} samples")
print(f"Test set: {len(X_test)} samples")
print(f"\nClass distribution in training set:")
print(y_train.value_counts())
print("\n✅ Data split complete. Test set locked away!")

### Step 4: Build a Model

In [None]:
print("🤖 STEP 4: BUILD MODEL\n")

# Create a model - we'll use Logistic Regression
# (Don't worry about the algorithm details yet - we'll cover this in Week 8)
model = LogisticRegression(max_iter=200, random_state=42)

# Train the model
print("Training the model...")
model.fit(X_train, y_train)

print("✅ Model trained!")
print(f"Model type: {type(model).__name__}")
print(f"\nThe model has learned patterns from {len(X_train)} training examples.")

### Step 5: Make Predictions

In [None]:
print("🔮 STEP 5: MAKE PREDICTIONS\n")

# Make predictions on training set
y_train_pred = model.predict(X_train)

# Make predictions on test set
y_test_pred = model.predict(X_test)

print("Sample predictions on test set:")
results_df = pd.DataFrame({
    'Actual': y_test[:10],
    'Predicted': y_test_pred[:10]
})
results_df['Correct'] = results_df['Actual'] == results_df['Predicted']
print(results_df)

print("\n✅ Predictions made!")

### Step 6: Evaluate Performance

In [None]:
print("📊 STEP 6: EVALUATE MODEL\n")

# Calculate accuracy
train_accuracy = accuracy_score(y_train, y_train_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)

print(f"Training Accuracy: {train_accuracy:.2%}")
print(f"Test Accuracy: {test_accuracy:.2%}")

# Visualize results
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Accuracy comparison
accuracies = [train_accuracy, test_accuracy]
labels = ['Training\nAccuracy', 'Test\nAccuracy']
colors = ['skyblue', 'coral']
axes[0].bar(labels, accuracies, color=colors, alpha=0.7, width=0.5)
axes[0].set_ylim([0, 1])
axes[0].set_ylabel('Accuracy', fontsize=12)
axes[0].set_title('Model Performance', fontsize=13, fontweight='bold')
axes[0].grid(True, alpha=0.3, axis='y')
for i, v in enumerate(accuracies):
    axes[0].text(i, v + 0.02, f'{v:.1%}', ha='center', fontweight='bold')

# Prediction distribution
prediction_counts = pd.Series(y_test_pred).value_counts().sort_index()
actual_counts = y_test.value_counts().sort_index()
x = np.arange(len(prediction_counts))
width = 0.35
axes[1].bar(x - width/2, actual_counts, width, label='Actual', alpha=0.7, color='skyblue')
axes[1].bar(x + width/2, prediction_counts, width, label='Predicted', alpha=0.7, color='coral')
axes[1].set_xlabel('Species', fontsize=12)
axes[1].set_ylabel('Count', fontsize=12)
axes[1].set_title('Actual vs Predicted Counts', fontsize=13, fontweight='bold')
axes[1].set_xticks(x)
axes[1].set_xticklabels(['Setosa', 'Versicolor', 'Virginica'])
axes[1].legend()
axes[1].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

print("\n✅ Evaluation complete!")
print("\n💡 Key Insights:")
if train_accuracy - test_accuracy < 0.05:
    print("   ✅ Train and test accuracy are similar - good generalization!")
elif train_accuracy - test_accuracy > 0.15:
    print("   ⚠️ Train accuracy much higher than test - possible overfitting!")
    
if test_accuracy > 0.9:
    print(f"   ✅ {test_accuracy:.1%} test accuracy is excellent!")
elif test_accuracy > 0.7:
    print(f"   ✅ {test_accuracy:.1%} test accuracy is good!")
else:
    print(f"   ⚠️ {test_accuracy:.1%} test accuracy - room for improvement!")

### Step 7: Making Predictions on New Data

In [None]:
print("🌟 STEP 7: INFERENCE (Making Predictions on New Data)\n")

# Simulate new flower measurements
new_flowers = pd.DataFrame({
    'sepal length (cm)': [5.1, 6.5, 7.0],
    'sepal width (cm)': [3.5, 3.0, 3.2],
    'petal length (cm)': [1.4, 5.5, 4.7],
    'petal width (cm)': [0.2, 1.8, 1.4]
})

print("New flower measurements:")
print(new_flowers)
print()

# Make predictions
predictions = model.predict(new_flowers)
prediction_proba = model.predict_proba(new_flowers)

# Show results
species_names = ['setosa', 'versicolor', 'virginica']
print("Predictions:")
for i, pred in enumerate(predictions):
    print(f"\nFlower {i+1}:")
    print(f"  Predicted species: {species_names[pred]}")
    print(f"  Confidence:")
    for j, prob in enumerate(prediction_proba[i]):
        print(f"    {species_names[j]}: {prob:.1%}")

print("\n✅ Model is now ready for production inference!")

---

## 🎓 Key Takeaways

Congratulations! You've completed your first ML project! Here's what you learned:

### Core Concepts:
1. **ML Definition**: Teaching computers to find patterns and make predictions
2. **Two Problem Types**: Classification (categories) vs Regression (numbers)
3. **ML Lifecycle**: The 10-step framework you'll use for every project
4. **Train/Test Split**: THE foundational concept preventing overfitting
5. **Stratification**: Preserving class balance in classification problems

### The sklearn API Pattern:
You'll use this pattern for EVERY model in Weeks 7-12:
```python
# 1. Create model
model = SomeAlgorithm()

# 2. Train (fit)
model.fit(X_train, y_train)

# 3. Predict
predictions = model.predict(X_test)

# 4. Evaluate
score = some_metric(y_test, predictions)
```

### Critical Rules:
- ✅ Always split data BEFORE any analysis
- ✅ Use stratified split for classification
- ✅ Test set = sacred vault (use only once!)
- ✅ Train accuracy > Test accuracy is normal
- ✅ If gap is huge → overfitting!

---

## 🚀 Next Steps

You now have the foundation! Next up:

1. **Notebook 02**: Deep dive into evaluation metrics (accuracy, precision, recall, F1, ROC-AUC, MSE, R²)
2. **Notebook 03**: Cross-validation and proper model selection

Then in Weeks 8-12, you'll learn different algorithms:
- Week 8: Linear Models, Tree-Based Methods
- Week 9: Advanced ML & MLOps
- Week 10: Time Series
- Week 11: Unsupervised Learning
- Week 12: Network Analysis

**The framework you learned today applies to ALL of them!** 🎯

In [None]:
print("🎉 Congratulations! You've completed Notebook 01!")
print("")
print("📚 You learned:")
print("   ✅ What ML is and when to use it")
print("   ✅ Classification vs Regression")
print("   ✅ The complete ML lifecycle")
print("   ✅ Train/test split mastery")
print("   ✅ Built your first ML model!")
print("")
print("🎯 Next: Notebook 02 - Model Evaluation Metrics")
print("   Learn how to properly evaluate and compare models!")