# üöÄ Scikit-Learn Phase 1: ML Basics & First Models

**Your next step after NumPy, Pandas & Matplotlib**

In this notebook you will learn:
1. Data Preprocessing (missing values, encoding)
2. Feature Scaling & Normalization
3. Train-Test Splitting
4. Linear Regression (Regression)
5. Logistic Regression (Classification)
6. Decision Tree Classifier
7. Model Evaluation Metrics
8. Cross-Validation
9. Hyperparameter Tuning (GridSearchCV)
10. ML Pipelines
11. Saving & Loading Models

---

### The ML Workflow
```
Load Data ‚Üí Preprocess ‚Üí Split ‚Üí Train ‚Üí Evaluate ‚Üí Predict ‚Üí Deploy
```



 Great question ‚Äî you're right to be skeptical. Some of those StatQuest/Krish Naik videos are **10-20 min conceptual explainers**, which are great for intuition but **not enough alone** for hands-on learning. Here's the fix:

## Updated Strategy: Use **Long-Form Tutorials** as Primary + **Short Videos** as Supplements

### Best Long-Form Resources (Hindi + English)

| Resource | Duration | Why it works |
|----------|----------|-------------|
| **CampusX ML Playlist** | 1.5-2 hrs/video | Deep theory + code. Best for your level. [Full Playlist](https://www.youtube.com/playlist?list=PLKnIA16OIAFMLe-8kQm4SavhGln4TgaEd) |
| **Krish Naik Complete ML Playlist** | 30-60 min/video | Practical, project-oriented. [Full Playlist](https://www.youtube.com/playlist?list=PLZoTAELRMXVPBTrWtJkn3wWQxZkmTXGwe) |
| **Codebasics ML Playlist** | 15-30 min/video | Short but code-heavy, good for coding along. [Full Playlist](https://www.youtube.com/playlist?list=PLeo1K3hjS3uvCeTRe6-LeyiOtwM9PUZO0) |
| **StatQuest** | 8-20 min | **Use ONLY as a supplement** to understand the math/intuition visually |

### Revised 30-Day Plan (Longer, More Comprehensive Videos)

#### Week 1: Regression
| Day | Topic | Primary Video (Long) | Supplement (Short) |
|-----|-------|---------------------|-------------------|
| 1 | Finish Preprocessing | [CampusX - Preprocessing (2hr)](https://www.youtube.com/watch?v=RiEpSd4j0vE) | ‚Äî |
| 2 | Linear Regression Theory + Code | [CampusX - Linear Regression (1.5hr)](https://www.youtube.com/watch?v=UZPfbG0jNec) | [StatQuest - Linear Regression (12min)](https://www.youtube.com/watch?v=PaFPbb66DxQ) |
| 3 | Gradient Descent Deep Dive | [CampusX - Gradient Descent (1.5hr)](https://www.youtube.com/watch?v=ORyfPJypKuU) | [3Blue1Brown - Gradient Descent (21min)](https://www.youtube.com/watch?v=IHZwWFHWa-w) |
| 4 | Multiple + Polynomial Regression | [CampusX - Multiple LR (1.5hr)](https://www.youtube.com/watch?v=2eYWJIBRSGc) | ‚Äî |
| 5 | Ridge, Lasso, ElasticNet | [CampusX - Regularization (1.5hr)](https://www.youtube.com/watch?v=aEow2V8jEyE) | [StatQuest - Ridge (21min)](https://www.youtube.com/watch?v=Q81RR3yKn30) |
| 6-7 | **Project: House Price Prediction** | [Krish Naik - End to End ML (1hr)](https://www.youtube.com/watch?v=p1hGz0w_OCo) | Code along in your notebook |

#### Week 2: Classification
| Day | Topic | Primary Video (Long) | Supplement |
|-----|-------|---------------------|-----------|
| 8 | Logistic Regression | [CampusX - Logistic Regression (2hr)](https://www.youtube.com/watch?v=ABhFGsEGIKA) | [StatQuest (9min)](https://www.youtube.com/watch?v=yIYKR4sgzI8) |
| 9 | Decision Trees | [CampusX - Decision Tree (1.5hr)](https://www.youtube.com/watch?v=PHxYNGo8NcI) | [StatQuest (18min)](https://www.youtube.com/watch?v=_L39rN6gz7Y) |
| 10 | Random Forest + Bagging | [CampusX - Random Forest (1.5hr)](https://www.youtube.com/watch?v=bHK1nDNRHO0) | [StatQuest (10min)](https://www.youtube.com/watch?v=J4Wdy0Wc_xQ) |
| 11 | SVM | [CampusX - SVM (2hr)](https://www.youtube.com/watch?v=ugTxMLjLS8M) | ‚Äî |
| 12 | KNN + Naive Bayes | [CampusX - KNN (1.5hr)](https://www.youtube.com/watch?v=BYaOETcsMaA) | ‚Äî |
| 13 | Model Evaluation Metrics | [CampusX - Metrics (1.5hr)](https://www.youtube.com/watch?v=EUiIydNBIbE) | [StatQuest - Confusion Matrix (7min)](https://www.youtube.com/watch?v=Kdsp6soqA7o) |
| 14 | **Project: Classification Project** | [Krish Naik - Heart Disease (45min)](https://www.youtube.com/watch?v=fHFOANOPMng) | Code along |

#### Week 3: Unsupervised + Boosting
| Day | Topic | Primary Video (Long) |
|-----|-------|---------------------|
| 15 | K-Means Clustering | [CampusX - KMeans (1.5hr)](https://www.youtube.com/watch?v=5shTLzwAdEc) |
| 16 | Hierarchical + DBSCAN | [CampusX - DBSCAN (1.5hr)](https://www.youtube.com/watch?v=DQabDWPqWiE) |
| 17 | PCA | [CampusX - PCA (2hr)](https://www.youtube.com/watch?v=SJpE2_YEb-8) |
| 18 | Feature Engineering | [CampusX - Feature Eng (2hr)](https://www.youtube.com/watch?v=6WDFfaYtN6s) |
| 19 | XGBoost / AdaBoost | [CampusX - Boosting (1.5hr)](https://www.youtube.com/watch?v=Tw05ywNJMaQ) |
| 20-21 | **Project: Full ML Pipeline** | [Krish Naik - Full Project (1hr)](https://www.youtube.com/watch?v=fHFOANOPMng) |

#### Week 4: Deep Learning Intro
| Day | Topic | Primary Video (Long) |
|-----|-------|---------------------|
| 22-23 | Neural Network Intuition + Math | [3Blue1Brown - NN Series (4 videos, ~1hr total)](https://www.youtube.com/playlist?list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi) |
| 24-25 | TensorFlow/Keras Hands-on | [Codebasics - DL Playlist (start)](https://www.youtube.com/playlist?list=PLeo1K3hjS3uu7CxAacxVndI4bE_o3BDR1) |
| 26-27 | CNN Basics | [CampusX - CNN](https://www.youtube.com/watch?v=zfiSAzpy9NM) or Codebasics CNN video |
| 28-29 | NLP Basics | [CampusX - NLP Intro (2hr)](https://www.youtube.com/watch?v=6ZHlThRqO2g) |
| 30 | Portfolio Review + Deployment | [Krish Naik - ML Deployment (1hr)](https://www.youtube.com/watch?v=bjsJOl8gz5k) |

---

### Bottom Line

- **CampusX = your main teacher** (long, thorough, code-along, Hindi)
- **StatQuest = watch BEFORE CampusX** for 10-min visual intuition on the topic
- **Krish Naik / Codebasics = project days** for building end-to-end
- **3Blue1Brown = only for neural networks** math intuition

Daily time: **~2-3 hours** (one long video + coding along in your notebooks). The short 6-min videos are just appetizers ‚Äî the CampusX sessions are the real meal.

## Section 1: Data Preprocessing with Scikit-Learn

Before feeding data to any ML model, we need to **clean and prepare** it:
- Handle **missing values** ‚Üí `SimpleImputer`
- Encode **categorical variables** ‚Üí `LabelEncoder`, `OneHotEncoder`
- These are found in `sklearn.preprocessing` and `sklearn.impute`

In [None]:
# --- Import all libraries we'll use throughout this notebook ---
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Scikit-Learn modules
from sklearn import datasets
from sklearn.model_selection import train_test_split, cross_val_score, KFold, GridSearchCV
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import (accuracy_score, confusion_matrix, classification_report,
                             mean_squared_error, r2_score)
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
import joblib

print("‚úÖ All libraries imported successfully!")

In [None]:
# --- Handling Missing Values with SimpleImputer ---

# Create a sample DataFrame with missing values
data = pd.DataFrame({
    'Age':    [25, np.nan, 35, 40, np.nan, 30],
    'Salary': [50000, 60000, np.nan, 80000, 70000, np.nan],
    'City':   ['Delhi', 'Mumbai', 'Delhi', np.nan, 'Mumbai', 'Pune']
})

print("üìå Original Data:")
print(data)
print(f"\nMissing values:\n{data.isnull().sum()}")

# Impute numerical columns with MEAN
num_imputer = SimpleImputer(strategy='mean')
data[['Age', 'Salary']] = num_imputer.fit_transform(data[['Age', 'Salary']])

# Impute categorical column with MOST FREQUENT value
cat_imputer = SimpleImputer(strategy='most_frequent')
data[['City']] = cat_imputer.fit_transform(data[['City']])

print("\n‚úÖ After Imputation:")
print(data)

In [None]:
# --- Encoding Categorical Variables ---

# LabelEncoder: converts categories to numbers (0, 1, 2...)
le = LabelEncoder()
data['City_encoded'] = le.fit_transform(data['City'])
print("LabelEncoder mapping:", dict(zip(le.classes_, le.transform(le.classes_))))
print(data[['City', 'City_encoded']])

# OneHotEncoder: creates binary columns for each category (better for ML)
print("\n--- OneHotEncoding ---")
ohe = OneHotEncoder(sparse_output=False, drop='first')  # drop='first' avoids dummy variable trap
city_ohe = ohe.fit_transform(data[['City']])
ohe_df = pd.DataFrame(city_ohe, columns=ohe.get_feature_names_out(['City']))
print(ohe_df)

## Section 2: Feature Scaling & Normalization

ML algorithms like Logistic Regression, KNN, SVM are **distance-based** ‚Äî they perform badly when features have very different scales.

| Technique | Formula | Range | When to use |
|-----------|---------|-------|-------------|
| **StandardScaler** | (x - mean) / std | ~ -3 to +3 | Most algorithms |
| **MinMaxScaler** | (x - min) / (max - min) | 0 to 1 | Neural networks, images |

‚ö†Ô∏è **Always fit on training data, then transform both train & test!**

In [None]:
# --- Feature Scaling Demo ---

# Sample data with different scales
sample = pd.DataFrame({
    'Age': [25, 30, 35, 40, 45, 50, 55, 60],
    'Salary': [30000, 45000, 50000, 60000, 70000, 80000, 90000, 100000]
})

# StandardScaler
scaler_std = StandardScaler()
scaled_std = scaler_std.fit_transform(sample)

# MinMaxScaler
scaler_mm = MinMaxScaler()
scaled_mm = scaler_mm.fit_transform(sample)

# Visualize before and after
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

axes[0].set_title("Original")
axes[0].hist(sample['Age'], alpha=0.7, label='Age', color='steelblue')
axes[0].hist(sample['Salary'], alpha=0.7, label='Salary', color='coral')
axes[0].legend()

axes[1].set_title("StandardScaler (mean=0, std=1)")
axes[1].hist(scaled_std[:, 0], alpha=0.7, label='Age', color='steelblue')
axes[1].hist(scaled_std[:, 1], alpha=0.7, label='Salary', color='coral')
axes[1].legend()

axes[2].set_title("MinMaxScaler (0 to 1)")
axes[2].hist(scaled_mm[:, 0], alpha=0.7, label='Age', color='steelblue')
axes[2].hist(scaled_mm[:, 1], alpha=0.7, label='Salary', color='coral')
axes[2].legend()

plt.tight_layout()
plt.show()

print(f"Before scaling ‚Äî Age mean: {sample['Age'].mean():.1f}, Salary mean: {sample['Salary'].mean():.1f}")
print(f"After StandardScaler ‚Äî Age mean: {scaled_std[:,0].mean():.4f}, Salary mean: {scaled_std[:,1].mean():.4f}")

## Section 3: Splitting Data into Training & Testing Sets

We **never** train and test on the same data ‚Äî that's like memorizing answers before an exam.

- **Training set** (~80%): model learns from this
- **Testing set** (~20%): model is evaluated on this (unseen data)

Key parameters of `train_test_split`:
- `test_size=0.2` ‚Üí 20% for testing
- `random_state=42` ‚Üí reproducible results every time
- `stratify=y` ‚Üí keeps class proportions same in both splits

In [None]:
# --- Train-Test Split ---

# Load the Iris dataset (built-in, no download needed)
iris = datasets.load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = pd.Series(iris.target, name='species')

print(f"Dataset shape: {X.shape}")
print(f"Classes: {iris.target_names}")
print(f"Class distribution:\n{y.value_counts()}\n")

# Split: 80% train, 20% test, stratified
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set: {X_train.shape[0]} samples")
print(f"Testing set:  {X_test.shape[0]} samples")
print(f"\nTrain class distribution:\n{y_train.value_counts()}")
print(f"\nTest class distribution:\n{y_test.value_counts()}")

## Section 4: Linear Regression (Your First Regression Model)

**Regression** = predicting a continuous number (price, temperature, salary)

Linear Regression finds the best straight line: **y = mx + b**

We'll use the **California Housing** dataset (predicting house prices).

In [None]:
# --- Linear Regression on California Housing ---

housing = datasets.fetch_california_housing()
X_h = pd.DataFrame(housing.data, columns=housing.feature_names)
y_h = pd.Series(housing.target, name='MedHouseVal')

print("Features:", list(X_h.columns))
print(f"Shape: {X_h.shape}")
print(X_h.head())

# Split
X_train_h, X_test_h, y_train_h, y_test_h = train_test_split(
    X_h, y_h, test_size=0.2, random_state=42
)

# Train Linear Regression
lr = LinearRegression()
lr.fit(X_train_h, y_train_h)

# Predict
y_pred_h = lr.predict(X_test_h)

# Evaluate
mse = mean_squared_error(y_test_h, y_pred_h)
r2 = r2_score(y_test_h, y_pred_h)
print(f"\nüìä Linear Regression Results:")
print(f"   Mean Squared Error: {mse:.4f}")
print(f"   R¬≤ Score:           {r2:.4f}  (1.0 = perfect)")

# Plot: Actual vs Predicted
plt.figure(figsize=(8, 5))
plt.scatter(y_test_h, y_pred_h, alpha=0.3, s=10, color='steelblue')
plt.plot([0, 5], [0, 5], 'r--', label='Perfect prediction')
plt.xlabel("Actual Price")
plt.ylabel("Predicted Price")
plt.title("Linear Regression: Actual vs Predicted House Prices")
plt.legend()
plt.tight_layout()
plt.show()

## Section 5: Logistic Regression for Classification

**Classification** = predicting a category (spam/not spam, cat/dog, disease/healthy)

Despite the name, **Logistic Regression** is a **classification** algorithm:
- Outputs probabilities (0 to 1) using the sigmoid function
- Works great as a baseline classifier
- We'll classify Iris flowers into 3 species

In [None]:
# --- Logistic Regression on Iris Dataset ---

# Scale features first (important for Logistic Regression)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train Logistic Regression
log_reg = LogisticRegression(max_iter=200, random_state=42)
log_reg.fit(X_train_scaled, y_train)

# Predict
y_pred_lr = log_reg.predict(X_test_scaled)

# Accuracy
acc_lr = accuracy_score(y_test, y_pred_lr)
print(f"‚úÖ Logistic Regression Accuracy: {acc_lr:.4f} ({acc_lr*100:.1f}%)")
print(f"\nüìã Classification Report:\n")
print(classification_report(y_test, y_pred_lr, target_names=iris.target_names))

# Decision Boundary Visualization (using 2 features only)
X_2d = X_train[['sepal length (cm)', 'sepal width (cm)']].values
X_2d_scaled = StandardScaler().fit_transform(X_2d)
lr_2d = LogisticRegression(max_iter=200, random_state=42)
lr_2d.fit(X_2d_scaled, y_train)

# Create mesh grid
x_min, x_max = X_2d_scaled[:, 0].min() - 1, X_2d_scaled[:, 0].max() + 1
y_min, y_max = X_2d_scaled[:, 1].min() - 1, X_2d_scaled[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.02), np.arange(y_min, y_max, 0.02))
Z = lr_2d.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)

plt.figure(figsize=(8, 5))
plt.contourf(xx, yy, Z, alpha=0.3, cmap='viridis')
scatter = plt.scatter(X_2d_scaled[:, 0], X_2d_scaled[:, 1], c=y_train, cmap='viridis', edgecolors='k', s=40)
plt.xlabel("Sepal Length (scaled)")
plt.ylabel("Sepal Width (scaled)")
plt.title("Logistic Regression ‚Äî Decision Boundary (2 features)")
plt.colorbar(scatter, label='Species')
plt.tight_layout()
plt.show()

## Section 6: Decision Tree Classifier

Decision Trees split data by asking **yes/no questions** on features:
- Easy to understand and visualize
- No need for feature scaling
- Can overfit if tree grows too deep

Let's train one on the same Iris dataset and compare with Logistic Regression.

In [None]:
# --- Decision Tree Classifier ---

dt = DecisionTreeClassifier(max_depth=3, random_state=42)
dt.fit(X_train, y_train)  # No scaling needed for trees!

y_pred_dt = dt.predict(X_test)
acc_dt = accuracy_score(y_test, y_pred_dt)

print(f"‚úÖ Decision Tree Accuracy: {acc_dt:.4f} ({acc_dt*100:.1f}%)")
print(f"\nüìã Classification Report:\n")
print(classification_report(y_test, y_pred_dt, target_names=iris.target_names))

# Compare models
print(f"\nüîç Comparison:")
print(f"   Logistic Regression: {acc_lr*100:.1f}%")
print(f"   Decision Tree:       {acc_dt*100:.1f}%")

# Visualize the tree
plt.figure(figsize=(16, 8))
plot_tree(dt, feature_names=iris.feature_names, class_names=iris.target_names,
          filled=True, rounded=True, fontsize=10)
plt.title("Decision Tree Visualization")
plt.tight_layout()
plt.show()

## Section 7: Model Evaluation Metrics

How do we know if a model is **actually good**?

**Classification Metrics:**
| Metric | What it measures |
|--------|-----------------|
| **Accuracy** | % of correct predictions |
| **Precision** | Of all predicted positives, how many are actually positive? |
| **Recall** | Of all actual positives, how many did we catch? |
| **F1-Score** | Harmonic mean of Precision & Recall |
| **Confusion Matrix** | Shows where the model is confused |

**Regression Metrics:**
| Metric | What it measures |
|--------|-----------------|
| **MSE** | Average squared error (lower = better) |
| **RMSE** | Square root of MSE (same unit as target) |
| **R¬≤ Score** | How much variance is explained (1.0 = perfect) |

In [None]:
# --- Confusion Matrix Heatmap ---

cm = confusion_matrix(y_test, y_pred_lr)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Logistic Regression confusion matrix
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[0],
            xticklabels=iris.target_names, yticklabels=iris.target_names)
axes[0].set_xlabel('Predicted')
axes[0].set_ylabel('Actual')
axes[0].set_title('Logistic Regression ‚Äî Confusion Matrix')

# Decision Tree confusion matrix
cm_dt = confusion_matrix(y_test, y_pred_dt)
sns.heatmap(cm_dt, annot=True, fmt='d', cmap='Oranges', ax=axes[1],
            xticklabels=iris.target_names, yticklabels=iris.target_names)
axes[1].set_xlabel('Predicted')
axes[1].set_ylabel('Actual')
axes[1].set_title('Decision Tree ‚Äî Confusion Matrix')

plt.tight_layout()
plt.show()

# Regression metrics recap
print("üìä Regression Metrics (California Housing ‚Äî from Section 4):")
print(f"   MSE:  {mse:.4f}")
print(f"   RMSE: {np.sqrt(mse):.4f}")
print(f"   R¬≤:   {r2:.4f}")

## Section 8: Cross-Validation

A **single** train/test split can be lucky or unlucky. **Cross-validation** tests the model on multiple splits:

**K-Fold CV (k=5):**
```
Fold 1: [TEST] [train] [train] [train] [train]
Fold 2: [train] [TEST] [train] [train] [train]
Fold 3: [train] [train] [TEST] [train] [train]
Fold 4: [train] [train] [train] [TEST] [train]
Fold 5: [train] [train] [train] [train] [TEST]
```
Each fold gets a chance to be the test set ‚Üí more reliable score.

In [None]:
# --- Cross-Validation ---

# Using a pipeline (scaler + model) to avoid data leakage
pipe_lr = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression(max_iter=200, random_state=42))
])

pipe_dt = Pipeline([
    ('classifier', DecisionTreeClassifier(max_depth=3, random_state=42))
])

# 5-Fold Cross-Validation
cv_scores_lr = cross_val_score(pipe_lr, X, y, cv=5, scoring='accuracy')
cv_scores_dt = cross_val_score(pipe_dt, X, y, cv=5, scoring='accuracy')

print("üìä 5-Fold Cross-Validation Results:")
print(f"\nLogistic Regression:")
print(f"   Fold scores: {cv_scores_lr}")
print(f"   Mean: {cv_scores_lr.mean():.4f} ¬± {cv_scores_lr.std():.4f}")

print(f"\nDecision Tree:")
print(f"   Fold scores: {cv_scores_dt}")
print(f"   Mean: {cv_scores_dt.mean():.4f} ¬± {cv_scores_dt.std():.4f}")

# Plot fold scores
fig, ax = plt.subplots(figsize=(8, 4))
folds = range(1, 6)
ax.plot(folds, cv_scores_lr, 'o-', label=f'Logistic Reg (mean={cv_scores_lr.mean():.3f})', color='steelblue')
ax.plot(folds, cv_scores_dt, 's-', label=f'Decision Tree (mean={cv_scores_dt.mean():.3f})', color='coral')
ax.set_xlabel('Fold')
ax.set_ylabel('Accuracy')
ax.set_title('Cross-Validation Scores per Fold')
ax.set_xticks(folds)
ax.set_ylim(0.8, 1.05)
ax.legend()
plt.tight_layout()
plt.show()

## Section 9: Hyperparameter Tuning with GridSearchCV

Models have **hyperparameters** (settings YOU choose before training). How to pick the best ones?

**GridSearchCV** tries every combination and picks the best via cross-validation:
```python
param_grid = {'max_depth': [2, 3, 5, 10], 'min_samples_split': [2, 5, 10]}
grid = GridSearchCV(model, param_grid, cv=5)
grid.fit(X, y)
print(grid.best_params_)
```