# üìí Notebook: 03_scaling_features.ipynb
üìå Sections:
Title & Introduction



1.   Why Scaling is Important
2.   Visualizing Raw (Unscaled) Features
3. Standard Scaling (StandardScaler)
4. Min-Max Scaling (MinMaxScaler)
5. Compare Transformed Values
6. When to Use Which Scaler
7. Summary / What‚Äôs Next



## 1. Title & Introduction (Markdown Cell)
### üìè 03 ‚Äî Scaling and Normalizing Features

In this notebook, we'll learn:

- Why feature scaling is important
- How to apply **StandardScaler** and **MinMaxScaler**
- The difference between standardization and normalization
- How to compare original vs. scaled feature distributions

### 2. Why Scaling is Important (Markdown Cell)
## ü§î Why Do We Need Scaling?

Many machine learning models (like k-NN, SVM, and neural nets) are sensitive to the **scale** of input features.

If one column has values ranging from 1‚Äì1000, and another from 0‚Äì1, the model may give too much importance to the large-scale feature.

üëâ We fix this using **scaling**.


## 3. Load and Visualize Raw Data

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Load the dataset
df = pd.read_csv("../data/sample_data.csv")

# Choose a few numeric columns
numeric_cols = df.select_dtypes(include=["int64", "float64"]).columns.tolist()

# Quick look
df[numeric_cols].head()

# Histogram of raw data
df[numeric_cols].hist(figsize=(10, 6), bins=20)
plt.suptitle("Original Feature Distributions")
plt.show()

## 4. Apply StandardScaler

In [None]:
from sklearn.preprocessing import StandardScaler

scaler_std = StandardScaler()
scaled_std = scaler_std.fit_transform(df[numeric_cols])

# Convert to DataFrame for inspection
df_scaled_std = pd.DataFrame(scaled_std, columns=[col + "_std" for col in numeric_cols])
df_scaled_std.head()

In [None]:
from sklearn.preprocessing import MinMaxScaler

scaler_mm = MinMaxScaler()
scaled_mm = scaler_mm.fit_transform(df[numeric_cols])

# Convert to DataFrame
df_scaled_mm = pd.DataFrame(scaled_mm, columns=[col + "_mm" for col in numeric_cols])
df_scaled_mm.head()

In [None]:
# Visualize distributions after scaling
df_scaled_std.hist(figsize=(10, 6), bins=20)
plt.suptitle("Standard Scaled Features (mean=0, std=1)")
plt.show()

df_scaled_mm.hist(figsize=(10, 6), bins=20)
plt.suptitle("Min-Max Scaled Features (range 0 to 1)")
plt.show()

## 7. When to Use Which Scaler (Markdown Cell)
### ü§ì StandardScaler vs. MinMaxScaler ‚Äî When to Use

**StandardScaler**  
- Centers data (mean = 0, std = 1)  
- Use when features are normally distributed  
- Good for: Linear regression, logistic regression, SVM

**MinMaxScaler**  
- Scales to a 0‚Äì1 range  
- Use when you want bounded inputs  
- Good for: Neural networks, clustering, distance-based algorithms


## 8. Summary / What‚Äôs Next (Markdown Cell)
### ‚úÖ Summary

In this notebook, we:

- Explored why scaling is necessary
- Applied **StandardScaler** (standardization)
- Applied **MinMaxScaler** (normalization)
- Compared distributions before and after scaling

‚û°Ô∏è **Next Up**: `04_encoding_categorical_variables.ipynb`  
We'll learn how to encode categorical columns using **OneHotEncoder** and **OrdinalEncoder**.
