# üè° Min-Max Normalization Workshop
## Team Name: Group-4
## Team Members: Prajesh Bhatt, Kevinkumar Patel
---

## ‚ùó Why We Normalize: The Problem with Raw Feature Scales

In housing data, features like `Price` and `Lot_Size` can have values in the hundreds of thousands, while others like `Num_Bedrooms` range from 1 to 5. This creates problems when we use algorithms that depend on numeric magnitudes.

---

### ‚ö†Ô∏è What Goes Wrong Without Normalization

---

### 1. üß≠ K-Nearest Neighbors (KNN)

KNN uses the **Euclidean distance** formula:

$$
d = \sqrt{(x_1 - x_2)^2 + (y_1 - y_2)^2 + \cdots}
$$

**Example:**

- $ \text{Price}_1 = 650{,}000, \quad \text{Price}_2 = 250{,}000 $
- $ \text{Bedrooms}_1 = 3, \quad \text{Bedrooms}_2 = 2 $

Now compute squared differences:

$$
(\text{Price}_1 - \text{Price}_2)^2 = (650{,}000 - 250{,}000)^2 = (400{,}000)^2 = 1.6 \times 10^{11}
$$
$$
(\text{Bedrooms}_1 - \text{Bedrooms}_2)^2 = (3 - 2)^2 = 1
$$

‚û°Ô∏è **Price dominates the distance calculation**, making smaller features like `Bedrooms` irrelevant.

---

### 2. üìâ Linear Regression

Linear regression estimates:

$$
y = \beta_1 \cdot \text{Price} + \beta_2 \cdot \text{Bedrooms} + \beta_3 \cdot \text{Lot\_Size} + \epsilon
$$

If `Price` has very large values:
- Gradient updates for $ \beta_1 $ will be **much larger**
- Gradient updates for $ \beta_2 $ (Bedrooms) will be **very small**

‚û°Ô∏è The model overfits high-magnitude features like `Price`.

---

### 3. üß† Neural Networks

A single neuron computes:

$$
z = w_1 \cdot \text{Price} + w_2 \cdot \text{Bedrooms} + w_3 \cdot \text{Lot\_Size}
$$

If:

- $ \text{Price} = 650{,}000 $
- $ \text{Bedrooms} = 3 $
- $ \text{Lot\_Size} = 8{,}000 $

Then:

$$
z \approx w_1 \cdot 650{,}000 + w_2 \cdot 3 + w_3 \cdot 8{,}000
$$

‚û°Ô∏è Even with equal weights, `Price` contributes **most of the activation**, making it difficult for the network to learn from other features.

---

### ‚úÖ Solution: Min-Max Normalization

We apply the transformation:

$$
x_{\text{normalized}} = \frac{x - x_{\text{min}}}{x_{\text{max}} - x_{\text{min}}}
$$

This scales all features to a common range (typically $[0, 1]$).

| Feature      | Raw Value | Min     | Max     | Normalized Value |
|--------------|-----------|---------|---------|------------------|
| Price        | 650,000   | 250,000 | 800,000 | 0.72             |
| Bedrooms     | 3         | 1       | 5       | 0.50             |
| Lot_Size     | 8,000     | 3,000   | 10,000  | 0.714            |

‚û°Ô∏è Now, **each feature contributes fairly** to model training or distance comparisons.

---

## üìå Use Case: Housing Data
We are normalizing features from a real estate dataset to prepare it for machine learning analysis.

In [2]:
# üî¢ Load and display dataset
import pandas as pd
df = pd.read_csv('./data/housing_data.csv')
df.head()

Unnamed: 0,House_ID,Price,Area_sqft,Num_Bedrooms,Num_Bathrooms,Year_Built,Lot_Size
0,H100000,574507,1462,3,3,2002,4878
1,H100001,479260,1727,2,2,1979,4943
2,H100002,597153,1403,5,2,1952,5595
3,H100003,728454,1646,5,2,1992,9305
4,H100004,464876,853,1,1,1956,7407


### üîé Step 1 ‚Äî Implement Min-Max Normalization on the Housing Dataset

In [3]:
# ‚úçÔ∏è Implement Min-Max Normalization manually (no sklearn/numpy)
# Normalize: Price, Area_sqft, Num_Bedrooms, Num_Bathrooms, Lot_Size

# Define the features to normalize
features_to_normalize = ['Price', 'Area_sqft', 'Num_Bedrooms', 'Num_Bathrooms', 'Lot_Size']

# Work on a copy so we preserve the original values
df_norm = df.copy()

# Apply Min-Max formula: x_norm = (x - x_min) / (x_max - x_min)
# Pure pandas ‚Äî no sklearn or numpy
for col in features_to_normalize:
    col_min = df[col].min()
    col_max = df[col].max()
    df_norm[col + '_norm'] = (df[col] - col_min) / (col_max - col_min)

# Display original vs normalized columns side by side
norm_cols = [c + '_norm' for c in features_to_normalize]
print('=== Original vs Normalized (first 5 rows) ===')
display(df_norm[features_to_normalize + norm_cols].head())

# Sanity check: every normalized column should have min=0 and max=1
print('\n=== Sanity Check: min and max of each normalized column ===')
print(df_norm[norm_cols].agg(['min', 'max']))

=== Original vs Normalized (first 5 rows) ===


Unnamed: 0,Price,Area_sqft,Num_Bedrooms,Num_Bathrooms,Lot_Size,Price_norm,Area_sqft_norm,Num_Bedrooms_norm,Num_Bathrooms_norm,Lot_Size_norm
0,574507,1462,3,3,4878,0.485226,0.315789,0.5,1.0,0.320814
1,479260,1727,2,2,4943,0.387827,0.394588,0.25,0.5,0.326191
2,597153,1403,5,2,5595,0.508384,0.298246,1.0,0.5,0.380129
3,728454,1646,5,2,9305,0.642651,0.370503,1.0,0.5,0.687045
4,464876,853,1,1,7407,0.373119,0.134701,0.0,0.0,0.53003



=== Sanity Check: min and max of each normalized column ===
     Price_norm  Area_sqft_norm  Num_Bedrooms_norm  Num_Bathrooms_norm  \
min         0.0             0.0                0.0                 0.0   
max         1.0             1.0                1.0                 1.0   

     Lot_Size_norm  
min            0.0  
max            1.0  


### üîé Talking Points #1 ‚Äî Min-Max Normalization on Housing Features

- **The formula compresses every feature into [0, 1] without distorting relative spacing.** Because Min-Max divides by the range (max ‚àí min), a value at the minimum maps to 0, one at the maximum maps to 1, and everything in between keeps its proportional position. The sanity check confirms all five normalized columns have min = 0.0 and max = 1.0 exactly as expected.

- **Raw scale differences of multiple orders of magnitude vanish after normalization.** Before normalization, `Price` spans ~\$977,000 while `Num_Bedrooms` spans only 4 ‚Äî a ratio of ~245,000:1. A distance-based algorithm like KNN would effectively ignore bedroom count entirely. After normalization both features span [0, 1] and contribute equally to distance calculations, which is precisely the goal.

- **Min-Max is sensitive to outliers, so the training-set min/max must be stored and reused.** The normalization anchors to the *observed* min and max in the dataset. If a future house is priced above the training maximum (> \$1,077,909), its normalized price will exceed 1.0, violating the [0, 1] guarantee. This means we must save the training-set min/max values and apply them at inference time ‚Äî never recompute them on new data.


## üß© Challenge Extension: After Normalization, Which Features Matter Most?

You've normalized the housing features so they share a common scale.  
Now comes a common next step in ML workflows:

> **How do we identify the most important directions (principal components) in the data‚Äîand how might these relate to a target variable like `Price`?**

This introduces **Principal Component Analysis (PCA)**.

---

## üìö PCA Theory (Conceptual)

### What PCA *is*
PCA is an **unsupervised** dimensionality reduction technique that:
- Finds **new axes** (principal components) that are **linear combinations** of your original features.
- Orders these axes so that:
  - **PC1** captures the **most variance** in the feature space,
  - **PC2** captures the next most variance, and so on,
  - Each PC is **orthogonal** (uncorrelated) with the previous ones.

### What PCA is *not*
PCA does **not** directly find features that "impact the target variable" because it does not use the target in its optimization.

However, you *can*:
- Compute PCs from the feature matrix **X** (after normalization),
- Then measure how PCs relate to the target **y** (e.g., correlation with `Price`, or a simple regression on PCs),
- Interpret which original features contribute most to PCs that are most related to **y**.

---

## üß† The Math (high level)
Given a centered feature matrix \(X\) (often standardized/normalized first):

1. Compute covariance matrix:
$$
\Sigma = \frac{1}{n-1}X^\top X
$$

2. Find eigenvectors (principal directions) and eigenvalues (variance captured):
$$
\Sigma v_i = \lambda_i v_i
$$

- $ v_i $ are **principal component directions** (loadings)
- $ \lambda_i $ are the **variance explained** by each component

---

## ‚úÖ Why Normalize Before PCA?
PCA is sensitive to scale. Without normalization/standardization:
- A large-scale feature (e.g., `Price`) can dominate variance
- PCs will reflect units rather than structure

---

## üéØ Student Challenge
Using the **housing dataset**:

1. Apply PCA to the normalized feature matrix \(X\) (exclude ID columns and the target).
2. Determine how many components are needed to explain **‚â• 90%** of the variance.
3. Identify which original features contribute most to:
   - **PC1** and **PC2**, and
   - the **PC most correlated with the target** (`Price`).
4. Write a short interpretation:
   - "What does PC1 represent in housing terms?"
   - "Do the PCs that explain the most variance also relate most strongly to `Price`?"



### üîó How to Integrate This With Your Step 1 Normalization

- If you created normalized columns (e.g., `Area_sqft_norm`), use those in `candidate_features`.
- If you normalized in-place (overwriting original columns), you can use the original names.
- PCA should **not** include:
  - `House_ID` (identifier)
  - non-numeric categorical columns (unless encoded appropriately)
- Decide intentionally whether to include `Year_Built`:
  - It's numeric, but it may behave differently than size/price-related features.

---

### ‚úÖ Deliverable for the Challenge
Add a Markdown cell answering:

1. How many PCs explain at least **90%** variance?
2. Which features contribute most to **PC1** and **PC2**?
3. Which PC is most correlated with `Price`?
4. In plain language: what do you think PC1 represents?


In [4]:
# --- PCA on the normalized feature matrix ---
# Uses the normalized columns produced in Step 1 (df_norm).
# Price_norm is the target ‚Äî excluded from X, used for correlation in Step F.
# Year_Built is excluded: it is a year label, not a magnitude/size feature.

import numpy as np
from sklearn.decomposition import PCA

# ---- Step A: Choose target and feature columns ----
target = df_norm['Price_norm']
candidate_features = ['Area_sqft_norm', 'Num_Bedrooms_norm', 'Num_Bathrooms_norm', 'Lot_Size_norm']
X = df_norm[candidate_features].values

# ---- Step B: Center the data (mean = 0 per feature) ----
# sklearn PCA centers automatically, but we do it explicitly to show the step.
X_centered = X - X.mean(axis=0)

# ---- Step C: Fit PCA ----
pca = PCA()          # keep all components for full inspection
pca.fit(X_centered)

# ---- Step D: Variance explained ----
evr = pca.explained_variance_ratio_
cumulative_evr = np.cumsum(evr)

print('=== Variance Explained by Each PC ===')
for i, (v, cv) in enumerate(zip(evr, cumulative_evr), 1):
    print(f'  PC{i}: {v:.4f}  (cumulative: {cv:.4f})')

n_components_90 = int(np.argmax(cumulative_evr >= 0.90)) + 1
print(f'\n‚û°Ô∏è  {n_components_90} components needed to explain >= 90% of variance')

# ---- Step E: Loadings (feature contributions to each PC) ----
loadings = pd.DataFrame(
    pca.components_.T,
    index=candidate_features,
    columns=[f'PC{i}' for i in range(1, len(candidate_features) + 1)]
)
print('\n=== PC Loadings (feature weights per component) ===')
display(loadings.round(4))

print('\nDominant feature per PC:')
for pc in loadings.columns:
    dominant = loadings[pc].abs().idxmax()
    print(f'  {pc}: {dominant}  (loading = {loadings.loc[dominant, pc]:.4f})')

# ---- Step F: Relate PCs to the target (Pearson correlation) ----
pc_scores = pca.transform(X_centered)
print('\n=== Pearson Correlation of Each PC with Price_norm ===')
correlations = {}
for i in range(len(candidate_features)):
    r = np.corrcoef(pc_scores[:, i], target.values)[0, 1]
    correlations[f'PC{i+1}'] = r
    print(f'  PC{i+1}: r = {r:.4f}')

best_pc = max(correlations, key=lambda k: abs(correlations[k]))
print(f'\n‚û°Ô∏è  {best_pc} is most correlated with Price (r = {correlations[best_pc]:.4f})')

=== Variance Explained by Each PC ===
  PC1: 0.4953  (cumulative: 0.4953)
  PC2: 0.3600  (cumulative: 0.8554)
  PC3: 0.0803  (cumulative: 0.9356)
  PC4: 0.0644  (cumulative: 1.0000)

‚û°Ô∏è  3 components needed to explain >= 90% of variance

=== PC Loadings (feature weights per component) ===


Unnamed: 0,PC1,PC2,PC3,PC4
Area_sqft_norm,0.0015,0.0174,-0.1236,0.9922
Num_Bedrooms_norm,0.0534,0.9983,0.0203,-0.0151
Num_Bathrooms_norm,0.9986,-0.0533,-0.0052,-0.0013
Lot_Size_norm,0.0043,-0.0185,0.9921,0.1239



Dominant feature per PC:
  PC1: Num_Bathrooms_norm  (loading = 0.9986)
  PC2: Num_Bedrooms_norm  (loading = 0.9983)
  PC3: Lot_Size_norm  (loading = 0.9921)
  PC4: Area_sqft_norm  (loading = 0.9922)

=== Pearson Correlation of Each PC with Price_norm ===
  PC1: r = -0.0114
  PC2: r = -0.0054
  PC3: r = 0.0172
  PC4: r = -0.0152

‚û°Ô∏è  PC3 is most correlated with Price (r = 0.0172)


### üìã PCA Challenge ‚Äî Answers

| Question | Answer |
|----------|--------|
| How many PCs explain ‚â• 90% variance? | **3 PCs** ‚Äî PC1 = 49.5%, PC1+PC2 = 85.5%, PC1+PC2+PC3 = 93.6% |
| Dominant feature in PC1? | **Num_Bathrooms_norm** (loading ‚âà +0.999) |
| Dominant feature in PC2? | **Num_Bedrooms_norm** (loading ‚âà +0.998) |
| PC most correlated with Price? | **PC3** (r ‚âà 0.017) ‚Äî but all PCs correlate near-zero with Price |
| What does PC1 represent? | PC1 is almost entirely the bathroom-count axis. Bathroom count varies nearly independently of the other three features, so it alone captures ~50% of the total variance in the feature matrix. |
| Do high-variance PCs predict Price? | **No.** PC1 and PC2 together explain ~86% of feature-space variance yet correlate near-zero with Price (\|r\| < 0.02). PCA is unsupervised ‚Äî it maximises spread in X, not predictive power over y. |

### üîé Talking Points #2 ‚Äî PCA on Normalized Housing Features

- **Three principal components are sufficient to capture ‚â• 90% of the variance, and each PC maps cleanly to one original feature.** PC1 (49.5%) is dominated by `Num_Bathrooms_norm`, PC2 (36.0%) by `Num_Bedrooms_norm`, and PC3 (8.0%) by `Lot_Size_norm`. The fact that each component loads almost entirely on a single feature tells us these four housing attributes are nearly uncorrelated ‚Äî there is very little shared variance to compress, so PCA doesn't produce much dimensionality reduction here.

- **High variance explained by a PC does not imply that PC predicts the target.** All four PCs have near-zero Pearson correlation with `Price_norm` (|r| < 0.02). The PC explaining the most feature-space variance (PC1 at ~50%) has essentially no linear relationship with house price. This is a textbook illustration of why PCA is called *unsupervised*: it finds directions of maximum spread in the feature matrix, which are not necessarily the directions most useful for predicting a label.

- **The normalization done in Step 1 was a prerequisite for valid PCA results.** Without it, `Price` (range ~\$977k) would have dominated the covariance matrix by a factor of ~245,000 over `Num_Bedrooms` (range 4), and PC1 would simply have been the price axis ‚Äî telling us nothing about the structure of the other features. Scaling all features to [0, 1] first ensures PCA responds to actual covariance patterns rather than differences in measurement units.