# Session 33: Data Transformation (Normalization)

**Unit 3: Data Collection and Cleaning**
**Hour: 33**
**Mode: Practical Lab**

---

### 1. Objective

This lab introduces a key data transformation technique: **Normalization**. We will learn why it's important and how to apply it using the Scikit-learn library.

**What is Normalization?** It is the process of scaling numerical data to a fixed range, typically **0 to 1**.

**Why do we do it?** Many machine learning algorithms perform better when numerical features are on a similar scale. For example, an algorithm might incorrectly think that `TotalCharges` (values up to ~8000) is more "important" than `tenure` (values up to 72) simply because the numbers are bigger. Normalization prevents this bias.

### 2. Setup

We need Pandas and the `MinMaxScaler` from the Scikit-learn library.

In [None]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

# Create a simple DataFrame to clearly see the effect
data = {'Value': [10, 20, 50, 100, 5]}
df_simple = pd.DataFrame(data)
df_simple

### 3. The Normalization Process

The formula for Min-Max scaling is: `(x - min(x)) / (max(x) - min(x))`

This formula guarantees that the new minimum value of the column will be 0 and the new maximum value will be 1.

#### Step 1: Initialize the Scaler

We create an instance of the `MinMaxScaler` object.

In [None]:
scaler = MinMaxScaler()

#### Step 2: Fit and Transform the Data

The `fit_transform` method does two things:
1.  **`fit`**: It looks at the data to learn the parameters it needs (in this case, the minimum and maximum values).
2.  **`transform`**: It applies the scaling formula to the data.

**Important:** The scaler expects the data in a 2D format, so we pass our column inside double square brackets `[['Value']]`.

In [None]:
scaled_values = scaler.fit_transform(df_simple[['Value']])
print(scaled_values)

The result is a NumPy array. Let's add it back to our DataFrame to see it clearly.

In [None]:
df_simple['Value_Normalized'] = scaled_values
df_simple

**Interpretation:**
*   The smallest original value (5) is now 0.
*   The largest original value (100) is now 1.
*   All other values are scaled proportionally between 0 and 1.

### 4. Application to the Telco Dataset

Let's apply this to the numerical columns in our Telco dataset.

In [None]:
# Load the data
url = 'https://raw.githubusercontent.com/IBM/telco-customer-churn-on-icp4d/master/data/Telco-Customer-Churn.csv'
df = pd.read_csv(url)

# Select only the numerical columns we want to scale
numerical_cols = ['tenure', 'MonthlyCharges'] # We'll ignore TotalCharges for simplicity
df_numerical = df[numerical_cols]

# Initialize a new scaler
telco_scaler = MinMaxScaler()

# Fit and transform
df_scaled = telco_scaler.fit_transform(df_numerical)

# The output is a NumPy array, let's convert it to a DataFrame to view it
df_scaled = pd.DataFrame(df_scaled, columns=['tenure_scaled', 'MonthlyCharges_scaled'])

df_scaled.head()

In [None]:
# Let's check the min and max of the new scaled columns
df_scaled.describe()

As expected, the `min` for both scaled columns is 0 and the `max` is 1.

### 5. Conclusion

In this session, you learned about Normalization (Min-Max Scaling):
1.  Understand its purpose: to scale data to a fixed range (0-1) for machine learning algorithms.
2.  Use the `MinMaxScaler` from Scikit-learn.
3.  Follow the `initialize -> fit_transform` process.
4.  Successfully apply the technique to a real-world dataset.

**Next Session:** We will learn about another, very similar, data scaling technique called Standardization.