# Session 34: Data Transformation (Standardization)

**Unit 3: Data Collection and Cleaning**
**Hour: 34**
**Mode: Practical Lab**

---

### 1. Objective

This lab introduces another crucial data transformation technique: **Standardization** (or Z-score scaling).

**What is Standardization?** It is the process of rescaling data to have a **mean of 0** and a **standard deviation of 1**.

**Normalization vs. Standardization:**
*   **Normalization (0 to 1):** Good when your data has a clear minimum and maximum or when you need data in a bounded range. It is sensitive to outliers.
*   **Standardization (Mean=0, Std=1):** Works well when your data is roughly normally distributed (bell-shaped). It is less affected by outliers than normalization. In practice, standardization is often used more frequently.

### 2. Setup

We need Pandas and the `StandardScaler` from Scikit-learn.

In [None]:
import pandas as pd
from sklearn.preprocessing import StandardScaler

# Create a simple DataFrame
data = {'Value': [10, 20, 50, 100, 5]}
df_simple = pd.DataFrame(data)

### 3. The Standardization Process

The formula for Standardization is: `(x - mean(x)) / std(x)`

#### Step 1: Initialize the Scaler

We create an instance of the `StandardScaler` object.

In [None]:
scaler = StandardScaler()

#### Step 2: Fit and Transform the Data

The process is identical to `MinMaxScaler`. The `fit_transform` method learns the mean and standard deviation and then applies the transformation.

In [None]:
scaled_values = scaler.fit_transform(df_simple[['Value']])

df_simple['Value_Standardized'] = scaled_values
df_simple

**Interpretation:**
*   Values below the original mean are now negative.
*   Values above the original mean are now positive.
*   The magnitude of the number represents how many standard deviations it is from the mean.

#### Step 3: Verify the Result

Let's check the mean and standard deviation of our new standardized column. The mean should be very close to 0, and the standard deviation should be very close to 1.

In [None]:
mean = df_simple['Value_Standardized'].mean()
std_dev = df_simple['Value_Standardized'].std()

print(f"New Mean: {mean:.2f}") # The result is a very small number, effectively zero
print(f"New Standard Deviation: {std_dev:.2f}")

### 4. Application to the Telco Dataset

Let's apply standardization to the numerical columns in our Telco dataset.

In [None]:
# Load the data
url = 'https://raw.githubusercontent.com/IBM/telco-customer-churn-on-icp4d/master/data/Telco-Customer-Churn.csv'
df = pd.read_csv(url)

# Select only the numerical columns we want to scale
numerical_cols = ['tenure', 'MonthlyCharges']
df_numerical = df[numerical_cols]

# Initialize a new scaler
telco_scaler = StandardScaler()

# Fit and transform
df_scaled = telco_scaler.fit_transform(df_numerical)

# Convert back to a DataFrame to view it
df_scaled = pd.DataFrame(df_scaled, columns=['tenure_scaled', 'MonthlyCharges_scaled'])

df_scaled.head()

In [None]:
# Let's check the mean and std dev of the new scaled columns
df_scaled.describe()

**Interpretation:** Looking at the `describe()` output, we can see the `mean` for both scaled columns is effectively zero, and the `std` (standard deviation) is 1.0, confirming the transformation was successful.

### 5. Conclusion

In this session, you learned about Standardization (Z-score Scaling):
1.  Understand its purpose: to rescale data to have a mean of 0 and a standard deviation of 1.
2.  Recognize when to use it versus Normalization.
3.  Use the `StandardScaler` from Scikit-learn.
4.  Verify the transformation by checking the new mean and standard deviation.

Scaling your data is a vital preprocessing step for many machine learning models.

**Next Session:** We will learn about another important data preparation technique: Feature Engineering.