# Scaling Data: Normalization

## Introduction to Scaling and Normalization
In machine learning, data often comes in various forms, which means the values can range from small to large numbers. Some algorithms, like distance-based algorithms (e.g., K-Nearest Neighbors or Support Vector Machines), are sensitive to the scale of the data. When features (variables) are on different scales, it can impact the model's performance and accuracy. 

### What is Normalization?
Normalization is the process of adjusting the data so that all the features have similar scales. This is often done by rescaling the data between a certain range, typically between 0 and 1. This makes sure that one feature does not dominate the model because of its large numerical range.

### Why is Normalization Important?
- **Improves Performance:** When features are scaled, it helps machine learning algorithms converge faster.
- **Prevents Bias:** Some algorithms may give more weight to features with larger values, skewing results. Normalization ensures that all features are treated equally.

Normalization is a way to make sure that all the data we use in a machine learning model is on the **same scale**. This means that we change the data so that every feature (or column of data) has similar values, making it easier for the model to understand and process.

### Why do we need Normalization?

Imagine you have two types of data: 
1. **Height of people** in **cm** (ranging from 150 cm to 200 cm).
2. **Age of people** in **years** (ranging from 18 to 60 years).

In this case, **height** and **age** are on very different scales. One is between 150 and 200, and the other is between 18 and 60. If you put them both in a model without normalizing, the **height** might end up influencing the results more because its range is bigger. 

Normalization **makes both features (height and age) on the same level**, so they are treated equally by the machine learning model.

## A Simple Example

Let's say we have the following data:

| Person | Height (cm) | Age (years) |
|--------|-------------|-------------|
| Alice  | 160         | 25          |
| Bob    | 170         | 30          |
| Charlie| 180         | 35          |

- **Height** is between 160 and 180 cm.
- **Age** is between 25 and 35 years.

Both features (Height and Age) are on different scales. To **normalize** the data, we can scale each feature so that it fits within a range between **0 and 1**. This makes sure both features are treated equally when used in a machine learning model.


## Types of Normalization

### 1. Min-Max Normalization:
This is the most commonly used method where each feature is scaled to a specific range, usually between 0 and 1. The formula is:

![My Image](./images/Min-Max-Normalization.jpg)



### 2. Z-Score Normalization (Standardization):
In this approach, the data is transformed so that it has a mean of 0 and a standard deviation of 1. The formula is:

\[
Z = \frac{X - \mu}{\sigma}
\]

Where:
- \(X\) is the value,
- \(\mu\) is the mean of the feature,
- \(\sigma\) is the standard deviation of the feature.

This method is often used when the data has outliers, as it helps in dealing with large variance in features.

## When to Use Normalization:
- **Use Min-Max normalization** when your data has a known range, and you want to scale it to a specific range like [0, 1].
- **Use Z-score normalization** when your data has varying scales or outliers, and you need to center the data.





## What Happens During Normalization?

Normalization changes the data so that all the values are between **0 and 1**. This is done by using a simple formula:

Normalized Value = (Current Value - Min Value) / (Max Value - Min Value)

Where:
- **Current Value** is the original value you want to normalize.
- **Min Value** is the smallest value in the feature.
- **Max Value** is the largest value in the feature.

### Example: Normalizing the Height and Age

For **Height** (between 160 and 180 cm):

- **For Alice** (Height = 160 cm):

Normalized Height = (160 - 160) / (180 - 160) = 0

- **For Bob** (Height = 170 cm):
Normalized Height = (170 - 160) / (180 - 160) = 0.5

- **For Charlie** (Height = 180 cm):
Normalized Height = (180 - 160) / (180 - 160) = 1

For **Age** (between 25 and 35 years):

- **For Alice** (Age = 25)

Normalized Age = (25 - 25) / (35 - 25) = 0

### Result:

| Person | Normalized Height | Normalized Age |
|--------|-------------------|----------------|
| Alice  | 0.0               | 0.0            |
| Bob    | 0.5               | 0.5            |
| Charlie| 1.0               | 1.0            |

Now both **Height** and **Age** are between 0 and 1, and the machine learning model will treat them equally.

## Code Example: Normalizing Data in Python

You can use Python and a special library called `sklearn` to normalize your data automatically. Here's a simple Python code:




In [3]:
# Importing necessary libraries
from sklearn.preprocessing import MinMaxScaler
import pandas as pd

# Data: Heights and Ages
data = {'Height': [160, 170, 180],
        'Age': [25, 30, 35]}

# Creating a DataFrame
df = pd.DataFrame(data)

# Display the original data
print("Original Data:")
print(df)

# Creating a Normalizer (MinMaxScaler)
scaler = MinMaxScaler()

# Normalizing the data
normalized_data = scaler.fit_transform(df)

# Creating a DataFrame with normalized values
normalized_df = pd.DataFrame(normalized_data, columns=[
                             'Normalized Height', 'Normalized Age'])

# Display the normalized data
print("\nNormalized Data:")
print(normalized_df)

Original Data:
   Height  Age
0     160   25
1     170   30
2     180   35

Normalized Data:
   Normalized Height  Normalized Age
0                0.0             0.0
1                0.5             0.5
2                1.0             1.0


# Z-Score Normalization in Simple Language

## What is Z-Score Normalization?

**Z-score normalization**, also known as **standardization**, is a way to adjust the data so that it has a mean (average) of **0** and a standard deviation of **1**. This makes the data easier for machine learning models to understand and process, especially when the data varies greatly in scale.

## Why Use Z-Score Normalization?

Imagine you have two features in your data:

1. **Height**: People’s heights in centimeters (e.g., 150 cm, 170 cm, 180 cm)
2. **Age**: People’s ages in years (e.g., 20 years, 40 years, 60 years)

If you don’t normalize these values, the **Age** feature might seem more important in the model because it has a much wider range (20 to 60 years) compared to **Height** (150 cm to 180 cm). Z-score normalization helps by giving both features equal importance.

## How Does It Work?

Z-score normalization turns the data into a **standardized score** by considering how far each value is from the **average** (mean) and how **spread out** the values are (standard deviation).

### Formula:
Z = (X - μ) / σ


Where:
- **X** is the original value (e.g., someone’s height or age),
- **μ (mu)** is the **average** (mean) of all the values,
- **σ (sigma)** is the **standard deviation**, which tells us how spread out the values are.

The result, **Z**, tells us how far a specific value is from the average, in terms of how many "standard deviations" it is away.

## Simple Example:

Let’s say we have these **Heights**: 150 cm, 160 cm, 170 cm, 180 cm.

- The **mean** (average) height is **160 cm**.
- The **standard deviation** is **10 cm** (this shows how spread out the values are).

Now, let’s normalize a height of **170 cm** using the formula:

Z = (170 - 160) / 10 = 10 / 10 = 1


This means **170 cm** is **1 standard deviation** above the average.

If someone’s height is **150 cm**, the Z-score would be:

Z = (150 - 160) / 10 = -10 / 10 = -1


This means **150 cm** is **1 standard deviation** below the average.

## What Does This Mean?

- **Z-scores** help us understand how far or close a value is from the average in a standardized way.
- After Z-score normalization, all features (like height, age, weight) will have values centered around **0** and can be compared more easily.
- This technique is useful when the data has different units or scales (e.g., height in centimeters and weight in kilograms), making sure that no single feature dominates the model because of its scale.

## Conclusion:

Z-score normalization is a way to make data comparable by adjusting it to have a mean of **0** and a standard deviation of **1**. It helps in giving all features the same importance when building machine learning models.




In [5]:
# Import necessary libraries
from sklearn.preprocessing import StandardScaler
import pandas as pd

# Data: Heights
data = {'Height': [150, 160, 170, 180]}

# Creating a DataFrame
df = pd.DataFrame(data)

# Display the original data
print("Original Data:")
print(df)

# Creating a Z-score Normalizer (StandardScaler)
scaler = StandardScaler()

# Normalizing the data using Z-score normalization
normalized_data = scaler.fit_transform(df)

# Creating a DataFrame with normalized values
normalized_df = pd.DataFrame(normalized_data, columns=['Normalized Height'])

# Display the normalized data
print("\nNormalized Data (Z-score):")
print(normalized_df)

Original Data:
   Height
0     150
1     160
2     170
3     180

Normalized Data (Z-score):
   Normalized Height
0          -1.341641
1          -0.447214
2           0.447214
3           1.341641


# L1 Normalization (L1 Regularization)

**L1 Normalization** is a technique used in machine learning to help a model make better predictions by **ignoring irrelevant information**.

## What is L1 Normalization?

Imagine you're making a decision based on several factors. For example, when deciding what car to buy, you might consider the **color**, **price**, **fuel efficiency**, and **safety features**. Some of these factors might matter more than others.

In machine learning, we often have many factors (called **features**). L1 Normalization helps the model decide which factors are **important** and which ones it can **ignore**. It does this by setting the **less important features** to **zero**, so the model focuses on just the important ones.

## Why is L1 Normalization Useful?

1. **Simplifies the Model**: By ignoring irrelevant features, the model becomes **simpler** and **faster**.
2. **Improves Accuracy**: With fewer features to worry about, the model can focus on what really matters, which can lead to **better predictions**.
3. **Prevents Overfitting**: Overfitting happens when the model gets too complicated and starts "memorizing" data instead of learning from it. L1 helps by keeping things simple.

## Simple Example

Imagine you're building a model to predict whether someone will buy a product based on **age**, **income**, and **number of social media followers**.

- **Without L1 Normalization**: The model might use all three features equally, even though the number of social media followers might not really matter.
- **With L1 Normalization**: The model might decide that **"number of social media followers"** isn’t important and completely **ignore it** by setting its value to zero. This makes the model **simpler** and **faster**.

## How Does L1 Normalization Work?

L1 Normalization works by **shrinking** the importance of features that don’t help with predictions. It gives **zero importance** to unimportant features, so they don’t influence the model’s decision-making.

## Key Takeaways

- **L1 Normalization** helps the machine focus only on the important features and **ignores the unimportant ones**.
- It makes the model **simpler**, **faster**, and **more accurate**.
- It’s especially useful when you have a lot of features and want to find out which ones really matter.

L1 Normalization is like cleaning up your decision-making process by throwing out irrelevant details and focusing on what truly matters.


In [9]:
import pandas as pd
from sklearn.preprocessing import Normalizer

# Load the CSV file
# Replace with your actual CSV file path
path = r'diabetes.csv'
data = pd.read_csv(path)

# Select the features to normalize (excluding the target variable 'Outcome')
features = data.drop(columns=['Outcome'])
print(features.head())
# Create a Normalizer object for L1 normalization
scaler = Normalizer(norm='l1')

# Apply L1 normalization
normalized_data = scaler.fit_transform(features)

# Convert the normalized data back to a DataFrame
normalized_df = pd.DataFrame(normalized_data, columns=features.columns)

# Add the 'Outcome' column back to the normalized data
normalized_df['Outcome'] = data['Outcome']

# Print the normalized data
print(normalized_df.head())

   Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0            6      148             72             35        0  33.6   
1            1       85             66             29        0  26.6   
2            8      183             64              0        0  23.3   
3            1       89             66             23       94  28.1   
4            0      137             40             35      168  43.1   

   DiabetesPedigreeFunction  Age  
0                     0.627   50  
1                     0.351   31  
2                     0.672   32  
3                     0.167   21  
4                     2.288   33  
   Pregnancies   Glucose  BloodPressure  SkinThickness   Insulin       BMI  \
0     0.017380  0.428703       0.208558       0.101383  0.000000  0.097327   
1     0.004185  0.355721       0.276207       0.121364  0.000000  0.111320   
2     0.025726  0.588477       0.205806       0.000000  0.000000  0.074926   
3     0.003103  0.276169       0.204799      

L2 Normalization

In [13]:
import pandas as pd
from sklearn.preprocessing import Normalizer

# Sample CSV data
data = {
    'Country': ['France', 'Spain', 'Germany', 'Spain', 'Germany', 'France', 'Spain', 'France', 'Germany', 'France'],
    'Age': [44, 27, 30, 38, 40, 35, None, 48, 50, 37],
    'Salary': [72000, 48000, 54000, 61000, None, 58000, 52000, 79000, 83000, 67000],
    'Purchased': ['No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes']
}

# Create DataFrame
df = pd.DataFrame(data)

# Fill missing values (NaN) with the mean of the respective column
df['Age'].fillna(df['Age'].mean(), inplace=True)
df['Salary'].fillna(df['Salary'].mean(), inplace=True)

# Extract the features we want to normalize
features = df[['Age', 'Salary']]

# Initialize the Normalizer for L2 normalization
scaler = Normalizer(norm='l2')

# Apply L2 normalization
normalized_data = scaler.fit_transform(features)

# Create a new DataFrame with the normalized data
normalized_df = pd.DataFrame(normalized_data, columns=['Age', 'Salary'])

# Add the 'Country' and 'Purchased' columns back to the normalized data
normalized_df['Country'] = df['Country']
normalized_df['Purchased'] = df['Purchased']

# Display the normalized data
print(normalized_df)

        Age  Salary  Country Purchased
0  0.000611     1.0   France        No
1  0.000562     1.0    Spain       Yes
2  0.000556     1.0  Germany        No
3  0.000623     1.0    Spain        No
4  0.000627     1.0  Germany       Yes
5  0.000603     1.0   France       Yes
6  0.000746     1.0    Spain        No
7  0.000608     1.0   France       Yes
8  0.000602     1.0  Germany        No
9  0.000552     1.0   France       Yes


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Age'].fillna(df['Age'].mean(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Salary'].fillna(df['Salary'].mean(), inplace=True)


Standarization

In [1]:
import pandas as pd
from sklearn.preprocessing import StandardScaler

# Load the data
data = pd.read_csv("Data.csv")

# Fill missing values with the mean (optional step for simplicity)
data['Age'].fillna(data['Age'].mean(), inplace=True)
data['Salary'].fillna(data['Salary'].mean(), inplace=True)

# Select the columns to standardize
features = data[['Age', 'Salary']]

# Apply Standardization
scaler = StandardScaler()
standardized_data = scaler.fit_transform(features)

# Convert back to a DataFrame for better readability
standardized_df = pd.DataFrame(standardized_data, columns=[
                               'Age_Standardized', 'Salary_Standardized'])

# Merge with the original DataFrame
result = pd.concat([data, standardized_df], axis=1)

# Print the resulting DataFrame
print(result)

   Country        Age        Salary Purchased  Age_Standardized  \
0   France  44.000000  72000.000000        No          0.758874   
1    Spain  27.000000  48000.000000       Yes         -1.711504   
2  Germany  30.000000  54000.000000        No         -1.275555   
3    Spain  38.000000  61000.000000        No         -0.113024   
4  Germany  40.000000  63777.777778       Yes          0.177609   
5   France  35.000000  58000.000000       Yes         -0.548973   
6    Spain  38.777778  52000.000000        No          0.000000   
7   France  48.000000  79000.000000       Yes          1.340140   
8  Germany  50.000000  83000.000000        No          1.630773   
9   France  37.000000  67000.000000       Yes         -0.258340   

   Salary_Standardized  
0         7.494733e-01  
1        -1.438178e+00  
2        -8.912655e-01  
3        -2.532004e-01  
4         6.632192e-16  
5        -5.266569e-01  
6        -1.073570e+00  
7         1.387538e+00  
8         1.752147e+00  
9         2.

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data['Age'].fillna(data['Age'].mean(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data['Salary'].fillna(data['Salary'].mean(), inplace=True)


Data Splitting

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Load the dataset
data = pd.read_csv("Data.csv")

# Fill missing values (optional, for simplicity)
data['Age'].fillna(data['Age'].mean(), inplace=True)
data['Salary'].fillna(data['Salary'].mean(), inplace=True)

# Separate features (X) and target (y)
X = data[['Age', 'Salary']]  # Features (independent variables)
y = data['Purchased']        # Target (dependent variable)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42)

# Print the sizes of the datasets
print("Training Features Shape:", X_train.shape)
print("Testing Features Shape:", X_test.shape)
print("Training Target Shape:", y_train.shape)
print("Testing Target Shape:", y_test.shape)

# Display sample data
print("\nTraining Features:\n", X_train.head())
print("\nTesting Features:\n", X_test.head())

Training Features Shape: (7, 2)
Testing Features Shape: (3, 2)
Training Target Shape: (7,)
Testing Target Shape: (3,)

Training Features:
     Age        Salary
0  44.0  72000.000000
7  48.0  79000.000000
2  30.0  54000.000000
9  37.0  67000.000000
4  40.0  63777.777778

Testing Features:
     Age   Salary
8  50.0  83000.0
1  27.0  48000.0
5  35.0  58000.0


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data['Age'].fillna(data['Age'].mean(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data['Salary'].fillna(data['Salary'].mean(), inplace=True)
