<a href="https://colab.research.google.com/github/comparativechrono/Principles-of-Data-Science/blob/main/Week_7/Section_6_Python_Example__Normalizing_and_Scaling_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Section 6 - Python example: Normalizing and scaling data
In the context of data preprocessing, normalization and scaling are crucial techniques used to standardize the range of independent variables or features of data. These methods are particularly important in machine learning, where they can significantly impact the performance of algorithms. This section provides a Python example demonstrating how to normalize and scale data using Scikit-learn, a powerful machine learning library that offers easy-to-use tools for these purposes.

1. Setting Up the Environment:

Before implementing normalization and scaling, ensure your Python environment includes Scikit-learn. If not already installed, it can be added via pip:

In [None]:
pip install scikit-learn

2. Importing Required Libraries:

Scikit-learn provides specific modules for preprocessing data. We'll also use Pandas for data manipulation and NumPy for any additional numerical operations:

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler, StandardScaler

3. Creating Sample Data:

Let's create a DataFrame with some sample data representing ages and incomes, which often require scaling due to their different ranges:

In [None]:
# Create a DataFrame
data = pd.DataFrame({ 'Age': [25, 35, 45, 55, 20, 30, 40, 50, 60], 'Income': [50000, 60000, 70000, 80000, 35000, 45000, 55000, 65000, 75000] })
print("Original Data:\n", data)

4. Applying Min-Max Scaling:

MinMaxScaler transforms features by scaling each feature to a given range, often [0, 1]. This is useful when parameters need to be on a positive scale and is common in algorithms that incorporate neural networks or require data within a bounded interval.

In [None]:
# Initialize the MinMaxScaler
scaler = MinMaxScaler()
# Fit and transform the data
data_scaled = scaler.fit_transform(data)
# Convert the array back to a DataFrame
data_scaled = pd.DataFrame(data_scaled, columns=['Age', 'Income'])
print("Data after Min-Max Scaling:\n", data_scaled)

5. Applying Standardization (Z-score Normalization):

StandardScaler removes the mean and scales each feature/variable to unit variance. This technique is less affected by outliers and is often used in clustering analyses and principal component analysis (PCA).

In [None]:
# Initialize the StandardScaler
standard_scaler = StandardScaler()
# Fit and transform the data
data_standardized = standard_scaler.fit_transform(data)
# Convert the array back to a DataFrame data_standardized = pd.DataFrame(data_standardized, columns=['Age', 'Income'])
print("Data after Standardization:\n", data_standardized)

6. Conclusion:

Normalization and scaling are pivotal preprocessing techniques that help harmonize different features in a dataset, ensuring that each feature contributes equally to the development of machine learning models and that the model is not biased toward variables on larger scales. By using Scikit-learn's preprocessing tools, data scientists can easily implement these techniques, enhancing model accuracy and improving overall predictive performance. These methods are especially important in datasets where the variables differ significantly in their ranges or distributions, as is often the case in real-world scenarios.