# Transforming Dataset Distributions

In this notebook, we will learn how to transform dataset distributions using **normalization** and **standardization**. Data transformation helps in converting raw data into a clean and usable format for machine learning algorithms.

## Import Libraries

We will use `numpy`, `pandas`, `matplotlib`, and `sklearn.preprocessing` modules for data transformation and visualization.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler, scale

# For inline plotting in Jupyter Notebook
%matplotlib inline

## Load Dataset

We will use the `mtcars` dataset and focus on the `mpg` (miles per gallon) column to demonstrate data transformations.

In [None]:
# Load mtcars dataset
url = 'https://raw.githubusercontent.com/selva86/datasets/master/mtcars.csv'
dataset = pd.read_csv(url)
dataset.head()

## Visualize Original Data

Let's plot the `mpg` values to visualize the original distribution before applying any transformations.

In [None]:
plt.figure(figsize=(10,5))
plt.plot(dataset['mpg'], marker='o')
plt.title('Original MPG Distribution')
plt.xlabel('Index')
plt.ylabel('MPG')
plt.show()

## Normalization (Min-Max Scaling)

Normalization scales data to a range of 0 to 1. It maintains the original distribution without distorting data points. This is especially useful when features have different ranges.

In [None]:
# Initialize MinMaxScaler
minmax_scalar = MinMaxScaler()

# Fit and transform the mpg column
scaled_data = minmax_scalar.fit_transform(dataset[['mpg']])

# Plot normalized data
plt.figure(figsize=(10,5))
plt.plot(scaled_data, marker='o')
plt.title('Normalized MPG Distribution')
plt.xlabel('Index')
plt.ylabel('Normalized MPG')
plt.show()

## Standardization (Z-score Scaling)

Standardization rescales data to have a mean of 0 and a standard deviation of 1. This method is useful for algorithms that assume normally distributed data.

In [None]:
# Standardize the mpg column
standardized_data = scale(dataset['mpg'])

# Plot standardized data
plt.figure(figsize=(10,5))
plt.plot(standardized_data, marker='o')
plt.title('Standardized MPG Distribution')
plt.xlabel('Index')
plt.ylabel('Standardized MPG')
plt.show()

## Summary

- **Normalization**: Scales values to [0,1], preserving original distribution.
- **Standardization**: Scales values to mean=0, std=1, useful for algorithms sensitive to scale.
- Use raw, normalized, and standardized data to evaluate machine learning models and choose the best preprocessing technique.