# <span style="color:darkblue;">[LDATS2350] - DATA MINING</span>

### <span style="color:darkred;">Python04 - Data Standardization</span>

**Prof. Robin Van Oirbeek**  

<br/>

**<span style="color:darkgreen;">Guillaume Deside</span>** (<span style="color:gray;">guillaume.deside@uclouvain.be</span>)

---

## **What is Data Standardization?**

**Data standardization** is a crucial preprocessing step in data mining and machine learning. It involves transforming your data so that each feature has the same scale, typically with a mean of 0 and a standard deviation of 1. This ensures that all variables contribute equally to the analysis and model training.

---

## **Why Standardize Data?**

### 1. **Improves Model Performance**
Some machine learning algorithms, such as **k-nearest neighbors**, **support vector machines**, and **gradient descent-based models**, are sensitive to the scale of input data. Standardization ensures that features with large magnitudes do not dominate features with smaller magnitudes.

### 2. **Enhances Interpretability**
Standardized data allows for better interpretability of results, especially in models where coefficients indicate the importance of features.

### 3. **Speeds Up Convergence**
In optimization-based models (e.g., logistic regression, neural networks), standardizing the data can lead to faster convergence during training.

---


## Analsing data

In [3]:
import numpy as np
import pandas as pd

from sklearn.datasets import load_breast_cancer

dataset = load_breast_cancer()


In [4]:
print(dataset["DESCR"])

.. _breast_cancer_dataset:

Breast cancer wisconsin (diagnostic) dataset
--------------------------------------------

**Data Set Characteristics:**

    :Number of Instances: 569

    :Number of Attributes: 30 numeric, predictive attributes and the class

    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry
        - fractal dimension ("coastline approximation" - 1)

        The mean, standard error, and "worst" or largest (mean of the three
        worst/largest values) of these features were computed for each image,
        resulting in 30 features.  For instance, field 0 is Mean Radi

In [5]:
#create the dataframe
dataset_df = pd.DataFrame(dataset.data)

columns = dataset.feature_names
dataset_df.columns = columns

dataset_df.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [6]:
dataset_df.describe()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
count,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,...,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0
mean,14.127292,19.289649,91.969033,654.889104,0.09636,0.104341,0.088799,0.048919,0.181162,0.062798,...,16.26919,25.677223,107.261213,880.583128,0.132369,0.254265,0.272188,0.114606,0.290076,0.083946
std,3.524049,4.301036,24.298981,351.914129,0.014064,0.052813,0.07972,0.038803,0.027414,0.00706,...,4.833242,6.146258,33.602542,569.356993,0.022832,0.157336,0.208624,0.065732,0.061867,0.018061
min,6.981,9.71,43.79,143.5,0.05263,0.01938,0.0,0.0,0.106,0.04996,...,7.93,12.02,50.41,185.2,0.07117,0.02729,0.0,0.0,0.1565,0.05504
25%,11.7,16.17,75.17,420.3,0.08637,0.06492,0.02956,0.02031,0.1619,0.0577,...,13.01,21.08,84.11,515.3,0.1166,0.1472,0.1145,0.06493,0.2504,0.07146
50%,13.37,18.84,86.24,551.1,0.09587,0.09263,0.06154,0.0335,0.1792,0.06154,...,14.97,25.41,97.66,686.5,0.1313,0.2119,0.2267,0.09993,0.2822,0.08004
75%,15.78,21.8,104.1,782.7,0.1053,0.1304,0.1307,0.074,0.1957,0.06612,...,18.79,29.72,125.4,1084.0,0.146,0.3391,0.3829,0.1614,0.3179,0.09208
max,28.11,39.28,188.5,2501.0,0.1634,0.3454,0.4268,0.2012,0.304,0.09744,...,36.04,49.54,251.2,4254.0,0.2226,1.058,1.252,0.291,0.6638,0.2075


<div style="border: 2px solid darkblue; padding: 10px; background-color: #89D9F5;">

### **Exercise - Creating and Customizing a Boxplot**

#### **Objective**
Learn how to create and customize a **boxplot** using Matplotlib to visualize the distribution of data across multiple features.

---

#### **Instructions**
1. Import the necessary libraries:  
   - `pandas` for creating a dataset.
   - `matplotlib.pyplot` for plotting.

2. Create a Pandas DataFrame `dataset_df` with at least **four columns** and **random numeric data**.  
   *(Hint: Use `np.random.rand()` to generate random data.)*

3. Generate a **basic boxplot** and **basic histograms** of the DataFrame using `dataset_df.boxplot()`.


4. Display the final customized boxplot.

</div>

**Expected output**
<img src="boxplot_features.png" />

**Expected output**
<img src="histogram.png" />

# **Standardization with StandardScaler**

---

### **What is Standardization?**
Standardization is a preprocessing technique that transforms features to have a **mean of 0** and a **standard deviation of 1**. This ensures that all features contribute equally to the analysis, avoiding dominance by features with larger magnitudes.

Mathematically, for each feature $X_j$, the standardized value \( Z_j \) is calculated as:

$
Z_j = \frac{X_j - \mu_j}{\sigma_j}
$

Where:
- $ Z_j $: The standardized value of the feature \( X_j \).
- $ X_j $: The original feature value.
- $ \mu_j $: The **mean** of the feature \( X_j \) (calculated on the training data).
- $ \sigma_j $: The **standard deviation** of the feature \( X_j \) (calculated on the training data).

---

### **Why Standardize Data?**
1. **Improves Model Performance**:
   - Models like **SVM**, **Logistic Regression**, and **KNN** are sensitive to the scale of features.
2. **Assumes Normalized Features**:
   - Techniques like **PCA** and **LDA** assume the data is normally distributed.
3. **Balances Feature Contribution**:
   - Prevents features with large magnitudes from dominating those with smaller ones.

---

### **StandardScaler in Practice**

The `StandardScaler` class from `sklearn.preprocessing` is used to perform standardization in Python. It works by:
1. **Fitting** on the training data to compute the mean (\( \mu_j \)) and standard deviation (\( \sigma_j \)) for each feature.
2. **Transforming** the dataset by centering (subtracting the mean) and scaling (dividing by the standard deviation).



<div style="border: 2px solid darkblue; padding: 10px; background-color: #89D9F5;">

### **Exercise: Standardize and Visualize Data with `StandardScaler`**

#### **Objective**
Learn how to use **`StandardScaler`** from `sklearn.preprocessing` to standardize your dataset and visualize the standardized data using a boxplot.

---

#### **Instructions**
1. **Fit the Scaler**:
   - Use `StandardScaler` to compute the mean and standard deviation for the features in the dataset.
   
2. **Transform the Data**:
   - Apply the transformation to standardize the dataset.  

3. **Convert to DataFrame**:
   - Convert the standardized data back into a Pandas DataFrame with the same column names as the original dataset.

4. **Plot the Boxplot**:
   - Create a **boxplot** for the standardized data to compare the feature distributions after standardization.

---

#### **Hints**
- Use `scaler.fit()` to compute the required statistics (mean and standard deviation).
- Use `scaler.transform()` to standardize the dataset.
- Use `pd.DataFrame()` to create a new DataFrame from the transformed data.
- Use `plt.boxplot()` or Pandas' `boxplot()` method for visualization.


</div>

**Expected output**
<img src="standardizer.png"/>

## **Using MinMaxScaler**

---
The `MinMaxScaler` is a data preprocessing technique from `sklearn.preprocessing` used to **normalize features** by scaling them to a specific range, typically [0, 1]. Unlike `StandardScaler`, which standardizes data to have a mean of 0 and unit variance, `MinMaxScaler` rescales the data linearly between a minimum and maximum value.

---

### **Formula**
The transformation performed by `MinMaxScaler` is given by:

$
X' = \frac{X - \text{X}_{\text{min}}}{\text{X}_{\text{max}} - \text{X}_{\text{min}}}
$

Where:
- $ X $: Original feature value.
- $ X_{\text{min}} $: Minimum value of the feature in the dataset.
- $X_{\text{max}}$: Maximum value of the feature in the dataset.
- $ X' $: Scaled value between 0 and 1.

This transformation scales each feature independently.

---

### **Why Use MinMaxScaler?**

1. **Normalization for Specific Ranges**:
   - MinMaxScaler is particularly useful when you want all features to lie within a specific range (e.g., [0, 1] or [-1, 1]).

2. **Avoiding Feature Dominance**:
   - Rescales features to ensure no single feature dominates others during computation (e.g., in distance-based algorithms like KNN or clustering).

3. **Maintains Distribution Shape**:
   - Unlike `StandardScaler`, MinMaxScaler does not distort the shape of the original feature distribution.

4. **Best for Bounded Models**:
   - Particularly beneficial for machine learning algorithms sensitive to absolute scales, such as **neural networks** or when features need to fit into a specific activation function range.


<div style="border: 2px solid darkblue; padding: 10px; background-color: #89D9F5;">

### **Exercise - Normalize Data Using MinMaxScaler**

#### **Objective**
Learn how to use Scikit-Learn's `MinMaxScaler` to normalize data to a specified range, such as **(-1, 1)**. This exercise will teach you how to scale feature values in a DataFrame.

---

#### **Instructions**

1. **Create or Load a Dataset**:
   - Create a Pandas DataFrame named `dataset_df` with at least **three columns** and **random numerical data**. *(Hint: Use `np.random.rand()` to generate random numbers.)*
   - Alternatively, use a real dataset if available.

2. **Initialize MinMaxScaler**:
   - Initialize the `MinMaxScaler` with the following parameters:
     - `copy=False`: Ensures the scaling is done **in place** without creating a copy.
     - `feature_range=(-1, 1)`: Scales all features into the range [-1, 1].

3. **Fit and Transform the Dataset**:
   - Apply `fit_transform()` to normalize the data.

4. **Verify Results**:
   - Print the **original data** and the **normalized data**.
   - Verify that all feature values now lie within the range [-1, 1].

