---
University Paris 1 Panthéon-Sorbonne 

Introduction to Machine Learning

Dr. Nourhène BEN RABAH

---

# Lecture 2: Data scaling, data normalization and data transformation 


---

In the last session, we discovered the first stage in data preparation, which is data cleaning.  In this session you will learn about data normalization, data scaling and data transformation. In the context of data preparation, these methods are important  to make the data in a suitable format for the ML algorithms. 

---

<div style="background-color:lightblue; padding:1px">
<strong>Let's note that Normalization and Scaling are often used as synonyms, with slightly different goals. </strong>
</div>

---

Let's explore each of these methods: 

#### 1) Data Scaling 

It's very possible that our dataset contains attributes that are scalable (i.e. the attributes have a large difference in the scale).  We can't provide this data to the ML algorithm because the large difference in the scale values can cause problems when comparing or combining them. 

Scaling the data makes sure that the attributes are at the <span style="color:blue"> same scale (usually between 0 and 1 or -1 and 1).</span>

![Example Image](scaling.png)


Many ML algorithms require scaled data like **KNN**, **K-means**, **linear regression**, **logistic regression** and **neural networks**. 

There are many scaling methods like [`MinMaxScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html), [`MaxAbsScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MaxAbsScaler.html), [`RobustScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html) and [`StandarScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)


The most used is [`MinMaxScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html)
**Min-max** is a scaling technique where values are rescaled between **0 and 1** or between **-1 and 1**. 


Given an original value $x_i$ in the dataset, the corresponding scaled value $x_{\text{scaled}_i}$  is calculated as follows:

<div style= background-color:lightblue; padding:10px">

$$
x_{\text{scaled}_i} = \frac{x_i - \text{min}(x)}{\text{max}(x) - \text{min}(x)} \times (\text{max}_{\text{new}} - \text{min}_{\text{new}}) + \text{min}_{\text{new}}
$$
</div>
Where:

- $( x_i $) is the original value.
- $( \text{min}(x)$) and $( \text{max}(x)$) are the minimum and maximum values of the original dataset, respectively.
- $( \text{min}_{ \text{new}} $) and $( \text{max}_{\text{new}} $) are the minimum and maximum values of the desired range (e.g., 0 and 1 or -1 and 1).

Let's illustrate this with an **example**:

Suppose you have a dataset with one feature and the original values are [10, 20, 30, 50]. 

###### 1) Use the above formula to scale the values between 0 and 1. 

**a. Compute the minimum and maximum values for the feature**

- $ \text{min}(x) $= 10
- $ \text{max}(x) $= 50
- $ \text{max}_{\text{new}}$= 1
- $ \text{mix}_{\text{new}}$= 0

**b. Scale each value using the Min-Max Scaling formula**
- For 10: $( x_{\text{scaled}} = \frac{10 - 10}{50 - 10} \times (1 - 0) + 0 = \frac{0}{40} + 0 = 0 $)
- For 20: $( x_{\text{scaled}} = \frac{20 - 10}{50 - 10} \times (1 - 0) + 0 = \frac{10}{40} + 0 = 0.25 $)
- For 30: $( x_{\text{scaled}} = \frac{40 - 10}{50 - 10} \times (1 - 0) + 0 = \frac{30}{40} + 0 = 0.75 $)
- For 50: $( x_{\text{scaled}} = \frac{50 - 10}{50 - 10} \times (1 - 0) + 0 = \frac{40}{40} + 0 = 1 $)

The sacaled values are : 0, 0.25, 0.75 and 1

###### 2)  Let's scale the values between -1 and 1 

###### 3) Now, let's using Min-Max Scaler from scikit-learn library 

In [None]:
from sklearn.preprocessing import MinMaxScaler

# Original dataset : it is a two-dimensional numpy array (https://numpy.org/doc/stable/reference/generated/numpy.array.html)
data = [[10], [20], [30],  [50]]


# Initialize MinMaxScaler
scaler = MinMaxScaler(feature_range=(0, 1))  # Set the desired range

# Fit and transform the data
scaled_data = scaler.fit_transform(data)

print(scaled_data)

##### Exercice 
Now you can download the diabetes dataset (from the dataset folder) and use the data scaling Min-Max if necessary. For more information on this dataset, please visit (https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database) 

Now, we will use another scaling method: [`StandarScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) which removes the mean and scaling to unit variance.

This method consists in two steps: 

a) **Centering the Data**:
   - For each feature (column) in the dataset, the mean of that feature is calculated.
   - Then, the mean is subtracted from each value in the feature column. This process centers the feature distribution around zero.

b) **Scaling to Unit Variance**:
   - After centering the data, the next step is to scale each feature so that it has a unit variance.
   - This is achieved by dividing each value in the feature column by the standard deviation of that feature.

Mathematically, the transformation applied to each feature $( x_i $) can be represented as follows:
<div style= background-color:lightblue; padding:1px">
$x_{\text{scaled}_i} = \frac{x_i - \text{mean}(x)}{\text{std}(x)} $
</div>
                                                  
Where:
- $( x_{\text{scaled}_i} $) is the scaled value of the feature $( x_i $).
- $( \text{mean}(x)$) is the mean of the feature $( x_i $).
- $( \text{std}(x) $) is the standard deviation of the feature $( x_i $).


Let's illustrate with the same example:

a. **Centering the Data**:
   - Mean of the dataset: $( \text{mean}(x) = \frac{10 + 20 + 30 + 50}{4} = \frac{110}{4} = 27.5 $)
   - Centered data: Subtract the mean from each value in the dataset.
   - Centered data: $([10 - 27.5, 20 - 27.5, 30 - 27.5, 50 - 27.5] = [-17.5, -7.5, 2.5, 22.5]$)

b. **Scaling to Unit Variance**:
   - Standard Deviation of the dataset: $( \text{std}(x) = \sqrt{\frac{1}{N} \sum_{i=1}^{N} (x_i - \text{mean}(x))^2} = 
    \sqrt{\frac{1}{4} \sum_{i=1}^{4} (-17.5)^2 + (-7.5)^2 + (2.5)^2 + (22.5)^2} $)
   - Standard Deviation of the dataset: $( \text{std}(x) = \sqrt{\frac{1}{4} (306.25 + 56.25 + 6.25 + 506.25)} = \sqrt{\frac{875}{4}} = \sqrt{218.75} \approx 14.76 $)
   - Scaled data: Divide each value in the centered data by the standard deviation.
   - Scaled data: $([-17.5 / 14.76, -7.5 / 14.76, 2.5 / 14.76, 22.5 / 14.76] $)

Let's calculate the scaled values:

- For 10: $( \frac{-17.5}{14.76} \approx -1.18 $)
- For 20: $( \frac{-7.5}{14.76} \approx -0.51 $)
- For 30: $( \frac{2.5}{14.76} \approx 0.17 $)
- For 50: $( \frac{22.5}{14.76} \approx 1.52 $)

So, the scaled values using StandardScaler for the modified dataset would be  $([-1.18, -0.51, 0.17, 1.52]$).

###### Now, let's try it from scikit-learn library 

In [None]:
from sklearn.preprocessing import StandardScaler

# Original dataset
data = [[10], [20], [30],  [50]]

# Initialize StandardScaler
scaler = StandardScaler() 

# Fit and transform the data
scaled_data = scaler.fit_transform(data)

print(scaled_data)

Let's rescaling the diabetes dataset using the standarScaler. 

In [6]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler

data = pd.read_csv('diabetes.csv')



In [7]:
# Initialize StandardScaler
scaler = StandardScaler()

# Fit and transform the data
scaled_data = scaler.fit_transform(data)


print(scaled_data[:10])

[[ 0.63994726  0.84832379  0.14964075  0.90726993 -0.69289057  0.20401277
   0.46849198  1.4259954   1.36589591]
 [-0.84488505 -1.12339636 -0.16054575  0.53090156 -0.69289057 -0.68442195
  -0.36506078 -0.19067191 -0.73212021]
 [ 1.23388019  1.94372388 -0.26394125 -1.28821221 -0.69289057 -1.10325546
   0.60439732 -0.10558415  1.36589591]
 [-0.84488505 -0.99820778 -0.16054575  0.15453319  0.12330164 -0.49404308
  -0.92076261 -1.04154944 -0.73212021]
 [-1.14185152  0.5040552  -1.50468724  0.90726993  0.76583594  1.4097456
   5.4849091  -0.0204964   1.36589591]
 [ 0.3429808  -0.15318486  0.25303625 -1.28821221 -0.69289057 -0.81134119
  -0.81807858 -0.27575966 -0.73212021]
 [-0.25095213 -1.34247638 -0.98770975  0.71908574  0.07120427 -0.12597727
  -0.676133   -0.61611067  1.36589591]
 [ 1.82781311 -0.184482   -3.57259724 -1.28821221 -0.69289057  0.41977549
  -1.02042653 -0.36084741 -0.73212021]
 [-0.54791859  2.38188392  0.04624525  1.53455054  4.02192191 -0.18943689
  -0.94794368  1.681258

In [4]:
#Return a Numpy representation of the DataFrame
array = data.values

# Initialize MiniMaxScaler
scaler = MinMaxScaler(feature_range=(0, 1)) # Set the desired range 

# Fit and transform the data
scaled_data = scaler.fit_transform(array)


print(scaled_data[:10])

#transform the numpy to a dataframe
df = pd.DataFrame(data=scaled_data)
print(df.head(5))

[[0.35294118 0.74371859 0.59016393 0.35353535 0.         0.50074516
  0.23441503 0.48333333 1.        ]
 [0.05882353 0.42713568 0.54098361 0.29292929 0.         0.39642325
  0.11656704 0.16666667 0.        ]
 [0.47058824 0.91959799 0.52459016 0.         0.         0.34724292
  0.25362938 0.18333333 1.        ]
 [0.05882353 0.44723618 0.54098361 0.23232323 0.11111111 0.41877794
  0.03800171 0.         0.        ]
 [0.         0.68844221 0.32786885 0.35353535 0.19858156 0.64232489
  0.94363792 0.2        1.        ]
 [0.29411765 0.58291457 0.60655738 0.         0.         0.38152012
  0.05251921 0.15       0.        ]
 [0.17647059 0.3919598  0.40983607 0.32323232 0.10401891 0.46199702
  0.07258753 0.08333333 1.        ]
 [0.58823529 0.57788945 0.         0.         0.         0.52608048
  0.02391119 0.13333333 0.        ]
 [0.11764706 0.98994975 0.57377049 0.45454545 0.64184397 0.45454545
  0.03415884 0.53333333 1.        ]
 [0.47058824 0.6281407  0.78688525 0.         0.         0.
  0.

#### 2) Data Normalization 

In scaling, you're changing the range of your data, while in normalization, you're changing the shape of the distribution of your data. Normalization is used to rescale each row of data. 
![Example Image](normalization.png)

It is mainly useful in sparse dataset where we have a lot of zeros.
In machine learning, there are two types of normalization preprocessing techniques : L1 normalization (known as Manhattan normalization) and L2 normalization. 

###### L1 Normalization (Least Absolute Deviations):
For L1 normalization, each component of the vector is divided by the L1 norm of the vector, which is the sum of the absolute values of its components:

                                                   L1_norm = |v1| + |v2| + ... + |vn|
Then, we divide each component of the vector by the L1 norm:

<div style= "background-color:lightblue; padding:10px"> 
                                        
Normalized_value_i = value_i / L1_norm
</div>

Let's have an example : 

a. we calculate the L1 norm of the vector, which is the sum of the absolute values of its components:
     - L1_norm = |10| + |20| + |30| + |50| = 10 + 20 + 30 + 50 = 110
b. we divide each component of the vector by the L1 norm:
     - Normalized_values_L1 = [10/110, 20/110, 30/110, 50/110] 
                            = [1/11, 2/11, 3/11, 5/11] 
                            = [0.0909, 0.1818, 0.2727, 0.4545]
     
###### L2 Normalization (Least Squares):

For L2 normalization, each component of the vector is divided by the L2 norm of the vector, which is the square root of the sum of the squares of its components:

- L2_norm = sqrt(v1^2 + v2^2 + ... + vn^2)

Then, we divide each component of the vector by the L2 norm:
<div style= "background-color:lightblue; padding:10px"> 
Normalized_value_i = value_i / L2_norm
</div>

Let's have an example : [10, 20, 30, 50]

we calculate the L2 norm of the vector, which is the square root of the sum of the squares of its components:
- L2_norm = sqrt(10^2 + 20^2 + 30^2 + 50^2) = sqrt(100 + 400 + 900 + 2500) = sqrt(3900) ≈ 62.45
- Normalized_values_L2 = [10/62.45, 20/62.45, 30/62.45, 50/62.45]= [0.1602, 0.3205, 0.4807, 0.8012].

Let's using the scikit-learn library

In [8]:
from sklearn.preprocessing import Normalizer
import numpy as np

# Sample data
data = np.array([[10, 20, 30, 50]])

# Initialize Normalizer for L1 normalization
normalizer_L1 = Normalizer(norm='l1') 
# Initialize Normalizer for L2 normalization
normalizer_L2 = Normalizer(norm='l2')  

# Apply L1 normalization
normalized_data_L1 = normalizer_L1.transform(data)
# Apply L2 normalization
normalized_data_L2 = normalizer_L2.transform(data)

print("Data:")
print(data)
print("\nL1 Normalized Data:")
print(normalized_data_L1)
print("\nL2 Normalized Data:")
print(normalized_data_L2)

Data:
[[10 20 30 50]]

L1 Normalized Data:
[[0.09090909 0.18181818 0.27272727 0.45454545]]

L2 Normalized Data:
[[0.16012815 0.32025631 0.48038446 0.80064077]]


##### Let's normalize the diabetes dataset using the L1 and the L2 normalization

In [10]:
from sklearn.preprocessing import Normalizer
import numpy as np
import pandas as pd

data = pd.read_csv('diabetes.csv')

# Initialize Normalizer for L1 normalization
normalizer_L1 = Normalizer(norm='l1') 
# Initialize Normalizer for L2 normalization
normalizer_L2 = Normalizer(norm='l2')  

# Apply L1 normalization
normalized_data_L1 = normalizer_L1.transform(data)
# Apply L2 normalization
normalized_data_L2 = normalizer_L2.transform(data)

print("Data:")
print(data)
print("\nL1 Normalized Data:")
print(normalized_data_L1)
print("\nL2 Normalized Data:")
print(normalized_data_L2)

Data:
     Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0              6      148             72             35        0  33.6   
1              1       85             66             29        0  26.6   
2              8      183             64              0        0  23.3   
3              1       89             66             23       94  28.1   
4              0      137             40             35      168  43.1   
..           ...      ...            ...            ...      ...   ...   
763           10      101             76             48      180  32.9   
764            2      122             70             27        0  36.8   
765            5      121             72             23      112  26.2   
766            1      126             60              0        0  30.1   
767            1       93             70             31        0  30.4   

     DiabetesPedigreeFunction  Age  Outcome  
0                       0.627   50        1  
1            



#### 3) Data transformation 

##### 3.1 Label Encoding 

**Label encoding** is a technique used to transform categories into numerical values, which can be useful for some machine learning algorithms that require numerical inputs.
For example, if you have a categorical variable "Color" with categories 

| color   |
|------------|
| Red        |
| Blue       |
| Green      |
| Red        |
| Yellow     |

Label Encoding can map these categories to: 

| Category   | Encoded |
|------------|---------|
| Red        | 0       |
| Blue       | 1       |
| Green      | 2       |
| Red        | 0       |
| Yellow     | 3       |


In [None]:
from sklearn.preprocessing import LabelEncoder

# Example data
colors = ['Red', 'Blue', 'Green','Red', 'Yellow']

# Initialize LabelEncoder
label_encoder = LabelEncoder()

# Fit and transform data
encoded_colors = label_encoder.fit_transform(colors)

# Print encoded data
print("Encoded Colors:", encoded_colors)

However, it's **important** to note that Label Encoding can introduce artificial order among the categories, which may be inappropriate especially if the categories are not ordered. Therefore, it's often preferred to use Label Encoding only for **ordered categorical variables** or to code the target variable. 

##### 3.2 One-Hot Encoding:

**One-Hot encoding** is a technique used to encode a category in a **binary vector** (so as not to have an order relationship).  
For example, if you have the same categorical variable "Color" with categories 


| color   |
|------------|
| Red        |
| Blue       |
| Green      |
| Red        |
| Yellow     |

After one-hot encoding, we will have: 

| Color_Blue | Color_Green | Color_Red | Color_Yellow |
|------------|-------------|-----------|--------------|
| 0.0        | 0.0         | 1.0       | 0.0          |
| 1.0        | 0.0         | 0.0       | 0.0          |
| 0.0        | 1.0         | 0.0       | 0.0          |
| 0.0        | 0.0         | 1.0       | 0.0          |
| 0.0        | 0.0         | 0.0       | 1.0          |


In [None]:
import pandas as pd
# Example data
colors = ['Red', 'Blue', 'Green', 'Red', 'Yellow']

# Convert data to DataFrame
df = pd.DataFrame({'Color': colors})

# Perform One-Hot Encoding
one_hot_encoded = pd.get_dummies(df['Color'])

# Print One-Hot Encoded data
print("One-Hot Encoded Data:")
print(one_hot_encoded)

##### Exercice. Let's encode the class column of the iris dataset

In [12]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')


label_encoder = LabelEncoder()

encoded_class = label_encoder.fit_transform(iris['species'])

<div style="background-color:lightblue; padding:1px">

### A summary 




When to use scaling and when to use normalization depends on the specific requirements of your data and the machine learning algorithm you're using.

- **Scaling**: Scaling is typically used when the features in your dataset have different ranges, and you want to bring them to a similar scale. This is often important for algorithms that are sensitive to the scale of the features, such as support vector machines (SVM) or k-nearest neighbors (KNN), logistic regression, K-means. 

- **Normalization**: Normalization is useful when the distribution of the feature's values is skewed or has outliers. Outliers are data points that deviate from the rest of the data set. Normalization ensures that all features have the same influence on the model. This is especially important for algorithms that use distance measures, such as KNN,  k-means clustering or gradient descent optimization. 

- **Encoding** is used to transform ordered categorical variables into numerical values, while **One-Hot Encoding** is used to transform unordered categorical variables into binary variables for use in machine learning models.
    
</div> 