**The primary objective of the normalization process is to scale numerical features in the dataset to a standard range, facilitating fair comparison and preventing features with larger scales from dominating the analysis.**

Normalization is crucial to bring numerical features to a consistent scale, preventing certain features from having undue influence on machine learning models due to their scale.

Various normalization techniques include **Min-Max scaling, Z-score normalization, and Robust scaling**. Each method has its advantages, and the choice depends on the characteristics of the data and the requirements of the analysis.

> in this project i preferred that use Min-Max scaling

## 1- import and reading data (with outlier and without it)

In [2]:
import sys
sys.path.append('../../../scripts/utilities')
from helper_functions import *
sys.path.append('../../../scripts/data_preprocessing')
from data_transformation import *

__with outlier__

In [3]:
base_path = '../../../data/processed_data/'
df = read_files('df_filling_missing_values_with_median_encoded_handle_noisy.csv', base_path=base_path)[0]

## 2- normalize features except labels and id

In [4]:
X=[column for column in df.columns if column not in ['MCQ160L','MCQ220','SEQN']]
df_normalized=normalize_data(df, columns=X, method='minmax')

In [5]:
df_normalized[:5]

Unnamed: 0,SEQN,RIDSTATR,RIAGENDR,RIDAGEYR,RIDRETH1,RIDRETH3,RIDEXMON,DMQMILIZ,DMDBORN4,DMDCITZN,...,LBXBSE,LBDBSESI,LBXBMN,LBDBMNSI,URXVOL1,URDFLOW1,LBDB12,LBDB12SI,MCQ160L,MCQ220
0,73557,1.0,0.0,0.8625,0.75,0.6,0.0,0.0,0.0,0.0,...,0.154784,0.15534,0.153094,0.153097,0.157609,0.031577,0.018893,0.018891,2.0,2.0
1,73558,1.0,0.0,0.675,0.5,0.4,0.0,1.0,0.0,0.0,...,0.182886,0.183252,0.120369,0.120369,0.163043,0.062923,0.018258,0.018259,2.0,2.0
2,73559,1.0,0.0,0.9,0.5,0.4,1.0,0.0,0.0,0.0,...,0.190455,0.190534,0.147075,0.147072,0.119565,0.024885,0.026659,0.026657,2.0,1.0
3,73560,1.0,0.0,0.325,0.5,0.4,0.0,1.0,0.0,0.0,...,0.129072,0.129854,0.212902,0.212899,0.110507,0.022115,0.018519,0.018517,0.0,0.0
4,73561,1.0,1.0,0.9125,0.5,0.4,0.0,1.0,0.0,0.0,...,0.154784,0.15534,0.153094,0.153097,0.009058,0.004192,0.007729,0.007731,2.0,2.0


## 3- save after normalizing

In [5]:
save_files([df_normalized], 'df_filling_missing_values_with_median_encoded_handle_noisy_normalized.csv', base_path='../../../data/processed_data/')

__without outlier__

In [14]:
df2 = read_files('df_filling_missing_values_with_median_encoded_handle_noisy_handle_outlier.csv', base_path=base_path)[0]

In [16]:
X2=[column for column in df2.columns if column not in ['MCQ160L','MCQ220','SEQN']]
df_normalized2=normalize_data(df2, columns=X2, method='minmax')

In [17]:
df_normalized2[:5]

Unnamed: 0,SEQN,RIDSTATR,RIAGENDR,RIDAGEYR,RIDRETH1,RIDRETH3,RIDEXMON,DMQMILIZ,DMDBORN4,DMDCITZN,...,LBXTC,LBDTCSI,LBXTTG,WTSH2YR.y,LBDBCDLC,LBDTHGLC,URXVOL1,URDFLOW1,MCQ160L,MCQ220
0,73557,1.0,0.0,0.8625,0.75,0.6,0.0,0.0,0.0,0.0,...,0.13172,0.132017,0.5,0.10582,0.0,0.0,0.157609,0.031577,2.0,2.0
1,73558,1.0,0.0,0.675,0.5,0.4,0.0,1.0,0.0,0.0,...,0.135753,0.136175,0.5,0.152503,0.0,0.0,0.163043,0.062923,2.0,2.0
2,73559,1.0,0.0,0.9,0.5,0.4,1.0,0.0,0.0,0.0,...,0.076613,0.076923,0.5,0.359484,0.0,0.0,0.119565,0.024885,2.0,1.0
3,73560,1.0,0.0,0.325,0.5,0.4,0.0,1.0,0.0,0.0,...,0.133065,0.133056,0.5,0.173127,1.0,0.0,0.110507,0.022115,0.0,0.0
4,73561,1.0,1.0,0.9125,0.5,0.4,0.0,1.0,0.0,0.0,...,0.177419,0.177755,0.5,0.10582,0.0,0.0,0.009058,0.004192,2.0,2.0


In [18]:
save_files([df_normalized2], 'df_filling_missing_values_with_median_encoded_handle_noisy_handle_outlier_normalized.csv', base_path='../../../data/processed_data/')