# Data Preprocessing

## What is data pre processing ?

Data preprocessing refers to the set of operations and techniques performed on raw data to transform it into a suitable format for further analysis or machine learning tasks. It involves cleaning, transforming, and organizing the data to improve its quality, relevance, and compatibility with the chosen algorithms or models.

The specific steps and techniques involved in data preprocessing may vary depending on the characteristics of the dataset and the goals of the analysis. However, common data preprocessing tasks include:

1. **Data Cleaning**: This step involves handling missing data by either imputing or removing the missing values. It also includes identifying and dealing with outliers or noisy data points that can adversely affect the analysis or model training.

2. **Data Transformation**: Data transformation techniques are used to normalize or scale numerical features, ensuring they have a consistent range or distribution. Common transformation methods include standardization, min-max scaling, or logarithmic transformations.

3. **Data Encoding**: Categorical variables need to be encoded into numerical representations for most machine learning algorithms to process them. Techniques like one-hot encoding, label encoding, or ordinal encoding are used to convert categorical data into a format compatible with the algorithms.

4. **Feature Selection and Extraction**: Feature selection involves identifying and selecting the most relevant features from the dataset. It helps reduce dimensionality, improve model performance, and enhance interpretability. Feature extraction techniques, such as principal component analysis (PCA) or factor analysis, aim to create new features by combining or transforming existing ones.

5. **Handling Imbalanced Data**: In some datasets, the distribution of classes or target variables may be imbalanced, meaning one class is significantly more prevalent than others. Data preprocessing techniques such as undersampling, oversampling, or generating synthetic samples can address this issue and improve the model's ability to learn from the minority class.

6. **Data Integration**: When working with data from multiple sources, data integration is performed to combine and merge datasets into a unified representation. This ensures that all relevant information is incorporated into the analysis or model training.

7. **Data Splitting**: The dataset is typically divided into training, validation, and testing subsets. This splitting allows for model training on the training set, hyperparameter tuning and model selection using the validation set, and unbiased evaluation of the final model's performance using the testing set.

By performing data preprocessing tasks, the quality and suitability of the data for analysis or machine learning tasks are improved. It helps to handle missing or inconsistent data, normalize or scale features, encode categorical variables, reduce dimensionality, and enhance the overall effectiveness and accuracy of the subsequent analysis or modeling processes.

## Why we should do data pre processing in Machine Learning model Development ?

Data preprocessing is essential before using it for machine learning for several reasons:

1. **Handling Missing Data**: Real-world datasets often have missing values, which can cause issues during model training. Data preprocessing techniques help in handling missing data by imputing or removing the missing values appropriately. This ensures that the machine learning model can utilize the available data effectively.

2. **Noise Removal**: Datasets may contain outliers or noisy data points due to various factors such as measurement errors or data collection issues. Preprocessing techniques can identify and handle these outliers by either removing them or applying techniques like smoothing or interpolation. By removing or minimizing noise, data preprocessing improves the accuracy and reliability of the machine learning models.

3. **Handling Inconsistent Data**: Inconsistent data can arise due to human errors or different data sources. For example, categorical variables may have multiple representations for the same category (e.g., "Male," "M," or "1" for gender). Data preprocessing techniques standardize and resolve these inconsistencies, ensuring that the data is in a consistent format for machine learning algorithms.

4. **Normalization and Scaling**: Machine learning models often benefit from having input features with similar scales. Data preprocessing includes normalization and scaling techniques that transform numerical features to a common range (e.g., between 0 and 1) or standardize them (e.g., zero mean and unit variance). This process prevents features with larger scales from dominating the model's learning process and improves convergence.

5. **Encoding Categorical Variables**: Machine learning models typically require numerical inputs, but datasets often contain categorical variables. Data preprocessing includes techniques like one-hot encoding or label encoding to convert categorical variables into numerical representations that the models can understand.

6. **Feature Selection and Extraction**: Datasets may have a large number of features, some of which may be irrelevant, redundant, or noisy. Data preprocessing involves feature selection and extraction techniques that identify and retain the most informative and relevant features. This step reduces the dimensionality of the data, improves model interpretability, and prevents overfitting.

7. **Data Splitting**: Data preprocessing includes splitting the dataset into training, validation, and testing sets. This ensures that the model's performance is evaluated on unseen data and helps prevent overfitting. The training set is used to train the model, the validation set is used for hyperparameter tuning, and the testing set provides an unbiased evaluation of the final model.

By performing data preprocessing before applying machine learning algorithms, we can enhance the quality and relevance of the data, reduce noise and inconsistencies, and optimize the data for the learning algorithms. This, in turn, improves the performance, accuracy, and generalization capabilities of the machine learning models.

### What is the differences between data normalization and standardization ?

Normalization and standardization are two commonly used techniques in data preprocessing to transform numerical data into a specific range or distribution. Although they both aim to bring data into a comparable format, there are differences in how they achieve this goal:

Normalization:
Normalization, also known as min-max scaling, rescales the data to a specific range, typically between 0 and 1. The formula for normalization is as follows:

x_normalized = (x - min(x)) / (max(x) - min(x))

where x is an individual data point, min(x) is the minimum value of the feature, and max(x) is the maximum value of the feature. Normalization preserves the original distribution of the data but scales it to fit within the specified range. It is sensitive to outliers because it depends on the minimum and maximum values. If outliers are present, they can significantly impact the normalization process.

Standardization:
Standardization transforms the data to have a mean of 0 and a standard deviation of 1. It follows the formula:

x_standardized = (x - mean(x)) / std(x)

where x is an individual data point, mean(x) is the mean of the feature, and std(x) is the standard deviation of the feature. Standardization centers the data around zero and scales it by the standard deviation. Unlike normalization, standardization is not affected by outliers since it calculates the mean and standard deviation, which are robust statistical measures.

Differences:

1. **Range**: Normalization scales the data to a specific range, often between 0 and 1, while standardization centers the data around 0 with a standard deviation of 1.

2. **Preservation of Distribution**: Normalization preserves the original distribution of the data but scales it, while standardization does not preserve the original distribution and instead standardizes it to have zero mean and unit variance.

3. **Sensitivity to Outliers**: Normalization is sensitive to outliers because it depends on the minimum and maximum values. If outliers are present, they can significantly affect the scaling. Standardization is less affected by outliers since it calculates the mean and standard deviation, which are robust to extreme values.

4. **Interpretability**: Normalization retains the original values in the specified range, making it easier to interpret the data. Standardization transforms the data to have a mean of 0 and a standard deviation of 1, which can sometimes make interpretation more challenging.

The choice between normalization and standardization depends on the specific requirements of the problem at hand and the characteristics of the dataset. Normalization is often suitable when preserving the original distribution and the range of the data is important, while standardization is useful when data distribution and comparing features on the same scale are more relevant.

![normalization%20vs%20standardization.png](attachment:normalization%20vs%20standardization.png)

### Importing Required Libraries

In [1]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

from jcopml.pipeline import num_pipe, cat_pipe
from jcopml.utils import save_model, load_model
from jcopml.plot import plot_missing_value
from jcopml.feature_importance import mean_score_decrease

  from pandas import MultiIndex, Int64Index
