## Data Preprocessing
Prepare By: Ejaz-ur-Rehman\
Date: 25-07-2025\
Email ID: ijazfinance@gmail.com

- Preprocessing of data is the essential step in data analysis or machine learning that involves cleaning, transforming, and organizing raw data into a usable format for modeling or analysis.
- Preprocessing is to data what prepping ingredients is to cooking. If the data is not cleaned, chopped, and measured right, the final result (our model or analysis) will suffer.

## Key Objectives of Data Preprocessing:
- Improve data quality
- Ensure consistency
- Handle missing or incorrect values
- Transform data into a format suitable for algorithms
- Increase model accuracy and performance

## Common Steps in Data Preprocessing:

| Step                       | Description                                                                 |
| -------------------------- | --------------------------------------------------------------------------- |
| **1. Data Cleaning**       | Remove or correct missing, incorrect, or duplicate data.                    |
| **2. Data Integration**    | Combine data from multiple sources into a single dataset.                   |
| **3. Data Transformation** | Convert data types, scale values, normalize or standardize features.        |
| **4. Feature Encoding**    | Convert categorical variables into numerical values (e.g., label encoding). |
| **5. Feature Selection**   | Select only relevant columns/features needed for modeling.                  |
| **6. Imputation**          | Fill in missing values using statistical or predictive techniques.          |
| **7. Splitting the Data**  | Divide into training and testing sets for model development.                |
| **8. Scaling**             | Rescale numerical features to bring them to the same range.                 |



## Why Preprocessing is Important:

- Raw data is rarely clean or ready for modeling.
- Many machine learning algorithms cannot handle missing or categorical data directly.
- Unprocessed data can lead to biased, inaccurate, or failed models.

## Data Transformation:

- Data transformation is the process of converting data from one format or structure into another—usually to make it more suitable for analysis or machine learning.
- it helps:
  - Improve model performance
  - Ensure consistency in the dataset
  - Enable algorithms to work properly on numerical inputs

## Why is Data Transformation Important:

- Most machine learning algorithms require numerical and standardized input. If the data is not transformed appropriately, it can:
  - Mislead the model
  - Introduce bias
  - Reduce accuracy

## Common Types of Data Transformation:

| Type                     | Description                                          | Example                           |
| ------------------------ | ---------------------------------------------------- | --------------------------------- |
| **Scaling**              | Resizes data to a specific range (like 0–1)          | `MinMaxScaler`                    |
| **Standardization**      | Centers the data (mean = 0, std = 1)                 | `StandardScaler`                  |
| **Normalization**        | Converts rows to unit norm (usually in NLP)          | L2 normalization                  |
| **Encoding Categorical** | Converts categories to numbers                       | Label Encoding, One-Hot Encoding  |
| **Log Transformation**   | Reduces skewness in distributions                    | `log(x + 1)`                      |
| **Binning**              | Groups continuous values into intervals              | Age groups: 0–10, 10–20, etc.     |
| **Imputation**           | Fills in missing values                              | Mean, median, regression, etc.    |
| **Parsing Dates**        | Extracts parts like year, month from datetime fields | `df['year'] = df['date'].dt.year` |



## What is Scaling in Data Preprocessing:

- Scaling is the process of transforming numerical features in our dataset to a specific range or distribution. This helps ensure that all features contribute equally to machine learning models, especially distance-based algorithms like K-Nearest Neighbors, SVM, or Gradient Descent–based models like Linear Regression and Neural Networks.

## Scaling Techniques Comparison Table

| **Technique**          | **Library Function (sklearn)** | **Formula**                                     | **Output Range**        | **Best Used When**                                                                                  |         |                                                                  |
| ---------------------- | ------------------------------ | ----------------------------------------------- | ----------------------- | --------------------------------------------------------------------------------------------------- | ------- | ---------------------------------------------------------------- |
| **Min-Max Scaling**    | `MinMaxScaler()`               | $x' = \frac{x - x_{\min}}{x_{\max} - x_{\min}}$ | 0 to 1 (or custom)      | When features have **known bounds** and **no significant outliers**. Often used in neural networks. |         |                                                                  |
| **Standard Scaling**   | `StandardScaler()`             | $x' = \frac{x - \mu}{\sigma}$                   | \~ -3 to +3 (unbounded) | When features follow a **normal distribution**. Good for regression, logistic regression, SVM, etc. |         |                                                                  |
| **Robust Scaling**     | `RobustScaler()`               | $x' = \frac{x - \text{median}}{\text{IQR}}$     | Varies                  | When the dataset contains **outliers**. Uses median & IQR instead of mean & std.                    |         |                                                                  |
| **MaxAbs Scaling**     | `MaxAbsScaler()`               | ( x' = \frac{x}{                                | x\_{\max}               | } )                                                                                                 | -1 to 1 | For **sparse** data (e.g. text data). Keeps 0 entries unchanged. |
| **Normalizer (L2/L1)** | `Normalizer(norm='l2')`        | $x' = \frac{x}{\|x\|_2}$                        | Unit norm (row-wise)    | When scaling **individual samples**, not features. Used in **text classification**.                 |         |                                                                  |
| **Decimal Scaling**    | *Manual implementation*        | $x' = \frac{x}{10^j}$, where $j$ makes $x' < 1$ | -1 to 1                 | Basic scaling; rarely used in modern ML. Simple but not robust.                                     |         |                                                                  |



## Impact of different methods on the distributions

![sphx_glr_plot_map_data_to_normal_001.png](attachment:sphx_glr_plot_map_data_to_normal_001.png)