# **Machine Learning (Data Preprocessing Tools)**

| **Aspect**                | **Description**                                                                                                                                                       |
|---------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **Definition**            | Data preprocessing is the process of transforming raw data into a clean and usable format to enhance the performance of machine learning models.                       |
| **Importance**            | Ensures data quality, improves model accuracy, reduces computational complexity, and helps in handling missing or inconsistent data.                                    |
| **Steps**                 | Key steps involved in data preprocessing include:                                                                                                                      |
| **Data Cleaning**         | Removing or fixing missing, duplicate, or incorrect data.                                                                                                              |
| **Data Integration**      | Combining data from multiple sources into a coherent dataset.                                                                                                           |
| **Data Transformation**   | Scaling, normalizing, or converting data into appropriate formats for analysis.                                                                                         |
| **Data Reduction**        | Reducing data volume while maintaining its integrity, often through techniques like dimensionality reduction or feature selection.                                       |
| **Data Encoding**         | Converting categorical data into numerical format using methods like one-hot encoding or label encoding.                                                                |
| **Data Normalization**    | Scaling data to a standard range, such as 0-1, to ensure uniformity in analysis.                                                                                        |
| **Feature Extraction**    | Creating new features from existing data to enhance model performance.                                                                                                 |
| **Outlier Detection**     | Identifying and handling data points that deviate significantly from the rest of the dataset.                                                                           |
| **Imputation**            | Filling in missing data with appropriate values, such as mean, median, mode, or using algorithms.                                                                       |
| **Splitting Data**        | Dividing the dataset into training, validation, and test sets to evaluate model performance accurately.                                                                 |


## Importing the libraries

In [10]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## Importing the dataset

In [11]:
dataset = pd.read_csv('./data/sale.csv')
dataset


Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


In [12]:
X = dataset.iloc[:, :-1].values # this code select all columns except the last one
X

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, nan],
       ['France', 35.0, 58000.0],
       ['Spain', nan, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

In [13]:
y=dataset.iloc[:,-1] # this code select the last one only
y

0     No
1    Yes
2     No
3     No
4    Yes
5    Yes
6     No
7    Yes
8     No
9    Yes
Name: Purchased, dtype: object

## Taking care of missing data

| **Aspect**               | **Description**                                                                                                                                                         |
|--------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **Tool**                 | SimpleImputer                                                                                                                                                           |
| **Library**              | Scikit-learn                                                                                                                                                            |
| **Definition**           | A transformer from Scikit-learn used for imputing missing values in a dataset by applying a specified strategy for each feature (column).                                |
| **Primary Use**          | Handling missing data by replacing it with a specified value or a statistical measure such as mean, median, or most frequent value.                                      |
| **Imputation Strategies**| - **Mean**: Replaces missing values using the mean along each column.                                                                                                   |
|                          | - **Median**: Replaces missing values using the median along each column.                                                                                               |
|                          | - **Most Frequent**: Replaces missing values using the most frequent value along each column.                                                                            |
|                          | - **Constant**: Replaces missing values with a constant value defined by the user.                                                                                       |
| **Parameters**           | - `missing_values`: The placeholder for missing values (e.g., `np.nan`).                                                                                                |
|                          | - `strategy`: The imputation strategy to use (`mean`, `median`, `most_frequent`, `constant`).                                                                            |
|                          | - `fill_value`: When strategy is `constant`, this is the value used to replace missing values.                                                                            |
| **Attributes**           | - `statistics_`: The statistics computed during fitting that are used for imputation.                                                                                    |
| **Methods**              | - `fit(X)`: Computes the statistics for imputation from the dataset `X`.                                                                                                 |
|                          | - `transform(X)`: Imputes the missing values in the dataset `X` using the computed statistics.                                                                           |
|                          | - `fit_transform(X)`: Fits to the data `X` and then transforms it.                                                                                                       |

In [18]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='median')
imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])
X

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, 63777.77777777778],
       ['France', 35.0, 58000.0],
       ['Spain', 38.77777777777778, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

## Encoding categorical data

### Encoding the Independent Variable

## ColumnTransformer

| **Aspect**               | **Description**                                                                                                                                                         |
|--------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **Tool**                 | ColumnTransformer                                                                                                                                                       |
| **Library**              | Scikit-learn                                                                                                                                                            |
| **Definition**           | A transformer that applies different preprocessing steps to different subsets of features in a dataset.                                                                 |
| **Primary Use**          | Facilitates the application of distinct data transformations to different columns within a single pipeline.                                                             |
| **Parameters**           | - `transformers`: List of (name, transformer, columns) tuples specifying the transformers to be applied and the columns they apply to.                                    |
|                          | - `remainder`: Specifies what to do with remaining columns that are not explicitly specified in the `transformers` list. Options include 'drop', 'passthrough', or a transformer. |
|                          | - `n_jobs`: Number of jobs to run in parallel.                                                                                                                           |
| **Attributes**           | - `transformers_`: The collection of fitted transformers as tuples of (name, fitted_transformer, column).                                                                |
|                          | - `named_transformers_`: Access fitted transformer by name.                                                                                                              |
|                          | - `remainder_`: Transformer used for remaining columns.                                                                                                                  |
| **Methods**              | - `fit(X)`: Fits all transformers using the data `X`.                                                                                                                    |
|                          | - `transform(X)`: Applies transformations to the data `X` and concatenates the results.                                                                                   |
|                          | - `fit_transform(X)`: Fits all transformers and applies them to the data `X`.                                                                                            |
|                          | - `inverse_transform(X)`: Reverts the transformations back to the original space.                                                                                        |
| **Example Usage**        | ```python                                                                                                                                                               |
|                          | from sklearn.compose import ColumnTransformer                                                                                                                           |
|                          | from sklearn.preprocessing import StandardScaler, OneHotEncoder                                                                                                          |
|                          | import pandas as pd                                                                                                                                                     |
|                          |                                                                                                                                                                         |
|                          | data = pd.DataFrame({                                                                                                                                                   |
|                          |     'numerical_feature': [0.5, 0.3, 0.9],                                                                                                                               |
|                          |     'categorical_feature': ['A', 'B', 'A']                                                                                                                               |
|                          | })                                                                                                                                                                       |
|                          |                                                                                                                                                                         |
|                          | transformer = ColumnTransformer(                                                                                                                                       |
|                          |     transformers=[                                                                                                                                                      |
|                          |         ('num', StandardScaler(), ['numerical_feature']),                                                                                                                |
|                          |         ('cat', OneHotEncoder(), ['categorical_feature'])                                                                                                                |
|                          |     ])                                                                                                                                                                   |
|                          |                                                                                                                                                                         |
|                          | transformed_data = transformer.fit_transform(data)                                                                                                                       |
|                          | ```                                                                                                                                                                     |


## OneHotEncoder

| **Aspect**               | **Description**                                                                                                                                                         |
|--------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **Tool**                 | OneHotEncoder                                                                                                                                                           |
| **Library**              | Scikit-learn                                                                                                                                                            |
| **Definition**           | A transformer that converts categorical features into a one-hot numeric array, where each unique category is represented by a binary column.                             |
| **Primary Use**          | Encoding categorical variables as a one-hot numeric array for use in machine learning algorithms.                                                                        |
| **Parameters**           | - `categories`: Specifies the categories for each feature. `‘auto’` means the categories are determined from the data.                                                   |
|                          | - `drop`: Specifies a methodology to drop one of the categories per feature to avoid collinearity.                                                                       |
|                          | - `sparse`: If True, returns a sparse matrix. If False, returns a dense array.                                                                                           |
|                          | - `dtype`: The desired data-type for the output array.                                                                                                                   |
|                          | - `handle_unknown`: Specifies the behavior when an unknown category is encountered. Options are `‘error’` (default) and `‘ignore’`.                                       |
| **Attributes**           | - `categories_`: The categories identified for each feature.                                                                                                            |
|                          | - `drop_idx_`: The indices of the dropped categories if `drop` is specified.                                                                                             |
| **Methods**              | - `fit(X)`: Fits the encoder to the data `X`.                                                                                                                            |
|                          | - `transform(X)`: Transforms the data `X` using the fitted encoder.                                                                                                      |
|                          | - `fit_transform(X)`: Fits the encoder and transforms the data `X`.                                                                                                       |
|                          | - `inverse_transform(X)`: Converts the encoded data back to the original categories.                                                                                      |
|                          | - `get_feature_names_out()`: Returns feature names for the output array.                                                                                                |
| **Example Usage**        | ```python                                                                                                                                                               |
|                          | from sklearn.preprocessing import OneHotEncoder                                                                                                                         |
|                          | import numpy as np                                                                                                                                                      |
|                          |                                                                                                                                                                         |
|                          | data = np.array([['A'], ['B'], ['A'], ['C']])                                                                                                                           |
|                          | encoder = OneHotEncoder(sparse=False)                                                                                                                                   |
|                          | encoded_data = encoder.fit_transform(data)                                                                                                                               |
|                          | print(encoded_data)                                                                                                                                                     |
|                          | ```                                                                                                                                                                     |


In [19]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')
X = np.array(ct.fit_transform(X))
X

array([[1.0, 0.0, 0.0, 44.0, 72000.0],
       [0.0, 0.0, 1.0, 27.0, 48000.0],
       [0.0, 1.0, 0.0, 30.0, 54000.0],
       [0.0, 0.0, 1.0, 38.0, 61000.0],
       [0.0, 1.0, 0.0, 40.0, 63777.77777777778],
       [1.0, 0.0, 0.0, 35.0, 58000.0],
       [0.0, 0.0, 1.0, 38.77777777777778, 52000.0],
       [1.0, 0.0, 0.0, 48.0, 79000.0],
       [0.0, 1.0, 0.0, 50.0, 83000.0],
       [1.0, 0.0, 0.0, 37.0, 67000.0]], dtype=object)

### Encoding the Dependent Variable

## LableEncoder
| **Aspect**               | **Description**                                                                                                                                                         |
|--------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **Tool**                 | LabelEncoder                                                                                                                                                            |
| **Library**              | Scikit-learn                                                                                                                                                            |
| **Definition**           | A transformer that encodes target labels with values between 0 and n_classes-1.                                                                                          |
| **Primary Use**          | Converting categorical labels into numeric format for use in machine learning algorithms.                                                                                |
| **Parameters**           | None.                                                                                                                                                                   |
| **Attributes**           | - `classes_`: The unique classes identified during fitting.                                                                                                             |
| **Methods**              | - `fit(y)`: Fits the encoder to the labels `y`.                                                                                                                          |
|                          | - `transform(y)`: Transforms the labels `y` to numeric values.                                                                                                           |
|                          | - `fit_transform(y)`: Fits the encoder and transforms the labels `y`.                                                                                                     |
|                          | - `inverse_transform(y)`: Converts numeric values back to the original labels.                                                                                           |
|                          | - `fit_transform(y)`: Fits the encoder and transforms the labels `y` in a single step.                                                                                   |
| **Example Usage**        | ```python                                                                                                                                                               |
|                          | from sklearn.preprocessing import LabelEncoder                                                                                                                          |
|                          |                                                                                                                                                                         |
|                          | data = ['cat', 'dog', 'fish', 'dog', 'cat']                                                                                                                             |
|                          | encoder = LabelEncoder()                                                                                                                                                |
|                          | encoded_data = encoder.fit_transform(data)                                                                                                                              |
|                          | print(encoded_data)  # Output: array([0, 1, 2, 1, 0])                                                                                                                   |
|                          | print(encoder.classes_)  # Output: array(['cat', 'dog', 'fish'], dtype='<U4')                                                                                           |
|                          | decoded_data = encoder.inverse_transform(encoded_data)                                                                                                                  |
|                          | print(decoded_data)  # Output: array(['cat', 'dog', 'fish', 'dog', 'cat'], dtype='<U4')                                                                                  |
|                          | ```                                                                                                                                                                     |


In [21]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)
y

array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1], dtype=int64)

## Splitting the dataset into the Training set and Test set

| **Aspect**               | **Description**                                                                                                                                                         |
|--------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **Tool**                 | train_test_split                                                                                                                                                        |
| **Library**              | Scikit-learn                                                                                                                                                            |
| **Definition**           | A utility function that splits arrays or matrices into random train and test subsets.                                                                                     |
| **Primary Use**          | Dividing a dataset into training and testing sets for model evaluation.                                                                                                |
| **Parameters**           | - `*arrays`: The input data to be split, such as features and labels.                                                                                                   |
|                          | - `test_size`: The proportion of the dataset to include in the test split. Can be a float between 0.0 and 1.0, or an integer for the absolute number of test samples.   |
|                          | - `train_size`: The proportion of the dataset to include in the train split. Can be a float between 0.0 and 1.0, or an integer for the absolute number of train samples. |
|                          | - `random_state`: Controls the shuffling applied to the data before the split for reproducibility.                                                                      |
|                          | - `shuffle`: Whether or not to shuffle the data before splitting. Default is `True`.                                                                                     |
|                          | - `stratify`: If not `None`, data is split in a stratified fashion, using this as the class labels.                                                                      |
| **Returns**              | - `splits`: A list containing train-test splits of the input data. Typically, four outputs are returned: `X_train`, `X_test`, `y_train`, `y_test`.                                                            |
| **Example Usage**        | ```python                                                                                                                                                               |
|                          | from sklearn.model_selection import train_test_split                                                                                                                    |
|                          | import numpy as np                                                                                                                                                      |
|                          |                                                                                                                                                                         |
|                          | X = np.arange(10).reshape((5, 2))                                                                                                                                       |
|                          | y = np.array([0, 1, 2, 3, 4])                                                                                                                                           |
|                          | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)                                                                               |
|                          | print("X_train:", X_train)                                                                                                                                              |
|                          | print("X_test:", X_test)                                                                                                                                                |
|                          | print("y_train:", y_train)                                                                                                                                              |
|                          | print("y_test:", y_test)                                                                                                                                                |
|                          | ```                                                                                                                                                                     |


In [22]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1)

In [23]:
X_train

array([[0.0, 0.0, 1.0, 38.77777777777778, 52000.0],
       [0.0, 1.0, 0.0, 40.0, 63777.77777777778],
       [1.0, 0.0, 0.0, 44.0, 72000.0],
       [0.0, 0.0, 1.0, 38.0, 61000.0],
       [0.0, 0.0, 1.0, 27.0, 48000.0],
       [1.0, 0.0, 0.0, 48.0, 79000.0],
       [0.0, 1.0, 0.0, 50.0, 83000.0],
       [1.0, 0.0, 0.0, 35.0, 58000.0]], dtype=object)

In [24]:
X_test

array([[0.0, 1.0, 0.0, 30.0, 54000.0],
       [1.0, 0.0, 0.0, 37.0, 67000.0]], dtype=object)

In [25]:
y_train

array([0, 1, 0, 0, 1, 1, 0, 1], dtype=int64)

In [26]:
y_test

array([0, 1], dtype=int64)

## Feature Scaling

| **Aspect**               | **Description**                                                                                                                                                         |
|--------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **Definition**           | Feature scaling is the process of normalizing or standardizing the range of independent variables or features in a dataset.                                              |
| **Primary Use**          | Ensuring that features contribute equally to the model, improving the performance and convergence speed of machine learning algorithms, especially those based on distance calculations.|
| **Common Methods**       | - **Normalization (Min-Max Scaling)**: Scales the data to a fixed range, usually 0 to 1. Formula: \(X' = \frac{X - X_{min}}{X_{max} - X_{min}}\).                                                            |
|                          | - **Standardization (Z-score Scaling)**: Scales the data so that it has a mean of 0 and a standard deviation of 1. Formula: \(X' = \frac{X - \mu}{\sigma}\).                                               |
|                          | - **Robust Scaling**: Uses median and interquartile range to scale data, making it robust to outliers. Formula: \(X' = \frac{X - \text{median}}{IQR}\).                                                    |
|                          | - **MaxAbs Scaling**: Scales each feature by its maximum absolute value. Formula: \(X' = \frac{X}{|X_{max}|}\).                                                                                           |
| **Importance**           | - Prevents features with larger ranges from dominating those with smaller ranges.                                                                                       |
|                          | - Speeds up the convergence of gradient-based optimization algorithms.                                                                                                  |
|                          | - Improves the accuracy and efficiency of distance-based algorithms like K-Nearest Neighbors (KNN) and Support Vector Machines (SVM).                                                                         |
| **Scikit-learn Tools**   | - `StandardScaler`: Standardizes features by removing the mean and scaling to unit variance.                                                                             |
|                          | - `MinMaxScaler`: Transforms features by scaling each feature to a given range.                                                                                         |
|                          | - `RobustScaler`: Scales features using statistics that are robust to outliers.                                                                                         |
|                          | - `MaxAbsScaler`: Scales each feature by its maximum absolute value.                                                                                                    |
| **Example Usage**        | ```python                                                                                                                                                               |
|                          | from sklearn.preprocessing import StandardScaler, MinMaxScaler                                                                                                          |
|                          | import numpy as np                                                                                                                                                      |
|                          |                                                                                                                                                                         |
|                          | data = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])                                                                                                                      |
|                          |                                                                                                                                                                         |
|                          | # Standardization                                                                                                                                                       |
|                          | scaler = StandardScaler()                                                                                                                                               |
|                          | standardized_data = scaler.fit_transform(data)                                                                                                                           |
|                          | print("Standardized Data:\n", standardized_data)                                                                                                                         |
|                          |                                                                                                                                                                         |
|                          | # Normalization                                                                                                                                                         |
|                          | min_max_scaler = MinMaxScaler()                                                                                                                                         |
|                          | normalized_data = min_max_scaler.fit_transform(data)                                                                                                                     |
|                          | print("Normalized Data:\n", normalized_data)                                                                                                                             |
|                          | ```                                                                                                                                                                     |


In [27]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train[:, 3:] = sc.fit_transform(X_train[:, 3:])
X_test[:, 3:] = sc.transform(X_test[:, 3:])

In [28]:
print(X_train)

[[0.0 0.0 1.0 -0.19159184384578545 -1.0781259408412425]
 [0.0 1.0 0.0 -0.014117293757057777 -0.07013167641635372]
 [1.0 0.0 0.0 0.566708506533324 0.633562432710455]
 [0.0 0.0 1.0 -0.30453019390224867 -0.30786617274297867]
 [0.0 0.0 1.0 -1.9018011447007988 -1.420463615551582]
 [1.0 0.0 0.0 1.1475343068237058 1.232653363453549]
 [0.0 1.0 0.0 1.4379472069688968 1.5749910381638885]
 [1.0 0.0 0.0 -0.7401495441200351 -0.5646194287757332]]


In [31]:
print(X_test)

[[0.0 1.0 0.0 -1.4661817944830124 -0.9069571034860727]
 [1.0 0.0 0.0 -0.44973664397484414 0.2056403393225306]]
