# Data Preprocessing Tools

## Importing the libraries

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

Brief explanation of each library:

1. **`numpy` (`np`)**:
   - Used for numerical computing in Python.
   - Provides support for large multi-dimensional arrays and matrices.
   - Includes functions for mathematical operations like linear algebra, statistical operations, and more.

2. **`matplotlib.pyplot` (`plt`)**:
   - A plotting library for creating static, animated, and interactive visualizations in Python.
   - Commonly used to generate graphs, charts, and plots.

3. **`pandas` (`pd`)**:
   - A library used for data manipulation and analysis.
   - Provides data structures like DataFrames, which are useful for handling and analyzing structured data, especially tabular data.



## Importing the dataset

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Load the dataset from a CSV file into a pandas DataFrame
dataset = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Machine Learning A-Z/Part 1 - Data Preprocessing/Data.csv')

# Extract all rows and all columns except the last one (features) into X
X = dataset.iloc[:, :-1].values # matrix of features

# Extract all rows of the last column (target variable) into y
y = dataset.iloc[:, -1].values # dependent variable vector

In [None]:
print(X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 nan]
 ['France' 35.0 58000.0]
 ['Spain' nan 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


In [None]:
print(y)

['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']


## Taking care of missing data

Handle missing values by replacing them with the mean of the respective columns:

In [None]:
# Import the SimpleImputer class from sklearn
from sklearn.impute import SimpleImputer


# Create an imputer object that replaces missing values (np.nan) with the mean of the column
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')

# Fit the imputer to the specified columns (2nd and 3rd columns) of the dataset X
imputer.fit(X[:, 1:3])

# Apply the transformation to replace missing values with the mean in those columns
X[:, 1:3] = imputer.transform(X[:, 1:3])

In [None]:
print(X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


The process of first fitting the `imputer` and then applying (transforming) it follows the typical machine learning pattern of **fitting** and **transforming**, which is necessary for the following reasons:

1. **Fitting (`fit`)**: This step calculates and stores the necessary information needed to handle the missing values. In this case, the imputer calculates the mean of each column where there are missing values. The `fit` method identifies what value (e.g., the mean in this case) should be used to replace missing values.

2. **Transforming (`transform`)**: Once the imputer has learned the necessary information (the mean values of the columns), the `transform` step actually applies the imputation. It replaces all the missing values with the calculated means from the fitting step.


## Encoding categorical data

In machine learning, **categorical data** refers to features (or columns) that represent categories or labels, rather than numerical values. Examples include:

- **Gender**: `Male`, `Female`
- **Country**: `USA`, `France`, `Germany`
- **Color**: `Red`, `Blue`, `Green`

**Why Categorical Data Needs to Be Encoded:**

Most machine learning algorithms require input data to be in numerical form because they rely on mathematical operations (e.g., distance calculations, dot products) that don't work on non-numeric (categorical) data. Therefore, we need to **encode** or convert these categorical variables into numbers.

**Common Encoding Techniques:**

1. **Label Encoding**:
   - Each category is assigned a unique integer. For example:
     - `Male` → `0`
     - `Female` → `1`
   - This method is simple but can introduce unintended ordinal relationships (i.e., algorithms might assume that `Female` > `Male` because of the numerical difference).

2. **One-Hot Encoding**:
   - Each category is represented by a binary vector, with each unique category having its own column. For example, for the `Country` feature:
     - `USA` → `[1, 0, 0]`
     - `France` → `[0, 1, 0]`
     - `Germany` → `[0, 0, 1]`
   - This approach avoids any ordinal relationships and is commonly used in machine learning pipelines.
   
**Example of Why Encoding is Necessary:**

Imagine trying to use a machine learning model with non-numeric values like this:

| Age | Gender  | Country  |
|-----|---------|----------|
| 25  | Male    | USA      |
| 30  | Female  | France   |
| 22  | Female  | Germany  |

If we pass the `Gender` and `Country` columns directly to the model, it won't be able to process strings like "Male" or "France". However, after encoding, the data would look something like this:

| Age | Gender_Male | Gender_Female | Country_USA | Country_France | Country_Germany |
|-----|-------------|---------------|-------------|----------------|-----------------|
| 25  | 1           | 0             | 1           | 0              | 0               |
| 30  | 0           | 1             | 0           | 1              | 0               |
| 22  | 0           | 1             | 0           | 0              | 1               |

Now the machine learning model can process the data since everything is in numerical form!

### Encoding the Independent Variable

We will use one-hot encoding.

In [None]:
# Import the ColumnTransformer class to apply transformations to specific columns
from sklearn.compose import ColumnTransformer

# Import the OneHotEncoder class to encode categorical variables as one-hot vectors
from sklearn.preprocessing import OneHotEncoder

# Create a ColumnTransformer object that applies OneHotEncoder to the first column (index 0)
# 'remainder='passthrough'' means that the rest of the columns are left unchanged
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')

# Fit the ColumnTransformer to the data and transform the first column using one-hot encoding
# Convert the result to a numpy array
X = np.array(ct.fit_transform(X))

In [None]:
print(X)

[[1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 1.0 0.0 30.0 54000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 35.0 58000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


### Encoding the Dependent Variable

We will use label encoding.

In [None]:
# Import the LabelEncoder class to convert categorical labels into numerical form
from sklearn.preprocessing import LabelEncoder

# Create an instance of LabelEncoder
le = LabelEncoder()

# Fit the LabelEncoder to the target variable (y) and transform it into numerical labels
# This converts categorical classes in y (e.g., 'Yes', 'No') into integers (e.g., 1, 0)
y = le.fit_transform(y)

In [None]:
print(y)

[0 1 0 0 1 1 0 1 0 1]


## Splitting the dataset into the Training set and Test set

In [None]:
# Import the train_test_split function to split the dataset into training and testing sets
from sklearn.model_selection import train_test_split

# Split the dataset into training and testing sets
# X_train and y_train are the training sets; X_test and y_test are the testing sets
# test_size=0.2 means 20% of the data will be used for testing, and 80% for training
# random_state=1 ensures that the split is reproducible (same result every time you run it)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1)

In [None]:
print(X_train)

[[0.0 0.0 1.0 38.77777777777778 52000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 35.0 58000.0]]


In [None]:
print(X_test)

[[0.0 1.0 0.0 30.0 54000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


In [None]:
print(y_train)

[0 1 0 0 1 1 0 1]


In [None]:
print(y_test)

[0 1]


## Feature Scaling

Feature scaling is a technique used to standardize the range of independent variables or features of data. The primary goal is to ensure that each feature contributes equally to the model's learning process, especially in algorithms that are sensitive to the scale of the features.

**Why Feature Scaling is Important:**

1. **Improves convergence**: Many machine learning algorithms, such as gradient descent-based methods (e.g., linear regression, logistic regression, neural networks), converge faster when features are on a similar scale.
2. **Prevents bias**: Algorithms that compute distances (e.g., K-nearest neighbors, clustering algorithms) can be biased towards features with larger ranges or magnitudes if features are not scaled.
3. **Enhances performance**: Scaling can improve the performance of some algorithms by ensuring that all features are given equal importance.

**Common Methods of Feature Scaling:**

1. **Standardization (Z-score Normalization)**:
   - **Formula**: \( z = \frac{x - \mu}{\sigma} \)
   - **Explanation**: Transforms features to have a mean of 0 and a standard deviation of 1.
   - **When to use**: Useful when the features have different units or scales and the algorithm assumes normally distributed data.
    - It works well all the time.

2. **Min-Max Normalization (Rescaling)**:
   - **Formula**: \( x' = \frac{x - \text{min}}{\text{max} - \text{min}} \)
   - **Explanation**: Transforms features to a fixed range, usually [0, 1].
   - **When to use**: Useful when features need to be scaled to a bounded range, such as for neural networks or algorithms that require a specific range.
    - Usually works well when we have a normal distribution.


**Do we have to do feature scaling before splitting the dataset into the training set and test set or after?**

Feature scaling is generally performed **after** splitting the dataset into training and test sets. Here's why:

1. **Avoid data leakage**: When you split your data, the test set is meant to represent unseen data. If you perform feature scaling (e.g., normalizing or standardizing) before splitting, information from the test set could "leak" into the training set because the scaling would have been influenced by data from both sets. This defeats the purpose of evaluating model performance on truly unseen data.

2. **Scaling based on training data only**: You should fit the scaler (e.g., calculating the mean and standard deviation for standardization) on the training data only. Then, use the same transformation (based on the training data) on the test set. This simulates the real-world scenario where future data is scaled based on past data without recalculating scaling parameters.


In [None]:
from sklearn.preprocessing import StandardScaler

# Create an instance of the StandardScaler
sc = StandardScaler()

# Fit the scaler on the training data and transform the training features
# This calculates the mean and standard deviation from the training data
# and then scales the training data to have a mean of 0 and a standard deviation of 1
X_train[:, 3:] = sc.fit_transform(X_train[:, 3:]) # only for values not in -3 and 3 range

# Transform the test features using the scaler fitted on the training data
# This ensures that the test data is scaled using the same mean and standard deviation as the training data
X_test[:, 3:] = sc.transform(X_test[:, 3:]) # only transform, not fit

In [None]:
print(X_train)

[[0.0 0.0 1.0 -0.19159184384578545 -1.0781259408412425]
 [0.0 1.0 0.0 -0.014117293757057777 -0.07013167641635372]
 [1.0 0.0 0.0 0.566708506533324 0.633562432710455]
 [0.0 0.0 1.0 -0.30453019390224867 -0.30786617274297867]
 [0.0 0.0 1.0 -1.9018011447007988 -1.420463615551582]
 [1.0 0.0 0.0 1.1475343068237058 1.232653363453549]
 [0.0 1.0 0.0 1.4379472069688968 1.5749910381638885]
 [1.0 0.0 0.0 -0.7401495441200351 -0.5646194287757332]]


In [None]:
print(X_test)

[[0.0 1.0 0.0 -1.4661817944830124 -0.9069571034860727]
 [1.0 0.0 0.0 -0.44973664397484414 0.2056403393225306]]
