<a href="https://colab.research.google.com/github/beinghaziq/ML-practice/blob/main/data%20processing/data_preprocessing_tools.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Preprocessing Tools

## Importing the libraries

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Importing the dataset

In [2]:
dataset = pd.read_csv('/content/drive/My Drive/Colab Notebooks/data.csv')
# Notes: Last column is dependent because we have yes or no against purchase and other
# columns are independent like age and salary
x = dataset.iloc[:, :-1].values # Get all columns except last one
# Notes: Last column => it is dependent in most cases and in this case it is
# purchased
y = dataset.iloc[:, -1].values
z = dataset.drop(dataset.columns[1], axis=1).values
# Notes: .iloc is an indexer for selecting rows and columns by their integer positions
# The colon (:) before the comma indicates that we want to select all rows.

In [35]:
print(z)

[['France' 72000.0 'No']
 ['Spain' 48000.0 'Yes']
 ['Germany' 54000.0 'No']
 ['Spain' 61000.0 'No']
 ['Germany' nan 'Yes']
 ['France' 58000.0 'Yes']
 ['Spain' 52000.0 'No']
 ['France' 79000.0 'Yes']
 ['Germany' 83000.0 'No']
 ['France' 67000.0 'Yes']]


In [None]:
print(x) # .values is giving data in form of array of arrays.

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 nan]
 ['France' 35.0 58000.0]
 ['Spain' nan 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


In [None]:
print(y)

['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']


## Taking care of missing data

The code uses `SimpleImputer` to replace missing values (`np.nan`) in columns 1 and 2 of the array `x` with the mean of those columns.

In [3]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy="mean")
imputer.fit(x[:, 1:3])
x[:, 1:3] = imputer.transform(x[:, 1:3])

In [None]:
print(x)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


## Encoding categorical data

Notes: We encode categorical data to make it compatible with machine learning algorithms, which often require numerical input. This process improves model performance and ensures accurate interpretation of the data.
In this case we require data in form of numpy friendly. That's why we used np.array as well.


**One-Hot Encoding:** Converts categorical values into multiple binary columns, where each unique category becomes a column with 1s and 0s. It avoids ordinal relationships but can increase the feature space significantly.

**Example: **

Before:
Color => Red, Green, Blue, Red

After:
Color_Red: 1001
Color_Green: 0100
Color_Blue: 0010


**Label Encoding:** Assigns each unique category an integer value. It’s memory efficient but implies an ordinal relationship, which may not be suitable for all algorithms.

**Example:**
Color => Red, Green, Blue, Red

After:
Color
0  # Red
1  # Green
2  # Blue
0  # Red

You would use **OneHotEncoder** when dealing with categorical features that have no inherent order or hierarchy, as it creates binary columns for each category. On the other hand, **LabelEncoder** is suitable for ordinal categorical data where there is a clear order or ranking among the categories, as it assigns integer labels accordingly.



### Encoding the Independent Variable

In [6]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers= [('encoder', OneHotEncoder(), [0])], remainder='passthrough')
encoded_matrix = np.array(ct.fit_transform(x))

In [10]:
print(encoded_matrix)

[[1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 1.0 0.0 30.0 54000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 35.0 58000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


### Encoding the Dependent Variable

In [4]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder() #passing nothing because we have only one single vector
encoded_dependent = le.fit_transform(y)

In [8]:
print(encoded_dependent)

[0 1 0 0 1 1 0 1 0 1]


## Splitting the dataset into the Training set and Test set

In [7]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(encoded_matrix, encoded_dependent, test_size = 0.2, random_state = 1)

In [8]:
print(x_train)

[[0.0 0.0 1.0 38.77777777777778 52000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 35.0 58000.0]]


In [9]:
print(x_test)

[[0.0 1.0 0.0 30.0 54000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


In [11]:
print(y_train)

[0 1 0 0 1 1 0 1]


In [12]:
print(y_test)

[0 1]


## Feature Scaling