# Data Preprocessing Tools

Data Processing is the process of converting raw data to suitable format for training and building ML models.

The processes involved are:
  1. Data Cleaning (Handle Missing Values)
  2. Data Transformation (Feauture Scaling - Normalization,Standardization)
  3. Data Splitting (Splitting data into training and test models)

## Importing the libraries

In [16]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## Importing the dataset

In [17]:
# While reading the CSV, check the delimiter used, if its anything other than a comma, you have to specify a delimiter parameter.
dataset = pd.read_csv('Data.csv')
# 1. Separating the feautures(X) and dependent variables(Y)

# 2. Here X contains the data frame for the first 3 columns of the dataset
X = dataset.iloc[:,:-1].values

# 3. Here Y contains the data frame for the last column of the dataset
Y = dataset.iloc[:,-1].values

In [None]:
# Sum of the total number of entries that are null in each column
print(dataset.isnull().sum())

In [None]:
print(X)
print(Y)

## Taking care of missing data

In [18]:
# Two ways to handle missing data
# 1. Delete the row that has missing data
# 2. Replace the missing cell with the average/median/frequent of the values of a particular column

from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan,strategy="mean")

# will look for all the missing cells(make sure to select ONLY numerical columns)
# you can also pass the dataset to it is as an arguement
imputer.fit(X[:,1:3])

# returns the updated data frame
# you can also pass the dataset as it is as an arguement
X[:,1:3] = imputer.transform(X[:,1:3])

In [None]:
print(X)

## Encoding categorical data

If there is a datatype other than numbers in the dataset, the ML model will have a hard time analysing it.So what we can do is we can group similar values in a column(that are not numbers) into a category(Encoding Categorical).

### Encoding the Independent Variable

In [19]:
# One Hot Encoding is used when there is no kind of relationship between the values in a column
# Eg. Lets say we have a column Fruits, with values apple,bannana, & kiwi, there is no relationship between between these values
# NOTE: Doing so, the dimensions of the table WILL increase

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

# specify what kind of transformations we will do, and on which indice(s) we will do so

# ColumnTransformer's transformers takes 3 params: operation name,type of operation,column index(s) you want to transform
ct = ColumnTransformer(transformers=[('encoder',OneHotEncoder(),[0])],
                       # passthrough = keep the columns that are not being transformed as it is
                       remainder='passthrough')

# X(feautures) is expected to be a numpy array
X = np.array(ct.fit_transform(X))


print(X)

In [None]:
# If the encoding is done directly from the dataset(i.e without separating it into dependent variables and feautures)

# ct = ColumnTransformer(transformers=[('encoder',OneHotEncoder(),[pass the column name(s) here])],remainder='passthrough')
# le = LabelEncoder()
# le.fit_tansform(dataset[columnName])

### Encoding the Dependent Variable

In [20]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

Y = le.fit_transform(Y)

## Splitting the dataset into the Training set and Test set

In [24]:
from sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.2,random_state=1)

print(X_train)
print(X_test)
print(Y_train)
print(Y_test)

## Feauture Scaling
This is done in order to avoid feautures/columns being dominated by other feautures/columns in such a way that the dominated feautures will not even be considered in ML model

Standardisation = X - mean(X)/ Standard Deviation(X) [-3,+3]


In [26]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

# Feauture Scaling must NOT be applied to all feautures/columns i.e it is not necessary to apply it to features that were encoded because it already lies between 0 and 1.

# fit - calculates the mean and standard deviation of the values in the column
# transform - obvious innit
X_train[:,3:] = scaler.fit_transform(X_train[:,3:])

#transform method on the test data is to ensure that the test data is processed in the same way as the training data
X_test[:,3:] = scaler.transform(X_test[:,3:])

print(X_train)
print(X_test)