# Data Preprocessing Tools

## Importing the libraries

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## Importing the dataset

_**Independent vars (features/inputs)**_ & _**dependent var (output)**_ need to be separated from the dataset to feed the training models.

In [15]:
dataset = pd.read_csv('Data.csv')

FEATURE_COLS = np.s_[:, :-1] # all rows, all colums excepts last
DEPENDENT_COL = np.s_[:, -1] # all rows, last column

X = dataset.iloc[FEATURE_COLS].values
y = dataset.iloc[DEPENDENT_COL].values

In [13]:
print("X", X)
print("\ny", y)

X [[1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 1.0 0.0 30.0 54000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 35.0 58000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 37.0 67000.0]]

y [0 1 0 0 1 1 0 1 0 1]


## Handle missing data

There are some different ways to fill the missing data in the dataset such as _**mean (avarage), median**_, etc of the whole column. Among them, _**mean**_ strategy is commonly used.

In [4]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values = np.nan, strategy = 'mean')

We only deal with numeric inputs:

In [14]:
NUMERIC_FEATURES = np.index_exp[:, 1:]

imputer.fit(X[NUMERIC_FEATURES])
X[NUMERIC_FEATURES] = imputer.transform(X[NUMERIC_FEATURES])                    

In [6]:
print(X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


## Encoding categorical data

For ML to compute the corelation between independent & dependent variables, categorical (non-contiguous) data such as _string_ need to be encoded as number.

### Encoding the Independent Variable

Independent variables do not have order, hence should not encoded as ordinal numbers (1, 2, 3...) by ```LabalEncoder```, but different unorder form such as matrix of numbers by ```OneHotEncoder```.

In [7]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

STRING_FEATURES = [0]

ct = ColumnTransformer(transformers = [('encoder', OneHotEncoder(), STRING_FEATURES)], remainder = 'passthrough') # passthrough to keep uncoded vars in the independent vars
X = np.array(ct.fit_transform(X))

In [8]:
print(X)

[[1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 1.0 0.0 30.0 54000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 35.0 58000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


### Encoding the Dependent Variable

Dependent feature can be encoded as ordinal numbers.

In [9]:
from sklearn.preprocessing import LabelEncoder
y = LabelEncoder().fit_transform(y)

In [10]:
print(y)

[0 1 0 0 1 1 0 1 0 1]


## Splitting the dataset into the Training set and Test set

Training set should be relatively larger than test set in order to to provide sufficient amount of inputs to train model. Here we use 80% of dataset for training and 20% for test.

In [11]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1) # keep randomness seed of spliting fixed => receive the same training & test set each time execute

In [12]:
print("X_train", X_train)
print("\nX_test", X_test)
print("\ny_train", y_train)
print("\ny_test", y_test)

X_train [[0.0 0.0 1.0 38.77777777777778 52000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 35.0 58000.0]]

X_test [[0.0 1.0 0.0 30.0 54000.0]
 [1.0 0.0 0.0 37.0 67000.0]]

y_train [0 1 0 0 1 1 0 1]

y_test [0 1]


## Feature Scaling

In certain (not all) models, scaling to put all the _non-categorical features_ on the same scale to avoid some features to dominate others. This should be done only for training set (_after dataset splitting_), since we treat test set as future data which we don't have yet hence not process.