# Data Preprocessing Tools

## Importing the libraries

In [8]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## Importing the dataset

In [9]:
dataset = pd.read_csv('Data.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

In [10]:
print(X)
print()
print(y)

print("Missing data in each column:")
missing_data = dataset.isnull().sum()
print(missing_data)

print()
print(dataset.isnull())

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 nan]
 ['France' 35.0 58000.0]
 ['Spain' nan 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]

['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']
Missing data in each column:
Country      0
Age          1
Salary       1
Purchased    0
dtype: int64

   Country    Age  Salary  Purchased
0    False  False   False      False
1    False  False   False      False
2    False  False   False      False
3    False  False   False      False
4    False  False    True      False
5    False  False   False      False
6    False   True   False      False
7    False  False   False      False
8    False  False   False      False
9    False  False   False      False


## Taking care of missing data

Note : Our goal is to take care of the missing values in the dataset. One of the ways is by replacing them  with the average value of the column. Here we use the SimpleImputer class in the impute module of the sklearn package. We create an object of the class SimpleImputer and call it imputer. To the constructor, we pass the first paramter 'missing_values' and set it to np.nan which is a numpy property to handle the NULL values. And the second paramter passed is 'strategy' with value 'mean' indicating that the missing values will be replaced by the mean. 

In [11]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values = np.nan, strategy = 'mean')
imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])

## Encoding categorical data

To encode categorical data, or columns with string data, we use one hot encoding instead of assigning the categories numbers because that leads to some unwanted numerical correleations for the model. One hot encoding splits the column into multiple columns for each category and uses binary vectors to do the encoding.

### Encoding the Independent Variable

In [12]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')
print('Before encoding:')
print(X)
print()
X = np.array(ct.fit_transform(X))
print('After encoding:')
print(X)

Before encoding:
[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]

After encoding:
[[1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 1.0 0.0 30.0 54000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 35.0 58000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


### Encoding the Dependent Variable

In [13]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
print('Encoding the Dependent Variable:', y)
y = le.fit_transform(y)

print('After encoding:')
print(y)

Encoding the Dependent Variable: ['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']
After encoding:
[0 1 0 0 1 1 0 1 0 1]


## Splitting the dataset into the Training set and Test set

## Feature Scaling