<a href="https://colab.research.google.com/github/dnnxl/ML-Notes/blob/master/Data_preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data preprocessing

>### Importing the libraries


In [0]:
import numpy as np
import matplotlib as plt
import pandas as pd
from sklearn.preprocessing import Imputer # Contains methods, classes to preprocess
from sklearn.preprocessing import LabelEncoder # Enconding categorical data
from sklearn.preprocessing import OneHotEncoder 
from sklearn.model_selection import train_test_split # Split the data into training and test set
from sklearn.preprocessing import StandardScaler # Feature Scaling

>### Importing the dataset

In [4]:
dataset = pd.read_csv("Data.csv")
pd.DataFrame(dataset)

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


### Matrix of features (Independent variables)

In [5]:
X = dataset.iloc[:, :-1].values
pd.DataFrame(X)

Unnamed: 0,0,1,2
0,France,44.0,72000.0
1,Spain,27.0,48000.0
2,Germany,30.0,54000.0
3,Spain,38.0,61000.0
4,Germany,40.0,
5,France,35.0,58000.0
6,Spain,,52000.0
7,France,48.0,79000.0
8,Germany,50.0,83000.0
9,France,37.0,67000.0


### Matrix of dependent variable

In [6]:
Y = dataset.iloc[:, 3].values
pd.DataFrame(Y)

Unnamed: 0,0
0,No
1,Yes
2,No
3,No
4,Yes
5,Yes
6,No
7,Yes
8,No
9,Yes


> ### Missing Data

How to treat missing data:
- Delete the row (Is not an option) because can contains crucial information.
- Replace the missing data by the mean of the values in the column that contains this missing data.
- Using median
- Most frequent
- You can replace the missing values with the mean or median, based on groups rather than taking the mean of entire population. 
- In other case, if the column has too many missing values (say > 80%), then you can simply drop that column.     

>**Mean method**

In [7]:
# axis = 0 column, axis = 1 rows
imputer = Imputer(missing_values = "NaN", strategy = "mean", axis = 0) 
imputer = imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])



In [8]:
pd.DataFrame(X)

Unnamed: 0,0,1,2
0,France,44.0,72000.0
1,Spain,27.0,48000.0
2,Germany,30.0,54000.0
3,Spain,38.0,61000.0
4,Germany,40.0,63777.8
5,France,35.0,58000.0
6,Spain,38.7778,52000.0
7,France,48.0,79000.0
8,Germany,50.0,83000.0
9,France,37.0,67000.0


>#### Median

The median is the value separating the higher half from the lower half of a data sample (a population or a probability distribution). For a data set, it may be thought of as the "middle" value. For example, in the data set {1, 3, 3, 6, 7, 8, 9}, the median is 6, the fourth largest, and also the fourth smallest, number in the sample. For a continuous probability distribution, the median is the value such that a number is equally likely to fall above or below it.

>### Categorical data

Contains categorical data, like France, Spain, Germany, Machine learning models are based on mathematical equations you can intuitively understand that it would cause sime problem if we keep the text here and the categorical variables.

In [9]:
# Encode the variables x
labelEnconder_X = LabelEncoder() 
X[:,0] = labelEnconder_X.fit_transform(X[:,0])
pd.DataFrame(X)

Unnamed: 0,0,1,2
0,0,44.0,72000.0
1,2,27.0,48000.0
2,1,30.0,54000.0
3,2,38.0,61000.0
4,1,40.0,63777.8
5,0,35.0,58000.0
6,2,38.7778,52000.0
7,0,48.0,79000.0
8,1,50.0,83000.0
9,0,37.0,67000.0


Note: Problem that Machine learning model will think that Spain has a greater value than Germany and so on. And that's not the case. These are actually three categories and there is no relational order between the three, we can't compare the three categories, this wouldn't make any sense. We can solve this using dummy encoding.

#### Dummy Encoding

Number of categories is the number of new columns.

In [10]:
oneHotEnconder = OneHotEncoder(categorical_features = [0])
X = oneHotEnconder.fit_transform(X).toarray()
pd.DataFrame(X)

In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.


Unnamed: 0,0,1,2,3,4
0,1.0,0.0,0.0,44.0,72000.0
1,0.0,0.0,1.0,27.0,48000.0
2,0.0,1.0,0.0,30.0,54000.0
3,0.0,0.0,1.0,38.0,61000.0
4,0.0,1.0,0.0,40.0,63777.777778
5,1.0,0.0,0.0,35.0,58000.0
6,0.0,0.0,1.0,38.777778,52000.0
7,1.0,0.0,0.0,48.0,79000.0
8,0.0,1.0,0.0,50.0,83000.0
9,1.0,0.0,0.0,37.0,67000.0


In [11]:
# Encode the variables y 
print("Encode the variable y: ")
labelEnconder_Y = LabelEncoder() 
Y = labelEnconder_Y.fit_transform(Y)
pd.DataFrame(Y)

Encode the variable y: 


Unnamed: 0,0
0,0
1,1
2,0
3,0
4,1
5,1
6,0
7,1
8,0
9,1


>### Splitting the Dataset into the Training set and Test set

In [0]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2, random_state = 0)

In [13]:
pd.DataFrame(X_train)

Unnamed: 0,0,1,2,3,4
0,0.0,1.0,0.0,40.0,63777.777778
1,1.0,0.0,0.0,37.0,67000.0
2,0.0,0.0,1.0,27.0,48000.0
3,0.0,0.0,1.0,38.777778,52000.0
4,1.0,0.0,0.0,48.0,79000.0
5,0.0,0.0,1.0,38.0,61000.0
6,1.0,0.0,0.0,44.0,72000.0
7,1.0,0.0,0.0,35.0,58000.0


In [14]:
pd.DataFrame(X_test)

Unnamed: 0,0,1,2,3,4
0,0.0,1.0,0.0,30.0,54000.0
1,0.0,1.0,0.0,50.0,83000.0


In [15]:
pd.DataFrame(y_train)

Unnamed: 0,0
0,1
1,1
2,1
3,0
4,1
5,0
6,0
7,1


In [16]:
pd.DataFrame(y_test)

Unnamed: 0,0
0,0
1,0


Note: The Machine learning model has to stablished some correlations between the independent variable and the dependent variable. Once the machine learning model understands the correlations between independent variables and the dependent variable.

> ### Feature Scaling

- A lot of machine learning models are based on what is called the Euclidean
- So a feature can dominate an other feature.
- Convert some values in the same scale.
- Converge much faster.
- Good accuracy and precise

> #### Standardisation

\begin{equation*}
\ x_{stand} = \frac{x - mean(x)}{std(x)} 
\end{equation*}

> #### Normalisation

\begin{equation*}
\ x_{norm} = \frac{x - min(x)}{max(x) - min(x)} 
\end{equation*}

In [0]:
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)

In [18]:
pd.DataFrame(X_train)

Unnamed: 0,0,1,2,3,4
0,-1.0,2.645751,-0.774597,0.263068,0.123815
1,1.0,-0.377964,-0.774597,-0.253501,0.461756
2,-1.0,-0.377964,1.290994,-1.975398,-1.530933
3,-1.0,-0.377964,1.290994,0.052614,-1.11142
4,1.0,-0.377964,-0.774597,1.640585,1.720297
5,-1.0,-0.377964,1.290994,-0.081312,-0.167514
6,1.0,-0.377964,-0.774597,0.951826,0.986148
7,1.0,-0.377964,-0.774597,-0.597881,-0.482149


In [19]:
pd.DataFrame(X_test)

Unnamed: 0,0,1,2,3,4
0,-1.0,2.645751,-0.774597,-1.458829,-0.901663
1,-1.0,2.645751,-0.774597,1.984964,2.139811


> ### Data preprocessing template

In [0]:
# Importing the libraries
import numpy as np
import matplotlib as plt
import pandas as pd
from sklearn.preprocessing import Imputer # Contains methods, classes to preprocess
from sklearn.model_selection import train_test_split # Split the data into training and test set

# Importing the dataset
dataset = pd.read_csv("Data.csv")
X = dataset.iloc[:, :-1].values
Y = dataset.iloc[:, 3].values

# Splitting the dataset into the Training and Test set
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2, random_state = 0)
