<a href="https://colab.research.google.com/github/abm4github/Machine-Learning-Beginners/blob/main/ML_DataPreprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

 **MACHINE LEARNING**

**Data Preprocessing**   
It is  essential to clean data before using it in modeling. Here we see the steps that are involved in data preprocessing.

In [50]:
# importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [51]:
# loading data
dataset = pd.read_csv('dataprep.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

**Missing Data**
missing data can be removed if the data is very large and missing values are merely one percent. If not, the numerical type missing values can be replaced with mean, median and categorical values can be replaced with 'most_frequent'ly occuring values.

In [52]:
# handling missing values (numerical)
# imputing numerical type 
from sklearn.impute import SimpleImputer
imputer_numerical = SimpleImputer(missing_values = np.nan, strategy = 'mean')
imputer_numerical.fit(X[:, 1:3])
X[:, 1:3] = imputer_numerical.transform(X[:, 1:3])

In [53]:
# handling missing values (categorical)
# imputing categorical type

imputer_categorical = SimpleImputer(strategy='most_frequent')
imputer_categorical.fit(X)
X = imputer_categorical.transform(X)

In [54]:
# the changes are visible on X
print(X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['France' 50.0 83000.0]
 ['France' 37.0 67000.0]]


In [55]:
# note: no changes are reflected in out dataset
print(dataset)

   Country   Age   Salary Purchased
0   France  44.0  72000.0        No
1    Spain  27.0  48000.0       Yes
2  Germany  30.0  54000.0        No
3    Spain  38.0  61000.0        No
4  Germany  40.0      NaN       Yes
5   France  35.0  58000.0       Yes
6    Spain   NaN  52000.0        No
7   France  48.0  79000.0       Yes
8      NaN  50.0  83000.0        No
9   France  37.0  67000.0       Yes


**Encoding categorical Data in independent variables (features)**  
Machine Learning models only accept numerical data, therfore all the textual data/ catergorical data/ strings are to be replaced with numerical values.
if the data is ordinal, we can replace them with 0,1,2....
if the data is nominal, we can replace them with 

In [56]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
col_trans = ColumnTransformer([('encoder', OneHotEncoder(), [0] )], remainder='passthrough')
X = np.array(col_trans.fit_transform(X))

In [57]:
# France is replaced with 1,0,0 
# Spain with 0,0,1 and
# Germany with 0,1,0
print(X)

[[1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 1.0 0.0 30.0 54000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 35.0 58000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [1.0 0.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


**Encoding categorical Data in dependent variables/ Target variables (labels)**

In [58]:
from sklearn.preprocessing import LabelEncoder
label_encod = LabelEncoder()
y = label_encod.fit_transform(y)

In [59]:
print(y)

[0 1 0 0 1 1 0 1 0 1]


In [60]:
# y column (label), nothing but purchased column of our dataset, above, replaced with 1 for Yes and 0 for No using LabelEncoder
dataset

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,,50.0,83000.0,No
9,France,37.0,67000.0,Yes


**Splitting Data**  
Data should be split before feature scaling

In [61]:
# splitting data in ratio 0.8,for  train, is to 0.2, for test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

Feature Scaling

In [62]:
# feature scaling should not be applied to the resultant features of encoded categorical column
from sklearn.preprocessing import StandardScaler
Std_Scaler = StandardScaler()
X_train[:, 3:] = Std_Scaler.fit_transform(X_train[:, 3:])
X_test[:, 3:] = Std_Scaler.transform(X_test[:, 3:])

In [63]:
X_train

array([[0.0, 0.0, 1.0, -0.19159184384578545, -1.0781259408412425],
       [0.0, 1.0, 0.0, -0.014117293757057777, -0.07013167641635372],
       [1.0, 0.0, 0.0, 0.566708506533324, 0.633562432710455],
       [0.0, 0.0, 1.0, -0.30453019390224867, -0.30786617274297867],
       [0.0, 0.0, 1.0, -1.9018011447007988, -1.420463615551582],
       [1.0, 0.0, 0.0, 1.1475343068237058, 1.232653363453549],
       [1.0, 0.0, 0.0, 1.4379472069688968, 1.5749910381638885],
       [1.0, 0.0, 0.0, -0.7401495441200351, -0.5646194287757332]],
      dtype=object)

In [64]:
X_test

array([[0.0, 1.0, 0.0, -1.4661817944830124, -0.9069571034860727],
       [1.0, 0.0, 0.0, -0.44973664397484414, 0.2056403393225306]],
      dtype=object)

Now you are ready to feed a dataset into your machine learning model of choice.