# Machine Learning

### Textbook is available at: [https://www.github.com/a-mhamdi/isetbz](https://www.github.com/a-mhamdi/isetbz)

---


### Data Preprocessing Template

**Importing the libraries**


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

**Importing the dataset**


In [2]:
df = pd.read_csv("Datasets/Data.csv")
df.head()

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes


In [3]:
df.describe()

Unnamed: 0,Age,Salary
count,9.0,9.0
mean,38.777778,63777.777778
std,7.693793,12265.579662
min,27.0,48000.0
25%,35.0,54000.0
50%,38.0,61000.0
75%,44.0,72000.0
max,50.0,83000.0


In [4]:
df["Purchased"].value_counts()

No     5
Yes    5
Name: Purchased, dtype: int64

**Extracting independant and dependant variables**


In [5]:
X = df.iloc[:, :-1].values
y = df.iloc[:, -1].values

In [6]:
print("***** Features *****", X, sep="\n") # Features
print("***** Target *****", y, sep="\n") # Target

***** Features *****
[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 nan]
 ['France' 35.0 58000.0]
 ['Spain' nan 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]
***** Target *****
['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']


**Imputation transformer for completing missing values**


[https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html)


In [7]:
from sklearn.impute import SimpleImputer

In [8]:
si = SimpleImputer(missing_values=np.nan, strategy="mean")
X[:, 1:3] = si.fit_transform(X[:, 1:3])

In [9]:
print("***** Features *****", X, sep="\n") # Features
print("***** Target *****", y, sep="\n") # Target

***** Features *****
[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]
***** Target *****
['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']


**How to encode categorical data?**


_Case of independent variable_


In [10]:
from sklearn.compose import ColumnTransformer

[https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html)

In [11]:
from sklearn.preprocessing import OneHotEncoder

[https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder)

In [12]:
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')
X = np.array(ct.fit_transform(X))

Display `X` after being encoded

In [13]:
print(X)

[[1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 1.0 0.0 30.0 54000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 35.0 58000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


_Case of dependent variable_


[https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html)

In [14]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

Fit `le` and return encoded labels

In [15]:
le.fit_transform(y)
print(y)

['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']


_Splitting the dataset into training set and test set_


[https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)


In [16]:
from sklearn.model_selection import train_test_split

In [17]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=123)

Print `y_train` which is $80\%$ of the target variable `y`

In [18]:
print(y_train)

['Yes' 'Yes' 'No' 'No' 'Yes' 'No' 'Yes' 'No']


Print `y_test` which is $20\%$ of the target variable `y`

In [19]:
print(y_test)

['Yes' 'No']


**Scaling of features**


[https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)

In [20]:
from sklearn.preprocessing import StandardScaler

In [21]:
sc = StandardScaler()

`X_train` after scaling *(fit & transform, mean $\mu$ & standard deviation $\sigma$ are stored to be later used to transform the test set)*

In [22]:
X_train[:, 3:] = sc.fit_transform(X_train[:, 3:])
print(X_train)

[[1.0 0.0 0.0 1.352762210122472 1.3688002742136494]
 [1.0 0.0 0.0 -0.4009572201748046 -0.40011084938552827]
 [0.0 1.0 0.0 1.622565199398976 1.7057357263277784]
 [0.0 0.0 1.0 0.003747263739951552 -0.14740926029993145]
 [0.0 0.0 1.0 -1.480169177280821 -1.242449479670851]
 [0.0 0.0 1.0 0.10867064845859215 -0.9055140275567218]
 [1.0 0.0 0.0 -0.1311542308983005 0.3579939178712621]
 [0.0 1.0 0.0 -1.0754646933660648 -0.7370463014996573]]


`X_test` after scaling *(only fit)*

In [23]:
X_test[:, 3:] = sc.transform(X_test[:, 3:])
print(X_test)

[[0.0 1.0 0.0 0.27355025301645564 0.08657369255710287]
 [1.0 0.0 0.0 0.8131562315694638 0.7791632330139234]]
