# Machine Learning

**Textbook is available @ [https://www.github.com/a-mhamdi/mlpy](https://www.github.com/a-mhamdi/mlpy)**

---

## Data Preprocessing Template

It is important to carefully consider the preprocessing steps that are appropriate for the specific dataset and machine learning task. Preprocessing of data helps to ensure that the data is in a suitable format to use, and can also help to improve the generalization ability of the model.

There are several reasons why data preprocessing is important in machine learning:

1. Cleaning and formatting the data;
1. Normalizing the data;
1. Reducing the dimensionality of the data; and
1. Enhancing the interpretability of the model.


### Introduction to Data Scaling

In [1]:
import numpy as np

In [2]:
X = np.array([[1, -1], [0, 2], [4.5, -3], [0, 9], [1.3, -2], [5, 4]])
X

array([[ 1. , -1. ],
       [ 0. ,  2. ],
       [ 4.5, -3. ],
       [ 0. ,  9. ],
       [ 1.3, -2. ],
       [ 5. ,  4. ]])

In [3]:
import pandas as pd

In [4]:
df = pd.DataFrame(X, columns=['Col #1', 'Col #2'])
df

Unnamed: 0,Col #1,Col #2
0,1.0,-1.0
1,0.0,2.0
2,4.5,-3.0
3,0.0,9.0
4,1.3,-2.0
5,5.0,4.0


In [5]:
df.describe()

Unnamed: 0,Col #1,Col #2
count,6.0,6.0
mean,1.966667,1.5
std,2.22411,4.505552
min,0.0,-3.0
25%,0.25,-1.75
50%,1.15,0.5
75%,3.7,3.5
max,5.0,9.0


#### MinMaxScaler

In [6]:
X_pg = (X-X.min(axis=0))/(X.max(axis=0)-X.min(axis=0))
X_pg

array([[0.2       , 0.16666667],
       [0.        , 0.41666667],
       [0.9       , 0.        ],
       [0.        , 1.        ],
       [0.26      , 0.08333333],
       [1.        , 0.58333333]])

In [7]:
from sklearn.preprocessing import MinMaxScaler

In [8]:
X_mms = MinMaxScaler().fit_transform(X)
X_mms

array([[0.2       , 0.16666667],
       [0.        , 0.41666667],
       [0.9       , 0.        ],
       [0.        , 1.        ],
       [0.26      , 0.08333333],
       [1.        , 0.58333333]])

#### StandardScaler

In [9]:
X_ms = (X-X.mean(axis=0))/(X.std(axis=0))
X_ms

array([[-0.4761141 , -0.60783067],
       [-0.96864593,  0.12156613],
       [ 1.2477473 , -1.09409521],
       [-0.96864593,  1.82349202],
       [-0.32835455, -0.85096294],
       [ 1.49401321,  0.60783067]])

In [10]:
from sklearn.preprocessing import StandardScaler

In [11]:
X_sc = StandardScaler().fit_transform(X)
X_sc

array([[-0.4761141 , -0.60783067],
       [-0.96864593,  0.12156613],
       [ 1.2477473 , -1.09409521],
       [-0.96864593,  1.82349202],
       [-0.32835455, -0.85096294],
       [ 1.49401321,  0.60783067]])

### Data Preprocessing Template

#### Importing the libraries


In [12]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt

In [13]:
np.set_printoptions(precision=3)

#### Importing the dataset


In [14]:
df = pd.read_csv('./datasets/Data.csv')
df.head()

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes


In [15]:
df.describe()

Unnamed: 0,Age,Salary
count,9.0,9.0
mean,38.777778,63777.777778
std,7.693793,12265.579662
min,27.0,48000.0
25%,35.0,54000.0
50%,38.0,61000.0
75%,44.0,72000.0
max,50.0,83000.0


In [16]:
df['Purchased'].value_counts()

Purchased
No     5
Yes    5
Name: count, dtype: int64

#### Extracting independent and dependent variables


In [17]:
X = df.iloc[:, :-1].values
y = df.iloc[:, -1].values

In [18]:
print('***** Features *****', X, sep='\n') # Features
print('***** Target *****', y, sep='\n') # Target

***** Features *****
[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 nan]
 ['France' 35.0 58000.0]
 ['Spain' nan 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]
***** Target *****
['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']


#### Imputation transformer for completing missing values


[https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html)


In [19]:
from sklearn.impute import SimpleImputer

In [20]:
si = SimpleImputer(missing_values=np.nan, strategy='mean')
X[:, 1:] = si.fit_transform(X[:, 1:])

In [21]:
print('***** Features *****', X, sep='\n') # Features
print('***** Target *****', y, sep='\n') # Target

***** Features *****
[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]
***** Target *****
['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']


#### How to encode categorical data?


##### Case of two categories

[https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html)

In [22]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

Fit `le` and return encoded labels

In [23]:
y = le.fit_transform(y)
print(y)

[0 1 0 0 1 1 0 1 0 1]


**MARGINAL NOTE**

Try to fit and transform a new `LabelEncoder` instance on the `Country` column.

>```python
>ce = LabelEncoder()
>country = ce.fit_transform(X[:, 0]) # You can use `df.Country` instead
>```

We can access the original values by simply writing:

>```python
>X[:, 0] = ce.inverse_transform(country) # X[:, 0].astype(int)
>```

##### Case of multiple categories

In [24]:
from sklearn.compose import ColumnTransformer

[https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html)

In [25]:
from sklearn.preprocessing import OneHotEncoder

[https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder)

In [26]:
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')
ct

In [27]:
X = np.array(ct.fit_transform(X))

Display `X` after being encoded

In [28]:
print(X)

[[1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 1.0 0.0 30.0 54000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 35.0 58000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


_REMARK_

In [29]:
z = [['Python'], ['Julia'], ['Rust'], ['JavaScript']]
Z = [3 * _ for _ in z]
Z

[['Python', 'Python', 'Python'],
 ['Julia', 'Julia', 'Julia'],
 ['Rust', 'Rust', 'Rust'],
 ['JavaScript', 'JavaScript', 'JavaScript']]

In [30]:
from sklearn.preprocessing import OrdinalEncoder

In [31]:
ctz = ColumnTransformer(transformers=[('oe', OrdinalEncoder(), [2]), ('ohe', OneHotEncoder(), [0])], remainder='passthrough')
ctz

In [32]:
ctz.fit_transform(Z)

array([[2.0, 0.0, 0.0, 1.0, 0.0, 'Python'],
       [1.0, 0.0, 1.0, 0.0, 0.0, 'Julia'],
       [3.0, 0.0, 0.0, 0.0, 1.0, 'Rust'],
       [0.0, 1.0, 0.0, 0.0, 0.0, 'JavaScript']], dtype=object)

#### Splitting the dataset into training set and test set


[https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)


In [33]:
from sklearn.model_selection import train_test_split

In [34]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=123)

Print `y_train` which is $80\%$ of the target variable `y`

In [35]:
print(y_train)

[1 1 0 0 1 0 1 0]


Print `y_test` which is $20\%$ of the target variable `y`

In [36]:
print(y_test)

[1 0]


#### Scaling of features


[https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)

In [37]:
from sklearn.preprocessing import StandardScaler

In [38]:
sc = StandardScaler()

`X_train` after scaling *(fit & transform, mean $\mu$ & standard deviation $\sigma$ are stored to be later used to transform the test set)*

In [39]:
X_train[:, 3:] = sc.fit_transform(X_train[:, 3:])
print(X_train)

[[1.0 0.0 0.0 1.352762210122472 1.3688002742136494]
 [1.0 0.0 0.0 -0.4009572201748046 -0.40011084938552827]
 [0.0 1.0 0.0 1.622565199398976 1.7057357263277784]
 [0.0 0.0 1.0 0.003747263739951552 -0.14740926029993145]
 [0.0 0.0 1.0 -1.480169177280821 -1.242449479670851]
 [0.0 0.0 1.0 0.10867064845859215 -0.9055140275567218]
 [1.0 0.0 0.0 -0.1311542308983005 0.3579939178712621]
 [0.0 1.0 0.0 -1.0754646933660648 -0.7370463014996573]]


`X_test` after scaling *(only transform)*

In [40]:
X_test[:, 3:] = sc.transform(X_test[:, 3:])
print(X_test)

[[0.0 1.0 0.0 0.27355025301645564 0.08657369255710287]
 [1.0 0.0 0.0 0.8131562315694638 0.7791632330139234]]
