# Introduction to Supervised Machine Leearning (SML)

Welcome to this introduction to machine learning (ML). In this session we cover the following topics
1. Generalizating and valididating from ML models.
2. The Bias-Variance Trade-Off
3. Out-of-sample testing and cross-validation workflows
4. Implementing Ml workflows in the Python (Sklearn) ecosystem.

In [None]:
# loading essential libraries

import numpy as np
import pandas as pd
import seaborn as sns
import statsmodels.api as sm

sns.set(style="darkgrid", color_codes=True)

# ML Case 2 (Classification, tabular data): Penguins

## Data Description

In [None]:
penguins = pd.read_csv("https://github.com/allisonhorst/palmerpenguins/raw/5b5891f01b52ae26ad8cb9755ec93672f49328a8/data/penguins_size.csv")

In [None]:
penguins.head()

## Preprocessing

In [None]:
penguins.dropna(inplace=True)

In [None]:
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler

In [None]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

In [None]:
# Import the confusion matrix plotter module
from mlxtend.plotting import plot_confusion_matrix

In [None]:
X = penguins.iloc[:,2:6]

In [None]:
scaler = StandardScaler()
X = scaler.fit_transform(X)

In [None]:
## This is new: We encode a categorical variable.
y = penguins.iloc[:, 0]
labelencoder_y = LabelEncoder()
y = labelencoder_y.fit_transform(y)


Let's split the data and fit a simple logistic model

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 21)

## Fitting SML models

In [None]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(multi_class="ovr") # since we have 3 classes
model.fit(X_train, y_train)

In [None]:
model.score(X_train, y_train)

We can use the LabelEncoder to obtain the initial labels of the penguins to see how well the model performed

In [None]:
true_penguins = labelencoder_y.inverse_transform(y_train)

predicted_penguins = labelencoder_y.inverse_transform(model.predict(X_train))

In [None]:
df = pd.DataFrame({'true_penguins': true_penguins, 'predicted_penguins': predicted_penguins})

pd.crosstab(df.true_penguins, df.predicted_penguins)

In [None]:
print(classification_report(true_penguins,predicted_penguins, labels=labelencoder_y.classes_))

In this case it is probably silly but this is how easy we can switch to a **non-sklearn algorithm**

In [None]:
# Let's use an advanded classifier algorithm
from xgboost import XGBClassifier

In [None]:
# fit model on training data
model = XGBClassifier()
model.fit(X_train, y_train)

In [None]:
true_penguins = labelencoder_y.inverse_transform(y_train)

predicted_penguins = labelencoder_y.inverse_transform(model.predict(X_train))

print(classification_report(true_penguins,predicted_penguins, labels=labelencoder_y.classes_))

In [None]:
df = pd.DataFrame({'true_penguins': true_penguins, 'predicted_penguins': predicted_penguins})

pd.crosstab(df.true_penguins, df.predicted_penguins)

In [None]:
model.score(X_train, y_train)

In [None]:
# Final eval
model.score(X_test, y_test)

In [None]:
true_penguins = labelencoder_y.inverse_transform(y_test)
predicted_penguins = labelencoder_y.inverse_transform(model.predict(X_test))
print(classification_report(true_penguins,predicted_penguins, labels=labelencoder_y.classes_))

In [None]:
df = pd.DataFrame({'true_penguins': true_penguins, 'predicted_penguins': predicted_penguins})
pd.crosstab(df.true_penguins, df.predicted_penguins)

## Bonus: Regression problem

Now, let's return to a quick example of a regression using the penguin data

In [None]:
penguins.head()

In [None]:
y = penguins['culmen_length_mm']

In [None]:
X_dum = penguins.species_short.str.get_dummies()

In [None]:
X = pd.concat([penguins.iloc[:,4:6], X_dum], axis=1)

In [None]:
X.iloc[:,0:2] = StandardScaler().fit_transform(X.iloc[:,0:2])

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 21)

In [None]:
from xgboost import XGBRegressor

In [None]:
from sklearn.linear_model import LinearRegression, ElasticNet
from sklearn.ensemble import RandomForestRegressor

In [None]:
model_ols = LinearRegression()
model_rf = RandomForestRegressor()
model_xgb = XGBRegressor()

In [None]:
model_ols.fit(X_train, y_train)

In [None]:
model_rf.fit(X_train, y_train)

In [None]:
model_xgb.fit(X_train, y_train)

In [None]:
model_ols.score(X_test, y_test)

In [None]:
model_rf.score(X_test, y_test)

In [None]:
model_xgb.score(X_test, y_test)

In [None]:
model_ols.score(X_train, y_train)

In [None]:
model_rf.score(X_train, y_train)

In [None]:
model_xgb.score(X_train, y_train)

In [None]:
y_pred_rf = model_rf.predict(X_test)

In [None]:
y_pred_ols = model_ols.predict(X_test)

In [None]:
y_pred_xgb = model_xgb.predict(X_test)

In [None]:
sns.scatterplot(x=y_test, y=y_pred_ols)

In [None]:
sns.scatterplot(x=y_test, y=y_pred_rf)

In [None]:
sns.scatterplot(x=y_test, y=y_pred_xgb)