# Introduction to Supervised Machine Leearning (SML)

Welcome to this introduction to machine learning (ML). In this session we cover the following topics
1. Generalizating and valididating from ML models.
2. The Bias-Variance Trade-Off
3. Out-of-sample testing and cross-validation workflows
4. Implementing Ml workflows in the Python (Sklearn) ecosystem.

In [None]:
# loading essential libraries

import numpy as np
import pandas as pd
import seaborn as sns
import statsmodels.api as sm

sns.set(style="darkgrid", color_codes=True)

# The very basics:

## Regression problems

Let' do a brief example for a simple linear model. We generate some data, where $y$ is a linear function of $x$ plus some random error.

Here the idea is to get the model to figure out parameters that we intentionally set to generate the data. It is also a useful way of thinking about the relationship between inferential statistics / econometrics and machine learning.

In [None]:
np.random.seed(21)

beta0 = 15
beta1 = 0.3

In [None]:
x = np.random.uniform(0,100, 500)
y = beta0 + (beta1*x) + np.random.normal(0, 5, 500)

In [None]:
sns.scatterplot(x = x, y = y)

Now that we have generated some data we can use statsmodels to run a simple linear regression.
PS: Later we will not be using statsmodels as it isn't really a machine learning package. However, it is useful for this demo.

In [None]:
X = sm.add_constant(x)

mod = sm.OLS(y, X)
res = mod.fit()

print(res.summary2())

In [None]:
sns.regplot(x = x, y = y, line_kws={"color": "orange"})


We see it got the underlying relationship somewhat correct. Keep in mind, its ability to discover it is also limited by the small sample, where small random errors can bias the result.

We can now use `predict()` to predict y values due to the fitted model.



In [None]:
y_pred = res.predict(X)

In [None]:
sns.scatterplot(x = x, y = y, alpha=0.5)
sns.scatterplot(x = x, y = y_pred)

It obviously predicts along th straight function line. Due to the random noise introduced, it is most of the time off a bit. Lets calculate the error term



In [None]:
error_reg = y - y_pred

In [None]:
np.mean(error_reg)


On average the error is very low. However, keep in mind positive and negative errors cancel each others out. Lets look at the RSME better.



In [None]:
np.sqrt(np.mean(error_reg**2))


However, we predicted on the data the model was fitted on. How would it fair on new data?



In [None]:
np.random.seed(42)

In [None]:
x_new = np.random.uniform(0,100, 500)
y_new = beta0 + (beta1*x_new) + np.random.normal(0, 5, 500)

In [None]:
X_new = sm.add_constant(x_new)

In [None]:
y_pred_new = res.predict(X_new)

In [None]:
error_reg_new = y_new - y_pred_new

In [None]:
np.sqrt(np.mean(error_reg_new**2))


## Classification problems

Ok, lets try the same with a binary class prediction. Lets create a random x and an associated binary y.



In [None]:
np.random.seed(21)

In [None]:
beta1 = 5

x = np.random.normal(0, 1, 500)
y = np.random.binomial(1, 1/(1+np.exp(-(beta1*x))))

In [None]:
sns.scatterplot(x = x, y = y, alpha=0.5)


let's fit a logistic regression on that


In [None]:
X = sm.add_constant(x)

mod = sm.GLM(y, X, sm.families.Binomial())
res = mod.fit()

In [None]:
print(res.summary2())



We can again visualize it:



In [None]:
sns.regplot(x= x, y = y, logistic=True)


We can use this fitted model to predict the datapoints y-class. Here, we have the choice to either report the **predicted class** or the **predicted probability**. We do both.



In [None]:
y_pred = res.predict(X)
y_pred_class = np.round(y_pred)

In [None]:
data_class = pd.DataFrame({'x':x,
                           'y':y,
                           'predicted' : y_pred,
                           'predicted_class' : y_pred_class})

In [None]:
data_class.head()

From here we can look into different ways to measure the performance of our trained model.

In [None]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

In [None]:
cm_log = confusion_matrix(y, y_pred_class)

In [None]:
cm_log

In [None]:
# Import the confusion matrix plotter module
from mlxtend.plotting import plot_confusion_matrix

In [None]:
plot_confusion_matrix(conf_mat=cm_log,
                                show_absolute=True,
                                show_normed=True)

In [None]:
print(classification_report(y,y_pred_class))