# Supervised learning (Misleading Accuracies)

In this notebook we focus on how accuracy measures in **supervised learning** can be misleading.  **Supervised learning** is a set of algorithms that take **labeled** data and try to **predict** the label using the other **features** in the data. Supervised learning so far dominates applications of machine learning, although **reinforcement learning** is catching up too. Unlike **un-supervised learning** where the data is not labeled and hence there's a lot of subjectivity, **supervised learning** algorithms, once trained on data, can be evaluated by comparing their **predictions** to the **labels** (in this context, we refer to the labels sometimes as **ground truth**).  Sometimes these comparisons can be misleading.

As usual, let's begin by reading some data. We use a bank marketing data, which has demographic and activity data about bank customers, as well as information about previous attempts to contact them for a marketing campain. The target `y` is binary and indicates whether the client signed up for a term deposit or not.

You can read more about the data [here](https://archive.ics.uci.edu/ml/datasets/Bank+Marketing).

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import warnings
warnings.simplefilter("ignore")

In [None]:
bank = pd.read_csv("./data/bank-full.csv", sep = ";").sample(5000)
print(bank.shape)
bank.head()

Since numeric and categorical features are often pre-processed differently, we will create variables that store the names of each to make it easier to refer to them later.

In [None]:
num_cols = bank.select_dtypes(['integer', 'float']).columns
cat_cols = bank.select_dtypes(['object']).drop(columns = "y").columns

print("Numeric columns are {}.".format(", ".join(num_cols)))
print("Categorical columns are {}.".format(", ".join(cat_cols)))

As usual before we can proceed to machine learning, we need to get the data ready. And since we're doing supervised learning, we need to set aside test data so we can later evaluate the model using that data. So let's begin by splitting the data.

In [None]:
from sklearn.model_selection import train_test_split

X = bank.drop(columns = "y") # features
y = bank["y"] # label

X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size = 0.15, random_state = 42)

In [None]:
X_train = X_train.reset_index(drop = True)
X_test = X_test.reset_index(drop = True)

In [None]:
print(f"Training data has {X_train.shape[0]} rows.")
print(f"Test data has {X_test.shape[0]} rows.")

Before we begin our journey of trying out different algorithms in `sklearn` we do need to encode our categorical features.

In [None]:
from sklearn.preprocessing import OneHotEncoder

onehoter = OneHotEncoder(sparse = False, drop = "first")
onehoter.fit(X_train[cat_cols])
onehot_cols = onehoter.get_feature_names_out(cat_cols)
X_train_onehot = pd.DataFrame(onehoter.transform(X_train[cat_cols]), columns = onehot_cols)
X_test_onehot = pd.DataFrame(onehoter.transform(X_test[cat_cols]), columns = onehot_cols)

Some algorithms we're going to use (such as decision tree) won't require that we normalize our numeric features, but most will. Not doing so won't break the algorithm, but just as we saw in the case of k-means, it will skew the results. So let's Z-normalize our numeric features now.

In [None]:
from sklearn.preprocessing import StandardScaler

znormalizer = StandardScaler()
znormalizer.fit(X_train[num_cols])
X_train_norm = pd.DataFrame(znormalizer.transform(X_train[num_cols]), columns = num_cols)
X_test_norm = pd.DataFrame(znormalizer.transform(X_test[num_cols]), columns = num_cols)
X_train_norm.head()

We now join our numeric features and our one-hot-encoded categorical features into one data set, called `X_train_featurized`, that we pass to the decision tree classifier.

In [None]:
# Join the columns
X_train_featurized = X_train_onehot # add one-hot-encoded columns
X_test_featurized = X_test_onehot   # add one-hot-encoded columns
X_train_featurized[num_cols] = X_train_norm # add numeric columns
X_test_featurized[num_cols] = X_test_norm   # add numeric columns

# Alternate method to join columns:
# X2_train_featurized = pd.concat([X_train_onehot, X_train_norm], axis = 1)
# X2_test_featurized = pd.concat([X_test_onehot, X_test_norm], axis = 1)

# We delete objects so that we do not accidently use them
del X_train_norm, X_test_norm, X_train_onehot, X_test_onehot

print("Featurized training data has {} rows and {} columns.".format(*X_train_featurized.shape))
print("Featurized test data has {} rows and {} columns.".format(*X_test_featurized.shape))

<hr style="border:4px solid gray">
<hr style="border:4px solid yellow">
<hr style="border:4px solid gray">

## Linear regression regressor

So far we've only seen classification algorithms. So it's time to change course and take a look at regression algorithms. A regression predicts a numeric target.  As a consequence, we evaluate regressions differently than classifications and we use different metrics. We used **accuracy** for the classification algorithms we saw above. We use **MSE** for the regressions we train below.

#### Mean Squared Error ($\text{MSE}$)
$\text{MSE}$ is a measure for the error of a prediction.  An $\text{MSE}$ of zero occurs when the predictions are perfect. $\text{MSE}$ increases as the predictions get worse.  $\text{MSE}$ is calculated by taking the mean of the squares of difference of the actual values to the predicted values.  
$$\text{MSE} = \frac{1}{n}\sum_{i}(y_i - \hat{y}_i)^2$$
where:
- $\text{MSE}$ is Mean Squared Error
- $y_i$ is the test target value of the $i$th sample
- $\hat{y}_i$ is the predicted target value from the inputs of the $i$th sample.  $\hat{y}_i$ is often written as $f(x_i)$ to remind us that the predicted values are a function of $x_i$ from the test values.
- $n$ is the number of samples

#### Variance ($\sigma^2$)
Variance ($\sigma^2$) is calculated by taking the mean of the squares of differences of the actual target values to the mean target value ($\bar{y}$).  This would be equivalent to an MSE for a regression model where we predict the average for all target values.  We consider variance to be the error for our baseline model.  We expect $\text{MSE}$ to be less than $\sigma^2$, otherwise our predictions would not be worth the effort.
$$\sigma^2 = \frac{1}{n}\sum_{i}(y_i - \bar{y})^2$$
where:
- $\sigma^2$ is the variance
- $n$ is the number of samples
- $y_i$ is the target value of the $i$th sample 
- $\bar{y} = \frac{1}{n}\sum_{i}(y_i)$

#### Compare Mean Squared Errors
We would like a metric that we can use to compare accuracies of regression models, even if the models are unrelated.  One way to make $\text{MSE}$ more comparable is if we scale $\text{MSE}$ by the variance, $\sigma^2$.  This variance-scaled $\text{MSE}$ is 0 for the best models and 1 for models that are no better than guessing the average.
$$\frac{\text{MSE}}{\sigma^2}$$  
where:
- $\text{MSE}$ is Mean Squared Error
- $\sigma^2$ is the variance

#### Coefficient of Determination
Instead of the variance-scaled $\text{MSE}$, we would like a metric that had a scale similar to that of accuracy measures for classifications.  We would like this metric to be 1 for a perfect model and to be 0 for our baseline model that is no better than guessing the average.  We can create a metric like this by subtracting the variance-scaled $\text{MSE}$ from 1.  We call this useful metric **Coefficient of Determination** ($R^2$).  
$$R^2 = 1 - \frac{\text{MSE}}{\sigma^2}$$

Unfortunately, the symbol for **Coefficient of Determination** is $R^2$, even though it is not a squared entity and $R^2$ can be less than zero.  To make matters worse, $R^2$ is often called "r-squared" and it is often confused with the square of Pearson's correlation coefficient.  Because of these confusions many data scientists do not know that they can use the **Coefficient of Determination** to compare very different regressions.  

We cannot use our classification target (`y`) as our regression target.  Instead, we can use the `duration` column in the data as our target because it is numeric.

In [None]:
y_train = X_train_featurized['duration']
X_train_featurized = X_train_featurized.drop(columns = 'duration')

y_test = X_test_featurized['duration']
X_test_featurized = X_test_featurized.drop(columns = 'duration')

Other than changing the target from categorical to numeric, we don't have to do things very differently from before. The training and predicting part of the code remain very similar. 

In [None]:
from sklearn.linear_model import LinearRegression

linreg = LinearRegression()
linreg.fit(X_train_featurized, y_train)

y_hat_train = linreg.predict(X_train_featurized)
y_hat_test = linreg.predict(X_test_featurized)

In [None]:
from sklearn.metrics import mean_squared_error

mse_train = mean_squared_error(y_train, y_hat_train)
mse_test = mean_squared_error(y_test, y_hat_test)

print("MSE on the training data: {:5.5f}.".format(mse_train))
print("MSE on the test data: {:5.5f}.".format(mse_test))

We don't know is this MSE is good or bad.  We need to compare it to the baseline MSE.  We could have used the coefficient of determination instead.

### Exercise (5 minutes)

To get MSE (mean squared error), here's what we need to do:

- find the **errors** (difference between predicted and actual value) and square them to get **squared errors**
- add up all the squared errors to get the **sum of squared errors**
- divide the sum of squared errors by the number of rows to get the **mean squared error**

- Use the training data to calculate the MSE using `numpy` and compare it to what you get when you run `mean_squared_error`. If you use `numpy` correctly, you should not have to write any loops.

In [None]:
def MSE(y_test, y_hat_test = None):
    if y_hat_test is None:
        y_hat_test = y_test.mean()
    Error = y_test - y_hat_test
    SquaredError = Error**2
    MeanSquaredError = SquaredError.mean()
    return MeanSquaredError

print("Training MSE:", MSE(y_train, y_hat_train))
print("Test MSE:", MSE(y_test, y_hat_test))
print("Baseline test MSE:", MSE(y_test), "; Same as variance of test:", np.var(y_test))

The MSE for the predicted test values isn't significantly lower than the variance of the test values.  It seems that this model isn't very good either.  

In [None]:
mse = MSE(y_test, y_hat_test)
variance = MSE(y_test)
CoD = 1 - mse / variance
print("Coefficient of Determination for the test data:", round(CoD,3))

The Coefficient of Determination formalizes the comparison of the MSE with the variance and creates a scale where 0 is baseline and 1 is a perfect model.  Just like our comparison of MSE with variance, the CoD indicates a very weak model.  

### Plot the Results (Sanity Check)
Another very important test is to plot your predicted results from your test data against the known results of your test data.

In [None]:
plottingData = pd.DataFrame({'y_test':y_test, 'y_hat_test':y_hat_test}).sort_values(by='y_test')

%matplotlib inline
import matplotlib.pyplot as plt
plt.plot([-1, 2], [-1, 2], c='red', lw=2, ls=':')
plt.scatter(plottingData['y_test'], plottingData['y_hat_test'], s=10, c='blue', alpha = 0.3)
plt.xlim(-1, 2)
plt.ylim(-1, 2)
plt.title('Example of a really bad model')
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.grid()
plt.show()