# Introduction - Recap of Supervised Learning


In this recap, we are going to take a focus on doing as most boring work as we can via automated tools, so we can focus on the analysis.

Also, this dataset is fully anonymized, so we can't bring any real-world knowledge of even augment this dataset. Which actually makes it a good target for AutoML tools.

In this recap, we are going to go through of some of the cardinal rules of supervised learning:

- Check for Missing Data
- Check for duplicate rows
- Check for weird stuff on the data
- Normalize the input variables so our models don't choke.
- Do a train test split
- DO A TRAIN TEST SPLIT. SERIOUSLY. DO IT.
- Have I said anything about train-test-split? If so, let me say it again, train test split. It's that important.
- Try a bunch of different models. 
- See the performance of different models.
- (OPTIONAL) Do some automated ML



In [1]:
# Let's first install some much needed dependencies

# Pandas-profiling in google colab is outdated 
!pip install pandas-profiling -U

# Let's also install hvplot to give us some good interactive plots
!pip install hvplot -U

# Let's install imbalanced-learn, because I could to this manually, but long live the external libraries
!pip install imbalanced-learn -U

# Let's add some stuff for AutoML
!pip install tpot -U

Collecting pandas-profiling
  Downloading pandas_profiling-3.0.0-py2.py3-none-any.whl (248 kB)
Collecting pydantic>=1.8.1
  Downloading pydantic-1.8.2-cp38-cp38-win_amd64.whl (2.0 MB)
Collecting visions[type_image_path]==0.7.1
  Downloading visions-0.7.1-py3-none-any.whl (102 kB)
Collecting tangled-up-in-unicode==0.1.0
  Downloading tangled_up_in_unicode-0.1.0-py3-none-any.whl (3.1 MB)
Collecting missingno>=0.4.2
  Downloading missingno-0.5.0-py3-none-any.whl (8.8 kB)
Collecting htmlmin>=0.1.12
  Downloading htmlmin-0.1.12.tar.gz (19 kB)
Collecting phik>=0.11.1
  Downloading phik-0.12.0-cp38-cp38-win_amd64.whl (659 kB)
Collecting multimethod==1.4
  Downloading multimethod-1.4-py2.py3-none-any.whl (7.3 kB)
Collecting imagehash
  Downloading ImageHash-4.2.1.tar.gz (812 kB)
Building wheels for collected packages: htmlmin, imagehash
  Building wheel for htmlmin (setup.py): started
  Building wheel for htmlmin (setup.py): finished with status 'done'
  Created wheel for htmlmin: filename=htm

# Preprocessing

In [2]:
import pandas as pd
from pandas_profiling import ProfileReport

import hvplot.pandas
import holoviews as hv

In [3]:
# Nice little known feature of Pandas: it allows you to literally read a CSV from a compressed file in the internet! 
loans = pd.read_csv("https://github.com/tiagofassoni/useful-datasets/raw/main/loan_dataset_iteration_1.zip")

In [4]:
ProfileReport(loans, minimal=True)

Summarize dataset:   0%|          | 0/48 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



## Interpreting the output of pandas-profiling

A great thing in pandas profiling is the `warnings` section. It allows us to realize:
- we must drop columns `f33`, `f34`, `f35`, `f37`, `f38` because they are constant
- Some columns clearly have zeros in place of `null`s
- the constant values in columns `f33`, `f34`, `f35`, `f37`, `f38` are zeros
- About 90% of the clients don't default

It's also not very clear which of our columns legitimately have zeros or have missing data disguised as zeros. In a normal setting, you can just ask people in the company and they most likely will have some answers. In this case, we are stuck with guesswork.

I'm going to use a imputer for the nulls, but *I'm not going to mess with the zeros*. It's as much a valid option as removing all rows with zeros or running a imputer in the nulls + zeros.

## Separating X and y

Just to make our lives easier, let's split the explanatory columns in `X`, and the response column in `y`.

In [5]:
# I ❤ tuple unpacking syntax
X, y = loans.drop(columns='loss'), loans['loss']

## Converting data to categorical

We need to convert the `loss` variable to a binary variable. To make our lives easier in the metrics department, let's say the positive outcome is when someone paid correctly, and the negative outcome is when someone defaulted. So, 0 becomes 1, anything else becomes 0.

There are some ways to do this in Pandas:

First, using just a condition in pandas, and some dirty trickery with the fact `true` gets converted to 1 and `false` is converted to 0 in Python. Yours truly doesn't like it, as it involves boolean conversion shenanigans.


```
y = (y == 0).astype(int)
```

Or using `np.where`, which the person doing the recap doesn't really like, as it uses a different package.
``` 
y = np.where(y == 0, 1, 0)
```

Or, the one yours truly appreciates, using  [pandas' apply-a-function](https://pandas.pydata.org/docs/reference/api/pandas.Series.apply.html?highlight=apply#pandas.Series.apply), because it works *in exactly the same way* if you start working with big data. So the knowledge is transferrable.

```
def my_super_function(number):
    if number == 0:
        return 1
    else:
        return 0
        
y = y.apply(my_super_function)
```

However, I find it super cumbersome to define a function just to be used once and thrown away. And, because programmers don't like to type too much (and also don't like to polute the main namespace with lots of useless functions), there is a way to create a function that runs once and disappear. Behold [the lambda function (or lambda expression)](https://docs.python.org/3/tutorial/controlflow.html#lambda-expressions):

```
y = y.apply(lambda number: 1 if number == 0 else 0)
```


In [6]:
y = y.apply(lambda number: 1 if number == 0 else 0)



## Train test split

Not doing a train test split is a cardinal data science sin. And a great way to lose credibility fast.

For example, yours truly worked in political campaigns and was approached once by someone claiming to have a model that would predict my candidate's appeal in televised political debates, with 95% accuracy. My first question on such strong claims was how they were testing it. Turns out they didn't have any test data and they just tried to sell me an overfitted model.

### There is also a train-validation-test split
Some people argue that, if you are doing you are using the test dataset *in any way* to select models, you are actually leaking the test dataset. This isn't much of an issue in non-neural-network-land, because there aren't all that many hyperparameters to train, and there isn't usually much data to comfortably do a train-validation-test split. However, for neural networks, it's usually a very, very good idea to have a train-validation-test split.

Also, if you decide to work in a bank, you may find out people using train-validation-test splits in credit models because they are (rightfully) super afraid to lose any money.

In [7]:
from sklearn.model_selection import train_test_split

# random_state is a lifesaver to ensure reproducibility
X_train, X_test, y_train, y_test = train_test_split(
    X, y, random_state=42
)

## Cleaning Missing Data

We have 3 main approaches to deal with missing data:
    - Remove the columns with missing data
    - Remove the lines with missing data
    - Impute (a fancy name for "try to guess") missing data 
    
As in most of mathematics, wherever there isn't a "best" or "optimal" way of doing something..... there are a bazillion ways. Same thing applies here. We can just impute using the mean, we can impute using the median, we can impute using the previous value, we can even impute using a decision tree trained at the rest of the data to look as similar as possible. [There are entire packages dedicated to fancy ways of imputing data](https://github.com/iskandr/fancyimpute)

Just to do a different thing now, let's use a new imputer in scikit-learn. It's "experimental" because its API might change between versions, but it's perfectly usable.

**Note we must always, always, train the imputer only on the training dataset**

In [8]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer, SimpleImputer1

In [9]:
imputer = IterativeImputer(random_state=42, max_iter=1000)

X_train = imputer.fit_transform(X_train)

X_test = imputer.transform(X_test)

## Feature Scaling

It's a common requirement to scale all columns so they are in the 0-1 range to avoid some big column overpowering the others. By the way, this is particularly important if you're using neural networks. [So says the creator of the Keras library](https://www.manning.com/books/deep-learning-with-python).

[Let's use the standard scaler from scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler)

Again, you must train the scaler in the training set, otherwise you will be leaking information from the test set

In [10]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test) 

# Let's finally train some models!

First, a panorama view:

In Mathematics, when we have an best way to do something (also called "optimal" way) we usually have only one way to do it. When there isn't a best way to do it, there will be miriads of different ways. As the question we are trying to solve is essentially "given some data of questionable reliability, get me a function that tells us the future", we are obviously going to have a million different ways and will need some way to try to assess model performance.





In [11]:
from sklearn.metrics import confusion_matrix, precision_score, recall_score, f1_score

## Decision Trees

Decision Trees are workhorses of machine learning. They are extremely flexible, are fast and, although they are prone to overfitting, we can use combinations of weak decision trees (called ensembles) to get very good results in practice. 

For example, Cloudflare uses ensembles of decision trees for spam filters.


In [12]:
# Let's store all of our experiments data in a dict of results

# TODO: Turn this into a class

class AllOurResults():
  list_results = []

  def append_to_results(self, title, params, classifier, X_train, X_test, y_train, y_test):
    self.list_results.append({
      'type': title,
      'params': params,
      'train accuracy': classifier.score(X_train, y_train),
      'test accuracy': classifier.score(X_test, y_test),
      'train precision': precision_score(y_train, classifier.predict(X_train)),
      'test precision': precision_score(y_test, classifier.predict(X_test)),
      'train recall': recall_score(y_train, classifier.predict(X_train)),
      'test recall': recall_score(y_test, classifier.predict(X_test)),
      'train f1': f1_score(y_train, classifier.predict(X_train)),
      'test f1': f1_score(y_test, classifier.predict(X_test)),
      'train confusion matrix': confusion_matrix(y_train, classifier.predict(X_train)),
      'test confusion matrix': confusion_matrix(y_test, classifier.predict(X_test)), 
  })

  def to_dataframe(self):
    return pd.DataFrame(self.list_results)

  def show_some_metrics(self):
    return self.to_dataframe().drop(columns=[['train confusion matrix (to plot later)', 'test confusion matrix (to plot later)']]).plot()  

  def plot_confusion_matrices(self):
    pass

all_our_results = AllOurResults()

In [13]:
from sklearn.tree import DecisionTreeClassifier

In [14]:
dtc = DecisionTreeClassifier()

In [15]:
dtc.fit(X_train, y_train)

DecisionTreeClassifier()

In [16]:
all_our_results.append_to_results('Decision Tree', 'default params', dtc, X_train, X_test, y_train, y_test)

## K-Nearest Neighbors Classifier (and Logistic Regression)

K-Nearest Neighbors Classifier uses a very simple approach to distinguish between classes: a point probably has the same class as the points close to it. That's it. Zero sophistication. 

One advantage of this model is that it allows for very non-linear boundaries. It doesn't care at all about the structure of your data, as long as the points of the same class are close to each other. One disadvantage is it isn't very stable. If you change some data, you might mess your boundaries and get a very different model.

This model is usually taught alongside Logistic (and Linear) regressions to show the bias-variance tradeoff. Logistic (and Linear) Regression are very stable in the change of data. But they have zero flexibility, as they are always going to make a line and that's it.





In [17]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier

lr = LogisticRegression(max_iter=1000)
lr.fit(X_train, y_train)

all_our_results.append_to_results('Logistic Regression', {'max_iter': 1000}, lr, X_train, X_test, y_train, y_test)

knn = KNeighborsClassifier()
knn.fit(X_train, y_train)

all_our_results.append_to_results('K-Nearest Neighbors', 'default params', knn, X_train, X_test, y_train, y_test)


## Some common modelling problems you will see

### Overfitting

This is a super common problem. We are trying to make our model learn patterns that generalize. However, if we don't impose some limits to our model, we can end up having a model that just memorizes stuff and can't generalize. A common symptom of this is the training set having way higher scores than the test set. 

### Data Leakage

Leakage happens when you have some "shortcut" that tells our model the right answer. A huge signal of it are ununsually high test scores (or when our test scores are higher than our training set scores). 

Once, a student reached out to me because his model to predict wins or losses of american football matches given some player data was having a test accuracy of 100%. Turns out he forgot to remove the goals each team had made, and the decision tree just used that as a shortcut!



## Let's try to improve our models with hyper parameter tuning

All models can be tuned by changing their parameters. As you may have seen so far, this is a very tedious process. Scikit-learn has some parts to help us with model selection, like [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV) and [HalvingGridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.HalvingGridSearchCV.html#sklearn.model_selection.HalvingGridSearchCV). There are even some packages which implement super fancy ways to have model selection, like [scikit-optimize](https://scikit-optimize.github.io/stable/) and [sklearn-deap](https://github.com/rsteca/sklearn-deap)

### What is a Cross Validation anyway?

Remember using the test set for model validation is a cardinal sin? If we start testing dozens of different model parameters using the test set, we are effectively using the test dataset as training. Also, with no variation in the training and test data set, we are very prone to just memorizing the training and test datasets, with no model that can generalize.

A cross-validation is a way to try to get decent models out of just the training set. Is it perfect or guaranteed to give us good results? It isn't. And absolutely nothing is guaranteed to give us good results, as we are trying to predict the future.

The idea of cross-validation is: we slice our dataset into 3 (or 5, or as many as you wish, but usually 3) subsets of identical size. We use two of them to train the model, one to test the model. We store the result, change the train and test subsets, run again, store the result. We finish when we exhausted all possible combinations. The final model score will be the average of all our model scores.

### Resource Usage
It's very common to test at least 10 different model parameters. In a 3-fold Cross Validation, that would take 3 models for each of 10 parameters... About 30 different models to run. This can quickly get in the thousands of models, and there is [even a system to record the results of the experiments](https://mlflow.org/)


In [18]:
from sklearn.model_selection import GridSearchCV

In [19]:
params_grid = {
    "criterion": ["gini", "entropy"],
    "splitter": ["best", "random"],
    "max_depth": [3, 5, 10, 50, 100, None],
    "min_samples_split": [2, 3, 5, 10],
}

# n_jobs is important because it allows us to parallelize our work. In this particular case, 
grid_search = GridSearchCV(
    DecisionTreeClassifier(), param_grid=params_grid, n_jobs=-2, cv=3, refit=True
)

In [20]:
grid_search.fit(X_train, y_train)

all_our_results.append_to_results('GridSearch Tree', grid_search.best_params_, grid_search, X_train, X_test, y_train, y_test)


In [21]:
all_our_results.to_dataframe()

Unnamed: 0,type,params,train accuracy,test accuracy,train precision,test precision,train recall,test recall,train f1,test f1,train confusion matrix,test confusion matrix
0,Decision Tree,default params,1.0,0.82,1.0,0.907563,1.0,0.892562,1.0,0.9,"[[569, 0], [0, 5431]]","[[20, 165], [195, 1620]]"
1,Logistic Regression,{'max_iter': 1000},0.9055,0.9075,0.905468,0.9075,1.0,1.0,0.950389,0.951507,"[[2, 567], [0, 5431]]","[[0, 185], [0, 1815]]"
2,K-Nearest Neighbors,default params,0.906,0.899,0.910025,0.907529,0.994476,0.989532,0.950378,0.946758,"[[35, 534], [30, 5401]]","[[2, 183], [19, 1796]]"
3,GridSearch Tree,"{'criterion': 'entropy', 'max_depth': 3, 'min_...",0.905167,0.9075,0.905167,0.9075,1.0,1.0,0.950223,0.951507,"[[0, 569], [0, 5431]]","[[0, 185], [0, 1815]]"


## Oversampling

The results so far have decent accuracy and decent precision. Only problem is, our bosses want as much precision as possible, because we literally lose money on our false positives! Banks will try to tune this to get as much risk protection as they can. 

We can try to run more grid searches, to run other models, but one other way is to oversample the minority class. There are very fancy ways to do this in [imbalanced-learn](https://imbalanced-learn.org/stable/), but we are going to stick with the simplest one, just repeating samples from the minority class.

In [22]:
from imblearn.over_sampling import RandomOverSampler

oversampler = RandomOverSampler(random_state=42)
X_train_res, y_train_res = oversampler.fit_resample(X_train, y_train)

In [23]:
# Our oversampler made the two classes have equal size, let's see how it all works out with our classifiers!
print(X_train.shape)
print(X_train_res.shape)

print(pd.Series(y_train).value_counts())
print(pd.Series(y_train_res).value_counts())

(6000, 39)
(10862, 39)
1    5431
0     569
Name: loss, dtype: int64
0    5431
1    5431
Name: loss, dtype: int64


In [24]:
# Decision Tree

dtc_rs = DecisionTreeClassifier()

dtc_rs.fit(X_train_res, y_train_res)

all_our_results.append_to_results('Decision Tree with Resampling', 'default params', dtc_rs, X_train, X_test, y_train, y_test)

In [25]:
# Logistic Regression
lr_rs = LogisticRegression(max_iter=1000)
lr_rs.fit(X_train_res, y_train_res)

all_our_results.append_to_results('Logistic Regression with Resampling', {'max_iter': 1000}, lr_rs, X_train, X_test, y_train, y_test)

knn_rs = KNeighborsClassifier()
knn_rs.fit(X_train, y_train)

all_our_results.append_to_results('K-Nearest Neighbors with Resampling', 'default params', knn_rs, X_train, X_test, y_train, y_test)

In [26]:
# Grid Search for Decision Tree
params_grid = {
    "criterion": ["gini", "entropy"],
    "splitter": ["best", "random"],
    "max_depth": [3, 5, 10, 50, 100, None],
    "min_samples_split": [2, 3, 5, 10],
}

# n_jobs is important because it allows us to parallelize our work. In this particular case, 
grid_search_rs = GridSearchCV(
    DecisionTreeClassifier(), param_grid=params_grid, n_jobs=-2, cv=3, refit=True
)

grid_search_rs.fit(X_train_res, y_train_res)

all_our_results.append_to_results('GridSearch Tree with Resampling', grid_search_rs.best_params_, grid_search_rs, X_train, X_test, y_train, y_test)


In [27]:
all_our_results.to_dataframe()

Unnamed: 0,type,params,train accuracy,test accuracy,train precision,test precision,train recall,test recall,train f1,test f1,train confusion matrix,test confusion matrix
0,Decision Tree,default params,1.0,0.82,1.0,0.907563,1.0,0.892562,1.0,0.9,"[[569, 0], [0, 5431]]","[[20, 165], [195, 1620]]"
1,Logistic Regression,{'max_iter': 1000},0.9055,0.9075,0.905468,0.9075,1.0,1.0,0.950389,0.951507,"[[2, 567], [0, 5431]]","[[0, 185], [0, 1815]]"
2,K-Nearest Neighbors,default params,0.906,0.899,0.910025,0.907529,0.994476,0.989532,0.950378,0.946758,"[[35, 534], [30, 5401]]","[[2, 183], [19, 1796]]"
3,GridSearch Tree,"{'criterion': 'entropy', 'max_depth': 3, 'min_...",0.905167,0.9075,0.905167,0.9075,1.0,1.0,0.950223,0.951507,"[[0, 569], [0, 5431]]","[[0, 185], [0, 1815]]"
4,Decision Tree with Resampling,default params,1.0,0.835,1.0,0.909994,1.0,0.907989,1.0,0.908991,"[[569, 0], [0, 5431]]","[[22, 163], [167, 1648]]"
5,Logistic Regression with Resampling,{'max_iter': 1000},0.607667,0.5925,0.94633,0.940917,0.600626,0.587879,0.73485,0.723635,"[[384, 185], [2169, 3262]]","[[118, 67], [748, 1067]]"
6,K-Nearest Neighbors with Resampling,default params,0.906,0.899,0.910025,0.907529,0.994476,0.989532,0.950378,0.946758,"[[35, 534], [30, 5401]]","[[2, 183], [19, 1796]]"
7,GridSearch Tree with Resampling,"{'criterion': 'gini', 'max_depth': 50, 'min_sa...",1.0,0.8455,1.0,0.912377,1.0,0.917906,1.0,0.915133,"[[569, 0], [0, 5431]]","[[25, 160], [149, 1666]]"


## AutoML

Do you feel lazy? There are automated packages that try to find the best model for us (and those are usually tree ensembles). Let's take a look at one, [TPOT](https://github.com/EpistasisLab/tpot)

In [28]:
from tpot import TPOTClassifier

tpot = TPOTClassifier(generations=10, random_state=42, verbosity=2, max_time_mins=30)
tpot.fit(X_train, y_train)

all_our_results.append_to_results('TPOT', tpot.fitted_pipeline_, tpot, X_train, X_test, y_train, y_test)





Optimization Progress:   0%|          | 0/100 [00:00<?, ?pipeline/s]


31.31 minutes have elapsed. TPOT will close down.
TPOT closed during evaluation in one generation.


TPOT closed prematurely. Will use the current best pipeline.

Best pipeline: XGBClassifier(input_matrix, learning_rate=0.001, max_depth=9, min_child_weight=7, n_estimators=100, n_jobs=1, subsample=0.45, verbosity=0)


In [29]:
all_our_results.to_dataframe()

Unnamed: 0,type,params,train accuracy,test accuracy,train precision,test precision,train recall,test recall,train f1,test f1,train confusion matrix,test confusion matrix
0,Decision Tree,default params,1.0,0.82,1.0,0.907563,1.0,0.892562,1.0,0.9,"[[569, 0], [0, 5431]]","[[20, 165], [195, 1620]]"
1,Logistic Regression,{'max_iter': 1000},0.9055,0.9075,0.905468,0.9075,1.0,1.0,0.950389,0.951507,"[[2, 567], [0, 5431]]","[[0, 185], [0, 1815]]"
2,K-Nearest Neighbors,default params,0.906,0.899,0.910025,0.907529,0.994476,0.989532,0.950378,0.946758,"[[35, 534], [30, 5401]]","[[2, 183], [19, 1796]]"
3,GridSearch Tree,"{'criterion': 'entropy', 'max_depth': 3, 'min_...",0.905167,0.9075,0.905167,0.9075,1.0,1.0,0.950223,0.951507,"[[0, 569], [0, 5431]]","[[0, 185], [0, 1815]]"
4,Decision Tree with Resampling,default params,1.0,0.835,1.0,0.909994,1.0,0.907989,1.0,0.908991,"[[569, 0], [0, 5431]]","[[22, 163], [167, 1648]]"
5,Logistic Regression with Resampling,{'max_iter': 1000},0.607667,0.5925,0.94633,0.940917,0.600626,0.587879,0.73485,0.723635,"[[384, 185], [2169, 3262]]","[[118, 67], [748, 1067]]"
6,K-Nearest Neighbors with Resampling,default params,0.906,0.899,0.910025,0.907529,0.994476,0.989532,0.950378,0.946758,"[[35, 534], [30, 5401]]","[[2, 183], [19, 1796]]"
7,GridSearch Tree with Resampling,"{'criterion': 'gini', 'max_depth': 50, 'min_sa...",1.0,0.8455,1.0,0.912377,1.0,0.917906,1.0,0.915133,"[[569, 0], [0, 5431]]","[[25, 160], [149, 1666]]"
8,TPOT,"(XGBClassifier(base_score=0.5, booster='gbtree...",0.905167,0.9075,0.905167,0.9075,1.0,1.0,0.950223,0.951507,"[[0, 569], [0, 5431]]","[[0, 185], [0, 1815]]"


TPOT wields great results, at the cost of time to train. By the way, if you decide to train it at the large dataset, it's going to take hours, even days!

# How to do EVERYTHING wrong

Well, see for yourself how the NSA handled all this stuff with a budget of billions.
https://arstechnica.com/information-technology/2016/02/the-nsas-skynet-program-may-be-killing-thousands-of-innocent-people/

A version more lightweight in details (and angrier) of the same errors made by the NSA

https://pluralistic.net/2021/08/02/autoquack/#gigo