### **Introduction**

Hi there! We are happy to see you opening this notebook and reading this - hopefully you are curious about what's gonna happen next🙂. With this regard we highly recommend you to go through the notebook in detail and make sure that you understand most of the material. We do believe that **Intro to ML module of our course is very important to build a solid understanding of ML**, therefore it is crucial to study this seminar and homework notebooks in order to master ML skills. And of course, we are here to help you with learning it!

In this homework we are going to play around the [dataset](https://archive.ics.uci.edu/ml/datasets/student+performance) of Portuguese students. The data includes student grades, demographic, social and school related features. And what we are particularly interested in here is to build a model which will **predict whether a student passes or fails the exam** based on information about her/him. We will take Maths exam for this study, but there is also the same data available for Portuguese language in the "data" folder, so if you are interested, you can give it a go too. Moreover, personally, we are very curious to find out what exactly affects the exam result and how: would it be the amount of time spend with friends or workday alcohol consumption? So if you are too, then let's dive in🏄

### **Importing libraries**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OrdinalEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.model_selection import GridSearchCV, train_test_split, learning_curve
from sklearn.metrics import roc_auc_score, roc_curve, plot_roc_curve, precision_recall_curve, classification_report, average_precision_score, f1_score

In [None]:
np.random.seed(10)

### **Preparing data**

#### **loading data**

In [None]:
# you can find the description of features in data/student.txt file 
data = pd.read_csv('data/student-mat.csv', sep=';')

In [None]:
data.head()

In [None]:
data.shape

We are definitely short of data here, but life is hard... Admittedly, it is not easy to collect much information of such kind - one has to ask many students, and they might be lazy to answer or just don't want to. So let's bear with it and try to do our best with data at hand (and also acknowledge that in Particle Physics we have really lots of data and rarely face this problem)

#### **defining target**

In [None]:
# plotting final grade
data.G3.hist()
plt.show()

So the G3 column represents the final grade for a Maths exam. It takes discrete values from 0 to 20 and we have **several ways of predicting students' success**. Firstly, we can pose this as a **binary classification problem** by defining a threshold, above which a student passes exam and below fails. Secondly, we can define several thresholds and make a **multiclass problem** (e.g. "pass", "waiting list", "fail"). Then, we can go ahead with doing **regression** on a raw G3 score. However, this would be not entirely correct since our target has a well-defined range of discrete values - but mathematically this will work, so why not. And lastly, we can actually realize that our target is [multinomially distributed](https://en.wikipedia.org/wiki/Multinomial_distribution), so we can work in the framework of [**Generalized Linear Models**](https://en.wikipedia.org/wiki/Generalized_linear_model) and [fit](https://www.statsmodels.org/stable/glm.html) such a model to data. OK, this would be quite too much for you to ask - but if you are a curious seeker, than we won't stop you from doing this😏 But for the homework we will **stick to a simpler option of binary classification.** 

In [None]:
# exercise: create a pass (1)/fail(0) binary target with a threshold of 14
# ~~~ your code goes here ~~~

data['passed'] = None

In [None]:
data.passed.hist()
plt.show()

Well, we have clearly **unbalanced classes** - there is roughly 6 times less students who passed the exam. We will ignore this for the first iteration of model building but then will correct in the following steps. 

#### **categorial features**

In [None]:
data.dtypes

In [None]:
np.unique(data.Mjob, return_counts=True)

So, there are **categorial features** in our dataset. Therefore, they need to be somehow converted into a numerical representation. One way to [approach this problem](https://scikit-learn.org/stable/modules/preprocessing.html#encoding-categorical-features) is to simply encode unique values by numbers. In the example above, "at_home" values will be converted into 0, "health" - into 1, and so on. But one can notice that this is not quite appropriate - because it implies that there is some ordering ("health" is higher than "at_home"), whilst there is none. For such cases [one-hot encoding](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) is more preferrable. However, in our dataset there are mostly binary categorial features, so we will use [OrdinalEncoder()](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html) class from sklearn to just assign ordinal numbers to these features.

In [None]:
# exercise: convert object features (and only them) with OrdinalEncoder() 
# ~~~ your code goes here ~~~

#### **exploring data**

In [None]:
data.sample(5)

In [None]:
data.describe()

Good, now we preprocessed categorical features and it's time to have a look on NaNs (well, actually we should've done this before, but OK)

In [None]:
# exercise: are there NaNs? If yes, fill them the way you deem preferable
# ~~~ your code goes here ~~~

Moving on, as the NaNs have been taken care of, we can explore data a bit. Here we will let you **get your hands dirty** and do this your own way, without any constrains and guidance. Do some pairplots, histograms, scatterplots, study correlations, add/remove/transform features - the goal here is that you experiment with data and prepare it for further propagation through ML pipeline to get the best result. **Be curious and creative**, ask yourself questions for which you would be interested to know the answer and use Python as your tool at hand. Also, as a rule of thumb, many answers can be answered with $\leq5$ lines of Python code, so keep this in mind😉

In [None]:
# ~~~ room for your imagination ~~~

#### **choosing metric**

It is important before we continue to think about **how we are going to evaluate the model's performance**. In the lecture you heard a bit about accuracy, precision and recall, F-score and ROC AUC, but which one should we pick for our task? There is clearly a class imbalance, so we can't trust accuracy and ROC AUC (the latter might be not trivial to understand, so check out [this lecture](https://github.com/esokolov/ml-course-hse/blob/master/2019-fall/lecture-notes/lecture04-linclass.pdf) for some examples or contemplate a bit on this animation below (and there is more cool animations [here](https://github.com/dariyasydykova/open_projects/tree/master/ROC_animation))
![](images/imbalance.gif)

OK, but then there is an important question: **what do you really want from your model?** Let's imagine that, for example, you are a head of the selection committee forming a new class on advanced algebra. The competition among students is tight: there are limited number of places available, there are lots of candidates and many of them are really excellent. So you decided to create a model which would help you to select the best students based on whether they will pass the exam or not - and you have the data from their previous Math exam at your disposal. In this situation you **don't really want to maximise recall** - this will correspond to selecting as many good students as possible, but amongst them there might be a lot of not that good ones and you have limited number of places, so you will have to filter them out once again. What you are really interested in is that the model is _precise_ in its prediction: that is, if it says "this student is good", then the student is really good (so will pass the exam). This is exactly what precision does!

However, it depends on the threshold: the model will predict the score and then you have to apply some cut on this score to define the "pass/fail" status of a student. Let's suppose that you are lazy, don't want to optimize this cut and want some threshold-averaged metric, like ROC AUC. And here it is, [average precision](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.average_precision_score.html#sklearn.metrics.average_precision_score). Basically, it can be interpreted as the area under _Precision-Recall curve_ (do you understand what's the difference comparing to ROC curve?). This metric is often used in the ranking problems where it can be interpreted as the ability on average to rank the positive class higher to the top. And well, implicitly this is what we are also interested in: ranking the students! So this seems like a good fit for us and we will go ahead with using this metric throughout the notebook. However, this was the choice based on our expectations from the model, but **your choice could be different** simply because you might be interested in different things in this problem (maybe you want your model to be _not gender biased?_ this is a new and interesting prespective, right?🙂). So feel free to pick a different metric or derive your own and train the model accordingly. But remember, **changing the metric after evaluating the model performance to a different one because this one isn't "good enough" is a cheating to yourself!** Try to understand and settle what you want from your model _a priori_ and stick to this throughout the analysis.

#### **learning curves**

As your awesome dataset is now in a proper shape and we settled the metric matters, let's check one thing first. We saw previously that there is clearly little data available, but maybe it's already enough to train a reasonable model? And if not, how much more data we would need? And here they are, [learning curves](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.learning_curve.html) to help us out.

Essentially, to plot these curves the model is trained firstly on some small fraction of data, then on increased one, and once again, in increasing steps, with the last step being performed on the whole data. The model is trained with **cross-validation** - that is, each portion of training data is further split into N folds and the model is trained N times on N-1 folds and tested on the remaining one fold.

Ah, we will take **logistic regression** as the baseline model - have a look at our part 2 Intro to ML lecture if you need to refresh your understanding of it. 

In [None]:
X = data.drop(columns=['passed', 'G1', 'G2', 'G3']) # would it be data leakage if we hadn't removed grades' features?
y = data['passed']

In [None]:
estimator = LogisticRegression(max_iter=300, class_weight='balanced', random_state=10)
scoring = 'average_precision' # metric to evaluate the performance at each split

In [None]:
# these are the portions of the whole datasets for which the training will be performed
train_sizes = np.linspace(.1, 1.0, 5)
train_sizes

In [None]:
# here we set cv=3, meaning that each training portion will be split into 3 folds for cross-validation
train_sizes, train_scores, test_scores = learning_curve(estimator, X, y, cv=3, scoring=scoring, train_sizes=train_sizes, random_state=10)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)

In [None]:
# exercise: plot resulting learning curves for train and test samples with standard deviation bands
# ~~~ your code goes here ~~~

Good, so extrapolating by eye $\sim500$ samples would be enough to bridge the gap between train and test scores. At the moment we have the metric value around $0.3-0.4$, which ain't much, but shows that we actually can train something meaningful. 

In [None]:
# question: can you find where overfitting happens on this plot?
# ~~~ your answer goes here ~~~

#### **on importance of shuffling**

Let's get to the training then! For that purpose we will split the data into train/validation/test sets with the ratio 0.6/0.2/0.2. The test set we will lay off until the very final testing, while a validation one we will use to study and tune the model. 

There is an important aspect of **shuffling** data which you should be aware of. Well, data can be shuffled - and on the one hand it might be necessary, while harmful on the other. For example, suppose that samples in your dataset **don't have any intrinsic order** (for example, images of cats and dogs), but for some reason they turned out to be ordered so that the first half of the dataset belongs only to a class "0" (e.g. "dog"), while the second half purely to a class "1" (e.g. "cat"). If you don't shuffle the dataset and later on split it into train/test sets, you might end up training only on dogs and testing only on cats! This is clearly not good, because your algorithm learns only on dogs, whilst your goal is to identify cats - so in this case shuffling will help to even the situation. And in general, after shuffling and splitting you still might want to cross check that there is no bias.

However, you might be dealing with data which **does have order**. For instance, you want to predict the weather based on some historical data. Then if you do cross-validation, you might want to carefully split the data into folds, so that at each step you don't train the model on the data from the future and test on the past🤔

So as the summary, remember to **check whether your data has some kind of order** and whether you need to acommodate to it. By the way, `train_test_split()` performs shuffling by default (see its [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)). However, you might get into an aforementioned trap if you split the data by yourself - so be careful!

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, shuffle=True, random_state=10) 
X_val, X_test, y_val, y_test = train_test_split(X_test, y_test, test_size=0.5, shuffle=True, random_state=10)

And now let's check the fraction of "passed" students in train, test and val samples. 

In [None]:
# exercise: find out these fractions
# ~~~ your code goes here ~~~

Try also to vary the `random_state` parameter - you will notice that, firstly, the **fraction fluctuates significantly**, and secondly, that **it is not even between different sets**. This is a consequence of the fact that we have few samples in the dataset. It ain't good and might result in a biased estimate of the model's performance and also in the model which doesn't reflect correctly a genuine data distribution. This is a significant problem and we are going to deal with this a bit later. For now, let's build a simple baseline model first.

### **Training baseline**

#### **building pipeline**

In [None]:
model = make_pipeline(StandardScaler(), LogisticRegression(random_state=10)) # make_pipeline is almost the same as Pipeline(), see documentation for it

In [None]:
model.get_params()

#### **hyperparameter optimisation**

Above you can see that there are several parameters of the model which can be optimised - they are so-called **hyperparameters**. This can't be made a part of the training procedure itself (why?), so what we will do is define a **grid** of hyperparameters' values and for each of these values train a model with CV. After one single step we will have N_folds of test scores which corresponds to one particular value on the hyperparameter grid. Scanning this way the grid we will aggregate all the test scores and then we can pick those point on the grid, which gives us the best performance.

Also note, that you can define the grid arbitrarily, and it is often advised that you pick a coarse range of values (e.g. using `numpy.logspace`), find roughly the range of optimal values and then fine-tune it (e.g. with `numpy.linspace`)

Below we will optimise `C` parameter only which corresponds to a strength of regularisation, so the hyperparameter grid in this case is one-dimensional.  But if you fancy, you may want to tune more of them (can you find what would make sense to tune?). For that purpose we will wrap the model into a `GridSearchCV()` class, which will automatically take care of scanning the grid, splitting, training, testing and score aggregating. And by the way, for that purpose we need to use only the train set, val/test sets stay intact!

In [None]:
tuning_range = np.logspace(-4, 0, num=40)
tuning_range

In [None]:
param_grid = {'logisticregression__C': tuning_range}
optimizer = GridSearchCV(model, param_grid, scoring='average_precision', cv=3) 

In [None]:
# as simple as that
optimizer.fit(X_train, y_train)

In [None]:
optimizer.cv_results_

In [None]:
# exercise: plot the results and find the best parameter
# ~~~ your code goes here ~~~

In [None]:
# question: how much do you gain with tuning this hyperparameter?
# ~~~ your answer goes here ~~~

As a final remark, for this small study we had only one hyperparameter to optimise, but when this number grows, we need to test more and more combinations of values. For example, for 1 parameter and 10 values we need to run 10 CV trainings, for 2 parameters we need to test $10\times10=100$ values, for 3 parameters $10^3$ and so on. At some point one will have to wait for a _really_ long time (or use more CPU power) until all the values are checked. For that case one may opt for a [Randomised Grid Search](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html) when not the whole grid is scanned, but just several random points of it, which might be actually [more optimal approach](https://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf) in general. In particular, this can be usefull when one has "noisy" hyperparameters optimising which doesn't really bring any improvement in performance. On the illustration below one can notice that random search might help to find a better global maximum.

<img src="images/random_search.png" alt="drawing" width="600"/>

#### **evaluating results** 

Good, so as we trained the model and optimised hyperparameters, it's time to check its performance on the validation set. Remember, that we used logistic regression for this problem, so we can not just predict labels, but continuous score to belong to a "passed" class! We will use a nice `classification_report` tool from sklearn, which will show the summary of several metrics in one table. And we also encourage you to look at other metrics which we haven't mentioned here to study their behaviour.

In [None]:
y_train_proba = optimizer.predict_proba(X_train)[:,1] # be careful which column in the prediction matrix to take
y_val_proba = optimizer.predict_proba(X_val)[:,1] # be careful which column in the prediction matrix to take

In [None]:
plt.hist(y_train_proba, density=True, histtype='step', label='train')
plt.hist(y_val_proba, density=True, histtype='step', label='val')
plt.grid()
plt.legend()
plt.show()

In [None]:
# question: what do you think, can the model's output above be interpreted as a probability of the "passed" class?
# ~~~ your answer goes here ~~~

In [None]:
classification_report(y_val, y_pred_proba, target_names=['failed', 'passed'])

Oops, it doesn't work with raw scores and needs the output to be thresholded, as expected. Here we are gonna be lazy once again and pick the median of the train distribution as a threshold. 

In [None]:
# exercise: obtain labels by cutting on the model's output at median value
# ~~~ your code goes here ~~~

y_pred = None

In [None]:
print(classification_report(y_val, y_pred, target_names=['failed', 'passed']))

In [None]:
# question: are these numbers good?
# ~~~ your answer goes here ~~~

OK, let's get some more metrics

In [None]:
plot_roc_curve(optimizer, X_val, y_val) 
plt.show()

In [None]:
roc_auc_train = roc_auc_score(y_train, y_train_proba)
roc_auc_val = roc_auc_score(y_val, y_val_proba)
roc_auc_train, roc_auc_val

In [None]:
f1_train = f1_score(y_train, y_train_proba > np.median(y_train_proba))
f1_val = f1_score(y_val, y_val_proba > np.median(y_train_proba))
f1_train, f1_val

In [None]:
average_precision_train = average_precision_score(y_train, y_train_proba)
average_precision_val = average_precision_score(y_val, y_val_proba)
average_precision_train, average_precision_val

But wait, isn't it weird that we have noticeably **better performance on the val set?** This is very suspicious, because normally the model learns the training data very well and thus shows better performance on it comparing to the test data, which was unseen. 

If you remember, in the begining we checked the fraction of "passed" students in train/test/val sets and there was a sizeable difference.

In [None]:
print(np.bincount(y)/y.shape[0])

In [None]:
print(np.bincount(y_train)/y_train.shape[0])

In [None]:
print(np.bincount(y_val)/y_val.shape[0])

In [None]:
print(np.bincount(y_test)/y_test.shape[0])

And there is $\sim50\%$ difference between them! Which is quite a lot and leads to a **bias in evaluating the performance**. Ideally, we need to have train and test data sampled from the same distribution - in other words, they should have the same properties. In our case this principle is violated by introducing disbalance in classes, so the model sees one population (with 0.17 ratio of classes, which is the same as in original population) during the training, but then we test it on a set with a different population (with 0.27 class ratio), while in our initial data this ratio is 0.18. So in this case, training set reflects correctly the general population of students, but the validation one doesn't. Therefore, due to this disbalance we've got the results of testing which can't be trusted - and if we didn't check for that, we could large overestimate our model performance! So let's try to fix this problem.

### **Class balance**

In [None]:
plt.hist(y_train)
plt.grid()
plt.show()

As was mentioned earlier, here we are dealing with a classification task where classes are unbalanced, so where is significantly more samples of one class ("failed") comparing to the other ("passed"). This, as was noted during the lecture, at its extreme case can result in the trained model which always predicts a sample to belong to a majority class, hence making the model too biased. Note, that you should distinguish this problem from the one we outlined above: in the first case we have imbalance between classes, while in the second one - imbalance of the classes' ratio between train and val sets. So here we are talking about imbalance producing the **bias of the model** (in the training), whilst before it was imbalance producing the **bias in the estimating of the model's performance** (in the testing).

We already checked that we have bias in evaluating the performance, but didn't really check whether imbalanced classes introduce biases in the model, so let's do this. There are several main strategies to train the model fairly in the imbalanced classes setting: **resampling and weighting**.

#### **resampling**

Resampling means that we add or remove samples from the dataset until the balance between classes is reached. It comprises of two approaches: **oversampling and undersampling**. In the former we add more samples to the minority class until it is even with a majority one; in the latter, on the opposite, we remove some samples from the majority class to balance it with a minority one. For obvious reasons we can't do undersampling here because we will end up with a tiny number of samples to train on (like 80 students) - we can't afford throwing away the data like this. 

So we will oversample, and there are options to consider. Firstly, one can try to find more data for this - and this is not the case for us. Secondly, we can simply add the samples that we have in our dataset once again - yes, we will end up having copies in the training data but this is fine. And lastly, we can generate new samples which are somewhat the same as the ones from the training data. There are several ways to do this and you can check out [this library](https://github.com/scikit-learn-contrib/imbalanced-learn) for a nice overview and implementation of them. In this exercise we will just oversample "passed" students until we have the same number as of "failed" students. 

It is important to note, that we are going to do this **only for the training set** - we shouldn't touch the test set and distort it! **The whole idea of the test data is that it should represents the data as we expect it to occur in the real world**, so that we test the model on what we anticipate to see. Balancing ourselves the test set will introduce a significant change in this data and therefore would bias our understanding of the model. 

In [None]:
# exercise: oversample the training data by randomly selecting and adding samples of positive class (passed students) 
# so that the number of passed and failed students is equal

# ~~~ your code goes here ~~~

In [None]:
X_train_upsampled = None
y_train_upsampled = None

Once we've balanced the training set, let's go ahead with training and testing. Note, that the testing data, as we outlined above, isn't changed and is still different from the training one in terms of class distribution. This problem will be approached in the next section, while here we are trying to investigate whether the class imbalance itself can cause problems during the training. Also we should mention, that for the sake of fair comparison, we will train on oversampled data, but test on the nominal one.

In [None]:
optimizer.fit(X_train_upsampled, y_train_upsampled)

In [None]:
y_train_proba = optimizer.predict_proba(X_train)[:,1] 
y_val_proba = optimizer.predict_proba(X_val)[:,1]

In [None]:
roc_auc_train = roc_auc_score(y_train, y_train_proba)
roc_auc_val = roc_auc_score(y_val, y_val_proba)
roc_auc_train, roc_auc_val

In [None]:
average_precision_train = average_precision_score(y_train, y_train_proba)
average_precision_val = average_precision_score(y_val, y_val_proba)
average_precision_train, average_precision_val

OK, average precision decreased a bit for train and val and there is still large difference (no surpise, right?) between train and val scores.

#### **weighting**

The next approach to avoid poor training in the context of class imbalance is to do **reweighting**. Basically, the loss for an optimization task looks like an averaged individual sum of each samples's contribution to it - and all they come with an equal weight of 1. We can modify this by assigning **weights** $w_i$ to every sample:

$\mathcal{L} \propto \frac{1}{N} \sum_{i=0}^{N-1} w_i\cdot[y_i \log (p_i) + (1 - y_i) \log (1 - p_i)]$

We can compensate for the class imbalanceness by setting a _per-class weight_ as (nice chance to check out your understanding of numpy😉):

$w = \frac{\text{n_samples}}{\text{n_classes * np.bincount(y)}}$ 


Essentially, events of minority class will receive higher weights (proportional to the disbalance) and the majority class will be downweighted. This means, that **misclassification of minority class is higher penalised by the loss function**, than those of the majority class. This trick should cause the model to get trained in a way that it distinguishes minority class better. 

In [None]:
# question: do you expect resampling and reweighting to yield equivalent results?
# ~~~ your answer goes here ~~~

In [None]:
model = make_pipeline(StandardScaler(), LogisticRegression(random_state=10, class_weight='balanced')) # reweighting is done with a "balanced" option 
optimizer = GridSearchCV(model, param_grid, scoring='average_precision', cv=3) 

In [None]:
optimizer.fit(X_train, y_train)

In [None]:
y_train_proba = optimizer.predict_proba(X_train)[:,1] 
y_val_proba = optimizer.predict_proba(X_val)[:,1]

In [None]:
roc_auc_train = roc_auc_score(y_train, y_train_proba)
roc_auc_val = roc_auc_score(y_val, y_val_proba)
roc_auc_train, roc_auc_val

In [None]:
average_precision_train = average_precision_score(y_train, y_train_proba)
average_precision_val = average_precision_score(y_val, y_val_proba)
average_precision_train, average_precision_val

Good, so we've just found out that the model isn't really sensitive to a class imbalance in the training set - using oversampling and weighting hasn't changed the results significantly. The model seems to be robust in the training to imbalanced classes. With this in mind let's finally balance train and val sets. 

#### **stratifying**

And [stratifying sampling](https://en.wikipedia.org/wiki/Stratified_sampling) will help us with that. What this will do, is not just randomly split data into train and test sets, but will do this in a more clever way with **preserving the ratio of classes** in each split. 

<img src="images/Stratified_sampling.png" alt="drawing" width="400"/>

In sklearn you can do this by adding `stratify=targets` to a `train_test_split()` function, where `targets` is the vector of classes, from which you want to do stratifying sampling.

OK, now let's split the data in a stratified fashion with 0.6/0.4/0.4 proportion, check the classes ratio once again as we did before and retrain+retest the model.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, shuffle=True, stratify=y, random_state=10) 
X_val, X_test, y_val, y_test  = train_test_split(X_test, y_test, test_size=0.5, shuffle=True, stratify=y_test, random_state=10)

In [None]:
print(np.bincount(y_train)/y_train.shape[0])

In [None]:
print(np.bincount(y_val)/y_val.shape[0])

In [None]:
print(np.bincount(y_test)/y_test.shape[0])

In [None]:
# question: is it consistent now?
# ~~~ your answer goes here ~~~

In [None]:
model = make_pipeline(StandardScaler(), LogisticRegression(class_weight='balanced', random_state=10))
optimizer = GridSearchCV(model, param_grid, scoring='average_precision', cv=3) 

In [None]:
optimizer.fit(X_train, y_train)

In [None]:
y_train_proba = optimizer.predict_proba(X_train)[:,1] 
y_val_proba = optimizer.predict_proba(X_val)[:,1]

In [None]:
roc_auc_train = roc_auc_score(y_train, y_train_proba)
roc_auc_val = roc_auc_score(y_val, y_val_proba)
roc_auc_train, roc_auc_val

In [None]:
average_precision_train = average_precision_score(y_train, y_train_proba)
average_precision_val = average_precision_score(y_val, y_val_proba)
average_precision_train, average_precision_val

In [None]:
# question: how have metrics changed?
# ~~~ your answer goes here ~~~

Finally, let's plot for the sake of illustration **ROC and Precision-Recall curves** - since an approximation of the area under the latter was used as the main metric throughout this notebook. For that purpose in sklearn you can use `roc_curve` and `precision_recall_curve` classes. There is one comment though: for ROC curve you already know that a diagonal line represents a random classifier, sort of the very basic and useless clasifier which want don't want to have. So here goes:

In [None]:
# exercise: can you find and plot the same line on a Precision-Recall curve?
# ~~~ your code goes here ~~~

baseline_value = None

In [None]:
fpr, tpr, _ = roc_curve(y_val, y_val_proba)
precisions, recalls, _ = precision_recall_curve(y_val, y_val_proba)

In [None]:
plt.title('ROC')
plt.plot([0, 1], [0, 1], linestyle='--', label='random')
plt.plot(fpr, tpr, marker='.', label='model')
plt.grid()
plt.legend()
plt.show()

In [None]:
plt.title('PR curve')
plt.plot([0, 1], [baseline_value, baseline_value], linestyle='--', label='random')
plt.plot(recalls, precisions, marker='.', label='model')
plt.xlabel('recall')
plt.ylabel('precision')
plt.grid()
plt.legend()
plt.show()

In [None]:
# question: do they show good performance?
# ~~~ your answer goes here ~~~

### ***Bagging 'em all**

If you reached that far in this notebook, than you must've learned and practised quite a lot, so well done! Moreover, you've build a solid pipeline of the analysis which you can now try to improve and that is quite an achievement. So in this section we will just show you some chilling stuff: let's do a bit of a **bagging** here.

This topic will be covered extensively in the next lecture on _Trees_, so here we'll just give you some teasers. Basically, you previously saw that if we vary a random seed of the train/test splitter, then the data fluctuates to a certain extent and thus the model and predictions also fluctuate. Also remember that we don't have much of the data, so the model predictions are more sensitive to these statistical fluctuations. And what if we try to make the model more robust by **combining several "fluctuating" models altogether**? So we take N _bootstrapped_ datasets from the original one, train N models and then combine their predictions? Feels like it should kind of average and smooth fluctuations and therefore improve the predictions's stability!

Ah, and back to **bootstrapping** - this is an extremely simple and yet powerful statistical technique which you should definitely know. Basically, you take your dataset with, say, M objects, you _pick M objects with replacement_ - and here you have one more dataset😏. Do this as many times as you want and you can have many datasets from the same population on which you can get not just a point estimates of whatever statistical observable you are interested in, but already a distribution of it. 

Sounds like a plan, so let's pick as a model logistic regression with Lasso regularisation, bootstrap more datasets, train several models and see whether this will bring an improvement upon the baseline.

In [None]:
# number of bootstrapped datasets - try increasing it as much as you want
n_bootstrapping = 200

In [None]:
# logistic regression with Lasso
log_lasso = SGDClassifier(loss='log', penalty='l1', class_weight='balanced', random_state=10, max_iter=2000)
# model = make_pipeline(StandardScaler(), log_lasso)

# bagging them all
model = make_pipeline(StandardScaler(), BaggingClassifier(base_estimator=log_lasso, n_estimators=n_bootstrapping)) 

param = 'baggingclassifier__base_estimator__alpha'
param_grid = {param: np.logspace(-4, 0, num=10)} 
optimizer = GridSearchCV(model, param_grid, scoring='average_precision', cv=3) 

In [None]:
# this might take a while
optimizer.fit(X, y)

In [None]:
param_values = optimizer.cv_results_[f'param_{param}'].data.astype('float32')
test_scores = optimizer.cv_results_['mean_test_score'] # "test" meaning the test fold of X_train
test_scores_up = test_scores + optimizer.cv_results_['std_test_score']
test_scores_down = test_scores - optimizer.cv_results_['std_test_score']

In [None]:
plt.semilogx(param_values, test_scores, color='salmon')
plt.fill_between(param_values, test_scores_down, test_scores_up, alpha=0.1)
plt.grid()
plt.show()

In [None]:
y_train_proba = optimizer.predict_proba(X_train)[:,1] 
y_val_proba = optimizer.predict_proba(X_val)[:,1]

In [None]:
roc_auc_train = roc_auc_score(y_train, y_train_proba)
roc_auc_val = roc_auc_score(y_val, y_val_proba)
roc_auc_train, roc_auc_val

In [None]:
average_precision_train = average_precision_score(y_train, y_train_proba)
average_precision_val = average_precision_score(y_val, y_val_proba)
average_precision_train, average_precision_val

In [None]:
fpr, tpr, _ = roc_curve(y_val, y_val_proba)
precisions, recalls, _ = precision_recall_curve(y_val, y_val_proba)

In [None]:
plt.title('PR curve')
plt.plot([0, 1], [baseline_value, baseline_value], linestyle='--', label='random')
plt.plot(recalls, precisions, marker='.', label='model')
plt.xlabel('recall')
plt.ylabel('precision')
plt.grid()
plt.legend()
plt.show()

Ah, we almost forgot, do you remember about the **test set**? We've never touched it thus far exactly for the sake of the very final testing, and the time has come.

In [None]:
y_test_proba = optimizer.predict_proba(X_test)[:,1]

In [None]:
roc_auc_test = roc_auc_score(y_test, y_test_proba)
average_precision_test = average_precision_score(y_test, y_test_proba)

In [None]:
print(f'ROC AUC: {roc_auc_test}')
print(f'average precision: {average_precision_test}')

In [None]:
plt.hist(y_train_proba, density=True, histtype='step', label='train')
plt.hist(y_val_proba, density=True, histtype='step', label='val')
plt.hist(y_test_proba, density=True, histtype='step', label='test')
plt.grid()
plt.legend()
plt.show()

In [None]:
# question: so, what are your thoughts about the outcomes of bagging and a check of the test set?

### ***Interpreting results**

And finally, let's have a look at non-zero weights of a bagging logistic regression with Lasso regularisation (and now you can even understand what it means😎). Since we have many models in the ensemble, we can aggregate all the weights per feature and then calculate their mean and, what is cool - the variance (thanks to bootstrapping!). 

By the way, remember that weights are not that directly interpretable as it was for linear models (and what we looked at in the seminar). It is not the model's output which is approximated by a linear combination of features, but [log odds](https://en.wikipedia.org/wiki/Logit) - so weights in this case show how particular features affects them, not the model output.

In [None]:
fdict = {}

In [None]:
# collecting non-zero features with their weights for each model in ensemble
for estimator in optimizer.best_estimator_[1].estimators_:
    i_nonzero_ws = estimator.coef_[0] != 0
    for weight, feature in zip(estimator.coef_[0][i_nonzero_ws], X_train.columns[i_nonzero_ws]):
        if feature in fdict.keys():
            fdict[feature].append(weight)
        else:
            fdict[feature] = [weight]

In [None]:
# deriving mean and variance
fdict_mean = {key: (np.mean(values), np.sqrt(np.var(values))) for key, values in fdict.items()}

In [None]:
sorted(fdict_mean.items(), key=lambda item: item[1], reverse=True)

Well, the variance is still quite large for most of the features to conclude even whether the weight is positive or negative. To be on the safe side we would need to perform some statistical tests for that, but this goes well beyond the scope of the homework. So let's just pick the features which have the weights $1\sigma$ away from 0.

In [None]:
# question: what are they? does it make sense that they contribute to the model prediction? can you find more interesting insights into the model and data?
# ~~~ room for your thoughts ~~~

And yes, feel free to explore and interpret the model yourself now. For example, you could try to find a student with the highest predicted score and try to figure out why the model picked him as the best? If you were indeed a head of the selection committee, then you would definitely ask yourself this question: **does this model really make sense so that I can use it?**. And, generally speaking, this is exactly the question which you, as a researcher, should always ask yourself throughout your analysis.

### **Closing remarks**

Great, we've guided you through a lot important things in this homework and hopefully you enjoyed this journey! Let's summarise what we've covered in this notebook:

* New preprocessing methods: categorial features, shuffling 
* Learning curves
* Choice of metric
* Cross-validation and hyperparameter optimisation
* Class imbalance and how to treat it 
* Bagging of models

We should say that in our opinion this **"Introduction to ML" module is the most important in our course since it lays the very foundation of Machine Learning** and opens up a new perspective on looking at the data. In the following classes we will be just expanding it into more complex types of models and data. Therefore we wanted to communicate you as clearly and simply as possible this new vision of approaching problems through the lecture, seminar and this homework and hopefully you've grasped some of it. Don't worry if something is not yet clear! Take your time and go through the notebooks and lecture once again (and also ping us in the chat for your question!). Google this and that, be curious and explore what we didn't ask you to do, so that in the end you feel yourself more confident with the topics we've covered so far.

In the following classes we will take off into a fascinating world of new Machine Learning models and the **next topic** we are going to introduce you to is **Decision Trees and ensembles** thereof. This is definitely goind to be exciting and we look forward to seeing you in the following classes! 

~~~