<img src = "https://datasciencebocconi.github.io/Images/Other/logoBocconi.png">
$\newcommand{\bb}{\boldsymbol{\beta}}$
$\DeclareMathOperator{\Gau}{\mathcal{N}}$
$\newcommand{\bphi}{\boldsymbol \phi}$
$\newcommand{\bpi}{\boldsymbol \pi}$
$\newcommand{\bx}{\boldsymbol{x}}$
$\newcommand{\by}{\boldsymbol{y}}$
$\newcommand{\bmu}{\boldsymbol{\mu}}$
$\newcommand{\bS}{\boldsymbol{\Sigma}}$
$\newcommand{\whbb}{\widehat{\bb}}$
$\newcommand{\hf}{\hat{f}}$
$\newcommand{\hy}{\hat{y}}$
$\newcommand{\tf}{\tilde{f}}$
$\newcommand{\ybar}{\overline{y}}$
$\newcommand{\E}{\mathbb{E}}$
$\newcommand{\Var}{Var}$
$\newcommand{\Cov}{Cov}$
$\newcommand{\Cor}{Cor}$

# Classification

In some ways there is very little to say about classification: it is like regression but with categorical response. As before there will be a learning function $f(\bx)$ but now the model that relates it to the response has to change to reflect the different characteristics thereof. But we can use the same set of tools - loss functions will change and as a result our tools of measuring performance too. 

On the other hand, this is a very common prediction problem and it is worth to understand deeper some of its intricacies

We will start with binary classification - the response is one of two categories. The labelling of the categories is arbitrary and any sensible methodology should not rely on how these categories are coded numerically. We will stick to 0/1 coding for the two categories. This is mathematically more convenient for the methods we will use here. For other approaches to classification -1/1 might be more convenient. 


## Summary

In this module we build predictive models for categorical outputs, following the same paradigm as in regression but changing the distribution that relates the output $y$ to the learning function $f(\bx)$. We also adapt appropriately the model performance criteria and introduce concepts such as the misclassification probability, the ROC curve and AUC score. We discuss the problem of class imbalance and some solutions. We also contrast regression with Bayes classifiers. We show how to predict multicategorical and ordinal output in a simple framework.

## The spam dataset

This is a classic dataset for binary classification, it can be found in the UCI repository

http://www.ics.uci.edu/~mlearn/MLRepository.html

and specifically here:

https://archive.ics.uci.edu/ml/datasets/spambase

and it is analyzed in few different ways in the Hastie et al. book 

I have created a version with a subset of the variables

In [None]:
import pandas as pd
%matplotlib inline
# This is a Python module that contains plotting commands
import matplotlib.pyplot as plt
# the following provides further tools for plotting with dfs
import seaborn as sns 
from sklearn.metrics import confusion_matrix
import numpy as np

In [None]:
#import auxiliary functions (plot_confusion_matrix, get_auc)

import requests
url = "https://datasciencebocconi.github.io/Code/helper_functions.py"
r = requests.get(url, allow_redirects=True)
open('downloaded_script.py', 'wb').write(r.content)
from downloaded_script import *

In [None]:
spam = pd.read_csv("https://datasciencebocconi.github.io/Data/spam_small_train.csv")
spam.head()

In [None]:
spam.info

Recall: always check the documentation to understand what the data is about. From the UCI:

"48 continuous real [0,100] attributes of type word_freq_WORD
= percentage of words in the e-mail that match WORD, i.e. 100 * (number of times the WORD appears in the e-mail) / total number of words in e-mail. A "word" in this case is any string of alphanumeric characters bounded by non-alphanumeric characters or end-of-string."

In [None]:
# Lets explore some basic aspects of this dataset

spam["class"].value_counts().plot(kind="bar")

In [None]:
spam.groupby("class").boxplot(rot=90)

The previous exploratory analysis is analogous to screening by correlations in regression. We could for example test nonparametrically equality of the two distributions - although these are only assessing marginal dependence and not interactions 

## Probabilistic classification using logistic regression

We will follow a very analogous route to regression. There will be a learning function and we will take to be linear in parameters and features: 

$$ f(\bx_i,\bb) = \bb^T \bphi_i$$

We will also model the distribution of the response and relate this distribution to the linear predictor

There is no assumptions in this case for the distribution of the response: there is only one distribution to describe binary variables, the *Bernoulli*. Our probabilistic binary classification models take: 

$$ y_i \sim Bernoulli(\pi_i) \iff p(y_i=1) = \pi_i \quad p(y_i=0) = 1-\pi_i$$

The remaining assumption we will make is how to relate $\pi_i$ to the linear predictor. We basically need a transformation that maps real values to $[0,1]$. The *logistic transformation* is one option and leads to the so-called **logistic regression**:

\begin{equation}
y_i \sim Bernoulli\left({1 \over 1 + e^{-f(\bx_i,\bb)}}\right)
\end{equation}

The negative sign is just for interpretation: large values of $f(\bx_i,\bb)$ are more likely associated to $y_i=1$

In [None]:
x = np.arange(-6,6,0.01)
y = 1/(1+np.exp(-x))
plt.plot(x,y)
plt.axvline(x=0.0,color='r', linestyle='--')
plt.axhline(y=0.5,color='r', linestyle='--')


Basic math shows that in this model 

$$- \log p(y_i | \bx_i) =  \log (1+e^{-f(\bx_i,\bb)}) - y_i f(\bx_i,\bb) $$ 

hence, the loss function becomes 

$$L(\bb) = \sum_i \log (1+e^{-f(\bx_i,\bb)}) - y_i f(\bx_i,\bb) $$

This is also **convex** and can be optimized efficiently (this true for other *link* functions too, e.g. probit). A standard way to do this is using Fisher scoring, a variation of Newton-Raphson - these are **gradient-descent iterative optimization algorithms**. These work in the same way for the whole family of **generalized linear models**

We now fit the model to the spam dataset. For the moment we will use the original variables as features

Here we focus on predictive modelling and the `LogisticRegression` is reasonable. For inference this is not providing enough detail. This function actually optimizes over a penalized likelihood loss function, the default being a rigde penalty. We will set the regularization parameter to a very small value to keep its effect minimal for the time being ($n$ here is very large relative to $p$)

Note that unlike `LinearRegression` now we can feed the function with dataframes and series... 

In [None]:
F = spam.drop('class', axis = 1)
y = spam["class"]

# to print stats
feature_names = F.columns
class_labels = ["email","spam"] # meant to represent 0 and 1

In [None]:
# importing the relevant sklearn tools
from sklearn.linear_model import LogisticRegression

lregr = LogisticRegression(penalty='l2', C=100.0, fit_intercept=True, 
                           intercept_scaling=1, solver='liblinear', max_iter=500)

# Fiting logistic regression

lregr.fit(F,y)      

# Compute the predicted probabilities in-sample

insample_pred = lregr.predict_proba(F)

insample_pred_res = spam.copy()

insample_pred_res["pi_i"] = insample_pred[:,1]

print(insample_pred_res.iloc[1:15,-1])

#### Exercise 1

This is a visualization exercise but very helpful for understanding the output of the model. Create a figure that contains the boxplots of predictive probabilities for each of the two classes

In [None]:
#Your code here 



## Turning probabilities into class prediction

The *probabilistic classifier* we use explicitly accounts for misclassification errors. The algorithm returns probabilities and these also reflect the classification uncertainty

Recall that if $y \sim Bernoulli(\pi)$ then $\Var(y) = \pi (1- \pi)$, which is large for $\pi \approx 1/2$

This is often the most useful type of classification output. 

It might also be desirable to turn class probabilities into class prediction, e.g., to report medical tests (pregnant/no pregnant) - the spam example is also a good example: a decision has to be taken for each email and it is convenient to have the algorithm produce class predictions. We will denote those by $\hy_i$

With class prediction there will be **misclassification errors**: false positives and false negatives

Before we go into details, let's see what the default operations in `sklearn` do for us. We can compute the following quantities in or out of sample and cross-validated too - basically all the discussion about use of sample for evaluation for regression applies here too. For the moment we experiment with in-sample calculations

In [None]:
# this is the predict method in the LogisticRegression object
y_hat = lregr.predict(F)
y_hat[0:10]

In [None]:
#confusion matrix
cm =  confusion_matrix(y_pred=y_hat, y_true=y, labels=[0,1])
print (cm)
# Plotting confusion matrix (custom help function)

plot_confusion_matrix(cm, class_labels) 

What `.predict` has done is it uses a threshold $c$ and if $\pi_i \geq c$ it sets $\hy_i = 1$. 

#### Exercise 2

Write a code that takes as input the class probabilities returned by method *predict_proba* and a threshold $c$, returns class predictions 1 if those are larger than $c$ and 0 otherwise. Use a mix of experimentation and your intuition and identify what is the value of $c$ `.predict` uses 

In [None]:
#Your code here 


Depending on the application the losses can be very different (remember the medical example). Let 

$$L(true,predict)$$ 

be a loss function that computes the cost of misclassification - this is something that requires context information not in the data. We take 

$$L(0,0) = L(1,1) = 0 \quad L(1,0) > 0 \quad L(0,1) > 0$$

Little math shows that if $L(1,0)/L(0,1) = C$ the optimal decision is: 

$$\hy_i = 1 \iff \pi_i > {1 \over 1+C}$$

### Misclassification rate

$$p[ y \neq \hy]$$ 

is known as the misclassification probability. This can be estimated from data very much the same way as $R^2$: in sample (which can be negatively biased), out of sample (which is data intensive), by cross-validation (which is computationally intensive) etc; also by using more advanced math, such as *concentration inequalities*

This number in isolation means pretty much nothing. Consider for example the (all too common situation) where we wish to predict $y$ in a population such that $p[y=1] = 0.001$. Then classifying everyone as $\hy = 0$ yields a misclassification probability 0.001 but the algorithm is never able to identify the class of interest. Missclassification rate can be useful in comparisons. 

There are other performance metrics - related to conditional probabilities - e.g. $p[\hy =1 | y=1]$ etc - such as specificity/sensitivity for qualifying the performance of a classifier

### ROC curve and AUC

A common tool to assess the performance of a probabilistic classifier is the ROC curve. Each point on this curve is (an estimate of) pair

$$
(p([\hy = 1 | y = 0] , p[\hy =1 | y = 1]) 
$$

For a given threshold $c$ we can estimate these probabilities from the confusion matrix - obtained in the best way we can - simply by computing the associated frequencies

As we vary the threshold the confusion matrices change and the frequencies too: varying $c$ from 0 to 1 we obtain the ROC curve - read the figure from right to left

We do this now using a customized function. I will do it on a test spam data


In [None]:
# read the test data, extract the info and create predictions

spam_test = pd.read_csv("https://datasciencebocconi.github.io/Data/spam_small_test.csv")
Ftest = spam_test.drop("class",axis=1)
ytest = spam_test["class"]

test_pred = lregr.predict_proba(Ftest)

# Custom plot function
get_auc(ytest, test_pred, class_labels, column=1, plot=True) # Helper function

AUC is the Area Under (the ROC) Curve

It has though an interesting and solid statistical interpretation. It can be directly related to a non-parametric test - the **Mann-Whitney** that the following two samples come from the same distribution: 

$$
\textrm{class 1 probs }: \{\pi_i: y_i = 1\} \quad \textrm{class 0 probs }: \{\pi_i: y_i = 0\}
$$

You would expect that for a decent classifier the two samples come from different distributions and the distribution of the "class 1 probs" is stochastically greater. If the two distributions were identical we would obtain the green-dashed ROC curve.

Consider a contest between the two samples:  each element of the first we compare with all of the elements in the second and record how many times it was at least as big. AUC is the frequency of won contests!

For some `sklearn` tools check 

http://scikit-learn.org/stable/modules/generated/sklearn.metrics.auc.html

#### Exercise 3

As a sanity check, draw the curve for estimates using in-sample predictions.

What do expect in-sample AUC to be, higher or lower than out-of-sample AUC? 

In [None]:
# Your code here



## Class imbalance

### Context

Typically, we want to predict the incidence of a rare event:  rare disease, an accident, a default, exceptional performance, etc, on the basis of measured characteristics. 

If we collect data *prospectively* we will end up with a sample with very little representation of the class we are particularly interested in. Hence it will be hard to learn the function that separates "rare" from "common" on the basis of the measured characteristics; indeed classifiers will not mind overpredicting the "common". 

Lets look at an example first

### Can we predict good wine ? 

This is another standard dataset available from the UCI repository. It is really about **ordinal regression** - or maybe **multiclass classification**. Here we will for illustration consider (at the expense of losing information) a transformation of the response 


In [None]:
wine_df = pd.read_csv("https://datasciencebocconi.github.io/Data/wine.csv")

In [None]:
wine_df.sample(5)

In [None]:
wine_df.quality.value_counts(sort=False).plot(kind="bar")

#### Exercise 4

Write a code to create a new culumn in the dataframe with name "bin_quality" and takes the value 1 if quality is bigger or equal to 8 and 0 otherwise

In [None]:
#Your code here



This is class imbalance for you!

In [None]:
# prepare the data
X = wine_df.drop(['quality',"bin_quality"], axis =1)
y = wine_df.bin_quality


In [None]:
from sklearn.model_selection import cross_val_predict

# I should have standardised features but here I am using tiny regularisation so it should not matter
model = LogisticRegression(C=100, solver='liblinear') 

In [None]:
# AUC
y_probabilities = cross_val_predict(model, X, y, method='predict_proba', cv = 5)
get_auc(y, y_probabilities, ["Bad/Average Wine", "Great Wine"], column=1, plot=True) # Help function

In [None]:
#### Accuracy
from sklearn.metrics import accuracy_score

y_pred = cross_val_predict(model, X, y, method='predict', cv = 5)
print ("Accuracy (cross-validated): ", accuracy_score(y, y_pred))

####  Classification report
from sklearn.metrics import classification_report

print (classification_report(y, y_pred))

**The operation was successful but the patient died!**

As the following exercise demonstrates, none of the good wines are classifed as such

#### Exercise 5

Plot the confustion matrix to verify results. Are we predicting the 'good wines'?

In [None]:
# Your code here 


## Some approaches to class imbalance

Roughly speaking there are two ways to try and do better: 

+ Retrospective study & bias correction: this operates at the *design* stage, that is the way data are collected in the first place 

+ Resampling and use of synthetic data (& bias correction): this works with the data at hand

### Retrospective studies

Collect data not as a *representative* sample from the population of interest, but oversample the rare class; for example from your medical database choose a sample of $n/2$ patients with the rare disease and $n/2$ without

On the basis of this *biased (non-representative)* sample train a probabilistic classifier, e.g., logistic regression 

The estimated learning function is biased too - it will predict way larger probability of class=1 

There are at least two ways to do proper inference with the non-representative sample. 

Lets say that $q(y)$ are the probabilities of the two classes in the population of interest, i.e., $q(1), q(0)$. And $r(y)$ are the probabilities with which we have sampled 

+ One is to change the loss function: instead of using the log-likelihood, which is an *arithmetic average* of individual log-densities, use a *weighted average*

  Some math shows that the following is a valid choice: 
  
  $$ L(\bb) = \sum_i {q(y_i) \over r(y_i)} \left [\log (1+e^{-f(\bx_i,\bb)}) - y_i f(\bx_i,\bb)\right ] $$
  
+ Another is to run the analysis with the biased sample and the log-likelihood loss, but then rescale the estimated probabilities

  Some math shows that if $\pi(y_i)$ are the probabilities estimated by the model, they should be changed to 
  
  $$ { \pi(y_i) {q(y_i) \over r(y_i)} \over \pi(1) {q(1) \over r(1)} + \pi(0) {q(0) \over r(0)}}$$
  
  Lets understand what is the effect of this weighting: consider very small and very large predicted $\pi_i$
  
Both approaches require that $q(y)$ is known - but this is often easy enough 

### Resampling approaches

Having no control of the data collection protocol, one can try and sharpen the distinction between the two classes by either under-representing (*undersampling*) the popular class, or over-representing (*oversampling*) the rare class, or both

Oversampling might just be randomly replicating rare cases or creating synthetic rare cases that "look like" the rare cases in the sample. To this respect doing some modelling on the $\bx_i$s can help - and there are links to *Bayes classifiers* we mention later and to so-called *generative models*. For example SMOTE (Synthetic Minority Oversampling Technique) does this

Any of these approaches needs to be combined with one of the bias-correction approaches developed above

Some tools in `sklearn` to do this are, e.g., the module `imblearn` and its methods `.over_sampling`, e.g. `RandomOverSampler` and `SMOTE`. `imblearn` requires installing - do not do this now! Analogous result to `.over_sampling` is obtained using `LogisticRegression`  `class_weight="balanced"` argument, that corresponds to oversampling

In our wine prediction example we have also done some arbitrary dichotomization of the response: no good reason why we should not pay for it!!

Lets try this one for the wine data:

In [None]:
model = LogisticRegression(C=100, solver='liblinear')
model.fit(X,y)
y_prob = model.predict_proba(X)
model = LogisticRegression(C=100, class_weight='balanced', solver='liblinear')
model.fit(X,y)
y_prob_imb = model.predict_proba(X)
wine_df["pred_prob"] = y_prob[:,1] 
wine_df["pred_prob_imb"] = y_prob_imb[:,1]

Note that these are not proper estimates of the class probability: we need to correct for the biased sample by rescaling the predicted probability 

In [None]:
## the correction factor: 

q1 = y.sum()/len(y)
r1 = 0.5

def reweight(pi,q1=0.5,r1=0.5):
    r0 = 1-r1
    q0 = 1-q1
    tot = pi*(q1/r1)+(1-pi)*(q0/r0)
    w = pi*(q1/r1)
    w /= tot
    return w

In [None]:
# correcting for biased sample

wine_df["pred_prob_imb_corr"] = wine_df["pred_prob_imb"].apply(reweight,args=(q1,r1))

In [None]:
 
wine_df.boxplot(["pred_prob","pred_prob_imb","pred_prob_imb_corr"],by="quality",layout=(1,3), figsize=(15,5))

In [None]:
y_pred_new = [1 if pi >= 0.5 else 0 for pi in wine_df["pred_prob_imb_corr"] ]

#confusion matrix
cm =  confusion_matrix(y_pred=y_pred_new, y_true=y, labels=[0,1])
# Plotting confusion matrix (custom help function)
plot_confusion_matrix(cm, ["Bad/Average Wine", "Great Wine"]) 



## Regularization

Everything that applies to regression applies to logistic regression: ridge, lasso etc penalties can be used for high-dimensional feature spaces and when they are convex the resultant loss function is too and the algorithms are efficient 

Indeed, the `LogisticRegression` function by default includes the penalty

## Multiclass classification

Often the output variable is categorical with many options, not just two, e.g. the wine dataset. 

It is more convenient to encode such multiclass output using the 1-hot encoding, i.e., each output is a vector of 0s with a single 1 in the chosen class: 

$$\by_i = (y_{i1},\ldots,y_{iK})^T \quad y_{ij} \in \{0,1\}, \quad \sum_j y_{ij} = 1$$


where the labels $1,2,\ldots,K$ are arbitrary encodings for the different output categories and any sensible analysis should not depends on their values. 

Binary classification takes $K=2$. In fact, in modern applications $K \sim 100$ or even $K \sim 1000$ (e.g. *recommendation*).

The most direct extension of the binary regression to multiclass is as follows. We take

$$\by_i \sim Categorical(\pi_{i1},\ldots,\pi_{iK})$$

with density 

$$p(\by_i) = \prod_{j} \pi_{ij}^{y_{ij}}$$

which is a clever way to simply say that the probability that the $j$th category is chosen is $\pi_{ij}$. 

### Multinomial-logistic regression

We need to map the probabilities $\pi_{ij}$ to the input $\bx_i$. One way that collapses to logistic regression when $K=2$ is to take:

$$\log{\pi_{ij} \over \pi_{i1}} = f(\bx_i,\bb_j)$$

Note that implicit in this definition is that the odds to choose $j$ vs 1 do not depend on what other options there exist: this is known as the *independence of irrelevant alternatives* assumptions and is criticized in certain contexts. 

The model definition implies

$$\pi_{ij} = {e^{f(\bx_i,\bb_j)} \over 1+ \sum_{k>1} e^{f(\bx_i,\bb_k)}}$$

and you should check that for $K=2$ this is precisely logistic regression. The pivot category is taken above to be 1, but any other can be chosen - this only affects the interpretation of the results. Note also that we have different parameters $\bb_j$ for each category $j$. 

This model is known by a multitude of names...

https://en.wikipedia.org/wiki/Multinomial_logistic_regression

The negative log-likelihood is immediatelly obtained and is **convex** in the $\bb_j$s, hence we have a nice learning problem to solve. In fact, an old clever trick can be used to turn learning this model into a Poisson GLM, this is known as the *Poisson trick* in the Stats community. This is particularly important for large $K$

### Reanalyzing the wine data

`LogisticRegression` in `sklearn` does in fact also fit the multinomial-logistic regression model. Lets try this out

In [None]:
# multinomial-logistic regression for wine
model = LogisticRegression(C=100,multi_class="multinomial",solver="newton-cg",max_iter=10000) 
X = wine_df.drop(['quality',"bin_quality"], axis =1)
y = wine_df.quality
model.fit(X,y)

In [None]:
y_probabilities = cross_val_predict(model, X, y, method='predict_proba', cv = 5)

In [None]:
y_pred = cross_val_predict(model, X, y, cv = 5)
cm =  confusion_matrix(y_pred=y_pred, y_true=y, labels=[3,4,5,6,7,8])
# Plotting confusion matrix (custom help function)
plot_confusion_matrix(cm, ["3","4","5","6","7","8"]) 

Notice the shrinkage towards the 5!

## Bayes classifiers

An entirely different - but turns out to be related (hence included here) - approach to classification is to built a **joint model** for

$$
p(\bx,y)
$$

as opposed for the conditional 

$$
p(y | \bx)
$$

that the previous approach we consider does. The fact that the joint model gives a recipe for generating data makes this approach be referred to as **generative**. 

Bayes classifiers come up with a joint model by decomposing the joint probabilities: 

$$p(\bx,y) = p(y) p(\bx | y)$$

Focusing on binary classification, one learns 

1. $p(y=1)$ - this is trivial
2. $p(\bx | y=1)$ and $p(\bx | y=0)$; the two conditional distributions

With these, predictive probabilities for the class are obtained using the **Bayes theorem** (hence the name) 

$$p(y =1 | \bx) = { p(\bx | y=1) p(y =1) \over p(\bx | y=1) p(y =1)  + p(\bx | y=0) p(y =0)} $$ 

The challenge is to come up with tractable and useful models for $p(\bx |y)$ - non-trivial since we typically have 10ths/100ds/1000ds of features

Two off-the-shelf options are: 

1. $\bx | y = i \sim \Gau(\bmu_i, \bS)$, for $i=0,1$. The resultant classifier is known as **Fisher discriminant analysis**
2. $p(\bx | y = i)  = \prod_{j=1}^p p_{i,j}(x_j)$ for $\bx = (x_1,\ldots,x_p)^T$; the resultant classifier is known as **naive Bayes**

It is well known, e.g. since Efron (1975, JASA), that discriminant analysis is equivalent to logistic regression with specific coefficients - the article shows that it is not that good idea to use the former

Naive Bayes is not functionally related to logistic regression but theory exists about their relative performance. In a nutshell, naive Bayes classifiers reach near-optimal performance with smaller sample sizes but their optimal performance is worse than that of logistic regression

Still, subject matter knowledge and more clever modelling on $p(\bx|y)$ can improve the performance of Bayes classifiers

Lets revisit an analysis we did with the spam dataset and appreciate the implicity Bayes classifier feel to it!

In [None]:
spam.boxplot(column=["word_freq_you","word_freq_hp","char_freq_!"], by = "class")

## Some hints for the practitioners

+ Predictive modelling with categorical output can be done using `LogisticRegression`; by default this module includes regularization, hence it is straighforward to include hundreds of features
+ When the output is multicategorical is way more sensible to build directly a model for the original output than first turn it (more or less arbitrarily) into a binary output. Even if for commercial/interpretability purposes a binary prediction is preferred, is preferrable to turn the multicategorical prediction into binary rather than the multicategorical output to binary and build a model
+ A probabilistic classifier returns probabilities for the possible categories. It is not the data scientist's job to turn those into class predictions. This should be done in conjunction with the user of the analysis and the consideration of losses. Once the losses have been specified a simple formula gives the optimal conversion
+ Class imbalance is an issue relevant for many or even most classification applications. Using the `class_weight="balanced"` within `LogisticRegression` gives a possible improvement using oversampling - make sure to correct the probabilities it returns since they are not correct!

## References

Hastie, T., Tibshirani, R., Friedman, J., 2009. *Elements of Statistical Learning*. 2nd Edition. Section 4.4; More advanced 3.8,3.9.  https://web.stanford.edu/~hastie/ElemStatLearn/

Bishop, C.M. *Pattern recognition and machine learning*. Sections 4.2, 4.3