Goal:
- Read through all sections in hundredpage ml book
- Read through relevant sections in ISL
- Read through relevant sections in applied predictive modelling
- Read through relevant sections in python datascience handbook
- Incorporate additional relevant sources

# Introduction and definitions

**Why do we estimate f?**

- Purpose of ml is often to infer a function f that describes the relationship between target and features.

- Can estimate f for (1) prediction or (2) inference or both.

**How do we estimate f?**

- 3 basic approaches: parametric (assume shape of f and estimate coefficients), non-parametric (also estimate shape of f), semi-parametric.

- Accuracy depends on (1) irreducible error (variance of error term) and (2) reducible error (appropriateness of our model and its assumptions)

**Ingredients to statistical learning**

- Specify aim

- Gather and pre-process data

- Select a learning algorithm

- Apply learning algorithm to data to build (and train) a model

- Assess model performance (by testing) and tune model

- Make predictions

**Types of learning**

- Supervised (labelled examples)

- Unsupervised (unlabelled examples)

- Semi-supervised (labelled and unlabelled examples)

- Reinforcement

**The trade-off between prediction accuracy and model interpretability**

- Linear models vs non-linear models (hard to interpret models often predict more accurately).

**Supervised vs. unsupervised learning**

**Regression vs classification problems**

- Classification assigns categorical labels, regression real-valued labels to unlabelled examples.

**Hyperparameters vs parameters**

- Hyperparameters determine how the algorithm works and are set by the researcher.

- Parameters determine the shape of the model and are estimated. 

**Model-based vs instance-based learning**

- Model-based algorithms estimate and then use parameters to make predictions (i.e. can discard training data once you have estimate), instance-based algorithms (e.g. KNN) use the entire training dataset.

**Deep vs shallow learning**

- Shallow learning algorithms learn parameters directly from features, deep learning algorithms (deep neural network) learn them from the output of preceeding layers.

# Learning algorithms

## Linear regression

Estimation

Evaluating fit

- RSE 

- R2

Residual standard error

- The RSE is 3.25, which implies that our estimates deviate about 3.25 from the actual values (this would be true even if we knew the population parameters, as the RSE is an estimate of the error standard deviation). Given the average value of sales, the percentage error is about 12 percent. Whether this is a lot or not depends on the application. 

- Becaue the RSE is an absolute measure of lack of fit, expressed in units of y, it's not always easy to interpret whether a given RSE is small or large.



$R^2$

- $R^2$, which is a relative measure of lack of fit, and measures the percentage of variance in y that the model can explain (and is thus always between 0 and 1). In the simple linear regression setting, $R^2 = Cor(X, Y)^2$.

- A low $R^2$ can mean that the true relationship is non-linear or that the error variance is very high or both. What constitutes "low" depends on the application.

- In the model above, more than 90 percent of the variation is explained by the set of explanatory variables.

Questions of interest

Is at least one of the predictors useful in explaining y?

- To test whether at least one of the predictors is useful in predicting the response, we can look at the reported F statistic.

In [25]:
res.fvalue, res.f_pvalue

(570.2707036590942, 1.575227256092437e-96)

- To test whether a subset of parameters is useful, we can run our own F-test. To manually test for all parameters, we can use:

In [32]:
a = np.identity(len(res.params))[1:]
res.f_test(a)

<class 'statsmodels.stats.contrast.ContrastResults'>
<F test: F=array([[570.27070366]]), p=1.5752272560925203e-96, df_denom=196, df_num=3>

- Which is equivalent to the statistic provided in the output. To test the (joint) usefulness of radio and newspaper, we can use:

In [40]:
a = np.identity(len(res.params))[[2, 3]]
res.f_test(a)

<class 'statsmodels.stats.contrast.ContrastResults'>
<F test: F=array([[272.04067681]]), p=2.829486915701129e-57, df_denom=196, df_num=2>

- Remember: the F statistic is valuable because irrespective of $p$, there is only a 5 percent change that the p-value is below 0.05. In contrast, individual predictors each have that probability, so for a large number of predictors, it's very likely that we observe significant ones solely due to chance.

Are all of the predictors or only a subset useful in explaining y?

## Logistic regression

# Data preparation

# Model selection and assessment

- Basic and advanced practice in hundreppage book

## Assessing model accuracy

Measuring the quality of fit

- MSE for regressions.

- Error rate for classification.

- Beware of overfitting! Maximise mse or error rate for testing data, not for training data (overfitting).

- Overfitting definition: situation where a simpler model with a worse training score would have achieved a better testing score.


The bias-variance trade-off

- MSE comprises (1) squared bias of estimate, (2) variance of estimate, and (3) variance of error. We want to minimise 1 and 2 (3 is fixed).

- Relationship between MSE and model complexity is U-shaped, because variance increases and bias decreases with complexity. We want to find optimal balance.

Classification setting

- Bayesian classifier as unattainable benchmark (we don't know P(y|x)).

- KNN one approach to estimate P(y|x), then uses bayesian classifier.

- Intuition as for MSE error: testing error rate is U-shaped in K (higher K means more flexibel model).

Confusion matrix

- Can be used to calculate precision, recall, specificity, accuracy, etc.

- True positive: there was an event and we predicted one.
- True negative: there was no event and we didn't predict one.
- False positive: there was no event but we predicted one (i.e. we predicted 1 instead of 0).
- False negative: there was an event but we didn't preditc one (i.e. we predicted 0 instead of 1)

ROC curves and AUC

- Plots the trade-off between the false positive rate (x-axis) and the true positive rate (y-axis) - the trade-off between the false alarm rate and the hit rate.

- True positive rate $= \frac{True Positives}{True Positives + False Negatives} = \frac{True Positives}{All Events}$.

- The true positive rate is also called sensitivity. And we can think of it as the hit rate: the propostion of events that we correctly classified as such.

- False positive rate $= \frac{False Positives}{False Positives + True Negatives} = \frac{False Positives}{All NonEvents}$.

- The false positive rate is also called the false alarm rate, the proportion of cases incorrectly classified as an event among all cases that are not an event. It is also referred to as inverterd specificity, where specificity is $= \frac{True Negatives}{False Positives + True Negatives}$.

- The ROC is useful because it directly shows false/true negatives (on the x-axis) and false/true positives (on the y-axis) for different thresholds and thus helps choose the best threshold, and because the AUC can be read as an overall model summary, and thus allows us to compare different models. 



Precision-recall curves

- Precision and recall originate in the field of information retrieval (e.g. getting documents from a query) but are also useful in machine learning. 

- Precision $= \frac{True Positives}{True Positives + False Positives} = \frac{True Positives}{All Positives}$

- Recall $= \frac{True Positives}{True Positives + False Negatives} = \frac{True Positives}{All Events}$

- In the context of document retriavel, precision is the useful documents as a proportion of all retrieved documents, recall the retrieved useful documents as a proportion of all available useful documents.

- We can think of precision as positive predictive power (how good is the model at predicting the positive class?), while recall is the same as sensitivity -- the proportion of all events that were successfully predicted.

- The precision-recall curve is particularly useful when we have much more no-event than event cases. In this case, we're often not interested much in correctly predicting no-events but focused on correctly predicting events. Because neither precision nor recall use true negatives in their calculations, they are well suited to this context ([paper](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0118432)).

- The precision recall curve plots precision (y-axis) and recall (x-axis). An unskilled model is a horizontal line at hight equal to the proportion of events in the sample. We can use ROC to compare models as different thresholds, or the F-score (the harmonic mean between precision and recall) at a specific threshold.

- Use precision-recall curves if classes are imbalanced, in which case ROC can be misleading (see example in last section [here](https://machinelearningmastery.com/roc-curves-and-precision-recall-curves-for-classification-in-python/)).

- Shape of curve: recall increases monotonically as we lower the threshold (move from left to right in the graph) because we'll find more and more true positives ("putting ones from the false negative to the true positive bucket", which increases the numerator but leaves the denominator unchanged), but precision needn't fall monotonically, because we also increase false positives (both the numerator and the denominator increase as we lower the threshold, and the movement of recall depends on which increases faster, which depends on the sequence of ordered predicted events). See [here](https://stats.stackexchange.com/a/183506) for a useful example.






# Practical tips

- One way to get a sense of how non-linear the problem is, is to compare the MSE of a simple linear model and a more complex model. If the two are very close, then assuming linearity and using a simple model is preferrable.

# Sources

- [Wikipedia](https://en.wikipedia.org/wiki/Logistic_function)

- [Roc](https://machinelearningmastery.com/roc-curves-and-precision-recall-curves-for-classification-in-python/)