# Supervised learning

= the training data you feed to the algorithm includes the desired solutions (called labels).

## Linear Regression

Many fancy statistical learning approaches can be seen as generalizations or extensions of linear
regression. 

### Simple linear regression

A method for predicting a quantitative response Y on the basis of a single predictor variable X. It assumes that there is approximately a linear relationship between X and Y. A **regression of Y on X** is expressed as: 

$$ \large Y \approx \beta_{0} + \beta_{1}X $$

Where $\beta_{0}$ and $\beta_{1}$ are known as the model coefficients or parameters. 

To find $\hat{y}$ the prediction of $Y$ on the basis of $X = x$, the equation below is used: 

$$ \large \hat{y} \approx \hat{\beta_{0}} + \hat{\beta_{1}}x $$ 

The hat symbol denotes the estimated value for an unknown parameter, or the predicted value of the response. The equation is known as the **least square line** since $\beta_{0}$ and $\beta_{1}$ are usually found by **minimizing the least squares criterion** (/minimizing the **Residual Sum of Squares (RSS)**):

$$ RSS = (y_{1} - \hat{y_{1}})^2 + (y_{2} - \hat{y_{2}})^2 ... + (y_{n} - \hat{y_{n}})^2 =(y_{1} - \hat{\beta_{0}} - \hat{\beta_{1}}x_{1})^2 + (y_{2} - \hat{\beta_{0}} - \hat{\beta_{1}}x_{2})^2 ... + (y_{n} - \hat{\beta_{0}} - \hat{\beta_{1}}x_{n})^2
 $$
 
   The RSS is used as a measure of the discrepancy between the data and an estimation model. A small RSS indicates a tight fit of the model to the data. 
    
The equation that is used to express the best linear approximation to the true relationship between X and Y is known as the **population regression line**:

$$ \large Y = \beta_{0} + \beta_{1}X + \epsilon$$

In real applications, we have access to a set of observations from which we can compute the least squares line; however, the population regression line is unobserved.

### Central limit theorem

Analogy between linear regression and estimation of the mean of a random variable (the analogy is based on bias): 

When we use the sample mean $\mu$ to estimate the real population mean $\hat{\mu}$ the estimation is unbiased in the sense that, on average, we expect $\mu$ to equal $\hat{\mu}$. For some samples, $\mu$ might overestimate $\hat{\mu}$ and for others, $\mu$ might underestimate $\hat{\mu}$. *An unbiased estimator does not systematically
over- or under-estimate the true parameter.* 

The same is true for the coefficients $\beta_{0} and \beta_{1}$. The average of many least squares lines (with different $\beta_{0} and \beta_{1}$), each estimated from a separate data set, is pretty close to the true population regression line.

The **standard error** of $\mu$ tells us the average amount that the estimate $\hat{\mu}$ differs from the actual value of $\mu$. The standard error is computed as follows:

$$ SE(\hat{\mu}) = \sqrt{VAR(\hat{\mu})} = \sqrt{\frac{\sigma^2}{n}} $$

Where $\sigma$ is the **standard deviation** (a representation of the spread of each of the data points).

*The standard deviation measures the amount of variability, or dispersion, from the individual data values to the mean, while the standard error of the mean measures how far the sample mean (average) of the data is likely to be from the true population mean.* 
**The standard error of a statistic is the approximate standard deviation of a statistical sample population**. 

The larger the sample size, the smaller the standard error because the statistic will approach the actual value. 

Similarly,the standard error associated with $\beta_{0} and \beta_{1}$ can easily be computed. Note that, in general, $\sigma^2$ is not known but can be estimated from the data. The estimate is known as the **residual standard error** and is given by $RSE = \sqrt{\frac{RSS}{(n-2)}}$

Standard errors can be used to:

- Compute **confidence intervals**. A 95% confidence interval is a range of values such that with 95% probability, the range will contain the true unknown value of the parameter. *If we take repeated samples and construct the confidence interval for each sample, 95% of the intervals will contain the true unknown value of the parameter.* 

    For linear regression, a 95% confidence interval for $\beta_{0} and \beta_{1}$ is computed as follows:

    $$ \hat{\beta_{1}} \pm 2 \cdot SE(\hat{\beta_{1}})$$
    $$ \hat{\beta_{0}} \pm 2 \cdot SE(\hat{\beta_{0}})$$

    i.e. if the confidence interval for $\beta_{0}$ is $[6.130, 7.935]$ and for $\beta_{1}$ $[4.2, 5.3]$ for a given feature, we can conclude that in the absence of the feature, the response will be between 6.130 and 7.935 units. In addition, for each unit of increase for the feature, there will be an average increase in the response of between 4.2 and 5.3 units.
    
    
- Perform **hypothesis testing** on the coefficients (null hyp: no relationship between X and Y (or $\beta_{1} = 0$), alt hyp: there is a relationship (or $\beta_{1} \neq 0$)). If $\hat{\beta_{1}}$ is sufficiently far from 0 we can be confident that $\beta_{1}$ is non-zero. 

    To assess what "sufficiently" means we look at the standard error of $\hat{\beta_{1}}$: When the SE($\hat{\beta_{1}}$ is small, then small values of $\hat{\beta_{1}}$ can provide strong evidence that $\beta_{1} \neq 0$. On the other hand, when SE($\hat{\beta_{1}}$ is large, then $|\hat{\beta_{1}}|$ must be large in order for us to reject the null hypothesis.
    
    The **t-statistic** measures the number of standard deviations that $\hat{\beta_{1}}$ is away from 0. The **p-value** is the probability of observing a value equal to |t| or larger (in abs value), assuming that the null hypothesis is true. 
    large t-value --> $\hat{\beta_{1}}$ far away from 0 --> low p-value (since the prob of observing a value that is so far from 0 assuming that it should be equal to 0 is low) --> reject null hypothesis (there is an association between the predictor and the response)! 

### Multiple linear regression

If you have p distinct predictors, the multiple linear regression model takes the form:

$$ \large Y = \beta_{0} + \beta_{1}X_{1} + \beta_{2}X_{2} + ... + \beta_{p}X_{p} +\epsilon$$

We interpret $\beta_{j}$ as the average effect on Y of a one unit increase in $X_{j}$, holding all other predictors fixed. All of the $\beta$ coefficients are estimated using the same least squares approach as for the simple linear regression. 

- **Obs.** The simple and multiple regression coefficients can be quite different:

    *Running a regression of shark attacks versus ice cream sales for data collected at a given beach community over a period of time would show a positive relationship. In reality, higher temperatures cause more people to visit the beach, which in turn results in more ice cream sales and more shark attacks. A multiple regression of shark attacks onto ice cream sales and temperature reveals that ice cream sales is no longer a significant predictor after adjusting for temperature.*


- **Hypothesis testing**: for multiple linear regression, the null hypothesis is equal to $\beta_{1} = \beta_{1} = \beta_{2} = .. = \beta_{p} = 0$ and the alternative hypothesis is that at least one $\beta_{j} \neq 0$. This hypothesis test is performed by computing the **F -statistic**. A large value for the F-statistic implies that at least one of the features must be related to the response. A p-value is associated with the F-stat. 

    Why look at the p-value associated with the F-stat and not the individual p-values associated with the t-stat? 
    
    *You would think that if any one of the p-values for the individual variables is very small, then at least one of the predictors is related to the response. However, this logic is flawed, especially when the number of predictors p is large! The F-statistic adjusts for the number of predictors but the individual p-values don't.* 
    
    When is the F-stat reliable?
    
    *The F-stat is only reliable when n>p (low dimensional setting where n = number obs and p = number features). In addition, it is more reliable when n is large than when n is small.*
    

- **Variable selection**: Once you have concluded, on the basis of the p-value associated with the F-stat, that at least one predictor is associated with the response, how do you find which predictor?

    Ideally, you try out different models, each containing a different subset of the predictors. However, there is about $2^p$ models that contain a subset of the predictors, so trying out every possible subset of the predictors is often infeasible. There exists three other approaches for variable selection:
    1. **Forward Selection**: We begin with a model with an intercept but no predictors (a null model) and we fit p simple linear regressions. We add to the model the variable that results in the lowest RSS. We then add to that model the variable that results in the lowest RSS for the new two-variable model. This approach is continued until some stopping rule is satisfied. *Contrarily to backward selection, forward selection can always be used. However, forward selection is a greedy approach, and might include variables early that later become redundant.*
    2. **Backward Selection**: We begin with all the variables in the model and remove the variable with the largest p-value. The new (p-1) model is fit and again, the variable with the largest p-value is removed. This procedure continues until a stopping rule is reached. *Backward selection can not be used if p>n.*
    3. **Mixed Selection**: We start with no variables in the model, and as with forward selection, we add the variable that provides the best fit. We continue to add variables one-by-one. If at any point the p-value for one of the variables in the model rises above a certain threshold, then we remove that variable from the model. We continue to perform these forward and backward steps until all variables in the model have a sufficiently low p-value, and all variables outside the model would have a large p-value if added to the model.
    
    
- **Terminology**: 
    The **least squares plane** (1) is only an estimate for the true **population regression plane** (2):
    $$ \large \hat{Y} = \hat{\beta_{0}} + \hat{\beta_{1}}X_{1} + \hat{\beta_{2}}X_{2} + ... + \hat{\beta_{p}}X_{p} \qquad\small (1) $$
    $$ \large f(X) = \beta_{0} + \beta_{1}X_{1} + \beta_{2}X_{2} + ... + \beta_{p}X_{p} \qquad\small (2) $$
    
    The inaccuracy in the coefficient estimates is related to the **reducible error**. Of course, in practice assuming a linear model for f (X) is almost always an approximation of reality, so there is an additional source of potentially reducible error which we call **model bias**.
    
    Even if we knew the true values for $\beta_{0}, \beta_{1}$ .. etc, the response value cannot be predicted perfectly because of the random error in the model $\epsilon$, **the irreducible error.** 
    
    **Prediction intervals** must account for both the uncertainty in estimating the population mean, plus the random variation of the individual values. So a prediction interval is always wider than a confidence interval.
    

### Assessing the accuracy of the model

The quality of a linear regression fit is typically assessed using two related quantities: **the residual standard error ($RSE$) and the $R^2$ statistic.**

The $RSE$ is an estimate of the standard deviation of $\epsilon$. You can see it as the average amount that the response will deviate from the true regression line.

$$RSE = \sqrt{\frac{RSS}{(n-p-1)}}$$ 

The $RSE$ is considered a measure of the lack of fit of the model to the data. A large $RSE$ indicates that the model doesn't fit the data very well. For simple linear regressions, p = 1 and thus $RSS$ is divided by (n-2).

Since $RSE$ is measured in the units of Y, it is not always clear what a good $RSE$ constitutes. The $R^2$ statistic provides an alternative measure of fit. It measures the proportion of variability in Y that can be explained using X. An $R^2$ statistic that is close to 1 indicates that a large proportion of the variability in the response is
explained by the regression.

$$R^2 = 1 - \frac{RSS}{TSS}$$

- **Obs 1**: in typical applications in biology, psychology, etc. the linear model is at best an extremely rough approximation to the data, and residual errors due to other unmeasured factors are often very large. In this setting, we would expect only a very small proportion of the variance in the response to be explained by the predictor, and an $R^2$ value well below 0.1 might be more realistic than a large one.
- **Obs 2**: In the simple linear regression setting, the squared correlation and the $R^2$ statistic are similar: $R^2 = Cor(X, Y)^2$, and, in the multiple linear regression setting, $R^2$ is equal to the square of the correlation between the response and the fitted linear model: $R^2 = Cor(Y, \hat{Y})^2$
- **Obs 3**: $R^2$ will always increase when more variables are added to the model, even if those variables are only weakly associated with the response. Why? Adding another variable results in a decrease in the RSS on the training data (though not necessarily the testing data). Thus, the $R^2$ statistic, which is also computed on the training data, must increase. 
    On the other hand, the $RSE$ doesn't always decrease when more variables are added to the model. Why? Models with more variables can have higher $RSE$ if the decrease in $RSS$ is small relative to the increase in p.
    
Finally, it can be useful to plot the data. Graphical summaries can reveal problems with a model that are not visible from numerical statistics.

## k-Nearest Neighbors

Algorithm that makes classification predictions about your desired data points based on the assumption that nearby data points are similar to your test point.

The k-nearest neighbors algorithm hinges on data points being close together. This becomes challenging as the number of dimensions increases, referred to as the “Curse of Dimensionality.” It’s especially hard for the k-nearest neighbors algorithm it requires two points to be very close on every axis, and adding a new dimension creates another opportunity for points to be farther apart. As the number of dimensions increases, the closest distance between two points approaches the average distance between points, eradicating the ability of the k-nearest neighbors algorithm to provide valuable predictions.

To overcome this challenge, you can add more data to the data set. By doing so you add density to the data space, bringing the nearest points closer together and returning the ability of the k-nearest neighbors algorithm to provide valuable predictions.

## Decision trees

Decision trees involve segmenting the predictor space into a number of distinct and non-overlapping regions (R1, R2, ... RJ). Each split of the domain is aligned with one of the feature axes (in theory they could have any shape but "rectangles" are chosen for simplicity's sake and ease of interpretation).

### Regression trees

- The predictor/feature space is segmented into distinct, non-overlapping regions (R1... RJ).
- Any new observation that falls into a particular partition RJ has the estimated response given by the **MEAN** of all training observations in RJ (the mean of the training observation in each partition is represented by a leaf).


- The goal is to find boxes that minimize **the Residual Sum of Squares (RSS)**. 
- The **RSS** is computed by summing, for each test observation and across all partitions of the feature space, the squared difference of the response $y_{i}$ of a particular testing observation with the mean response of the training observation within the $j_{th}$ region. 

$$ \large RSS = \sum\limits_{j=1} ^{J} \sum\limits_{i \in R_{j}} (y_{i} - \hat{y}_{R_{J}})^2 $$


- Since it is computationally expensive to consider all possible partitions of the feature space into J rectangles, minimizing the RSS is done with **the Recursive Binary Splitting approach (RBS)**.


- In one sentence: the RBS approach helps construct a tree by considering all features (X1, ... Xp) and all possible values of the cutpoint s for each of the features, and then choosing at each node the feature that best splits the data. It is said to be: 
    - *top down* because it begins at the top of the tree (at which point all observations belong to a single region) and then successively splits the predictor space (each split --> new branch). 
    - *greedy* because at each step of the building process, the best split is made at that particular step, rather than looking ahead and picking a split that will lead to a better tree in some future step.


- The RSS in greater detail: 

  For any j and s, a pair of half planes is defined (1). We then seek the values of j and s that minimize the equation (2) . 
  This process is repeated. However, each time, instead of splitting the entire predictor/feature space, only one of the two previously identified regions is split. The process continues until a stopping criterion is reached (i.e. until no region contains more than five observations).
    1. 
    $$ \large R_{1(j,s)} = \{X | X_{j} < s\} $$
    $$ \large R_{2(j,s)} = \{X | X_{j} \geqslant s\} $$
    
    2. 
    
$$ \large\sum\limits_{i: x_{i} \in R_{1(j,s)}} (y_{i} - \hat{y}_{R_{1}})^2 + \sum\limits_{i: x_{i} \in R_{2(j,s)}} (y_{i} - \hat{y}_{R_{2}})^2 $$

- In order to **NOT overfit** the data you could:

    - Alternative 1: Build the tree only so long as the decrease in the RSS due to each split exceeds some (high) threshold. However, this strategy is too short-sighted since a "worthless" split might be followed by a "good" split later on.
   
    - Alternative 2: Grow a very large tree and then prune it back in order to obtain a subtree.
    
   **Cost complexity pruning** : introduces an additional tuning parameter ($\alpha$) that balances the depth of the tree and the goodness of fit to training data. For each value of $\alpha$ there corresponds a subtree $T \subset T_{0}$ such that:
   
       $$ \large \sum\limits_{n=1} ^{|T|} \sum\limits_{x_{i} \in R_{J}} (y_{i} - \hat{y}_{R_{J}})^2 + \alpha|T| $$
       
    is as small as possible. |T| corresponds to the number of terminal nodes of the tree T (its' complexity).
        - When $\alpha = 0, T = T_{0}$ 
        - As $\alpha$ increases, there is a price to pay for having a tree with many terminal nodes.

### Classification trees

- The predictor/feature space is segmented into distinct, non-overlapping regions (R1... RJ).
- Any new observation that falls into a particular partition RJ has the estimated response given by the **MODE** of all training observations in RJ (each observation belongs to the most commonly occuring class of training observations in the region to which it belongs).

- The goal is to find boxes that minimize either the:

    **Classification error rate** = the fraction of the training obs in that region that do not belong to the most common class.
    
    $$\large E = 1-max(\hat{p}_{mk})$$ 
    
    where $\hat{p}_{mk}$ is the proportion of the training obs in the mth region that are from the kth class.
    
    Or the
    **Gini index** = the measure of the total variance across the k classes (or the node purity) (small G value implies "pure" node). Values range from 0 to 0.5.
    
    $$G = \large\sum\limits_{k=1}^{K} \hat{p}_{mk}(1-\hat{p}_{mk})$$ 
    
    Or the **Entropy** = measure of uncertainty. Values range from 0 to 1.
    
    $$D = \large - \sum\limits_{k=1}^{K} \hat{p}_{mk}log({p}_{mk})$$ 

- Once you've calculated the gini index or entropy for both branches of a node, we can determine the quality of the split by weighting the entropy of each branch by how many elements it has.
 
     Information gain is based on the decrease in entropy after a dataset is split on an attribute. It helps to determine the order of attributes in the nodes of a decision tree (by finding the attribute that returns the highest information gain).
     
     **Information Gain = how much Entropy we removed** $= D_{before} - D_{after}$

#### Gini, entropy examples

i.e. in a two class problem with 400 obs in each class, suppose one split creates nodes (300,100) and (100, 300) while the other creates nodes (200,400) and (200,0). Both splits have a misclassification rate of 0.25 but the second split produces a pure node and is probably preferable (it has a lower gini index and entropy).

Example 1: 4 Red / 0 Blue

$Gini = 1 - (prob(red)^2 + prob(blue)^2) = 1 - (1^2 + 0) = 0$
$Entropy = -[prob(red)*log(prob(red))]-[prob(blue)*log(prob(blue))] = -[4/4*log(prob(4/4))]- 0 = 0$

Example 2: 2 Red / 2 Blue

$Gini = 1 - (0.5^2 + 0.5^2) = 0.5$ (the group is as impure as possible! (trick: divide answer by 0.5 --> 1)
$Entropy = -[2/4*log(prob(2/4))]-[2/4*log(prob(2/4))] = -(-1/2)-(-1/2) = 1$

Example 2: 3 Red / 1 Blue

$Gini = 1 - (0.75^2 + 0.25^2) = 0.375$ (0.375/0.5 = 0.75 <-- prob of incorrect/correct labelling)
$Entropy = -[3/4*log(prob(3/4))]-[1/4*log(prob(1/4))] = 0.811$ (a bit worse than gini score)

### Good links

- https://www.quantstart.com/articles/Beginners-Guide-to-Decision-Trees-for-Supervised-Machine-Learning/
- https://victorzhou.com/blog/information-gain/

# Unsupervised learning

= the training data is unlabeled and the system tries to learn without a "teacher". 

# Semisupervised learning

= the training data is both labeled and unlabeled (usually a lot of unlabeled data and a little bit of labeled). i.e. Google photos: once you upload all your family photos to the service, it automatically recognizes that the same person A shows up in photos 1,5 and 11 etc. If you tell the system who each person is, the system is able to name everyone in every photo.

# Reinforcement learning

= the learning system observes the environment, selects and performs actions, and gets rewards in return (or penalties in the form of negative rewards); it must then learn by itself what is the best strategy to get the most reward over time. i.e. DeepMind's AlphaGo