In [13]:
import ISLP
import numpy as np
import pandas as pd
import plotly.express as px

# <font color='teal'>ISLP Chapter 2 - Statistical Learning ðŸ§ ðŸ“ˆ

Suppose that we observe a quantitative response $Y$, and $p$ different predictors, $X_{1}, X_{2},...,X_{p}$. 

We assume that there is some relationship between $\mathbf{Y}$ and $\mathbf{X} = (X_{1}, X_{2},...,X_{p})$

which can be written in the very general form
<font size=5> 
$$Y = f(X) + \epsilon$$

**In essence, statistical learning refers to a set of approaches for estimating $f$.**

In this chapter some of the key theoretical concepts are outlined that arise in estimating $f$, as well as tools for evaluating the estimates obtained.
    
---

**<font color='teal'>Why Estimate $f$ ?**

There are two main reasons that we may wish to estimate $f$ : 
    
- Prediction
- Inference
    
---
    
<font color='teal'>**Prediction**<br>
    

In many situations, a set of inputs $\mathbf{X}$ are readily available, but the output $\mathbf{Y}$ cannot be easily obtained. In this setting, since the error term averages to zero, we can predict $\mathbf{Y}$ using:

<font size=5>
$$
\hat{Y} = \hat{f}(X)
$$
    
where $\hat{f}$ represents our estimate for $f$, and $\hat{Y}$ represents the resulting prediction for $Y$ . 
    
In this setting, $\hat{f}$ is often treated as a black box, in the sense that one is not typically concerned with the exact form of $\hat{f}$, provided that it yields accurate predictions for $Y$
    
---


    
<font color='teal'>**Accuracy**
    
The accuracy of $\hat{Y}$ as a prediction for $Y$ depends on two quantities, which we will call the *reducible error* and the *irreducible error*. 
    
In general, $\hat{f}$ will not be a perfect estimate for $f$, and this inaccuracy will introduce some error. 
    
This error is reducible because we can potentially improve the accuracy of $\hat{f}$ by using the most appropriate statistical learning technique to estimate $f$. 
    
However, even if it were possible to form a perfect estimate for $f$, so that our estimated response took the form $\hat{Y} = f(X)$, our prediction would still have some error in it! 
    
This is because $Y$ is also a function of $\epsilon$, which, by defnition, cannot be predicted using $X$. 
    
Therefore, variability associated with $\epsilon$ also affects the accuracy of our predictions.
    
This is known as the irreducible error, because no matter how well we estimate $f$, we cannot reduce the error introduced by $\epsilon$.
    
**Why is the irreducible error larger than zero?**
    
The quantity $\epsilon$ may contain unmeasured variables that are useful in predicting $Y$, since we donâ€™t measure them, $f$ cannot use them for its prediction. The quantity $\epsilon$ may
also contain unmeasurable variation.

    
$$
E[(Y - \hat{Y})^{2}] = E[f(X) + \epsilon - \hat{f}(X)]^{2} \\
\; \; \; \; \; \; \; \; \; = [f(X) - \hat{f}(X)]^2 + Var(\epsilon)
$$
    
- Reducible error $[f(X) - \hat{f}(X)]^2$
- Irreducible error $Var(\epsilon)$

---

Rules & Assumptions

$ E[cx] = c E[x]  \space\space\space\space and \space\space\space\space E[a + b] = E[a] + E[b] \space\space\space\space and \space\space\space\space E[\epsilon] = 0 $

---

- $Y = f(X) + \epsilon \space\space\space\space$ Using true $f$ we still experience some error in a prediction
<br>

- $\hat{Y} = \hat{f}(X) \space\space\space\space$ No error because $\hat{Y}$ is a prediction
<br>

- $E[(Y - \hat{Y})^{2}] \space\space\space\space$ represents the average, or expected value, of the squared difference between the predicted and actual value of Y
<br>

Substituting 1 & 2 into 3

$E[(Y - \hat{Y})^{2}] = E[ ( f(X) + \epsilon - \hat{f}(X) )^{2} ]$

Expanding the square

$f(x) - \hat{f}(x) = z$

$E[z^{2} + 2z\epsilon + \epsilon^{2}]$



$ E[z^{2} + 2z\epsilon] +E[\epsilon^{2}]$


$Var(\epsilon) = E[(\epsilon)^{2} - \mu]$ the mean of irreducible error considered 0 so $E[\epsilon^{2}]$ = $Var(\epsilon)$

$ E[z^{2} + 2z\epsilon] +Var(\epsilon)$

$ E[(f(x) - \hat{f}(x))^{2} + 2(f(x) - \hat{f}(x))\epsilon] +Var(\epsilon)$

$ E[(f(x) - \hat{f}(x))^{2}] + E[2(f(x) - \hat{f}(x))\epsilon] +Var(\epsilon)$

$E[(f(x) - \hat{f}(x))^{2}] + E[\epsilon]2(f(x) - \hat{f}(x)) +Var(\epsilon)$

$E[(f(x) - \hat{f}(x))^{2}] + Var(\epsilon) = E[(Y - \hat{Y})^{2}]$



---
    
<font color='teal'>**Inference**<br>
    
We are often interested in understanding the association between $Y$ and $X_{1},...,X_{p}$. 
    
In this situation we wish to estimate $f$, but our goal is not necessarily to make predictions for $Y$ . 
    
Now $\hat{f}$ cannot be treated as a black box, because we need to know its exact form.
    
To do this we may ask the following questions:
    
- **Which predictors are associated with the response?**
    
It is often the case that only a small fraction of the available predictors are substantially associated with $Y$. Identifying the few important predictors among a large set of possible variables can be extremely useful, depending on the application.

- **What is the relationship between the response and each predictor?**
    
Some predictors may have a positive relationship with $Y$ , in the sense that larger values of the predictor are associated with larger values of $Y$ . Other predictors may have the opposite relationship. Depending on the complexity of $f$, the relationship between the response and a given predictor may also depend on the values of the other predictors.

- **Can the relationship between Y and each predictor be adequately summarized using a linear equation, or is the relationship more complicated?**
    
Historically, most methods for estimating f have taken a linear form. In some situations, such an assumption is reasonable or even desirable. But often the true relationship is more complicated, in which case a linear model may not provide an accurate representation of the relationship between the input and output variables.
    
linear models allow for relatively simple and interpretable inference, but may not yield as accurate predictions as some other approaches. 
    
In contrast, some of the highly non-linear approaches that we discuss in the later chapters of this book can potentially provide quite accurate predictions for Y , but this comes at the expense of a less interpretable model for which inference is more challenging.
    
---
    
<font color='teal'>**How To Estimate $f$**
    
Our goal is to apply a statistical learning method to the training data in order to estimate the unknown function f. In other words, we want to fnd a function $\hat{f}$ such that $Y â‰ˆ \hat{f}(X)$ for any observation $(X, Y)$ 
    
Broadly speaking, most statistical learning methods for this task can be characterized as either *parametric* or *non-parametric*.
    
---
  
<font color='teal'>**Parametric Methods**
    
- 1. First, we make an assumption about the functional form, or shape, of $f$. For example, one very simple assumption is that $f$ is linear in $X$
   
$$
f(X) = Î²_{0} + Î²_{1}X_{1} + Î²_{2}X_{2} + Â·Â·Â· + Î²_{p}X_{p}
$$
    
- 2. After a model has been selected, we need a procedure that uses the training data to ft or train the model. In the case of the linear model, we need to estimate the parameters $Î²_{0}, Î²_{1},..., Î²_{p}$. That is, we want to fnd values of these parameters such that
    
$$
Y â‰ˆ Î²_{0} + Î²_{1}X_{1} + Î²_{2}X_{2} + Â·Â·Â· + Î²_{p}X_{p}
$$
    
The model-based approach just described is referred to as parametric; it reduces the problem of estimating $f$ down to one of estimating a set of parameters. 
    
The potential disadvantage of a parametric approach is that the model we choose will usually not match the true unknown form of $f$.
    
If the chosen model is too far from the true $f$, then our estimate will be poor. 
    
We can try to address this problem by choosing *fexible models* that can ft many diferent possible functional forms for $f$. <br>But in general, ftting a more fexible model requires estimating a
greater number of parameters. These more complex models can lead to a phenomenon known as *overftting* the data, which essentially means they follow the errors, or noise, too closely
    
---
    
<font color='teal'>**Non-Parametric Methods**
    
Non-parametric methods do not make explicit assumptions about the functional form of $f$. Instead they seek an estimate of $f$ that gets as close to the data points as possible without being too rough or wiggly. 
    
Such approaches can have a major advantage over parametric approaches: by avoiding the assumption of a particular functional form for $f$, they have the potential to accurately ft a wider range of possible shapes for $f$.
    
But non-parametric approaches do sufer from a major disadvantage: since they do not reduce the problem of estimating f to a small number of parameters, a very large number of observations (far more than is typically needed for a parametric approach) is required in order to obtain an accurate estimate for $f$.
    
<font color='teal'>**The Trade-Of Between Prediction Accuracy and Model Interpretability**
    
Some methods are less fexible, or more restrictive, in the sense that they can produce just a relatively small range of shapes to estimate f. For example, linear regression is a relatively infexible approach, because it can only generate linear functions.
    
In general, as the fexibility of a method increases, its interpretability decreases.
    
If we are mainly interested in inference, then restrictive models are much more interpretable. For instance, when inference is the goal, the linear model may be a good choice since it will be quite easy to understand the relationship.
    
In contrast, very fexible approaches, can lead to such complicated estimates of $f$ that it is difficult to understand how any individual predictor is associated with the response.
    
Even when prediction is the only aim, sometimes less flexible approaches provide more accurate results due the potential for overfitting in highly flexible methods.
    
---
    
<font color='teal'>**Supervised Versus Unsupervised Learning**
    
Most statistical learning problems fall into one of two categories:

- **Supervised**
    - For each observation of the predictor measurement(s) $x_{i}, i = 1,...,n$ there is an associated response measurement $y_{i}$.
    
- **Unsupervised**
    - Describes the somewhat more challenging situation in which for every observation $i = 1,...,n$, we observe a vector of measurements $x_{i}$ but no associated response $y_{i}$. There is no response variable to predict. the situation is referred to as unsupervised because we lack a response variable that can supervise our analysis. We can seek to understand the relationships between the variables
or between the observations. One statistical learning tool that we may use in this setting is cluster analysis, or clustering.
    
---
    
<font color='teal'>**Regression Versus Classifcation Problems**

Variables can be characterized as either quantitative or qualitative (also known as categorical). 
    
Quantitative variables take on numerical values. Examples include a personâ€™s age, height, or income, the value of a house, and the price of a stock. 
    
In contrast, qualitative variables take on values in one of $K$ diferent classes, or categories. Examples of qualitative variables class include a personâ€™s marital status (married or not), the brand of product purchased (brand A, B, or C), whether a person defaults on a debt (yes or no)
    
We tend to refer to problems with a quantitative response as regression problems, while those involving a qualitative response are often referred to as classifcation problems. However, the distinction is not always that crisp.
    
Least squares linear regression is used with a quantitative response, whereas logistic regression
is typically used with a qualitative (two-class, or binary) response. Thus, despite its name, logistic regression is a classifcation method. But since it estimates class probabilities, it can be thought of as a regression method as well. Some statistical methods, such as K-nearest neighbors and boosting can be used in the case of either quantitative or qualitative responses.
    
We tend to select statistical learning methods on the basis of whether the response is quantitative or qualitative. However, whether the predictors are qualitative or quantitative is generally considered less important. Most of the statistical learning methods discussed in this book can be applied regardless of the predictor variable type, provided that any qualitative predictors are properly coded before the analysis is performed.
    
---
    
<font color='teal'>**Assessing Model Accuracy**
    
There is no free lunch in statistics: no one method dominates all others over all possible data sets.
    
<font color='teal'> **Measuring the Quality of Fit**
    
In order to evaluate the performance of a statistical learning method on a given data set, we need some way to measure how well its predictions actually match the observed data. That is, we need to quantify the extent to which the predicted response value for a given observation is close to the true response value for that observation. In the regression setting, the most commonly-used measure is the mean squared error (MSE), given by
    
<font size=5>$$
MSE = \frac{1}{n}\Sigma^n_{i=1}(y_{i} - \hat{f}(x_{i}))^2
$$
    
In general, we do not really care how well the method works on the training data. Rather, we are interested in the accuracy of the predictions that we obtain when we apply our method to previously unseen test data. 
    
We want to choose the method that gives the lowest test MSE, as opposed to the lowest training MSE. In other words, if we had a large number of test observations we could compute the average squared prediction error for these test observations.
    
The *degrees of freedom* is a quantity that summarizes the fexibility of a curve. Linear regression
is at the most restrictive end, with two degrees of freedom.
    
When a given method yields a small training MSE but a large test MSE, we are said to be overftting the data. This happens because our statistical learning procedure is working too hard to fnd patterns in the training data, and may be picking up some patterns that are just caused by random chance
rather than by true properties of the unknown function $f$.
    
Note that regardless of whether or not overftting has occurred, we almost always expect the training MSE to be smaller than the test MSE because most statistical learning methods either directly or
indirectly seek to minimize the training MSE. 
    
Overftting refers specifcally to the case in which a less fexible model would have yielded a smaller
test MSE.
    
Though the mathematical proof is beyond the scope of this book, it is possible to show that the expected test MSE, for a given value $x_{0}$, can always be decomposed into the sum of three fundamental quantities: the variance of $\hat{f}(x_{0})$, the squared *bias* of $\hat{f}(x_{0})$ and the variance of the error terms $\epsilon$. That is,
    
$$E(y_{0} âˆ’ \hat{f}(x_{0}))^2 = Var(\hat{f}(x_{0})) + [Bias(\hat{f}(x_{0}))]^2 + Var(\epsilon)
$$ (2.7)
    
Here the notation $(y_{0} âˆ’ \hat{f}(x_{0}))^2$ defines the expected test MSE at $x_{0}$ and refers to the average test MSE that we would obtain if we repeatedly estimated $f$ using a large number of training sets, and tested each at $x_{0}$
    
The overall expected test MSE can be computed by averaging $(y_{0} âˆ’ \hat{f}(x_{0}))^2$ over all possible values of x0 in the test set.
    
Equation 2.7 tells us that in order to minimize the expected test error, we need to select a statistical learning method that simultaneously achieves low variance and low bias. Note that variance is inherently a nonnegative quantity, and squared bias is also nonnegative. Hence, we see that the
expected test MSE can never lie below $Var(\epsilon)$, the irreducible error.
    
<font color='teal'>**Variance & Bias**
    
What do we mean by the variance and bias of a statistical learning method? Variance refers to the amount by which $\hat{f}$ would change if we estimated it using a diferent training data set. Since the training data are used to ft the statistical learning method, diferent training data sets
will result in a diferent $\hat{f}$. But ideally the estimate for $f$ should not vary too much between training sets. 
    
However, if a method has high variance then small changes in the training data can result in large changes in $\hat{f}$.
    
In general, more fexible statistical methods have higher variance because they are more closely fitting the training data.
    
*Bias* refers to the error that is introduced by approximating a real-life problem, which may be extremely complicated, by a much simpler model. 
    
For example, linear regression assumes that there is a linear relationship between $Y$ and $X_{1}, X_{2},...,X_{p}$. It is unlikely that any real-life problem truly has such a simple linear relationship, and so performing linear regression will undoubtedly result in some bias in the estimate of $f$.

In [14]:
wage_df = ISLP.load_data('Wage')

In [15]:
avg_wages = wage_df[['year', 'wage']].groupby('year').mean()

In [16]:
px.scatter(avg_wages, x=avg_wages.index, y='wage', trendline="ols")

In [17]:
px.scatter(wage_df, x='age', y='wage', color='maritl')

In [18]:
px.scatter(wage_df, x='age', y='wage', color='education')