In [1]:
from IPython.core.display import HTML
HTML("""
<style>
div.text_cell_render { /* Customize text cells */
font-family: 'Times New Roman';
font-size:1.3em;
line-height:1.4em;
padding-left:1.5em;
padding-right:1.5em;
}
</style>
""")

<h1><center>Statistical Learning</center></h1>

### 2.1 What Is Statistical Learning?

The <b>Advertising</b> data set consists of the sales of that product in 200 different markets, along with advertising budgets for the product in each of those markets for three different media: <b>TV, radio, and newspaper.</b> The plot of data is shown below. Our goal is to develop an accurate model that can be used to predict sales on the basis of the three media budgets.

In general, suppose that we observe a quantitative response $Y$ and $p$ different predictors, $X_1$, $X_2$, ... , $X_p$, we assume that there is some relationship between $Y$ and $X = (X_1, X_2, ... , X_p)$, which can be written in general form as:

$$Y = f(X) + \epsilon$$

where, $\epsilon$ is a random error term which is independent of X and has mean 0. In this formula, $f$ represents the systematic information that $X$ provides about $Y$. <b>In essence, statistical learning refers to a set of approaches for estimating  $f$.</b>

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

adv = pd.read_csv("data/Advertising.csv")

fig = plt.figure(figsize=(15,8))

ax = fig.add_subplot(131)
sns.regplot(x="TV", y="Sales", color='r', data=adv, order=1, scatter_kws={'alpha':0.4}, 
            line_kws={'color':'g', 'alpha':0.7})
ax.set_xlabel('TV')
ax.set_ylabel('Sales')

ax = fig.add_subplot(132)
sns.regplot(x="Radio", y="Sales", color='r', data=adv, order=1, scatter_kws={'alpha':0.4}, 
            line_kws={'color':'g', 'alpha':0.7})
ax.set_xlabel('Radio')
ax.set_ylabel('Sales')

ax = fig.add_subplot(133)
sns.regplot(x="Newspaper", y="Sales", color='r', data=adv, order=1, scatter_kws={'alpha':0.4}, 
            line_kws={'color':'g', 'alpha':0.7})
ax.set_xlabel('Newspaper')
ax.set_ylabel('Sales')

plt.show()

  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval


<Figure size 1500x800 with 3 Axes>

#### 2.1.1 Why Estimate f ?

There are two main reasons that we may wish to estimate $f$:
 - <b>Prediction:</b> We can predict $Y$ using $X$ as:
 
 $$\widehat{Y} = \widehat{f}(X)$$
 
 where $\widehat{f}$ represents our estimate for $f$ and $\widehat{Y}$ represents the resulting prediction for $Y$. The accuracy of $\widehat{Y}$ as a prediction of $Y$ depends on two quantities: <b>reducible error</b> and <b>irreducible error</b>. Reducible error can be minimized by opting a better prediction. Irreducible error arises due to $Y$'s dependence on $\epsilon$ and hence can not be reduced further. 
 
 
 - <b>Inference:</b> In this situation we wish to estimate $f$, but our goal is not necessarily to make predictions for $Y$. Instead, we want to understand the relationship between $X$ and $Y$, or more specifically, we want to understand how $Y$ changes as a function of $X$. The three main questions that can be of interest while doing inference analysis are: <b>Which predictors are associated with the response?, What is the relationship (positive or negative) between the response and each predictor?</b> and <b>Can the relationship between Y and each predictor be adequately summarized using a linear equation, or is the relationship more complicated?</b

#### 2.1.2 How Do We Estimate f ?

$f$ can be estimated by many <b>linear</b> and <b>non-linear</b> approaches. The given dataset is called as <b>training data</b>. Our goal is to apply a statistical learning method to the training data in order to estimate $f$. In other words, we want to find a function $\widehat{f}$ such that $Y \approx \widehat{f}(X)$ for an observation $(X,Y)$. The most statistical learning methods for this task can be characterized as:
 - <b>Parametric Methods:</b> Parametric methods involve a two-step model based approach which are:
    - We make an assumption about the shape of $f$. For a liner model,  $f$ can be represented as:
    
    $$f(X) = \beta_0 + \beta_1 X_1 + ... + \beta_p X_p$$
    
    and then we have to estimate $p+1$ coefficients $\beta_0, \beta_1,..., \beta_p$.
    
    - After the model has been selected, we need a procedure that uses the <b>training data</b> to estimate the parameters.
    
    The model based approach described above is called as <b>parametric</b>, as it reduces the problem of estimating $f$ down to the estimation of a set of parameters. Parametric form is easy to estimate but with a disadvantage that the chosen model will not match the true unknown form of $f$. We can design more complex models to overcome this issue but this leads to <b>overfitting</b> the data as it follows the <b>error</b> or <b>noise</b> too closely.
    
    
 - <b>Non-parametric Methods:</b> Non-parametric methods seek an estimate of $f$ that gets as close to the data points as possible without being too rough or wiggly. They have the potential to accurately fit a wider range of possible shapes of $f$. Since they do not reduce the problem of estimating $f$ to a small number of parameters, a very large number of observations is required to obtain an accurate estimate of $f$.

#### 2.1.3 The Trade-Off Between Prediction Accuracy and Model Interpretability

A common question related to model selection is: <b>why would we ever choose to use a more restrictive method instead of a very flexible approach?</b> When inference is the goal, linear model (restrictive) is a good choice as it is easy to interpret. So we can conclude that when inference is the goal, simple and relatively inflexible statistical learning methods have a clear advantage. For prediction, a more flexible and complex model can be used as interpretability is not a concern.

#### 2.1.4 Supervised Versus Unsupervised Learning

In supervised learning, for each observation of the predictor measurement, there is an associated response measurement which is used to build a model and predict the response of future observations.

In unsupervised learning, there is no response associated with the observations. Cluster analysis is an example of unsupervised learning.

#### 2.1.5 Regression Versus Classification Problems

Variables can be characterized as either <b>quantitative</b> or <b>qualitative (categorical)</b>. Quantitative variables take on numerical values, whereas, qualitative variables take on values in one of the $K$ different classes. We refer to problems with a quantitative response as <b>regression</b> problem, while those involving a qualitative response is called as <b>classification</b> problem. The type of <b>predictors</b> is not of much concern when choosing the model as categorical variables can easily be coded before applying the model.

### 2.2 Assessing Model Accuracy

No one statistical learning method dominates all other over all possible data sets. Hence it is an important task to decide that for any given data set which model fits best.

#### 2.2.1 Measuring the Quality of Fit

In the <b>regression</b> setting, the most commonly used measure for quality of fit is <b>mean squared error (MSE)</b>, which is given as:

$$MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \widehat{f}(x_i))^{2}$$

where $\widehat{f}(x_i)$ is the prediction for the $i$th observation. MSE will be small if the predicted responses are very close to the true responses. While training a model, <b>training MSE</b> is of lesser significance. We should be more interested in <b>test MSE</b>, which is the MSE for the previously unseen test observation not used to train the model. When the test data is available, we can simply compute test MSE and select the model which has the lowest test MSE. In the absence of test data, the basic approach is to simply select a model with the lowest training MSE. Below figure shows the MSE of test and train data for a model:

<img src="images/MSE.PNG">

In the right image, the grey curve shows training MSE and the red one test MSE. The <b>horizontal dashed line</b> indicates $Var(\epsilon)$, which is the lowest achievable test MSE amongst all methods. It is to be noted that as we increase <b>flexibility (degree of freedom)</b>, training MSE reduces but test MSE tends to increase after a certain point. So the blue curve on the left, which, although has a higher training MSE is the bset fit for the data.

The right hand side figure shows a fundamental property of a statistical model irrespective of the data set or the statistical methods being used. When a small method yields a small training MSE but a large test MSE, we are <b>overfitting</b> the data.

There are various approaches that can be used to find the best model (or find the minimum point) by analysing test MSE. One important method is <b>cross-validation</b>, which is a method for estimating test MSE using the training data.

#### 2.2.2 The Bias-Variance Trade-Off

The expected test MSE for a given value $x_0$ can always be decomposed into the sum of three fundamental quantities: <b>variance of $\widehat{f}(x_0)$</b>, <b>the squared bias of $\widehat{f}(x_0)$</b> and the variance of the <b>error term $\epsilon$</b>.

$$E(y_0 - \widehat{f}(x_0))^{2} = Var(\widehat{f}(x_0)) + [Bias(\widehat{f}(x_0))]^{2} + Var(\epsilon)$$

Here the notion $E(y_0 - \widehat{f}(x_0))^{2}$ defines the <b>expected test MSE</b> and refers to the average test MSE that would be obtained if we repeatedly estimated $f$ using a large number of training sets ane tested each at $x_0$. The overall expected test MSE can be computed by averaging it over all possible values of $x_0$ in the test set.

In order to minimize the expected test error, we need to select a statistical learning method that simultaneously achieves <b>low variance</b> and <b>low bias</b>.

<b>Variance</b> refers to the amount by which $\widehat{f}$ would change if we estimated it using a different training data. If a statistical method ($\widehat{f})$ has high variance, small change in the training data can result in a largr change in $\widehat{f}$. <b>More flexible statistical methods have higher variance.</b>

<b>Bias</b> refers to the error that is introduced by approximating a real-life problem, which may be extremely complicated, by a much simpler model. Generally, <b>more flexible methods result in less bias.</b>

As we use more flixible methods, <b>variance will increase and bias will decrease.</b> As we increase the flexibility of a statistical method, <b>the bias tends to initially decrease faster than the variance increases,</b> and hence the test MSE declines. Below figure illustrates the bias-variance tradeoff (with increasing flexibility) for the example shown above. Blue curve represents the squared bias, orange curve the variance and red curve the test MSE. It should be noted that as we increase the flexibility, bias decreases and variance increases. This phenomenon is referred to as <b>bias-variance tradeoff</b>, as it is easy to obtain a method with extremely low bias but high variance or a method with very low variance but high bias.

<img src="images/Bias Variance Tradeoff.png"  style="width: 400px;">

#### 2.2.3 The Classification Setting

The most common approach for quantifying the accuracy of the estimate $\widehat{f}$ in the classification setting is the <b>training error rate</b>, defined as the proportion of mistakes that are made if we apply our estimate $\widehat{f}$ to the training observations:

$$\frac{1}{n}\sum_{i=1}^{n} I(y_i \neq \widehat{y_i})$$

where $\widehat{y_i}$ is the predicted class label for the $i$th observation and $I(y_i \neq \widehat{y_i})$ is the <b>indicator variable</b> that equals 1 if $y_i \neq \widehat{y_i}$. Hence, <b>the above equation computes the fraction of incorrect classifications.</b> Similarly, the <b>test error rate</b> associated with the test observations of the form $(x_0, y_0)$ can be calculated as:

$$Ave(I(y_0 \neq \widehat{y_0}))$$

#### The Bayes Classifier

In the classification setting, the test error rate can be minimized on an average by a simple classifier that <b>assign each observations to the most likely class, given its predictor values.</b> We can simply assign a test observation with predictor value $x_0$ to the class $j$ for which

$$Pr(Y=j \ | \  X = x_0)$$

is largest. This classifier is called as <b>Bayes Classifier.</b> In a two class classifier, Bayes classifier predicts the class <b>1</b>, if $Pr(Y=1 \ | \  X = x_0) > 0.5$, and class <b>2</b> otherwise. The boundary that divides the data set into classes (when probability of being in different classes is equal) is called as the <b>Bayes decision boundary</b>.

The Bayes classifier produces the lowest possible test error rate called as the <b>Bayes error rate.</b> Since the bayes classifier will always choose the class for which the probability is maximum, the error rate at $X=x_0$ will be $1 - max_{j}Pr(Y=j \ | \ X=x_0)$. Hence, overall bayes error rate is given as:

$$1 - E(max_{j}Pr(Y=j \ | \ X=x_0)),$$

where exception averages the probability over all possible values in $X$.

#### K-Nearest Neighbors

For real data, we do not know the conditional distribution of Y given X and hence computing Bayes Classifier is impossible. Many approaches attempt to estimate the <b>conditional distribution of Y given X</b> and then classify a given observation to the calss with highest estimated probability. One such method is called the <b>K-nearest neighbors (KNN) classifier</b>.

Given a positive integer <b>K</b> and a test observation $x_0$, the KNN classifier first identifies the $K$ points in the training data that are closest to $x_0$, represented by $N_0$. It then estimates the conditional probability of class $j$ as the fraction of points in $N_0$ whose response values equal $j$:

$$Pr(Y = j \ | \ X = x_0) = \frac{1}{K} \sum_{i \in N_0}I(y_i = j)$$

Finally, KNN applies Bayes rule and classifies the test observation $x_0$ to the class with the largest probability.

Despite the fact that it is a very simple approach, KNN can often produce classifiers that are surprisingly close to the optimal Bayes classifier. <b>Choice of K has a drastic effect on KNN classifier.</b> Below figure shows the effect of K on <b>KNN decision boundary</b>. When <b>K=1</b>, KNN classifier is highly flexible and find patterns in the data that don't correspond to the Bayes decision boundary. Hence, <b>lower value of K corresponds to a classifier that has low bias but very high variance.</b> As K grows, the method becomes less flexible and produces a decision boundary that is close to linear. <b>Higher value of K corresponds to a low-variance but high-bias classifier.</b> For KNN, <b>1\K</b> serves as a measure of flexibility. As K decreases, 1\K increases and hence flexibility increases.

<img src="images/KNN.png">

<b>In both the regression and classification settings, choosing the correct level of flexibility is critical to the success of any statistical learning method. The bias-variance tradeoff, and the resulting U-shape in the test error, can make this a difficult task.</b>