##### Statistical Learning

It refers to a set of approaches for estimating $f$

##### Why estimate $f$
for Prediction and Inference 

##### Prediction

$\hat{Y} = \hat{f}(X)$ where $\hat{f}$ is the black box often we don't care about the form of $\hat{f}$ , provided that it yields accurate results 

$X$ are set of inputs points and $\hat{Y}$ is the predictioni for $Y$

##### Reducible error 
$\hat{f}$ will not be a perfect estimate and this inaccuracy will introduce some error. This is the *reducable* error as it can be improved by choosing the most appropriate learning technique for $f$

##### Irreducible error 
Even if we get a perfect estimate for $f$, our prediction will have some error in it. This is because $Y$ is also a function of $\epsilon$, which cannot be prediction by $X$. There variablity associated with $\epsilon$ also affects the accuracy of our predictions. 

This is called *Irreducible error*

Why cannot we reduce this error?

    1.the quantity of $\epsilon$ contain unmeasured variables that are useful in prediction $Y$. Since we don't measure them, $f$ cannot use for its prediction.
    2.The  $\epsilon$ may also contain unmeasurable variation. 
 
eg: risk of an adverse reaction for a given patient might vary at given time of the day, depending on various factors.

$$ E(Y-\hat{Y})^2 = E[f(X) + \epsilon - \hat{f}(X)]^2$$
$$ =  [f(X) - \hat{f}(X)]^2 + \text{Var}(\epsilon)$$

$E(Y-\hat{Y})^2$ is the average or expected value of the squared distance between the predicted and the actual value of $Y$


##### Inference 
Understanding the way $Y$ is affected as $X_1,...,X_p$ change. we need understand the relationship between X and Y and understand how $Y$ changes as a function of $X_1,...,X_p$. 
the function cannot be treated as black box. 

    1.Which predictors are associated with the response? 
    2.What is the relationship between the response and each predictor? 
    3.Can the relationship between Y and each predictor be adequately summarized by a linear equation or the relationship is more complicated? 
    
    1.Which predictors contributed to a response? 
    2.Which predictos increased the sales ? 
    3 What is the percentage increase in the reponse variable with a given increase in the predictor? 
    
    
##### Prediction
Getting an accurate model to predict the response using the predictors. The underlying fucntional relationship is not of importance here. 

##### how to estimate $f$?
Apply statistical learning meethod to the training data in order to estimate the unknown function $f$. 

Most statistical learning methods are characterised as *parametric or non parametric*

##### Parametric 

>We make an assumption about the functional form or shape. 
eg. $ f(X) = \beta_0 + \beta_1X_1 + \beta_2X_2+ .... + \beta_PX_P$.
We assumed that f is a linear and only need to estimate P + 1 coefficients. 

>After the model is selected, we need a procedure that uses the training data to fit or train the model.  In case of linear model we need to estimate the $\beta_0,\beta_1....\beta_p$

The most common approach referred is ordinary least squares. 


disadvantages of parametric approach is 
>The model chosen will never match the true $f$

##### Non-Parametric

We do not make explicit assumptions about the functional form of $f$. Instead they seek an estimate of $f$ that is close to the data points as possible without being too rough or wiggly. 

>Advantage 
has a potential to fit a wider range of possible shapes for $f$. 
The model can estimate of $f$ will be very close to the true $f$

>disadvantages
They need large number of observations in order to provide the accurate estiamte for $f$

If we are looking for inferring the relationship between predictors and repsonse variables, we would need a restrictive approach (eg: Linear model) 

if we are looking for prediction based on the input data and getting a response value, we would need a more flexible approach which can get accurate results 


side note: Info about scatter plots. if there are p data points in our observation. we can make $p(p-1)/2$ 
scatter plots


##### Measuring the Quality of Fit 

##### Regression Setting

$$ MSE = \frac{1}{n}\sum_{i=1}^n(y_i - \hat{f}(x_i))^2 $$

$$ Ave(y_0 - \hat{f}(x_0))^2 $$

In some setting when the test observations are available, the avg squared prediction error can be evaluated to get the best one with the smallest test MSE. 

when the test observations are not available, we need to select statistical learning method that minimizes  the training MSE. But this will not yield good predictions with a newer data. 

Regardless of overfitting or not, we almost always expect the training MSE to be smaller than the test MSe because most statistical learning methods either directly or indirectly seek to minimize the training MSE. 

Expected test MSE can be decomposed into sum of 3 quantities. 

##### Trade off between Bias and Variance 

$ E(y_0 - \hat{f}(x_0))^2 = \text{Var}(\hat{f}(x_0)) + [\text{Bias}(\hat{f}(x_0))^2] + \text{Var}(\epsilon) $

In order to minimize the expected test error, we need to select a statistical learning method that simultaneously achieves low variance and low bias. 

Variance refers to the amount by which $\hat{f}$ would change if we estimated it using different training data set. 

Bias refers to the error introduced by approximating a real-life problem. 

More flexible methods will increase the variance and reduce the Bias and more restrictive models will have increase the Bias and reduce the variance. However, at some  point the flexiblity has little impact on the bias but starts to significantly increase the variance. When this happends the test MSE increases. 

Good test set performance of a statistical learning method requires low variance as well as low squared bias. This is referred to as a trade-off because it is easy to obtain a method with extremely low bias but high variance (for instance, by drawing a curve that passes through every
single training observation) or a method with very low variance but high bias (by fitting a horizontal line to the data). The challenge lies in finding a method for which both the variance and the squared bias are low.

##### Classification Setting

When the response variable is qualitative then we use classification setting. 

$$ \hat{f} = \frac{1}{n} \sum_{i=1}^{n} I (y_i \neq \hat{y}_i) $$

The equation computes the fraction of incorrect classifications and is referred to as training error rate. 

$\text{Avg}(I(y_0 \neq \hat{y}_0)) $ is the test error rate where $\hat{y}_0$ is the predicted class label that results from applying the classifer to the test observations with the predictor $x_0$ 

##### Bayes Classifier 

Classifier which assigns each observation to the most likely class given its predictor values. 

Assign a test observation with a predictor vector $x_0$ to the class $j$ for which $\text{Pr}(Y = j|X = x_0)$ is the largest.

Bayes decision boundary is one where the Probably is exactly 50%. 

Bayes error rate is $ 1 - E(\underset{j}{\text{max}} Pr(Y = j|X))$

##### KNN 

A positive integer K is chosen along with a test observation $x_0$. 
the KNN identifies the K points in the training data set that is close to $x_0$, represented by a Neighbourhood $\mathcal{N}_0$. It estimates the conditional probability of the class j as the fraction of points in $\mathcal{N}_0$ whose response values equal $j$:
$$ \text{Pr}(Y = j|X = x_0) =  \frac{1}{K}\sum_{i\in{\mathcal{N}_0}}I(y_i = j)$$
Finally KNN appplies bayes rule and classifies the test observation $x_0$ to the class with the largest probablity. 