## 1. Resampling   
- Resampling involves repeatedly drawing samples from a training set and refitting a model of interest on each sample in order to obtain additional information about the fitted model.
- Two of the most commonly used techniques:   
  1. Cross-validation
  2. Bootstrapping
-  Resampling approaches can be computationally expensive

### 1.1 Cross-Validation   
- Test error is average error on unseen dataset  
- In absence of large test set, we estimate the test error rate by holding out a subset of the training observations from the fitting process, and then applying the statistical learning method to those held out observations
- The resulting error provides an estimate of the test error rate
#### 1.1.1  Validation Set Approach   
- It involves randomnly dividing training set in to 2 parts, a `training set` and a `validation set` or `hold-out set`
- Model is fit on training set and test error is estimated from validation set
- We can divide the data set by 50%-50%, but more common partition is 70%-30%, 75%-25% and 80%-20%
  ![image.png](attachment:17661b16-bde5-4d0b-b881-d8596b0cae70.png)
- A schematic display of the validation set approach. A set of n observations are randomly split into a training set (blue) and a validation set (beige)
  ![image.png](attachment:17eaaacb-c272-4dd7-b666-666622c81ecd.png)
- 2 Potential Issues
  1. Validation estimate of test error can be highly variable
  2. Since statistical methods tend to perform worse when trained on fewer observations, this suggests that the validation set error rate may tend to overestimate the test error rate for the model fit on the entire data set

In [1]:
import pandas as pd
df = pd.read_csv(r'Auto.csv')
print(df.head())
print(df.shape)

    mpg  cylinders  displacement horsepower  weight  acceleration  year  \
0  18.0          8         307.0        130    3504          12.0    70   
1  15.0          8         350.0        165    3693          11.5    70   
2  18.0          8         318.0        150    3436          11.0    70   
3  16.0          8         304.0        150    3433          12.0    70   
4  17.0          8         302.0        140    3449          10.5    70   

   origin                       name  
0       1  chevrolet chevelle malibu  
1       1          buick skylark 320  
2       1         plymouth satellite  
3       1              amc rebel sst  
4       1                ford torino  
(397, 9)


### 1.1.2  Leave-One-Out Cross-Validation   
- LOOCV involves splitting the set of observations into two parts
- Training set contains all but one observation (in blue below). Test set contains one observation (in beige)
  ![image.png](attachment:25538455-26b1-4ae8-a25b-4df8f1154a08.png)

-  Compute MSE for each left out sample i.e. $MSE_i$. The estimate of test error is </br>
   $CV_{(n)} = \frac{1}{n}\sum_{i = 1}^{n} MSE_i$
- LOOCV does not have 2 problems of validation set approach
- LOOCV has the potential to be expensive to implement, since the model has to be fit n times

### 1.1.3  k-Fold Cross-Validation   
- This approach involves randomly dividing the set of observations into k groups, or folds, of approximately equal size
- The first fold is treated as a validation set, and the model is fit on the remaining k − 1 folds
- The mean squared error, $MSE_1$, is then computed on the observations in the held-out fold
- This procedure is repeated k times; each time, a different group of observations is treated as a validation set </br>
  $CV_{(k)} = \frac{1}{k}\sum_{i=1}^{k}MSE_i$

- Example (3 - fold)
  ![image.png](attachment:bb26d069-4bcb-4155-8902-eb4227bfc48d.png)


- What is the relationship between LOOCV and k-fold validation?
- k-fold validation is cheaper than LOOCV
- One performs k-fold cross-validation using k = 5 or k = 10, as these values have been shown empirically to yield test error rate estimates that suffer neither from excessively high bias nor from very high variance
- Though we have used above results for regression case but the same can be replicated for classification as well. We replace 'MSE' by 'error rate' for classification  </br>
$CV_{(k)} = \frac{1}{k}\sum_{i=1}^{k}Err_i$  </br>
Where $Err_i$ can be misclassification rate i.e.
$Err_i = I(y_i \neq \hat{y_i})$

- We can use cross-validation error to estimate test error

![image.png](attachment:00dc44f2-1a8f-4db7-9ee0-62290b374c67.png)




## 1.2 The Bootstrap   
- Bootstrap sampling is random sampling with replacement
- This is used to estimate parameters e.g. coefficients of linear/logistic regression model
- This is also used to get an estimate of standard error  

![image.png](attachment:d49d3d4e-52bf-44be-af65-390ab1102ae9.png)       

- We can compute the standard error of these bootstrap estimates using the formula

![image.png](attachment:abb1de01-cdbf-4844-bbdf-52a6cb12710a.png)

## Question
1. What is the probability that $j^{th}$ observation is not part of the any of the n bootstrap samples having data size as n?
2. What is the validation set approach
3. LOOCV
4. k-fold cross validation? Ideal value for  k?
5. What is bootstrap sampling? Uses
6. Model assessment vs model selection
7. Suppose that we use some statistical learning method to make a prediction for the response Y for a particular value of the predictor X. Carefully describe how we might estimate the standard deviation of our prediction