# 第二章 Statistical Learning

## 2.1 What Is Statistical Learning?

In essence, statistical learning refers to a set of approaches for estimating f(x)： $Y=f(X)+\epsilon$




### 2.1.1 Why Estimate f?

There are two main reasons that we may wish to estimate for f: prediction and inference.

Recommendation: 基本无害的计量经济学（Mostly Harmless Econometrics，Angrist and Pischke, 2009)；教材04-基本有用的计量经济学

### 2.1.2 How Do We Estimate f

**Parametric methods：** two-step model-based approach， First, we make an assumption about the functional form, or shape,
of f. After a model has been selected, we need a procedure that uses the training data to fit or train the model. The most common approach is referred to as (ordinary) least squares.



![title](img/line.png)
![title](img/line2.png)
![title](img/img1.png)


**Non-parametric methods:** Non-parametric methods do not make explicit assumptions about the functional form of f. Instead they seek an estimate of f that gets as close to the data points as possible without being too rough or wiggly. 

Such approaches can have a major advantage over parametric approaches: by avoiding the assumption of a particular functional form for f, they have the potential to accurately fit a wider range of possible shapes for f.

![title](img/img2.png)

### 2.1.3 The Trade-off Between Prediction Accuracy and Model Interpretability

A representation of the tradeoff between flexibility and interpretability, using different statistical learning methods. In general, as the flexibility of a method increases, its interpretability decreases.
 
**Recommendation:** Bao Y, Ke B, Li B, et al. Detecting accounting fraud in publicly traded US firms using a machine learning approach[J]. Journal of Accounting Research, 2020, 58(1): 199-235.

![title](img/img3.png)

### 2.1.4 Supervised versus Unsupervised Learning

Most statistical learning problems fall into one of two categories: supervised or unsupervised.

Supervised :For each observation of the predictor measurement(s) xi, i = 1, . . . , n there is an associated response measurement yi.

Unsupervised: describes the somewhat more challenging situation in which for every observation i = 1, . . . , n, we observe a vector of measurements xi but no associated response yi.

Semi-supervised learning:. For m of the observations, where m < n, we have both predictor measurements and a response measurement. For the remaining n − m observations, we have predictor measurements but no response measurement. 

### 2.1.5 Regression versus Classification Problems

Variables can be characterized as either quantitative or qualitative (also known as categorical). Quantitative variables take on numerical values. Examples include a person's age, height, or income, the value of a house, and the price of a stock.   

In contrast, qualitative variables take on values in one of different classes , or categories. Examples of qualitative variables include a person’s gender (male or female), the brand of product purchased (brand A, B, or C)

------
## 2.2 Assessing Model Accuracy

### 2.2.1 Measuring the Quality of Fit

 mean squared error (MSE):also known as train MSE
![title](img/line3.png)


![title](img/img4.png)
Left: Data simulated from f, shown in black. Three estimates of f are shown: the linear regression line (orange curve), and two smoothing spline fits (blue and green curves). 

Right: Training MSE (grey curve), test MSE (red curve), and minimum possible test MSE over all methods (dashed line). Squares represent the training and test MSEs for the three fits shown in the left-hand panel.

**Hint：** 模型复杂度不是越高越好，复杂度越高，越容易发生过拟合情况，导致模型的泛化能力减弱。

### 2.2.2 The Bias-Variance Trade-Of

#### Expected test MSE： ![title](img/line5.png)

**Variance** refers to the amount by which $\hat{f}$ would change if we
estimated it using a different training data set.
If a method has high variance then small changes in the training data can result in large changes in $\hat{f}$.   In general,more flexible statistical methods have higher variance.

**Bias** refers to the error that is introduced by approximating
a real-life problem, which may be extremely complicated, by a much
simpler model. 

![title](img/img5.png)



### 2.2.3 The Classification Setting

training error rate :
![title](img/line6.png)
test error rate:
![title](img/line7.png)

**Bayes classifier:**  $\operatorname{Pr}\left(Y=j \mid X=x_{0}\right)$

Note that this is a conditional probability: it is the probability conditional
probability that Y = j, given the observed predictor vector $x_{0}$. 

Recommendation: https://zhuanlan.zhihu.com/p/149774236 (贝叶斯分类器)

In a two-class problem where there are only two possible response values, say class 1 or class 2, the Bayes classifier corresponds to predicting class one if Pr(Y = 1|X = x0) > 0.5, and class two otherwise.
![title](img/line10.png)

**K-nearest neighbors (KNN) classifier**:
Given a positive integer K and a test observation x0, the KNN classifier first identifies the
K points in the training data that are closest to x0, represented by N0. It then estimates the conditional probability for class j as the fraction of
points in N0 whose response values equal j:![title](img/line9.png)