# Chapter 2 Statistical Learning

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## Conceptual

### Q1

For each of parts (a) through (d), indicate whether we would generally expect the performance of a flexible statistical learning method to be better or worse than an inflexible method. Justify your answer.
- (a) The sample size $n$ is extremely large, and the number of predictors $p$ is small.
- (b) The number of predictors $p$ is extremely large, and the number of observations $n$ is small.
- (c) The relationship between the predictors and response is highly non-linear.
- (d) The variance of the error terms, i.e. $\sigma^2 = \mathrm{Var}(\epsilon)$, is extremely high.

### A1

In case (a), we would generally expect a flexible method to be better than an inflexible method.
With a large number of samples, the flexible method would be less prone to overfitting and would be better able to match the underlying function relating the inputs to the output.

In case (b), we would generally expect a flexible method to be worse than an inflexible method.
With only a small number of observations, the method would be especially prone to overfitting.

In case (c), we would generally expect a flexible method to be better than an inflexible method.
The inflexible method would have trouble capturing the curvature in the data, whereas the flexible method would be able to.

In case (d), we would generally expect a flexible method to be worse than an inflexible method.
The flexible method would be susceptible to following the noise too closely, whereas the inflexible method would still be able to get the gist of the underlying relationship. 

### Q2

Explain whether each scenario is a classification or regression problem, and indicate whether we are most interested in inference or prediction. Finally, provide $n$ and $p$.
- (a) We collect a set of data on the top 500 firms in the US. For each firm we record profit, number of employees, industry and the CEO salary. We are interested in understanding which factors affect CEO salary.
- (b) We are considering launching a new product and wish to know whether it will be a *success* or a *failure*. We collect data on 20 similar products that were previously launched. For each product we have recorded whether it was a success or failure, price charged for the product, marketing budget, competition price, and ten other variables.
- (c) We are interested in predicting the % change in the USD/Euro exchange rate in relation to the weekly changes in the world stock markets. Hence we collect weekly data for all of 2012. For each week we record the % change in the USD/Euro, the % change in the US market, the % change in the British market, and the % change in the German market.

### A2

Case (a) is a regression problem where we are most interested in inference.
$n = 500, p = 3$

Case (b) is a classification problem where we are most interested in prediction.
$n = 20, p = 14$

Case (c) is a regression problem where we are most interested in prediction.
$n = 52, p = 3$

### Q3

We now revisit the bias-variance decomposition.
- (a) Provide a sketch of typical (squared) bias, variance, training error, test error, and Bayes (or irreducible) error curves, on a single plot, as we go from less flexible statistical learning methods towards more flexible approaches. The x-axis should represent the amount of flexibility in the method, and the y-axis should represent the values for each curve. There should be five curves. Make sure to label each one.
- (b) Explain why each of the five curves has the shape displayed in part (a).

### A3

- Squared bias
    - Decreases as flexibility increases
    - A more flexible method can more closely match the underlying function
- Variance
    - Increases as flexibility increases
    - A more flexible method is more sensitive to small changes in the training data set
- Training error
    - Decreases as flexibility increases
    - A more flexible method can more closely match the training data
- Test error
    - Sum of squared bias, variance, and irreducible error
    - U-shaped
    - Decreases at first then increases as flexibility increases
- Irreducible error
    - Constant

### Q4

You will now think of some real-life applications for statistical learning
- (a) Describe three real-life applications in which *classification* might be useful. Describe the response, as well as the predictors. Is the goal of each application inference or prediction? Explain your answer.
- (b) Describe three real-life applications in which *regression* might be useful. Describe the response, as well as the predictors. Is the goal of each application inference or prediction? Explain your answer.
- (c) Describe three real-life applications in which *cluster analysis* might be useful.

### A4

One application of classification is in analyzing the movement of the stock market, predicting whether a particular stock index will increase or decrease on a given day based on the percent changes in the index for the previous 5 or so days.
The goal of such an application would be prediction, as determining when the stock price will increase or decrease is more important for making money than the exact reasons for the change.
A second application of classification is in determining whether someone is likely to default on a loan.
The predictors might include their income, age, credit history, collateral, capital, and conditions of the loan.
The goal of such an application would be inference, as banks would likely be interested in knowing which predictors are most significant so that they can prioritize getting that data from clients.
A third application of classification is in identifying coins that are rare and valuable.
The predictions could be based on the mass of the coin, its diameter, its type, and image data of the front & back sides of the coin.
The goal of such an application would be prediction, as once the potential coins of value are set aside, a human can appraise them and confirm their worth and put them up for sale.

One application of regression is in determining what factors are most associated with a person's wage.
The predictors might include age, education, and the calendar year.
The goal of such an application would be inference, as many people would like to know what they can do to increase their wages.
For instance, some might wonder if investing in education will lead to higher wages in the future.
A second application of regression is in determining what means of advertising lead to the most sales.
The predictors could include the advertising budgets for different means such as TV, radio, newspaper, YouTube ads, website ads, and YouTube sponsorships.
The goal of such an application would be inference, as a company would want to know which advertising methods are associated with the most sales and which ones they should devote the most money towards.
A third application of regression is in determining the values of homes.
The predictors could include the crime rate, zoning, distance from a river, air quality, schools, income level of the community, size of the houses, and so on.
The goal of such an application could be prediction if all a real estate agent cared about was setting a price, or a client determining if the price for a house is reasonable.
The goal might also be inference if they wanted to explain to a client the reason for the pricing.

One application of cluster analysis is in a market segmentation study.
Based on characteristics for potential customers such as zip code, family income, and shopping habits, a firm might want to identify distinct groups of customers.
This can possibly help the company find ways to target those groups.
A second application of cluster analysis is in analyzing gene expression data for cancer cell lines.
The goal would be to identify groups among the different cell lines, as that information could potentially help with treatment.
A third application of cluster analysis is in image analysis, helping identify groups of related pixels that represent an object.
For instance, it might help with identifying letters or faces in an image.

### Q5

What are the advantages and disadvantages of a very flexible (versus a less flexible) approach for regression or classification? Under what circumstances might a more flexible approach be preferred to a less flexible approach? When might a less flexible approach be preferred?

### A5

A very flexible approach allows for greater variety in the functions generated, thus giving it the potential to describe more complex underlying relationships.
However, if there is not enough training data, a very flexible approach could be susceptible to overfitting, picking up on patterns that are caused by random variation rather than properties of the underlying function.
A very flexible approach would also be prone to overfitting if the irreducible error in the data is high, following the noise too closely.
A less flexible approach might be preferred if there is not much training data or there is lots of irreducible error.
It would still be able to get the gist of the underlying relationship.

### Q6

Describe the differences between a parametric and a non-parametric statistical learning approach. What are the advantages of a parametric approach to regression or classification (as opposed to a non-parametric approach)? What are its disadvantages?

### A6

Parametric approaches make an assumption about the functional form of the underlying relationship.
That function has coefficients/parameters that can be tuned to produce different shapes based on the training data.
Non-parametric approaches make no assumptions about the functional form, and instead try to match the data as closely as possible without being too rough or wiggly.
It is generally much easier to estimate a set of parameters than an entirely arbitrary function.
But, there are no guarantees that the model chosen matches the form of the underlying function, and it could potentially be very different.
Non-parametric methods can fit a wider range of possible function shapes without making any assumptions about the underlying form, but they often require a very large number of observations in order to obtain an accurate estimate.

### Q7

The table below provides a training data set containing six observations, three predictors, and one qualitative response variable. Suppose we wish to use this data set to make a prediction for $Y$ when $X_1 = X_2 = X_3 = 0$ using $K$-nearest neighbors.
    - (a) Compute the Euclidean distance between each observation and the test point, $X_1 = X_2 = X_3 = 0$.
    - (b) What is our prediction with $K = 1$? Why?
    - (c) What is our prediction with $K = 3$? Why?
    - (d) If the Bayes decision boundary in this problem is highly non-linear, then would we expect the *best* value for $K$ to be large or small? Why?

| Obs. | $X_1$ | $X_2$ | $X_3$ | $Y$   |
| ---- | ----- | ----- | ----- | ----- |
| 1    |  0    |  3    |  0    | Red   |
| 2    |  2    |  0    |  0    | Red   |
| 3    |  0    |  1    |  3    | Red   |
| 4    |  0    |  1    |  2    | Green |
| 5    | -1    |  0    |  1    | Green |
| 6    |  1    |  1    |  1    | Red   |

### A7

If the Bayes decision boundary is highly non-linear, then we would expect the *best* value for $K$ to be small.
Large $K$ values tend to produce more linear boundaries, while low $K$ values tend to produce more wiggly boundaries.
With a small $K$ the model is more flexible, using only the most nearby observations for its decision.
With a large $K$ the model is less flexible, using a larger portion of the observations for its decision.

In [2]:
# Create some helper function
def distance(a, b):
    """
    Calculate the Euclidian distance between two points, a and b.
    """
    return np.linalg.norm(a - b)

def K_nearest_neighbors(df, predictor_cols, response_col, test_point, K):
    """
    Use K-nearest neighbors on the data in the dataframe, 'df'.
    'predictors_cols' are the names of the predictor columuns.
    'respones_col' is the name of the response column.
    'test_point' is an array with the test point.
    'K' is the K value, how many of the closest points to consider.
    """
    return (df
        .assign(Distance=lambda df_: df_.loc[:,predictor_cols].apply(lambda x: distance(x, test_point), axis="columns", raw=True))
        .nsmallest(K, "Distance")
        .loc[:,response_col]
        .mode()
        .iloc[0]
    )

# Create a dataframe with the data
data = [
    [0, 3, 0, "Red"],
    [2, 0, 0, "Red"],
    [0, 1, 3, "Red"],
    [0, 1, 2, "Green"],
    [1, 0, 1, "Green"],
    [1, 1, 1, "Red"],
]
df = pd.DataFrame(data,
    columns=["X_1", "X_2", "X_3", "Y"],
    index=range(1, len(data) + 1),
)

# Compute the Euclidian distance between each observation and the test point
test_point = np.array([0, 0, 0])
df["Distance"] = df.loc[:,"X_1":"X_3"].apply(lambda x: distance(x, test_point), axis="columns", raw=True)
print(df)

# Make a prediction with K = 1, then K = 3
for K in [1, 3]:
    prediction = K_nearest_neighbors(df, ["X_1", "X_2", "X_3"], "Y", test_point, K)
    print()
    print(f"K = {K}, {prediction}")

   X_1  X_2  X_3      Y  Distance
1    0    3    0    Red  3.000000
2    2    0    0    Red  2.000000
3    0    1    3    Red  3.162278
4    0    1    2  Green  2.236068
5    1    0    1  Green  1.414214
6    1    1    1    Red  1.732051

K = 1, Green

K = 3, Red


## Applied