### **Conceptual Exercises for Chapter 2 from ISLP** 

#### I am only writing the theoretical answers. The mathematical answers will be done and written on my own.

### **For each of parts (a) through (d), indicate whether we would generally expect the performance of a flexible statistical learning method to be better or worse than an inflexible method. Justify your answer**

**(a) The sample size n is extremely large, and the number of predictors p is small.**

An inflexible method would perform poorly, compared to a flexible method because of the number of parameters that needs to be estimated. Due to the nature of the inflexible method, the model cannot capture the complexity of the data and hence, performs porrly. A prime example would be running multiple linear regression on non-linear data. 


**(b) The number of predictors p is extremely large, and the number of observations n is small.**

A flexible method would outperform the inflexible method, mainly in terms of quality of fit. An inflexible method would "overfit" the model, defeating the point of generalization. However, a flexible method can accustom to the nature of the data and produce a model that is, in terms, general. 

**(c) The relationship between the predictors and response is highly non-linear**

An inflexible method would perform better in this scenario, as there is not much space for flexibility, it is sure that a model will be able to adapt to the nature of the data, which in this case, it's non-linear. An example would be polynomial regression. 

**(d) The variance of the error terms, i.e. σ2 = Var(e), is extremely high.**

An inflexible method would perform better in this scenario. Due to the nature of the variance, it is easy for the flexible method to not be able to generalize to the data and create a model that's too wiggly. An inflexible method, though that would prodcue a much higher error rate, might generalize better in this scenario. 

### **Explain whether each scenario is a classification or regression problem, and indicate whether we are most interested in inference or prediction. Finally, provide n and p.**

**(a) We collect a set of data on the top 500 firms in the US. For each firm we record profit, number of employees, industry and the CEO salary. We are interested in understanding which factors affect CEO salary** 

This will be a regression scenario as we are trying to "predict" the number of features, i.e. estimating the parameters that affect the response, which in this case, is the salary. 

**(b) We are considering launching a new product and wish to know whether it will be a success or a failure. We collect data on 20 similar products that were previously launched. For each product we have recorded whether it was a success or failure, price
charged for the product, marketing budget, competition price, and ten other variables**

This will be a classification task as we are trying to state whether it will be a success or a failure, depending on the parameters and their influence on the output. 

**(c) We are interested in predicting the % change in the USD/Euro exchange rate in relation to the weekly changes in the world stock markets. Hence we collect weekly data for all of 2012. For each week we record the % change in the USD/Euro, the % change in the US market, the % change in the British market, and the % change in the German market.**

This will be a regression task as we are trying to "predict" the change based on the weekly changes to the stock market. In this case, the change will be the response variable. 


### **You will now think of some real-life applications for statistical learning.**

**(a) Describe three real-life applications in which classification might be useful. Describe the response, as well as the predictors. Is the goal of each application inference or prediction? Explain your answer.**

Classification can be useful in: 
1. Spam identification
2. Detecting handwritten characters
3. Object detection

**(b) Describe three real-life applications in which regression might be useful. Describe the response, as well as the predictors. Is the goal of each application inference or prediction? Explain your answer.**

Regression can be useful in: 
1. Time-series analysis 
2. Sales/weather forecasting 
3. Predicting any response in relation to a real-life scenario (predicting accidents, etc.)


**(c) Describe three real-life applications in which cluster analysis
might be useful.**

Cluster analysis can be useful in: 
1. Market segmentation based on what a customer likes 
2. Recommending specific items to specific customers. 
3. Market who to send emails to based on their email-click data. 


### **What are the advantages and disadvantages of a very flexible (versus a less flexible) approach for regression or classification? Under what circumstances might a more flexible approach be preferred to a less flexible approach? When might a less flexible approach be preferred?**

Data which has high variance should resort to a more flexible method as it might be more suitable due to the nature of the data. A flexible model can generalize to any sort of data however for scenarios where p > n, or the parameters to estimate is large than the number of parameters to exist, an inflexible method would perform better because of the chance of "overfitting" and creating a model that's computationally complex.


### **Describe the differences between a parametric and a non-parametric statistical learning approach. What are the advantages of a parametric approach to regression or classification (as opposed to a non-parametric approach)? What are its disadvantages?**

A parametric approach is when to solve a problem, 'p' number of parameters have to be estimated. This is the case with regression. However when it comes to non-parametric methods, like Bayes' classifier, the model can assume a certain behavior based on the observations it has had. 

For a parametric approach, it is easier to interpret the model and if there is a constraint on computational time. However, for non-parametric approach, if there is no derivable trend from the data, and relationship between variables are non-linear, it is better to not choose the parametric approach. 


### **The table below provides a training data set containing six observations, three predictors, and one qualitative response variable**

In [1]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt

In [2]:
# The training set is described here. 

data = pd.DataFrame({'Obs': [1, 2, 3, 4, 5, 6], 
                    'X1': [0, 2, 0, 0, -1, 1], 
                    'X2': [3, 0, 1, 1, 0, 1], 
                    'X3': [0, 0, 3, 2, 1, 1], 
                    'Y': ['Red', 'Red', 'Red', 'Green', 'Green', 'Red']})

data

Unnamed: 0,Obs,X1,X2,X3,Y
0,1,0,3,0,Red
1,2,2,0,0,Red
2,3,0,1,3,Red
3,4,0,1,2,Green
4,5,-1,0,1,Green
5,6,1,1,1,Red


In [3]:
# For (a)

def distance(x): 
    result = np.sum(x**2, axis=1)
    return result**0.5

euc_dist = pd.DataFrame({'distance': distance(data[['X1', 'X2', 'X3']])})
data_dist = pd.concat([data, euc_dist], axis=1)
data_dist

Unnamed: 0,Obs,X1,X2,X3,Y,distance
0,1,0,3,0,Red,3.0
1,2,2,0,0,Red,2.0
2,3,0,1,3,Red,3.162278
3,4,0,1,2,Green,2.236068
4,5,-1,0,1,Green,1.414214
5,6,1,1,1,Red,1.732051


In [4]:
# (b) For K = 1... 

K = 1
data_dist.nsmallest(K, "distance")

Unnamed: 0,Obs,X1,X2,X3,Y,distance
4,5,-1,0,1,Green,1.414214


In [5]:
# (c) For K = 3...

K = 3 
data_dist.nsmallest(K, "distance")

Unnamed: 0,Obs,X1,X2,X3,Y,distance
4,5,-1,0,1,Green,1.414214
5,6,1,1,1,Red,1.732051
1,2,2,0,0,Red,2.0


(d) I would expect the value of K to be smaller. Because of the nature of the decision boundary, it would be more flexible to choose a K that is smaller. 