### Final exam practice questions

#### Question 1
In this question, we will create a function to generate linear data step-by-step. 
1. Define a function called `gen_lin_data` that takes as its input an integer `n`, and returns a matrix `X` and a vector `Y`. 
2. First we will generate `X`. Within the function,
 - Create two random (numpy) vectors of size `n` named `X_1` and `X_2` where 
    - Each element of `X_1` is independent and distributed according to the Uniform(-2,2) distribution
    - Each element `X_2` is independent and distributed according to the Normal(2,2) distribution. 
    
  - Finally, concatenate `X_1` and `X_2` to generate `X` so that `X` is a numpy object with `n` rows and 2 columns.


3. Now we will work towards generating `Y`. Within the function,
  - Define a numpy array called `beta` and set it equal to `[-1,2]`. 
  - Define another `n` dimensional vector called `eps` where each element is independent and distributed according to the Normal(0,1/2) distribution. 
  - Finally, Generate a vector of size `n` called `Y` according to the following equation:
$$
Y = X \cdot \beta + eps
$$
4. Now that your function is defined, test your function in the designated cell and call it with **`n` equal to 500**. Save the resulting arrays as `Y500` and `X500`. **Print** the average of `Y500`.

In [1]:
import numpy as np

In [2]:
# Your function here
def gen_lin_data(n):
    X_1 = np.random.uniform(-2,2,n)
    X_2 = np.random.normal(2,2, n)
    X = np.vstack([X_1,X_2]).T
    beta = np.array([-1,2])
    eps = np.random.normal(0,1/2,n)
   
    Y = X @ beta + eps
    return X, Y

In [3]:
np.random.seed(10) #ignore this line and do not delete it

# Define Y500 and X500 below this line:
X500, Y500 = gen_lin_data(500)
# Your code here
print(np.mean(Y500))

3.803125339392462


#### Question 2
In this question, we will create our own function to perform basic OLS estimation.
1. Define a function called `simple_OLS` that takes a vector `Y` and a matrix `X` as inputs and returns a vector of OLS (linear regression) coefficient estimates. **To get full marks, you must explicitly calculate the OLS coefficients using the analytical formula for linear regression.** You will get some partial credit for estimating the equations using another method.
2. **In the Markdown Cell below your function**, answer this question: If `y` is an $n$-vector and `X` is an $n\times d$ matrix, what are the dimensions of the objects that will be output by the `simple_OLS` function? Verify this with your code or justify it using linear algebra.
3. In the designated cell, define `beta_hat` to be the output of `simple_OLS` using `Y500` and `X500` from the previous problem.
4. In the second provided Markdown Cell, explain what values you would expect the entries of `beta_hat` to be close to, and why.


In [4]:
# Your function here
def simple_OLS(Y, X):
    return np.linalg.inv(X.T @ X)@(X.T @ Y)

### Response 1
The simple OLS function will return a $d$-vector (or column vector). You can see this by calling `beta_hat.shape`(with $d = 2$) or by noting that it will share a row dimension with `X.t` (which is $d$) and a column dimension with `Y` (which is 1), making it a $d$-vector. 

In [5]:
# Define beta_hat here
beta_hat = simple_OLS(Y500, X500)
print(beta_hat)
print(beta_hat.shape)

[-0.98368016  1.99845442]
(2,)


### Response 2
We would expect beta_hat to be close to `[-1, 2]` because that was the value of the true linear coefficients we generated the data with. 

#### Question 3
In this question, we will evaluate the performance of our OLS estimator. 
1. Define a function called `simple_eval` that takes `X`, `Y`, and `beta_hat` as inputs and returns `est_mse`, the mean-squared error of `beta_hat`. 
2. Within that function
    - Define `Yhat` to be the predicted values of `beta_hat` appled to `X`. 
    - Calculate the mean-squared error and set it to `est_mse`. To do this, you can either use the metrics module from scikitlearn or calculate the mean-squared error by hand. 
3. In the cell below your function, call your `simple_eval` function using `X500`, `Y500`, and `beta_hat` from Question 2. Save that value to a variable called `ols_mse`. 

In [6]:
# Your function here
def simple_eval(X, Y, beta_hat):
    Yhat = X @ beta_hat
    est_mse = sum(np.power(Y - Yhat,2))/len(Y)
    #est_mse = metrics.mean_squared_error(Y500, Yhat)
    return est_mse

In [7]:
# Define ols_mse here
ols_mse = simple_eval(X500, Y500, beta_hat)
ols_mse

0.23351289190033098

#### Question 4
Now, fit a single regression tree on `Y500` and `X500`. Set the `random_state` and `max_depth` arguments equal to 123 and 10 respectively. Calculate the resulting training mean-squared error of this tree and set it equal to `tree_mse`.

Is the MSE from this regression tree lower or higher than the mean-squared error from linear regression (OLS)? Why would that be? Enter your answer in the markdown cell.

In [10]:
# Fit Tree Here
from sklearn import tree, metrics
reg_tree = tree.DecisionTreeRegressor(random_state=123, max_depth = 10).fit(y = Y500, X = X500)
tree_mse = metrics.mean_squared_error(Y500,reg_tree.predict(X500))
print(tree_mse)

0.018038859089902624


### Response
The MSE for the regression tree is lower. This is because it is highly over parameterized (overfitted) to the training data we are evaluating it on. 

#### Question 5
Now we will generate test data by using `gen_lin_data` with `n` equal to 200. Save the resulting arrays as `X200` and `Y200`. Using `simple_eval` and this testing data, calculate the testing mean-squared error of the `beta_hat` estimate from Question 2. Do the same for tree estimator from Question 4. In the Markdown cell below, answer the following question: Which estimator has a lower test mean-squared error? Why?

In [11]:
# Your code here
# Your code here
X200, Y200 = gen_lin_data(200)
mse_test_ols = simple_eval(X200, Y200, beta_hat)
print(mse_test_ols)
tree_mse_test = metrics.mean_squared_error(Y200,reg_tree.predict(X200))
print(tree_mse_test)

0.25686370013802284
0.5989805366324684


### Response
OLS has a lower (better) test mse. This is because the tree model was overfitting, and now we are predicting out-of-sample.. 