# Linear regression

## LIN.QA.2
Consider the  dataframe <span style="font-family:Courier; "> D</span> in the 
<span style="font-family:Courier; "> data2.pkl</span> pickle file.

It contains the i.i.d. dataset of a regression task where target is ${\mathbf y} \in \mathbb R$  and the input $x \in \mathbb R^2$.

 

The student should consider the following list of models:

1. $h(x)=\beta_0$
2. $ h(x)=\beta_0+\beta_1 x_1+\beta_2 x_2$
3. $ h(x)=\beta_0+\beta_1 x_1 $
4. $ h(x)=\beta_0+\beta_1 x_1^2+\beta_2 x_2^2 $
5. $h(x)=\beta_0+\beta_1 x_1+\beta_2 x_2 +\beta_3 x_1^2+\beta_4 x_2^2+\beta_5 x_1 x_2 +\beta_6 x_1^3 +\beta_7 x_2^3$
6. $ h(x)=\beta_0+\beta_1 x_1^2 $
7. $ h(x)=\beta_0+\beta_1 x_2^2 $
8. $ h(x)=\beta_0+\beta_1 x_1^3 $
9. $ h(x)=\beta_0+\beta_1 x_1^2+\beta_2 x_2^2 +\beta_3 x_1 x_2 $

 By using Python, the student should answer the following questions. Using functions implementing linear models (like in sklearn) is NOT allowed.

1. Which model has the lowest empirical risk?
2. Which model has the lowest FPE error?
3. Which model has the lowest generalisation error (estimated by leave-one-out)?
4. What is the least squares estimate of \(\beta_0\) in model 1?
5. What is the least-squares estimate of \(\beta_0\) in model 9?





Use the instructions
```python
import pickle
with open("data2.pkl", 'rb') as f:
    data = pickle.load(f)
D=data["D"]
```
to load the data frame in Python.

In [1]:
import pickle
with open("data2.pkl", 'rb') as f:
    data = pickle.load(f)
D=data["D"]

## Dataset extraction

In [3]:
import numpy as np

# Extract X and Y
X = D[:, :2]
Y = D[:, 2]
N = D.shape[0]
n = X.shape[1]




## Creation dataframes for linear regression

In [5]:

# Create a vector of ones
ones_n = np.ones(N)

# Create matrices M1 to M9
M1 = ones_n.reshape(N, 1)
M2 = np.column_stack((ones_n, X[:, 0], X[:, 1]))
M3 = np.column_stack((ones_n, X[:, 0]))
M4 = np.column_stack((ones_n, X[:, 0]**2, X[:, 1]**2))
M5 = np.column_stack((
    ones_n, 
    X[:, 0], 
    X[:, 1], 
    X[:, 0]**2, 
    X[:, 1]**2, 
    X[:, 0]*X[:, 1], 
    X[:, 0]**3, 
    X[:, 1]**3
))
M6 = np.column_stack((ones_n, X[:, 0]**2))
M7 = np.column_stack((ones_n, X[:, 1]**2))
M8 = np.column_stack((ones_n, X[:, 0]**3))
M9 = np.column_stack((ones_n, X[:, 0]**2, X[:, 1]**2, X[:, 0]*X[:, 1]))

# List of M matrices
M_list = [M1, M2, M3, M4, M5, M6, M7, M8, M9]




## Linear regression function
It computes 
* the least squares parameter identification
* empirical risk
* leave-one-out generalisation error
* FPE criterion
  

In [4]:
def fpe(M, Y):
    """
    Compute FPE, Remp, beta0, and LOO for given M and Y.
    """
    # Calculate beta using the normal equation
    beta = np.linalg.inv(M.T @ M) @ M.T @ Y
    p = M.shape[1]
    
    # Predicted Y
    Y_hat = M @ beta
    
    # Residuals
    e = Y - Y_hat
    
    # Initialize LOO residuals
    loo = np.zeros(N)
    
    for i in range(N):
        # Remove the i-th observation
        Mi = np.delete(M, i, axis=0)
        Yi = np.delete(Y, i)
        
        # Calculate beta for the reduced dataset
        betai = np.linalg.inv(Mi.T @ Mi) @ Mi.T @ Yi
        
        # Predict the i-th observation
        Y_hat_i = M[i, :] @ betai
        
        # Calculate LOO residual
        loo[i] = Y[i] - Y_hat_i
    
    # Calculate FPE and Remp
    Fpe = (1 + p / N) / (1 - p / N) * np.mean(e**2)
    Remp = np.mean(e**2)
    beta0 = beta[0]
    LOO = np.mean(loo**2)
    
    return {
        "Fpe": Fpe,
        "Remp": Remp,
        "beta0": beta0,
        "LOO": LOO
    }


In [None]:

# Initialize lists to store results
fp = []
remp = []
beta0 = []
loo = []

# Calculate FPE for each M matrix
for i in range(9):
    results = fpe(M_list[i], Y)
    fp.append(results["Fpe"])
    remp.append(results["Remp"])
    beta0.append(results["beta0"])
    loo.append(results["LOO"])

# Find indices with minimum Remp and FP (adding 1 to match R's 1-based indexing)
min_remp_index = np.argmin(remp) + 1
min_fp_index = np.argmin(fp) + 1
min_loo_index = np.argmin(loo) + 1



## Print results

In [7]:
# Print the results
print(f"\n which.min(Remp)={min_remp_index}",
      f"\n which.min(FP)={min_fp_index}",
      f"\n which.min(LOO)={min_loo_index}",
      f"\n \n beta0_1={beta0[0]}",
      f"\n beta0_9={beta0[8]}\n")

print(f"Remp={remp}\n")
print(f"FP={fp}\n")
print(f"LOO={loo}\n -- \n")


 which.min(Remp)=5 
 which.min(FP)=9 
 which.min(LOO)=9 
 
 beta0_1=3.756133372362787 
 beta0_9=0.8389855702311156

Remp=[20.383312889362127, 15.667690765989743, 18.341727559983266, 9.725145612395208, 0.16970407382974798, 16.312997989394113, 12.268665327312657, 19.47084348173513, 0.1866383730239761]

FP=[21.21528484402997, 17.667821502073544, 19.87020485664854, 10.966653562913747, 0.2343532448125091, 17.67241448851029, 13.291054104588714, 21.093413771879728, 0.21909722050640673]

LOO=[21.22377435377148, 19.892220493625448, 20.870249756689518, 19.81884361463974, 0.2510710658316887, 24.40675862529736, 16.12036554767194, 27.521960487495026, 0.22703969639219956]
 -- 

