# Linear Regression - 2D (ie. 1 feature vs. 1 target)

### For each linear regression exercise below, do the following:
- solve it using the closed form formula (only for Exercise 1)
- solve it using scikit-learn LinearRegression class (eg. from sklearn.linear_model import LinearRegression)
- compute the R-squared and MSE using scikit-learn mean_squared_error and r2_score (eg. from sklearn.metrics import mean_squared_error, r2_score)
- plot the data points in blue and the regression line in red using matplotlib

### Exercise 1 - Real estate price prediction

- Dataset: https://www.kaggle.com/datasets/quantbruce/real-estate-price-prediction
- Regression 1: house age (feature X2) vs. house price of unit area (target y)
- Regression 2: distance to the nearest MRT station (feature X3) vs. house price of unit area (target y)
- Regression 3: number of convenience stores (feature X4) vs. house price of unit area (target y)
- Question: Which feature exhibits the most predictive power ?

### Exercise 2 - Find Beta of NVIDIA versus S&P500

- use the yfinance_helper.py script to download NVIDIA (ticker: NVDA) and S&P500 (ticker: ^GSPC) data from 2018/01/01 to 2023/12/31
- Regression: S&P500 daily returns (feature X) vs. NVIDIA daily returns (target y)
- Question: What's the beta of the stock NVIDIA versus the S&P500

### Exercise 3: Autoregression of NVIDIA stock

- use the NVIDIA daily returns from Exercise 2
- Regression 1: NVIDIA daily returns the day before (@ t - 1) vs. NVIDIA daily returns (@ t) (ie. {(S_t-1, S_t): 1 <= t< N}
- Problem: Find the daily lag L that has the most predictive power, 1 <= L <= 10 (ie. Find L such that {(S_t-L, S_t): 1 <= t < N} has maximum R-squared

### Exercise 4: Exponential model

- load the exp_data.csv data into a pandas dataframe
- use linear regression to fit the following model: f(x) = a * exp(b * x) where exp is the exponential function and a,b are real numbers

### Exercise 5:

- data: polynomial_data.csv
- model: try fitting a * X^2 + b * Y^2 + c * X * Y + d * X + e * Y + f

### Exercise 6:

- data: sine_wave_data.csv
- model:
    - try fitting a polynomial model of degree > 10 in the variable X (eg. a_10 * X^10 + ... + a_1 * X + a_0)
    - Optional: try adding regularization (ie. minimize ||y - y_hat(theta)||^2 + lambda * ||theta|| where ||.|| is the l2 norm)

### Exercise 7

- data: Choose a **multivariate regression** dataset from https://archive.ics.uci.edu/
- model 1:
    - fit a LinearRegression model using scikit-learn
    - measure the model by computing MSE, Bias^2, and Variance
- model 2:
    - fit a Ridge or Lasso model using scikit-learn
    - measure the model by computing MSE, Bias^2, and Variance
- model 3:
    - use a scikit-learn pipeline to normalize the data
    - use scikit-learn GridSearch with a Ridge or Lasso model and use 2 cross-validation steps
    - measure the model by computing MSE, Bias^2, and Variance

### Exercise 8: Autoregression of NVIDIA stock (follow-up)

- use the NVIDIA daily returns from Exercise 2 and try finding a better predictive model
- can you use your predictions to backtest a 1-day buy and hold strategy ?