**In this lab about linear regression, we'll be working with the library [StatsModel](https://www.statsmodels.org/stable/index.html), which provides numerous classes and functions for the estimation of statistical models.**

**The dataset that we'll be considering is 'diamond.csv' [[1]](https://www.kaggle.com/datasets/shivam2503/diamonds), which contains several information about diverse diamonds, such as their dimensions, the quality of their cuts, their prices, etc... The goal of the lab will be to define linear regression models to best estimate diamonds prices using a bunch of predictor variables, and to understand the meaning of the obtained coefficients.**

**Dataset's column information :**

*   'price' : price in US dollars.
*   'carat' : weight of the diamond. 
*   'cut' : quality of the cut (Fair, Good, Very Good, Premium, Ideal)
*   'color' : diamond's color's, from J (worst) to D (best).
*  'clarity' : how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))
* 'x' : length in mm.
* 'y' : width in mm. 
* 'z' : height in mm.
* 'table' : width of top of the diamond relative to its widest point. 
* 'depth' = 2z/(x+y) 



**Import necessary libraries**

In [None]:
import statsmodels.api as sm
import numpy as np
import pandas as pd 
from patsy import dmatrices
import matplotlib.pyplot as plt 

**1) Load the dataset, take a look at its properties (shape, data type, etc...). Be careful to set the dataframe incides correctly. Check for missing values, and replace them appropriately if any are present.**

**2) Generate scatter plots of the variable 'price' against the variables 'x', 'y' and 'z'. Do you notice anything strange ? How would you handle such cases ?** 

**3) Select 'price' a the target variable and 'x' as the predictor. Fit a linear regression model to the data, and output the model's summary.**

* **3.1) Is there evidence of a linear relationship between the target and the predictor variables ? What can you say regarding the statistical significance of the estimated coefficients ?**

* **3.2) How do you interpret the value of the coefficients ?**

* **3.3) What are the estimates' 95% confidence intervals, and how do you interpret them ?**

**4) Add 'y' as another predictor variable, fit the model and output its summary.**

* **4.1) Is there still evidence of a linear relationship between the target and predictor variables ?**

* **4.2) How do you interpret the coefficients ?**



**5) Add an interaction term between 'x' and 'y', refit the model and output its summary.**

* **5.1) Does the model seems to be a better fit compared to the one with only 'x' and 'y' ?** 

* **5.2) How do you interpret the coefficients ?**

**6) Generate dummy variables out of the variables 'cut', 'color' and 'clarity'. Make sure that for each of those variables, one level was selected as the reference level (and consequently, that this level is not represented by a dummy variable).**

**Why do we need to have k-1 dummy variables, when k is the number of levels ?**

**7) Refit the model using the dummy variables obtained from the variable 'color', and output its summary.**

* **7.1) Does the model seem to be a good fit ?**

* **7.2) Are all coefficients significant ? if not, what does it mean ?**

* **7.3) How do you interpret the coefficients ?**


**8) Refit the model using this time all predictor variables (at the exception of price, of course), and output its summary.**

**What do you observe ? Does the model seem to be a better fit compared to the previous ones ? Are all coefficients still significant ?**

**9) We will now select candidate features to fit our model using a forward selection strategy. To this end, we will define different entering criteria for our candidate features :**
* Does the introduction of the feature decreases the MSE ? 
* Does the introduction of the feature decreases the AIC ? 
* Does the introduction of the feature decreases the BIC ? 
 
**To this end, define two new functions : neg_AIC(y_true, y_pred, n, k) and neg_BIC(y_true, y_pred, n, k) that respectively compute the negative AIC and BIC given the ground truth y values (y_true), the predicted y values (y_pred), the number of samples (n) and the number of predictors (k). The AIC and BIC can be computed as such :**

* AIC = 2*k + n*log(mse) 
* BIC = n*log(mse) + k*log(n)

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score, cross_validate
from sklearn.metrics import make_scorer
from sklearn.metrics import mean_squared_error, r2_score


def forward_selection(df, model, target_column, columns, scoring_rule):
  features_to_keep  = []
  features_to_try = []  
  best_score = -np.inf
  cond = True
  y = df[target_column].values
  while len(columns) != 0 and cond is True:
    cond = False
    best_feat = None 
    for col in columns:  
      features_to_try = features_to_keep + [col]   
      X = get_predictors(df, features_to_try)
      n, k = X.shape[0], X.shape[1]
      if scoring_rule == 'aic':
        scorer = make_scorer(neg_AIC, n=n, k=k, greater_is_better=True)
      elif scoring_rule == 'bic':
        scorer = make_scorer(neg_BIC, n=n, k=k, greater_is_better=True)
      else:
        scorer = scoring_rule
      cv_results = cross_validate(model, X, y, scoring=scorer, cv=10)
      score = cv_results['test_score'].mean()
      if score > best_score: 
        best_feat = col
        cond = True
        best_score = score
    if best_feat != None:
      columns.remove(best_feat)
      features_to_keep.append(best_feat)
  return features_to_keep, best_score 

def get_predictors(df, cols):
  cat_pred = []
  cont_pred = []
  for col in cols:
    if isinstance(df[col].values[0], str):
      cat_pred.append(col)
    else:
      cont_pred.append(col)
    if len(cat_pred) != 0:
      df_dummies = pd.get_dummies(df[cat_pred], drop_first=True)
    else:
      df_dummies  = pd.DataFrame() 
  df_cont = df[cont_pred] 
  df_cat = pd.concat([df_dummies, df_cont], axis=1)
  
  return df_cat.values  





**10) Use the function forward_selection() and the functions neg_AIC() and neg_BIC() to perform a forward selection on the dataframe features (at the exception of carat) to see which subset of features is best to fit the target variable 'price'. Also, do a forward selection with an entering criterion defined as the MSE. When performing selection, do not consider the variable 'carat'.**

**For each selection, report the best subset of features obtained, as well as the score obtained. What do you observe ?**  

10) The set of predictors obtained when defining the BIC or the AIC as the entering criterion is smaller than the one obtained with the MSE. This is expected, as the AIC and the BIC penalize the inclusion of a new predictor to the model, which is not the case of the MSE. As a general rule, using the BIC might result in a smaller set than when using the AIC, which might in turn results in a smaller set than when using the MSE. 

**11) Looking at the scatter plot of the variable 'price' against the variable 'x', a linear model might not be the best fit to explain the relation between the two variables.  Using a transformation of the variable 'x', try to obtain a better fit. Plot the linear regression line and the one obtained using the transformation of 'x'.**