# Answer

# Part 1 - Baseline Model

### 1.0) Import and inspect data

In [3]:
import pandas as pd

url = "https://raw.githubusercontent.com/JWarmenhoven/ISLR-python/master/Notebooks/Data/Advertising.csv"

df = pd.read_csv(url)

print(df.head())
print(df.info())
print(df.describe())

   Unnamed: 0     TV  Radio  Newspaper  Sales
0           1  230.1   37.8       69.2   22.1
1           2   44.5   39.3       45.1   10.4
2           3   17.2   45.9       69.3    9.3
3           4  151.5   41.3       58.5   18.5
4           5  180.8   10.8       58.4   12.9
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  200 non-null    int64  
 1   TV          200 non-null    float64
 2   Radio       200 non-null    float64
 3   Newspaper   200 non-null    float64
 4   Sales       200 non-null    float64
dtypes: float64(4), int64(1)
memory usage: 7.9 KB
None
       Unnamed: 0          TV       Radio   Newspaper       Sales
count  200.000000  200.000000  200.000000  200.000000  200.000000
mean   100.500000  147.042500   23.264000   30.554000   14.022500
std     57.879185   85.854236   14.846809   21.778621    5.217457
min      1.000

### 1.1) Train test split

In [5]:
from sklearn.model_selection import train_test_split

X = df[["TV", "Radio", "Newspaper"]]
y = df["Sales"]

X_train,X_test,y_train,y_test = train_test_split(X, y, test_size=0.3, random_state=1)

### 1.2) Fit OLS

In [13]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train,y_train)

y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

### 1.3) Model Evaluation

In [14]:
print("=== PARAMETERS ===")
print("Intercept (beta_0):", model.intercept_)
print("Coefficients (beta_1, beta_2, beta_3):")
for name, coef in zip(X.columns, model.coef_):
    print(f"  {name}: {coef}")

=== PARAMETERS ===
Intercept (beta_0): 2.9372157346906143
Coefficients (beta_1, beta_2, beta_3):
  TV: 0.04695204776848461
  Radio: 0.17658643526817375
  Newspaper: 0.0018511533188922285


In [15]:
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from math import sqrt

linear_r2_test = r2_score(y_test, y_test_pred)
linear_rmse_test = sqrt(mean_squared_error(y_test, y_test_pred))
linear_mae_test = mean_absolute_error(y_test, y_test_pred)

linear_r2_train = r2_score(y_train, y_train_pred)
linear_rmse_train = sqrt(mean_squared_error(y_train, y_train_pred))
linear_mae_train = mean_absolute_error(y_train, y_train_pred)

r2_df = pd.DataFrame({
    'Model': ['Linear'],
    'R-squared train': [linear_r2_train],
    'R-squared test': [linear_r2_test],
    'RMSE train': [linear_rmse_train],
    'RMSE test': [linear_rmse_test],
    'MAE train': [linear_mae_train],
    'MAE test': [linear_mae_test]
})
r2_df

Unnamed: 0,Model,R-squared train,R-squared test,RMSE train,RMSE test,MAE train,MAE test
0,Linear,0.885005,0.922461,1.789726,1.388857,1.374654,1.054833


## Part 2 - Overly Complex Model

### 2.0) Set polynomial features, transform and scaledata

In [18]:
from sklearn.preprocessing import PolynomialFeatures, StandardScaler

poly = PolynomialFeatures(degree=5)

X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)

print("# of original features:", X.shape[1])
print("# of polynomial features:", X_train_poly.shape[1])

scaler = StandardScaler()

X_train_poly_scaled = scaler.fit_transform(X_train_poly)
X_test_poly_scaled = scaler.transform(X_test_poly)

# of original features: 3
# of polynomial features: 56


### 2.1) Fitting the Model and predict

In [21]:
model = LinearRegression()
model.fit(X_train_poly_scaled, y_train)

y_poly_train_pred = model.predict(X_train_poly_scaled)
y_poly_test_pred = model.predict(X_test_poly_scaled)

### 2.2) Model Coefficient

In [27]:
feature_names = poly.get_feature_names_out(X.columns)

coef_table = pd.DataFrame({
    "feature": feature_names,
    "coefficient": model.coef_
})

print("\n=== INTERCEPT ===")
print(model.intercept_)

print("\n=== COEFFICIENTS ===")
print(coef_table)


=== INTERCEPT ===
13.791428571428598

=== COEFFICIENTS ===
                   feature   coefficient
0                        1  1.821154e+13
1                       TV  1.178025e+01
2                    Radio  3.462622e+00
3                Newspaper  2.693525e+00
4                     TV^2 -4.107031e+01
5                 TV Radio  8.288176e+00
6             TV Newspaper -1.391336e+01
7                  Radio^2  1.238073e+01
8          Radio Newspaper -4.750875e+00
9              Newspaper^2 -8.279650e+00
10                    TV^3  7.846092e+01
11              TV^2 Radio -2.367382e+01
12          TV^2 Newspaper  3.436589e+01
13              TV Radio^2 -8.765333e+00
14      TV Radio Newspaper -2.343996e+01
15          TV Newspaper^2  3.770616e+01
16                 Radio^3 -4.270588e+01
17       Radio^2 Newspaper -4.795659e-02
18       Radio Newspaper^2  1.402873e+01
19             Newspaper^3  2.110354e+00
20                    TV^4 -6.322125e+01
21              TV^3 Radio  5.518854e+

### 2.3) Performance Metrics

In [28]:
linear_poly_r2_test = r2_score(y_test, y_poly_test_pred)
linear_poly_rmse_test = sqrt(mean_squared_error(y_test, y_poly_test_pred))
linear_poly_mae_test = mean_absolute_error(y_test, y_poly_test_pred)

linear_poly_r2_train = r2_score(y_train, y_poly_train_pred)
linear_poly_rmse_train = sqrt(mean_squared_error(y_train, y_poly_train_pred))
linear_poly_mae_train = mean_absolute_error(y_train, y_poly_train_pred)

r2_df = pd.DataFrame({
    'Model': ['Linear','Linear Poly'],
    'R-squared train': [linear_r2_train,linear_poly_r2_train],
    'R-squared test': [linear_r2_test,linear_poly_r2_test],
    'RMSE train': [linear_rmse_train,linear_poly_rmse_train],
    'RMSE test': [linear_rmse_test,linear_poly_rmse_test],
    'MAE train': [linear_mae_train,linear_poly_mae_train],
    'MAE test': [linear_mae_test,linear_poly_mae_test]
})
r2_df

Unnamed: 0,Model,R-squared train,R-squared test,RMSE train,RMSE test,MAE train,MAE test
0,Linear,0.885005,0.922461,1.789726,1.388857,1.374654,1.054833
1,Linear Poly,0.969195,0.170779,0.92632,4.541836,0.735912,1.559126


### Answer to Question 1:

On the Linear Polynomial Model we have in total 56 coefficients, some of which are raised to powers above 10, while other are raised to negative powers. Both the number of coefficients and their magnitudes hinder the model interpretability, rendering it very difficult to assert which variables contribute to explain the independent variable and how exactly they influence it.

Although on the Linear Polynomial model we observe a higher R-squared in the training set (0.97 vs 0.89 of the Simple Linear), this variable falls sharply to 0.17 on the test set. Also, the Polynomial model performed better than the Linear on every metric on the train set, but much poorly on the test set. All this confirms the hypothesis of overfitting it to the train set, rendering it practically unuseful for the test set.

It is worth mentioning that we observe a surprising dynamic in the Simple Linear Model: all the performance metrics improve in the test set, which rather unusual. Despite this dynamic, their magnitudes are about the same.

In conclusion, the Baseline Model is much simpler, which favours interpretability, and performs better on unseen data (train set). In addition, it only has 4 parameters (intercept, Beta_1, Beta_2 and Beta_3) which allows us to easily say what are the relationships between X variables and the output y. Also, it performs similarly on both the test and train set, which is a good sign of proper fit.

# Part 3: The Regularization Fix (Ridge & Lasso)

### 3.0) Fitting Ridge and Lasso

In [38]:
from sklearn.linear_model import Ridge, Lasso

ridge = Ridge(alpha=0.0005) #Used the same alpha as the instruction sheet
ridge.fit(X_train_poly_scaled, y_train)

lasso = Lasso(alpha=0.0001)  #Used the same alpha as the instruction sheet
lasso.fit(X_train_poly_scaled, y_train)

  model = cd_fast.enet_coordinate_descent(




### 3.1) Predicting

In [39]:
y_ridge_train_pred = ridge.predict(X_train_poly_scaled)
y_ridge_test_pred  = ridge.predict(X_test_poly_scaled)

y_lasso_train_pred = lasso.predict(X_train_poly_scaled)
y_lasso_test_pred  = lasso.predict(X_test_poly_scaled)

### 3.2) Model coefficients

In [42]:
feature_names = poly.get_feature_names_out(X.columns)

ridge_coef_table = pd.DataFrame({
    "feature": feature_names,
    "coefficient": ridge.coef_
})

lasso_coef_table = pd.DataFrame({
    "feature": feature_names,
    "coefficient": lasso.coef_
})

print("=== RIDGE INTERCEPT ===")
print(ridge.intercept_)

print("\n=== RIDGE COEFFICIENTS ===")
print(ridge_coef_table)

print("\n=== LASSO INTERCEPT ===")
print(lasso.intercept_)

print("\n=== LASSO COEFFICIENTS ===")
print(lasso_coef_table)


=== RIDGE INTERCEPT ===
13.791428571428568

=== RIDGE COEFFICIENTS ===
                   feature  coefficient
0                        1     0.000000
1                       TV     9.910008
2                    Radio    -2.532399
3                Newspaper     0.602857
4                     TV^2   -22.098419
5                 TV Radio    14.139929
6             TV Newspaper    -3.043782
7                  Radio^2     9.781942
8          Radio Newspaper    -1.738526
9              Newspaper^2     0.743297
10                    TV^3    21.578210
11              TV^2 Radio   -16.531668
12          TV^2 Newspaper     6.210163
13              TV Radio^2   -17.475811
14      TV Radio Newspaper    -3.208278
15          TV Newspaper^2     6.625725
16                 Radio^3    -8.195824
17       Radio^2 Newspaper    -0.754891
18       Radio Newspaper^2     1.826712
19             Newspaper^3    -3.838098
20                    TV^4    -4.404186
21              TV^3 Radio    16.703143
22       

### 3.2) Computing Metrics

In [40]:
ridge_r2_train   = r2_score(y_train, y_ridge_train_pred)
ridge_r2_test    = r2_score(y_test,  y_ridge_test_pred)
ridge_rmse_train = sqrt(mean_squared_error(y_train, y_ridge_train_pred))
ridge_rmse_test  = sqrt(mean_squared_error(y_test,  y_ridge_test_pred))
ridge_mae_train  = mean_absolute_error(y_train, y_ridge_train_pred)
ridge_mae_test   = mean_absolute_error(y_test,  y_ridge_test_pred)

lasso_r2_train   = r2_score(y_train, y_lasso_train_pred)
lasso_r2_test    = r2_score(y_test,  y_lasso_test_pred)
lasso_rmse_train = sqrt(mean_squared_error(y_train, y_lasso_train_pred))
lasso_rmse_test  = sqrt(mean_squared_error(y_test,  y_lasso_test_pred))
lasso_mae_train  = mean_absolute_error(y_train, y_lasso_train_pred)
lasso_mae_test   = mean_absolute_error(y_test,  y_lasso_test_pred)

r2_df = pd.DataFrame({
    'Model': ['Linear','Linear Poly','Ridge Poly', 'Lasso Poly'],
    'R-squared train': [linear_r2_train,linear_poly_r2_train,ridge_r2_train, lasso_r2_train],
    'R-squared test': [linear_r2_test,linear_poly_r2_test,ridge_r2_test, lasso_r2_test],
    'RMSE train': [linear_rmse_train,linear_poly_rmse_train,ridge_rmse_train, lasso_rmse_train],
    'RMSE test': [linear_rmse_test,linear_poly_rmse_test,ridge_rmse_test, lasso_rmse_test],
    'MAE train': [linear_mae_train,linear_poly_mae_train,ridge_mae_train, lasso_mae_train],
    'MAE test': [linear_mae_test,linear_poly_mae_test,ridge_mae_test, lasso_mae_test]
})
r2_df

Unnamed: 0,Model,R-squared train,R-squared test,RMSE train,RMSE test,MAE train,MAE test
0,Linear,0.885005,0.922461,1.789726,1.388857,1.374654,1.054833
1,Linear Poly,0.969195,0.170779,0.92632,4.541836,0.735912,1.559126
2,Ridge Poly,0.997037,0.950861,0.287292,1.105633,0.219692,0.508102
3,Lasso Poly,0.991327,0.992479,0.491506,0.432559,0.330758,0.326672


### Answer to Question 2:

With the Ridge and Lasso approach the coefficient have now much more interpretable magnitude compared to the Linear Polynomial model. Nevertheless, the coefficients on Ridge model remain relatively high, varying between -22 and +22, while Lasso's coefficient have more normal values, in between -3 and +7 range. This change is driven by the penalty on large coefficients to the loss function that these models incorporate. This shrinkage reduces variance and overfitting compared to the unregularized polynomial model, which is free to use huge, unstable coefficients to chase noise in the training data.

Lasso Model set 4 of the 56 polynomial features to zero, from which we can infer that many of the complex, high-degree terms don’t add much for predicting sales. Regarding the more important coefficient, we can infer that the main drivers are the lower-order effects of TV and Radio, while Newspaper plays only a small role.

Regarding the performance metrics of the Ridge and Lasso Models, they have similar magnitudes in the test and train set, suggesting a proper fit of the models. Moreover, it is worth noting that while the Ridge performs better on the train set, Lasso shows slight improvements in all metrics in the test set, repeating the same surprising dynamic observed in the Linear Regression.

In conclusion, Lasso and Ridge model brought significant improvements in comparison to the Linear Polynomial Model, underscoring the effectiveness of the penalty on large coefficients that these models comprise. This approach protect these models from coefficients influenced by noise in the training data.


### Answer to Question 3:

I’d say the Lasso model is my preferred tool for decision-making, but we must acknowledge that a simple linear regression tells essentially the same story, in a much simpler way.

Lasso takes the rich, flexible polynomial setup and fits it in a "rational way"to data. It is able to keep the useful signals and shrinks away the noise, yielding superior performance on new data and a more interpretable set of coefficients. It confirms that TV is the main driver of sales, with Radio playing a meaningful but secondary role, and Newspaper contributing very little at the margin. Nonetheless, the simple linear regression, points in exactly the same direction and is usually easier to explain and communicate to non-specialized peers.

So my recommendation to the CMO would be: use Lasso as the main model for decision-making, as it signficantly reduces errors, but don't disregard a simple linear model, as it tell pretty much the same story in a simpler way, improving interpretability.