# Homework: Week 1 - The Overfitting Trap in Marketing ROI

## Dataset:
https://raw.githubusercontent.com/JWarmenhoven/ISLR-python/master/Notebooks/Data/Advertising.csv

## Part 1: The "Simple" Model (Baseline)
1.   **Split Your Data:** Before you do anything else, split your data into a training set and a test set using train_test_split. (Use a test_size=0.3 and random_state=1). You will use the training set to fit all your models and the test set to evaluate them.

2.   **Fit & Interpret:** Fit a simple LinearRegression using only the three original features (TV, Radio, Newspaper).

3.   **Write Down Coefficients and Performance metrics:** Note the coefficients for TV, Radio, and Newspaper and its performance metrics on both train set and test set. This will be your baseline model.

## Part 2: The "Overly Complex" Model (The Trap)

1.   **Create Polynomial Features:** Use PolynomialFeatures (try degree=5) to create a new, high-dimensional training set. Make sure to fit_transform on your training data and only .transform your test data. This will create many new features (e.g., TV^2, Radio^3, TV * Radio). Make sure to fit_transform on your training data and only .transform your test data.
2.   **Scale Your Features:** Use StandardScaler. fit_transform on the polynomial training data and just .transform on the polynomial test data. (This is critical for regularization to work).
3. **Fit the Overfit Model:** Fit a LinearRegression on this new, scaled, polynomial training set.
3.   **Check the Coefficients and Metrics:** Print the model.coef_ and calculate its performance metrics.

### Question 1 (Observation):
* What do you observe about the coefficients? Are they large or small? Do they make any intuitive sense? What does this tell you about the risk of this model? (Hint: They will likely be huge and non-sensical, a classic sign of overfitting).
* Print the metrics for this complex model on the training set.
* Print the R-squared score for this same model on the test set.
* What do you observe? What does the difference between these two scores (and the baseline score from Part 1) tell you about this model? Is this a good model?

## Part 3: The Regularization Fix (Ridge & Lasso)
Now, let's fix the model from Part 2.
1.   Fit Ridge: Fit a Ridge model on the same scaled, polynomial training data.
2.   Fit Lasso: Fit a Lasso model on the same scaled, polynomial training data.

### Question 2 (Analysis & Performance):
* What do you observe about the coefficients from the two new models now? Are they still large or small? Make some comments about the changes.
* Look at the coefficients from your Lasso model. How many features did it set to zero? What does this tell you about the 'true' drivers of sales?"
* What is the performance metrics for your Ridge model on the train set and test set?
* What is the performance metrics for your Lasso model on the train set and test set?
* How do these scores compare to the 'overfit' model's test score? What does this prove about the value of regularization?

### Question 3 (The Verdict):
In the end, after trying a simple model, an overfit complex model, and two regularized models, what is your final recommendation to the CMO? Which channels (TV, Radio, Newspaper) are the most reliable drivers of sales?

# Answer

# Part 1 - Baseline Model

### 1.0) Import and inspect data

In [3]:
import pandas as pd

url = "https://raw.githubusercontent.com/JWarmenhoven/ISLR-python/master/Notebooks/Data/Advertising.csv"

df = pd.read_csv(url)

print(df.head())
print(df.info())
print(df.describe())

   Unnamed: 0     TV  Radio  Newspaper  Sales
0           1  230.1   37.8       69.2   22.1
1           2   44.5   39.3       45.1   10.4
2           3   17.2   45.9       69.3    9.3
3           4  151.5   41.3       58.5   18.5
4           5  180.8   10.8       58.4   12.9
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  200 non-null    int64  
 1   TV          200 non-null    float64
 2   Radio       200 non-null    float64
 3   Newspaper   200 non-null    float64
 4   Sales       200 non-null    float64
dtypes: float64(4), int64(1)
memory usage: 7.9 KB
None
       Unnamed: 0          TV       Radio   Newspaper       Sales
count  200.000000  200.000000  200.000000  200.000000  200.000000
mean   100.500000  147.042500   23.264000   30.554000   14.022500
std     57.879185   85.854236   14.846809   21.778621    5.217457
min      1.000

### 1.1) Train test split

In [5]:
from sklearn.model_selection import train_test_split

X = df[["TV", "Radio", "Newspaper"]]
y = df["Sales"]

X_train,X_test,y_train,y_test = train_test_split(X, y, test_size=0.3, random_state=1)

### 1.2) Fit OLS

In [13]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train,y_train)

y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

### 1.3) Model Evaluation

In [14]:
print("=== PARAMETERS ===")
print("Intercept (beta_0):", model.intercept_)
print("Coefficients (beta_1, beta_2, beta_3):")
for name, coef in zip(X.columns, model.coef_):
    print(f"  {name}: {coef}")

=== PARAMETERS ===
Intercept (beta_0): 2.9372157346906143
Coefficients (beta_1, beta_2, beta_3):
  TV: 0.04695204776848461
  Radio: 0.17658643526817375
  Newspaper: 0.0018511533188922285


In [15]:
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from math import sqrt

linear_r2_test = r2_score(y_test, y_test_pred)
linear_rmse_test = sqrt(mean_squared_error(y_test, y_test_pred))
linear_mae_test = mean_absolute_error(y_test, y_test_pred)

linear_r2_train = r2_score(y_train, y_train_pred)
linear_rmse_train = sqrt(mean_squared_error(y_train, y_train_pred))
linear_mae_train = mean_absolute_error(y_train, y_train_pred)

r2_df = pd.DataFrame({
    'Model': ['Linear'],
    'R-squared train': [linear_r2_train],
    'R-squared test': [linear_r2_test],
    'RMSE train': [linear_rmse_train],
    'RMSE test': [linear_rmse_test],
    'MAE train': [linear_mae_train],
    'MAE test': [linear_mae_test]
})
r2_df

Unnamed: 0,Model,R-squared train,R-squared test,RMSE train,RMSE test,MAE train,MAE test
0,Linear,0.885005,0.922461,1.789726,1.388857,1.374654,1.054833


## Part 2 - Overly Complex Model

In [None]:
from sklearn.preprocessing import PolynomialFeatures
poly_transform=PolynomialFeatures(degree=15)