# Segment 2 Lab 1

Linear regression, generalization and overfitting

## Disclaimer again!

I will go through this quite fast - if you're new to ML, use this to gain intuition!

In [None]:
# imports from incredibly common packages: numpy, matplotlib, sci-kit learn

import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# To set up our problem

## We have 100 Home Appliances.

Each of them have a **price** and a **height**.

Unknown to us - all the home appliances have a price that is (strangely) closely related to its height!

The price is actually closely given by:
$$
price = 200 + 100 \cdot height - 40 \cdot height^2 + 4 \cdot height^3
$$

This means that if we were to create 3 features:
1. height
2. height squared
3. height cubed

Then a linear regression model would be able to predict the price closely.

Let's see how this works.

In [None]:
# Generate synthetic data for Home Appliances: 100 points with prices that roughly follow a cubic curve

np.random.seed(42)
n_samples = 100
heights = np.random.uniform(0, 10, n_samples)
noise = np.random.normal(0, 50, n_samples)
prices = 200 + 100*heights - 40*heights**2 + 4*heights**3 + noise

In [None]:
# Plot the generated data

plt.figure(figsize=(10, 6))
plt.scatter(heights, prices, alpha=0.5)
plt.title('Home Appliance Price vs Height')
plt.xlabel('Height')
plt.ylabel('Price')
plt.show()

In [None]:
# Every numpy array has a shape that represents the dimensions of the array
# Our weights are currently a one dimensional list of floats:

print(heights.shape)
print(heights)

In [None]:
# We want to turn this into a 2D matrix, with a row for each of our 100 datapoints
# And columns for each feature (there is only 1)

X = heights.reshape(-1, 1)
print(X.shape)
print(X)

In [None]:
# We can use this utility method from sk-learn to split up our data into train and test:

X_train, X_test, y_train, y_test = train_test_split(X, prices, test_size=0.2, random_state=42)

In [None]:
# Running Linear Regression is simply a couple of lines of code - predict price based only on height:

model = LinearRegression()
model.fit(X_train, y_train)

# Making predictions

Now we have a trained model and we can make predictions. We will run the model on our test dataset.

We'll calculate 2 key metrics: Mean Squared Error and R squared.

**Mean Squared Error** as it sounds, the average of the square of the difference between y and y hat

**R squared** measures how well the predictions fit the truth;  
0 means that the model explains none of the variability and is no better than always guessing the mean value   
100% means it fully explains it  
A negative number suggests that the model is worse than just guessing the average!

In [None]:
# Run predictions on test dataset using model.predict

y_hat_test = model.predict(X_test)
mse = mean_squared_error(y_test, y_hat_test)
r2 = r2_score(y_test, y_hat_test)

In [None]:
# Let's visualize the results

plt.figure(figsize=(10, 6))
plt.scatter(X_test, y_test, alpha=0.5)
plt.plot(X_test, y_hat_test, color='red', linewidth=2)
plt.title(f'Linear Regression with 1 feature (height) - performance on unseen Test data\nMSE: {mse:,.0f}\nR squared: {r2*100:.0f}%')
plt.xlabel('Height')
plt.ylabel('Price')
plt.show()

In [None]:
# Now let's add a SECOND feature to our input data, in addition to the height
# Let's add a feature which is the SQUARE of the height

X2 = np.column_stack((heights, heights**2))
print(X2.shape)
print(X2)

In [None]:
# Split into train and test and run Linear Regression

X_train, X_test, y_train, y_test = train_test_split(X2, prices, test_size=0.2, random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)

y_hat_test = model.predict(X_test)
mse = mean_squared_error(y_test, y_hat_test)
r2 = r2_score(y_test, y_hat_test)

In [None]:
# Sort X_test for smooth curve plotting

X_test_sorted = X_test[X_test[:, 0].argsort()]
y_hat_test_sorted = model.predict(X_test_sorted)

In [None]:
# Plot the results

plt.figure(figsize=(10, 6))
plt.scatter(X_test[:, 0], y_test, alpha=0.5)
plt.plot(X_test_sorted[:, 0], y_hat_test_sorted, color='red', linewidth=2)
plt.title(f'Linear Regression - performance on unseen Test data\nMSE: {mse:,.0f}\nR squared: {r2*100:.0f}%')
plt.xlabel('Height')
plt.ylabel('Price')
plt.show()

In [None]:
# Step 3: Polynomial Regression (Degree 3)

X3 = np.column_stack((heights, heights**2, heights**3))
X_train, X_test, y_train, y_test = train_test_split(X3, prices, test_size=0.2, random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)

y_hat_test = model.predict(X_test)
mse = mean_squared_error(y_test, y_hat_test)
r2 = r2_score(y_test, y_hat_test)

In [None]:
# Sort X_test for smooth curve plotting

X_test_sorted = X_test[X_test[:, 0].argsort()]
y_hat_test_sorted = model.predict(X_test_sorted)

In [None]:
# Visualize the results

plt.figure(figsize=(10, 6))
plt.scatter(X_test[:, 0], y_test, alpha=0.5)
plt.plot(X_test_sorted[:, 0], y_hat_test_sorted, color='red', linewidth=2)
plt.title(f'Linear Regression - performance on unseen Test data\nMSE: {mse:,.0f}\nR squared: {r2*100:.0f}%\nMSE: {mse:,.0f}, R2: {r2*100:.0f}%')
plt.xlabel('Height')
plt.ylabel('Price')
plt.show()

Remember the true formula is: $price = 200 + 100 \cdot height - 40 \cdot height^2 + 4 \cdot height^3$

In [None]:
print("Intercept:", model.intercept_)
print("Coefficients:", model.coef_)

# Conclusion

When we give our Linear Regression model three features:

height  
height squared  
height cubed  

It's able to fit and then it generalizes well, because the results look great on unseen data.

And this makes sense because secretly, we generated data based on exactly those features.

Now let's demonstrate what it looks like to "overfit". What if we came up with a model with THIRTY features...

In [None]:
# Step 4: Overfitting Demonstration!

X_overfit = np.column_stack([heights**i for i in range(1,31)])

print(X_overfit.shape)
print(X_overfit[0])

In [None]:
# Linear Regression

X_train, X_test, y_train, y_test = train_test_split(X_overfit, prices, test_size=0.2, random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)

y_hat_test = model.predict(X_test)
mse = mean_squared_error(y_test, y_hat_test)
r2 = r2_score(y_test, y_hat_test)

In [None]:
# Sort X_test for smooth curve plotting

X_test_sorted = X_test[X_test[:, 0].argsort()]
y_hat_test_sorted = model.predict(X_test_sorted)

In [None]:
# Plot the results

plt.figure(figsize=(10, 6))
plt.scatter(X_test[:, 0], y_test, alpha=0.5)
plt.plot(X_test_sorted[:, 0], y_hat_test_sorted, color='red', linewidth=2)
plt.title(f'Linear Regression - performance on unseen Test data\nMSE: {mse:,.0f}\nR squared: {r2*100:.0f}%')
plt.xlabel('Height')
plt.ylabel('Price')
plt.show()

In [None]:
print("Intercept:", model.intercept_)
print("Coefficients:", model.coef_)

# Conclusion

And so you see the surprising results. We provided the Linear Regression with the same festures as before, and MORE features if it wanted them.

And something went wrong: we got worse results.

Why?

Because armed with the extra flexibility, the model was able to fit more closely to the training data.

But it was finding new patterns in the noise that DIDN'T REFLECT the underlying pattern.

So when we extended it to make new predictions, it did worse.

And that is over-fitting!

## Exercise for you

Make it worse! Find ways to overfit even more dramatically.