# Polynomial Regression and Interaction Variables

In this notebook, we are going to apply polynomial regression to the dataset of students grades vs. the time in hours they have spent studying and other factors presented in the previous notebook. We will also add interaction features to our training data to improve our predictions.

FYI, the dataset is synthetic, I do not have any mean to monitor the time you spend studying my course.

## Importing the dependencies

First, we are going to import all the dependencies that we will need for this lab. If you cannot run the following code cell, do not forget to [create an environment](https://www.freecodecamp.org/news/how-to-setup-virtual-environments-in-python/), to install the dependencies inside of it (using the command `pip install -r requirements.txt`) and to use it as your Jupyter kernel.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.ticker import MaxNLocator
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures, StandardScaler

In [None]:
df = pd.read_csv("./synthetic_student_data.csv")
df.head()

## Example with a single input feature

In [None]:
fig, ax = plt.subplots(figsize=(10, 5))

ax.scatter(df["hours_studied"], df["grade"], c="b", marker="+")

ax.set_xlabel("Number of hours studied", fontsize="large")
ax.set_ylabel("Grade", fontsize="large")
ax.set_ylim((0,20))
ax.yaxis.set_major_locator(MaxNLocator(integer=True))
fig.tight_layout()

In [None]:
fig, ax = plt.subplots(figsize=(10, 5))

X = df["hours_studied"].to_numpy()
Y = df["grade"].to_numpy()

poly_2 = PolynomialFeatures(2, include_bias=False)

X_poly2 = poly_2.fit_transform(X.reshape(-1, 1))

ax.scatter(X, Y, c="b", marker="+")

ax.set_xlabel("Number of hours studied", fontsize="large")
ax.set_ylabel("Grade", fontsize="large")
ax.set_ylim((0,20))
ax.yaxis.set_major_locator(MaxNLocator(integer=True))

model = LinearRegression()
model.fit(X_poly2, Y)

# Predict line over full x range
x_vals = np.linspace(df["hours_studied"].min(), df["hours_studied"].max(), 100).reshape(-1, 1)
x_vals_poly2 = poly_2.fit_transform(x_vals)
y_vals = model.predict(x_vals_poly2)

ax.plot(x_vals, y_vals)
fig.tight_layout()

### Example of an overfitted regression

In [None]:
fig, ax = plt.subplots(figsize=(10, 5))

X = df["hours_studied"].to_numpy()
Y = df["grade"].to_numpy()

poly_12 = PolynomialFeatures(12, include_bias=False)  # This time we compute all the powers of our feature up to 12

X_poly12 = poly_12.fit_transform(X.reshape(-1, 1))

ax.scatter(X, Y, c="b", marker="+")

ax.set_xlabel("Number of hours studied", fontsize="large")
ax.set_ylabel("Grade", fontsize="large")
ax.set_ylim((0,20))
ax.yaxis.set_major_locator(MaxNLocator(integer=True))

model = LinearRegression()
model.fit(X_poly12, Y)

# Predict line over full x range
x_vals = np.linspace(df["hours_studied"].min(), df["hours_studied"].max(), 100).reshape(-1, 1)
x_vals_poly12 = poly_12.fit_transform(x_vals)
y_vals = model.predict(x_vals_poly12)

ax.plot(x_vals, y_vals)
fig.tight_layout()

In this case, we needed to largely increase the number of powers because we had a single feature but this issue can happen much more quickly with additional features.

## Improved prediction with multiple variables and interaction

Here, we will take into account the multiple parameters in our dataset. The results would be much more complex to plot so we will simply display the coefficients. We will also produce interaction variables to improve the quality of our regression.

In [None]:
input_features = ["hours_studied", "sleep_hours", "class_attendance"]
X = df[input_features].to_numpy()
Y = df["grade"].to_numpy()

poly_2 = PolynomialFeatures(2, include_bias=False)
X_poly2 = poly_2.fit_transform(X)

X_train, X_test, Y_train, Y_test = train_test_split(X_poly2, Y, test_size=0.2, random_state=4321)

scaler = StandardScaler()  # We have multiple features, do not forget to scale them
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)  # We scale our test data based on the mean and std of the training data to ensure that there is no imbalance between them

model = LinearRegression()
model.fit(X_train_scaled, Y_train)

coeffs = model.coef_
bias = model.intercept_
print(f"Regression equation (with scaled features): grade = {bias:.3f}{''.join([f' + {coeffs[i]:.3f} × {input_features[i]}' for i in range(len(input_features))])}" \
      f" + {coeffs[3]:.3f} × {input_features[0]}² + {coeffs[4]:.3f} × {input_features[0]} × {input_features[1]} + {coeffs[5]:.3f} × {input_features[0]} × {input_features[2]}" \
      f" + {coeffs[6]:.3f} × {input_features[1]}² + {coeffs[7]:.3f} × {input_features[1]} × {input_features[2]} + {coeffs[8]:.3f} × {input_features[2]}²")

score = model.score(X_test_scaled, Y_test)
print(f"R² score = {score:.3f}")