# Linear Regression

In this notebook, we are going to apply linear regression to a dataset of students grades vs. the time in hours they have spent studying and other factors. 

FYI, the dataset is synthetic, I do not have any mean to monitor the time you spend studying my course.

## Importing the dependencies

First, we are going to import all the dependencies that we will need for this lab. If you cannot run the following code cell, do not forget to [create an environment](https://www.freecodecamp.org/news/how-to-setup-virtual-environments-in-python/), to install the dependencies inside of it (using the command `pip install -r requirements.txt`) and to use it as your Jupyter kernel.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.ticker import MaxNLocator
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [None]:
df = pd.read_csv("./synthetic_student_data.csv")
df.head()

## Example with a single input feature

In [None]:
fig, ax = plt.subplots(figsize=(10, 5))

ax.scatter(df["hours_studied"], df["grade"], c="b", marker="+")

ax.set_xlabel("Number of hours studied", fontsize="large")
ax.set_ylabel("Grade", fontsize="large")
ax.set_ylim((0,20))
ax.yaxis.set_major_locator(MaxNLocator(integer=True))
fig.tight_layout()

In [None]:
fig, ax = plt.subplots(figsize=(10, 5))

for i in [1, 2, 5, 50, len(df.index)]:  # Increase training data in steps
    X = df["hours_studied"].to_numpy()[:i]
    Y = df["grade"].to_numpy()[:i]

    ax.scatter(X, Y, c="b", marker="+")

    ax.set_xlabel("Number of hours studied", fontsize="large")
    ax.set_ylabel("Grade", fontsize="large")
    ax.set_ylim((0,20))
    ax.yaxis.set_major_locator(MaxNLocator(integer=True))
    fig.tight_layout()

    model = LinearRegression()
    model.fit(X.reshape(-1, 1), Y)
    
    # Predict line over full x range
    x_vals = np.linspace(df["hours_studied"].min(), df["hours_studied"].max(), 100).reshape(-1, 1)
    y_vals = model.predict(x_vals)
    
    ax.plot(x_vals, y_vals, label=f'{i} points')
    ax.legend(loc="lower right")

## Improved prediction with multiple variables and evaluation

Here, we will take into account the multiple parameters in our dataset. The results would be much more complex to plot so we will simply display the coefficients. We will also separate our dataset into a train and a test segment to assess the generalization of our data.

In [None]:
input_features = ["hours_studied", "sleep_hours", "class_attendance"]
X = df[input_features].to_numpy()
Y = df["grade"].to_numpy()
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=4321)

scaler = StandardScaler()  # We have multiple features, do not forget to scale them
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)  # We scale our test data based on the mean and std of the training data to ensure that there is no imbalance between them

model = LinearRegression()
model.fit(X_train_scaled, Y_train)

coeffs = model.coef_
bias = model.intercept_
print(f"Regression equation (with scaled features): grade = {bias:.3f}{''.join([f' + {coeffs[i]:.3f} × {input_features[i]}' for i in range(len(coeffs))])}")

score = model.score(X_test_scaled, Y_test)
print(f"R² score = {score:.3f}")

As a point of comparison for R² score, let's try to train our regression on a single variable.

In [None]:
X = df["hours_studied"].to_numpy()
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X.reshape(-1, 1))

Y = df["grade"].to_numpy()
X_train, X_test, Y_train, Y_test = train_test_split(X_scaled, Y, test_size=0.2, random_state=4321)

model = LinearRegression()
model.fit(X_train, Y_train)

coeffs = model.coef_
bias = model.intercept_
print(f"Regression equation: grade = {bias:.3f} + {coeffs[0]:.3f} × hours_studied")

score = model.score(X_test, Y_test)
print(f"R² score = {score:.3f}")