# Extending Linear Regression to Non-Linear Models

In the previous chapters, we explored how to use linear regression to model relationships in data — that is, when the target variable can be described as a weighted sum of input features:

$$
y = w_0 + w_1 x_1 + w_2 x_2 + \dots
$$

However, in many real-world scenarios, the relationship between variables is not linear. For example, in a parabolic relationship, the output depends on the square of an input:

$$
y = a x^2 + b
$$

The good news is: we can **extend linear regression to handle such non-linear patterns** by transforming our input features. This process — often called *feature engineering* — allows us to use standard linear regression techniques on transformed (non-linear) features.

In this notebook, you will learn:

- How to visualize non-linear relationships in data  
- How to manually and automatically create non-linear features  
- How to use polynomial regression with `scikit-learn`  
- How to evaluate model performance and detect **underfitting** and **overfitting**


In [None]:
# Load the UCI wheat seeds dataset and display the first few rows

import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt

# Load the dataset from the UCI repository
seeds = pd.read_csv(
    'https://archive.ics.uci.edu/ml/machine-learning-databases/00236/seeds_dataset.txt',
    sep='\t',
    on_bad_lines='skip',
    names=[
        'area', 'perimeter', 'compactness', 'length', 'width',
        'symmetry_coef', 'length_groove', 'seed_type'
    ]
)

# Display the first few rows of the dataset
seeds.head()

In [None]:
# Visualize the relationship between width and length of seeds

sns.scatterplot(data=seeds, x="width", y="length")
plt.title("Scatterplot of Seed Width vs. Length")
plt.xlabel("Width")
plt.ylabel("Length")
plt.show()

In [None]:
# Fit a simple linear regression model predicting length from width

from sklearn import linear_model

model = linear_model.LinearRegression()
model.fit(X=seeds[["width"]], y=seeds["length"])

# Create prediction line over a range of width values
x_range = np.arange(2.5, 4.1, 0.1).reshape(-1, 1)
y_pred = model.predict(x_range)

# Plot data and regression line
sns.scatterplot(data=seeds, x="width", y="length")
sns.lineplot(x=x_range.flatten(), y=y_pred, color="red", label="Linear Fit")
plt.title("Linear Regression: Length ~ Width")
plt.xlabel("Width")
plt.ylabel("Length")
plt.legend()
plt.show()

We might guess a formula that involves also a quadratic term: $y = 6.7 - 2\cdot \text{width} + 0.5 \cdot \text{width}^2$

In [None]:
# Manually plot a parabolic curve to show a possible non-linear trend

x_range = np.arange(2.5, 4.1, 0.1)
y_parabola = 6.7 - 2 * x_range + 0.5 * x_range**2  # Just a guessed quadratic formula

sns.scatterplot(data=seeds, x="width", y="length")
sns.lineplot(x=x_range, y=y_parabola, color="green", label="Manual Quadratic Curve")
plt.title("Illustrating a Non-Linear Relationship")
plt.xlabel("Width")
plt.ylabel("Length")
plt.legend()
plt.show()

## Turning Linear Features into Non-Linear Ones: Feature Engineering

What did we just do in the previous cell?

We manually created a curve using a combination of terms like $x$ and $x^2$.  
This gives us an important idea: we can **augment our dataset** by adding new features that are **non-linear transformations** of existing ones.

Instead of modeling:
$$
y = w_0 + w_1 \cdot x_1
$$

we now model:
$$
y = w_0 + w_1 \cdot x_1 + w_2 \cdot \underbrace{x_1^2}_{:=x_2}
$$

This is still a linear regression model — but with **non-linear features**.

We’ll now add a new column `width²` to our dataset and use it alongside `width` to fit a better model.


In [None]:
# Add a non-linear (squared) feature manually

seeds["width2"] = seeds["width"] ** 2

# Define input features and target variable
X = seeds[["width", "width2"]]
y = seeds["length"]

In [None]:
# Fit a linear regression model on the original and squared width features

model_nonlin = linear_model.LinearRegression()
model_nonlin.fit(X=X, y=y)

In [None]:
# Create prediction inputs: both width and width^2 over a range
x_range = np.arange(2.5, 4.1, 0.1)
X_pred = np.stack([x_range, x_range**2], axis=1)

# Plot the data and both models (linear and degree-2)
sns.scatterplot(data=seeds, x="width", y="length")
sns.lineplot(x=x_range, y=model.predict(x_range[:, np.newaxis]), color="red", label="Linear")
sns.lineplot(x=x_range, y=model_nonlin.predict(X_pred), color="green", label="Quadratic (degree 2)")

plt.title("Comparing Linear and Quadratic Fits")
plt.xlabel("Width")
plt.ylabel("Length")
plt.legend()
plt.show()

### Why do we need `np.stack` here?

Our non-linear regression model was trained with **two input features**: `width` and `width²`.  
So when we want to make predictions, we need to provide both of these features for each input value.

We start with a vector of width values:

```python
x_range = np.arange(2.5, 4.1, 0.1)  # shape: (n_samples,)
```

Now we want to create a 2D array where:

- The **first column** is `x_range` (the width values),
- The **second column** is `x_range**2` (the squared values).

We use `np.stack` to combine them **column-wise**:

```python
X_pred = np.stack([x_range, x_range**2], axis=1)
```

Now `X_pred` has shape `(n_samples, 2)`, which matches what the model expects:  
a matrix where each row is an input `[width, width²]`.


In [None]:
# Compare model performance using Mean Squared Error (MSE)

from sklearn import metrics

mse_linear = metrics.mean_squared_error(seeds["length"], model.predict(seeds[["width"]]))
mse_quadratic = metrics.mean_squared_error(seeds["length"], model_nonlin.predict(seeds[["width", "width2"]]))

print(f"MSE (Linear Model):    {mse_linear:.5f}")
print(f"MSE (Quadratic Model): {mse_quadratic:.5f}")

## Automating Feature Expansion with `PolynomialFeatures`

Instead of manually adding new features like $x^2$, we can use `scikit-learn`’s `PolynomialFeatures` class to do this automatically.

This tool generates all polynomial combinations of the input features up to a specified degree.

For example, if we input just `width` and set `degree=2`, it will generate:

- a bias term (1),
- `width`,
- `width²`.

We’ll now redo our previous example using this more general and scalable approach.


In [None]:
# Automatically generate polynomial features up to degree 2

from sklearn.preprocessing import PolynomialFeatures

# Instantiate the transformer for degree 2 polynomials
poly = PolynomialFeatures(degree=2)

# Input feature (just 'width')
X = seeds[["width"]]

# Transform into [1, width, width^2]
X_poly = poly.fit_transform(X)

In [None]:
# Let's see how X looks like
X

In [None]:
# And the transformed features X_poly
X_poly

In [None]:
# Fit a linear regression model on the polynomial features (degree 2)

model_poly = linear_model.LinearRegression()
model_poly.fit(X=X_poly, y=y)

In [None]:
# Generate predictions using the fitted polynomial model

# Create width values and expand them into polynomial features
x_range = np.arange(2.5, 4.1, 0.1)
X_pred = poly.transform(x_range[:, np.newaxis])

# Plot the original data and both model fits
sns.scatterplot(data=seeds, x="width", y="length")
sns.lineplot(x=x_range, y=model.predict(x_range[:, np.newaxis]), color="red", label="Linear")
sns.lineplot(x=x_range, y=model_poly.predict(X_pred), color="green", label="Degree 2 (PolynomialFeatures)")
plt.title("PolynomialFeatures vs. Linear Regression")
plt.xlabel("Width")
plt.ylabel("Length")
plt.legend()
plt.show()

In [None]:
# Fit a polynomial regression model of degree 4

# Prepare features and labels
X = seeds[["width"]]
y = seeds["length"]

# Create polynomial features up to degree 4
poly4 = PolynomialFeatures(degree=4)
X_poly4 = poly4.fit_transform(X)

# Fit the model
model_poly4 = linear_model.LinearRegression()
model_poly4.fit(X=X_poly4, y=y)

# Generate predictions
x_range = np.arange(2.5, 4.1, 0.1)
X_pred4 = poly4.transform(x_range[:, np.newaxis])

# Plot original data and all three fits
sns.scatterplot(data=seeds, x="width", y="length")
sns.lineplot(x=x_range, y=model.predict(x_range[:, np.newaxis]), color="red", label="Linear")
sns.lineplot(x=x_range, y=model_poly.predict(poly.transform(x_range[:, np.newaxis])), color="green", label="Degree 2")
sns.lineplot(x=x_range, y=model_poly4.predict(X_pred4), color="purple", label="Degree 4")
plt.title("Comparing Linear, Degree 2, and Degree 4 Fits")
plt.xlabel("Width")
plt.ylabel("Length")
plt.legend()
plt.show()

In [None]:
# Illustrate underfitting and overfitting by using a small sample

from sklearn.metrics import mean_squared_error

# Take a small random sample of the data
seeds_sample = seeds.sample(20, random_state=42)
X_sample = seeds_sample[["width"]]
y_sample = seeds_sample["length"]

# Plot the data and fit polynomials of degrees 1, 2, and 8
ax = sns.scatterplot(data=seeds_sample, x="width", y="length")
models = {}

for degree in [1, 2, 8]:
    poly = PolynomialFeatures(degree=degree)
    X_poly = poly.fit_transform(X_sample)

    model_poly = linear_model.LinearRegression()
    model_poly.fit(X=X_poly, y=y_sample)

    models[degree] = model_poly

    # Generate predictions for plotting
    x_range = np.arange(2.5, 4.3, 0.1)
    X_pred = poly.transform(x_range[:, np.newaxis])
    y_pred = model_poly.predict(X_pred)

    # Print training error
    mse = mean_squared_error(y_sample, model_poly.predict(X_poly))
    print(f"MSE (degree {degree}): {mse:.5f}")

    # Plot the model
    sns.lineplot(x=x_range, y=y_pred, label=f"Degree {degree}")

ax.set_title("Fitting on a Small Sample")
ax.set_ylim((3, 7))
plt.xlabel("Width")
plt.ylabel("Length")
plt.legend()
plt.show()

In [None]:
# Validate models on a different data sample to assess generalization

# Take a new random sample (not overlapping with the first one)
seeds_sample2 = seeds.sample(20, random_state=22)
X_val = seeds_sample2[["width"]]
y_val = seeds_sample2["length"]

# Predict using all three models on the new sample
y_pred_1 = models[1].predict(PolynomialFeatures(degree=1).fit_transform(X_val))
y_pred_2 = models[2].predict(PolynomialFeatures(degree=2).fit_transform(X_val))
y_pred_8 = models[8].predict(PolynomialFeatures(degree=8).fit_transform(X_val))

# Plot residuals (difference between actual and predicted values)
fig, ax = plt.subplots()
plt.plot(y_val.values - y_pred_1, label="Degree 1")
plt.plot(y_val.values - y_pred_2, label="Degree 2")
plt.plot(y_val.values - y_pred_8, label="Degree 8")
ax.set_title("Residuals on Validation Set")
ax.set_ylabel("Prediction Error")
ax.legend()
plt.show()

# Print validation MSE for each model
print(f"MSE (degree 1): {mean_squared_error(y_val, y_pred_1):.5f}")
print(f"MSE (degree 2): {mean_squared_error(y_val, y_pred_2):.5f}")
print(f"MSE (degree 8): {mean_squared_error(y_val, y_pred_8):.5f}")

In [None]:
# Evaluate validation error across a range of polynomial degrees

# Define training and validation samples
seeds_sample = seeds.sample(20, random_state=42)
X_train = seeds_sample[["width"]]
y_train = seeds_sample["length"]

seeds_sample2 = seeds.sample(20, random_state=22)
X_val = seeds_sample2[["width"]]
y_val = seeds_sample2["length"]

# Evaluate models of increasing complexity (degrees 0 to 9)
errors = []

for degree in range(10):
    poly = PolynomialFeatures(degree=degree)

    X_train_poly = poly.fit_transform(X_train)
    X_val_poly = poly.transform(X_val)

    model_poly = linear_model.LinearRegression()
    model_poly.fit(X_train_poly, y_train)

    val_pred = model_poly.predict(X_val_poly)
    mse = mean_squared_error(y_val, val_pred)
    errors.append(mse)

# Plot validation error by model degree
fig, ax = plt.subplots()
ax.plot(range(10), errors, "-o")
ax.set_title("Validation Error vs. Polynomial Degree")
ax.set_xlabel("Polynomial Degree")
ax.set_ylabel("Mean Squared Error (MSE)")
plt.show()

## Training and Validation Splits with `train_test_split`

In practice, we rarely sample training and validation data manually.

Instead, we use utility functions like `train_test_split` from `scikit-learn` to randomly split our dataset into **training** and **testing (validation)** sets.

This allows us to:

- Train our model on one portion of the data
- Evaluate how well it generalizes on unseen data
- Avoid overfitting by tuning model complexity based on validation performance

Let’s see how it works.

In [None]:
# Automatically split the dataset into training and testing sets

from sklearn.model_selection import train_test_split

# Split: 70% training, 30% test
seeds_train, seeds_test = train_test_split(seeds, test_size=0.3, random_state=0)

print(f"Training samples: {len(seeds_train)}")
print(f"Testing samples:  {len(seeds_test)}")

## Exercises

These exercises will help you practice what you’ve learned using a new dataset.


### Dataset: `centripetal.csv`

This dataset was collected using a smartphone while someone rotated in place. It includes the following columns:

- `Angular velocity (rad/s)`: the rotational speed in **radians per second**  
- `Acceleration (m/s^2)`: the **inward acceleration** experienced during rotation (as on a carousel), measured in **m/s²**

There is a **non-linear relationship** between these two quantities:
$$
a = \omega^2 \cdot r
$$
where $a$ is the centripetal acceleration, $\omega$ is the angular speed, and $r$ is the radius (assumed constant here).


### Tasks

1. **Load and explore the data**  
   - Load the `centripetal.csv` file.  
   - Plot centripetal acceleration vs. angular speed.

2. **Fit a linear regression model**  
   - Use acceleration to predict angular speed.  
   - Compute and report the **mean squared error (MSE)**.

3. **Fit a polynomial regression model (degree 2)**  
   - Use `PolynomialFeatures` to generate features.  
   - Fit a model and compute the MSE.  
   - Which model fits the data better?

4. **Test generalization with train/test split**  
   - Use `train_test_split` to create an 80% training / 20% test split.  
   - Fit a **degree 10 polynomial model** on the training set.  
   - Plot the model’s predictions and evaluate its test error.  
   - What do you observe? Why might this happen?


💡 **Hint:** Centripetal acceleration increases **non-linearly** with angular speed. This is a great example of a case where a **linear model may be misleading**.