# Regression Case Study: Horsepower Prediction with Regression Models

**Dataset:** `FuelEconomy.csv`  
**Task:** Predict horsepower using regression  
**Models:**  
- Linear Regression  
- Polynomial Regression (degree 2, 3, 4)  
**Regularization:** **Not used** (as requested)

---

## What I will do in this notebook

### 1.1 Load and inspect the dataset
- Load the CSV into a pandas DataFrame.
- Display column names, shape, and summary statistics (describe()).
- Identify missing values (if any) and clearly state how you handle them.

### 1.2 Train/Test split
- Randomly split the dataset into 70% training and 30% testing.
- Use a fixed random state for reproducibility.

### 1.3 Model training: linear + polynomial regression
- Train the following models to predict HP:
    - (a) Linear Regression
    - (b) Polynomial Regression (degree 2)
    - (c) Polynomial Regression (degree 3)
    - (d) Polynomial Regression (degree 4)

### 1.4 Model evaluation (train and test)
- For each model, report metrics on both train and test sets: MSE, MAE, $R^2$
- Present results in a clean table

### 1.5 Discussion and interpretation


In [15]:

# ============================================================
# Imports
# ============================================================

import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline

from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score


In [16]:
# ============================================================
# Load dataset
# ============================================================

DATA_PATH = "FuelEconomy.csv"
df = pd.read_csv(DATA_PATH)

print("Shape:", df.shape)
print("\nColumns:")
print(df.columns.tolist())

display(df.head())

print("\nSummary statistics:")
display(df.describe(include="all"))

print("\nMissing values per column:")
display(df.isna().sum())

Shape: (100, 2)

Columns:
['Horse Power', 'Fuel Economy (MPG)']


Unnamed: 0,Horse Power,Fuel Economy (MPG)
0,118.770799,29.344195
1,176.326567,24.695934
2,219.262465,23.95201
3,187.310009,23.384546
4,218.59434,23.426739



Summary statistics:


Unnamed: 0,Horse Power,Fuel Economy (MPG)
count,100.0,100.0
mean,213.67619,23.178501
std,62.061726,4.701666
min,50.0,10.0
25%,174.996514,20.439516
50%,218.928402,23.143192
75%,251.706476,26.089933
max,350.0,35.0



Missing values per column:


Horse Power           0
Fuel Economy (MPG)    0
dtype: int64

In [17]:
# ============================================================
# Utility functions
# ============================================================

TARGET_COL = "Horse Power"

def prepare_xy(df_in, target_col=TARGET_COL):
    """Drop missing rows, split into X and y."""
    df_clean = df_in.dropna().copy()
    X = df_clean.drop(columns=[target_col])
    y = df_clean[target_col]
    return X, y

def split_data(X, y, test_size=0.30, random_state=42):
    """70/30 random train-test split."""
    return train_test_split(X, y, test_size=test_size, random_state=random_state)

def compute_metrics(y_true, y_pred):
    """Return MSE, MAE, R^2."""
    return {
        "MSE": mean_squared_error(y_true, y_pred),
        "MAE": mean_absolute_error(y_true, y_pred),
        "R^2": r2_score(y_true, y_pred),
    }

def run_models_and_evaluate(df_in, degrees=(1, 2, 3, 4),
                            target_col=TARGET_COL, test_size=0.30, random_state=42):
    """Train/evaluate linear (deg=1) + polynomial regression models.

    Returns a DataFrame of metrics.
    Also prints fitted equations and scatter plots (test set) for each model.
    """
    X, y = prepare_xy(df_in, target_col=target_col)
    X_train, X_test, y_train, y_test = split_data(X, y, test_size=test_size, random_state=random_state)

    rows = []

    for deg in degrees:
        if deg == 1:
            model = LinearRegression()
            model_name = "Linear Regression"
        else:
            model = Pipeline([
                ("poly", PolynomialFeatures(degree=deg, include_bias=False)),
                ("lr", LinearRegression())
            ])
            model_name = f"Polynomial Regression (degree={deg})"

        # Fit model
        model.fit(X_train, y_train)

        # Predict
        yhat_train = model.predict(X_train)
        yhat_test  = model.predict(X_test)

        # Metrics
        train_m = compute_metrics(y_train, yhat_train)
        test_m  = compute_metrics(y_test, yhat_test)

        rows.append({
            "Model": model_name,
            "Train MSE": train_m["MSE"],
            "Train MAE": train_m["MAE"],
            "Train R^2": train_m["R^2"],
            "Test MSE": test_m["MSE"],
            "Test MAE": test_m["MAE"],
            "Test R^2": test_m["R^2"],
        })

    return pd.DataFrame(rows)

results = run_models_and_evaluate(df)

display(results)


Unnamed: 0,Model,Train MSE,Train MAE,Train R^2,Test MSE,Test MAE,Test R^2
0,Linear Regression,357.69918,16.061689,0.90632,318.561087,14.940628,0.912561
1,Polynomial Regression (degree=2),350.879731,15.995824,0.908106,331.105434,15.14833,0.909118
2,Polynomial Regression (degree=3),345.108668,15.746762,0.909618,318.404012,14.764973,0.912604
3,Polynomial Regression (degree=4),339.700171,15.508465,0.911034,313.798757,14.735471,0.913868


### 1.5 Dicussion and interpretation

The degree 4 polynomial regression model performed the best on the test set because it achieved the lowest test MSE and MAE and the highest $R^2$ indicating the best generalization performance on unseen data.

Increasing the polynomial degree does not always improve performance. We see this in the increase from degree 1 to degree 2, where test MSE and MAE both got higher and $R^2$ got lower.