# Regression Case Study: Horsepower Prediction with Regression Models

**Dataset:** `electricity_consumption_based_weather_dataset.csv`  
**Task:** Predict daily electricity consumption  
**Models:**  
- Linear Regression  
- Polynomial Regression (degree 2, 3, 4)  
**Regularization:** **Not used** (as requested)

---

## What I will do in this notebook

### 2.1 Load and inspect the dataset
- Load the CSV into a pandas DataFrame.
- Display column names, shape, and summary statistics (describe()).
- Clearly identify the dependent variable: daily consumption.
- Identify missing values (if any) and clearly state how you handle them: rows containing missing values are dropped

### 2.2 Train/Test split
- Randomly split the dataset into 70% training and 30% testing.
- Use a fixed random state for reproducibility.

### 2.3 Model training: linear + polynomial regression
- Train the following models to predict HP:
    - (a) Linear Regression
    - (b) Polynomial Regression (degree 2)
    - (c) Polynomial Regression (degree 3)
    - (d) Polynomial Regression (degree 4)

### 2.4 Model evaluation (train and test)
- For each model, report metrics on both train and test sets: MSE, MAE, $R^2$
- Present results in a clean table

### 2.5 Discussion and interpretation


In [7]:

# ============================================================
# Imports
# ============================================================

import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline

from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score


In [8]:
# ============================================================
# Load dataset
# ============================================================

DATA_PATH = "electricity_consumption_based_weather_dataset.csv"
df = pd.read_csv(DATA_PATH)

print("Shape:", df.shape)
print("\nColumns:")
print(df.columns.tolist())

display(df.head())

print("\nSummary statistics:")
display(df.describe(include="all"))

print("\nMissing values per column:")
display(df.isna().sum())


Shape: (1433, 6)

Columns:
['date', 'AWND', 'PRCP', 'TMAX', 'TMIN', 'daily_consumption']


Unnamed: 0,date,AWND,PRCP,TMAX,TMIN,daily_consumption
0,2006-12-16,2.5,0.0,10.6,5.0,1209.176
1,2006-12-17,2.6,0.0,13.3,5.6,3390.46
2,2006-12-18,2.4,0.0,15.0,6.7,2203.826
3,2006-12-19,2.4,0.0,7.2,2.2,1666.194
4,2006-12-20,2.4,0.0,7.2,1.1,2225.748



Summary statistics:


Unnamed: 0,date,AWND,PRCP,TMAX,TMIN,daily_consumption
count,1433,1418.0,1433.0,1433.0,1433.0,1433.0
unique,1433,,,,,
top,2006-12-16,,,,,
freq,1,,,,,
mean,,2.642313,3.800488,17.187509,9.141242,1561.078061
std,,1.140021,10.973436,10.136415,9.028417,606.819667
min,,0.0,0.0,-8.9,-14.4,14.218
25%,,1.8,0.0,8.9,2.2,1165.7
50%,,2.4,0.0,17.8,9.4,1542.65
75%,,3.3,1.3,26.1,17.2,1893.608



Missing values per column:


date                  0
AWND                 15
PRCP                  0
TMAX                  0
TMIN                  0
daily_consumption     0
dtype: int64

In [9]:
# ============================================================
# Utility functions
# ============================================================

TARGET_COL = "daily_consumption"

def prepare_xy(df_in, target_col=TARGET_COL):
    """Drop missing rows, split into X and y."""
    df_clean = df_in.dropna().copy()
    X = df_clean.drop(columns=[target_col, "date"])
    y = df_clean[target_col]
    return X, y

def split_data(X, y, test_size=0.30, random_state=42):
    """70/30 random train-test split."""
    return train_test_split(X, y, test_size=test_size, random_state=random_state)

def compute_metrics(y_true, y_pred):
    """Return MSE, MAE, R^2."""
    return {
        "MSE": mean_squared_error(y_true, y_pred),
        "MAE": mean_absolute_error(y_true, y_pred),
        "R^2": r2_score(y_true, y_pred),
    }

def run_models_and_evaluate(df_in, degrees=(1, 2, 3, 4),
                            target_col=TARGET_COL, test_size=0.30, random_state=42):
    """Train/evaluate linear (deg=1) + polynomial regression models.

    Returns a DataFrame of metrics.
    Also prints fitted equations and scatter plots (test set) for each model.
    """
    X, y = prepare_xy(df_in, target_col=target_col)
    X_train, X_test, y_train, y_test = split_data(X, y, test_size=test_size, random_state=random_state)

    rows = []

    for deg in degrees:
        if deg == 1:
            model = LinearRegression()
            model_name = "Linear Regression"
        else:
            model = Pipeline([
                ("poly", PolynomialFeatures(degree=deg, include_bias=False)),
                ("lr", LinearRegression())
            ])
            model_name = f"Polynomial Regression (degree={deg})"

        # Fit model
        model.fit(X_train, y_train)

        # Predict
        yhat_train = model.predict(X_train)
        yhat_test  = model.predict(X_test)

        # Metrics
        train_m = compute_metrics(y_train, yhat_train)
        test_m  = compute_metrics(y_test, yhat_test)

        rows.append({
            "Model": model_name,
            "Train MSE": train_m["MSE"],
            "Train MAE": train_m["MAE"],
            "Train R^2": train_m["R^2"],
            "Test MSE": test_m["MSE"],
            "Test MAE": test_m["MAE"],
            "Test R^2": test_m["R^2"],
        })

    return pd.DataFrame(rows)

results = run_models_and_evaluate(df)

display(results)


Unnamed: 0,Model,Train MSE,Train MAE,Train R^2,Test MSE,Test MAE,Test R^2
0,Linear Regression,272403.396174,384.465016,0.276,248125.8,375.404537,0.299333
1,Polynomial Regression (degree=2),264765.769932,379.648753,0.2963,255268.5,379.039083,0.279163
2,Polynomial Regression (degree=3),259249.53487,375.952901,0.310961,265623.7,385.235167,0.249922
3,Polynomial Regression (degree=4),251909.339001,372.116566,0.33047,12151490.0,578.642201,-33.313844


### 2.5 Dicussion and interpretation

The linear regression model performed the best on the test set because it achieved the lowest test MSE and MAE and the highest $R^2$ indicating the best generalization performance on unseen data.

The polynomial models do not improve the fit compared to linear regression. We see that in each degree increase, test MSE and MAE both got higher and $R^2$ got lower. Electricity consumption can have nonlinear dependence on weather (e.g., heating/cooling thresholds), but the available features and data volume here are not sufficient for higher-degree polynomials to reliably learn those patterns.

Higher degree models perform worse because of overfitting. For degree 4, training MSE decreases and test MSE increases but $R^2$ becomes strongly negative. This shows the model is fitting noise in the training data rather than learning generalizable structure.

None of the models achieve good test performance. One reason is insufficient feature information. Electricity usage depends heavily on factors not included here, such as occupancy, building characteristics, and appliance usage. Weather alone cannot fully explain consumption. The other reason is high noise. The low test $R^2$ indicate that a large portion of variance is unexplained. For example, seasonal effects, holidays, and human behavior introduce variability that these models cannot capture.
