# **EE 344 HW 1 - Aarav Wadhwani**

## Setup

### 1) Imports

In [59]:
# ============================================================
# Imports
# ============================================================

import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline

from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

### 2) Utility functions

Below, we will run **multiple regression experiments** (dropping a feature, splitting by groups, one-hot encoding, etc.).  
To keep our notebook clean and avoid copy-pasting the same code many times, we define a few **helper functions**.

These functions handle the most common steps in any supervised learning workflow:
1. Preparing features **X** and target **y**
2. Splitting data into **train** and **test**
3. Training linear and polynomial regression models
4. Computing evaluation metrics
5. Visualizing predicted vs actual values on the test set
6. Printing a simplified version of the learned regression equation

---

### What each function does

- **`prepare_xy(df_in)`**  
  Removes rows with missing values and splits the dataset into:
  - **X** = input features (all columns except the target)  
  - **y** = target variable (here: `Performance Index`)

- **`split_data(X, y)`**  
  Performs a **70% / 30% train-test split** using a fixed `random_state` so that results are reproducible.

- **`compute_metrics(y_true, y_pred)`**  
  Computes three standard regression evaluation metrics:
  - **MSE (Mean Squared Error):** penalizes large errors more strongly  
  - **MAE (Mean Absolute Error):** average absolute prediction error  
  - **R² (Coefficient of Determination):** measures how well the model explains the variance in the data

- **`print_fitted_equation(...)`**  
  Prints the fitted regression model equation using the learned coefficients.  
  For polynomial regression, the number of terms can become very large, so the notebook prints only the **top terms** with the largest coefficient magnitude.

- **`plot_actual_vs_predicted_test(...)`**  
  Creates a scatter plot for the **test set** showing:
  - **Actual values** (blue circles)
  - **Predicted values** (red x’s)  
  This helps visually compare how close the predictions are to real values.

- **`run_models_and_evaluate(...)`**  
  This is the main driver function that runs everything for a given scenario:
  - Trains **Linear Regression** and **Polynomial Regression (degrees 2, 3, 4)**
  - Evaluates **train and test** performance using MSE, MAE, and R²
  - Prints fitted equation (top terms)
  - Generates test-set scatter plots
  - Returns a clean results table for easy comparison

---

✅ After this section, the rest of the notebook becomes much shorter and easier to read, because each scenario can reuse these helper functions.


In [60]:
# ============================================================
# Utility functions
# ============================================================

TARGET_COL = "Horse Power"

def prepare_xy(df_in, target_col=TARGET_COL):
    """Drop missing rows, split into X and y."""
    df_clean = df_in.dropna().copy()
    X = df_clean.drop(columns=[target_col])
    y = df_clean[target_col]
    return X, y

def split_data(X, y, test_size=0.30, random_state=42):
    """70/30 random train-test split."""
    return train_test_split(X, y, test_size=test_size, random_state=random_state)

def compute_metrics(y_true, y_pred):
    """Return MSE, MAE, R^2."""
    return {
        "MSE": mean_squared_error(y_true, y_pred),
        "MAE": mean_absolute_error(y_true, y_pred),
        "R^2": r2_score(y_true, y_pred),
    }

def run_models_and_evaluate(df_in, degrees=(1, 2, 3, 4),
                            target_col=TARGET_COL, test_size=0.30, random_state=42,
                            top_k_terms=15):
    """Train/evaluate linear (deg=1) + polynomial regression models.

    Returns a DataFrame of metrics.
    """
    X, y = prepare_xy(df_in, target_col=target_col)
    X_train, X_test, y_train, y_test = split_data(X, y, test_size=test_size, random_state=random_state)

    rows = []

    for deg in degrees:
        if deg == 1:
            model = LinearRegression()
            model_name = "Linear Regression"
        else:
            model = Pipeline([
                ("poly", PolynomialFeatures(degree=deg, include_bias=False)),
                ("lr", LinearRegression())
            ])
            model_name = f"Polynomial Regression (degree={deg})"

        # Fit model
        model.fit(X_train, y_train)

        # Predict
        yhat_train = model.predict(X_train)
        yhat_test  = model.predict(X_test)

        # Metrics
        train_m = compute_metrics(y_train, yhat_train)
        test_m  = compute_metrics(y_test, yhat_test)

        rows.append({
            "Model": model_name,
            "Train MSE": train_m["MSE"],
            "Train MAE": train_m["MAE"],
            "Train R^2": train_m["R^2"],
            "Test MSE": test_m["MSE"],
            "Test MAE": test_m["MAE"],
            "Test R^2": test_m["R^2"],
            "Train size": len(X_train),
            "Test size": len(X_test),
        })

    return pd.DataFrame(rows)



# Part 1: Regression Case Study: Horse Power Prediction with Regression Models

**Dataset:** `FuelEconomy.csv`  
**Task:** Build regression models to predict horsepower (HP) based on fuel consumption features.  


**Models:**  
- Linear Regression  
- Polynomial Regression (degree 2, 3, 4)  
**Regularization:** **Not used** (as requested)

---

## Evaluation Metrics (Train & Test)

For each model, we report:
- Mean Squared Error (**MSE**)
- Mean Absolute Error (**MAE**)
- Coefficient of Determination (**R²**)



## 1) Load the dataset and inspect basic information

In this section, we load the dataset into a pandas DataFrame and perform a **basic sanity check** before building any models.

### What this code does
- **Loads the CSV file** into a pandas DataFrame (`df`)
- Prints the **shape** of the dataset:
  - number of rows = number of samples (students)
  - number of columns = number of features (variables)
- Prints the **column names** to understand what information is available
- Displays the **first few rows** using `head()` to preview the data format and values
- Shows **summary statistics** using `describe()`:
  - for numeric columns: mean, standard deviation, min/max, quartiles, etc.
  - for non-numeric columns: count, unique values, most common value, etc.
- Checks for **missing values** in each column

### Why this matters
Machine learning models depend heavily on clean and well-structured data.  
Before training any regression model, we must confirm:
- the target column exists (here: **Horse Power**)
- columns have reasonable values and types
- there are no unexpected missing values that could break model training


In [61]:

# ============================================================
# Load dataset
# ============================================================

DATA_PATH = "FuelEconomy.csv"
df = pd.read_csv(DATA_PATH)

print("Shape:", df.shape)
print("\nColumns:")
print(df.columns.tolist())

display(df.head())

print("\nSummary statistics:")
display(df.describe(include="all"))

print("\nMissing values per column:")
display(df.isna().sum())


Shape: (100, 2)

Columns:
['Horse Power', 'Fuel Economy (MPG)']


Unnamed: 0,Horse Power,Fuel Economy (MPG)
0,118.770799,29.344195
1,176.326567,24.695934
2,219.262465,23.95201
3,187.310009,23.384546
4,218.59434,23.426739



Summary statistics:


Unnamed: 0,Horse Power,Fuel Economy (MPG)
count,100.0,100.0
mean,213.67619,23.178501
std,62.061726,4.701666
min,50.0,10.0
25%,174.996514,20.439516
50%,218.928402,23.143192
75%,251.706476,26.089933
max,350.0,35.0



Missing values per column:


Unnamed: 0,0
Horse Power,0
Fuel Economy (MPG),0


## Results

Use the Utility Functions to run and evaluate the models

In [62]:
results = run_models_and_evaluate(
    df,
    degrees=(1, 2, 3, 4),
    target_col="Horse Power",
    top_k_terms=15
)

display(results)

Unnamed: 0,Model,Train MSE,Train MAE,Train R^2,Test MSE,Test MAE,Test R^2,Train size,Test size
0,Linear Regression,357.69918,16.061689,0.90632,318.561087,14.940628,0.912561,70,30
1,Polynomial Regression (degree=2),350.879731,15.995824,0.908106,331.105434,15.14833,0.909118,70,30
2,Polynomial Regression (degree=3),345.108668,15.746762,0.909618,318.404012,14.764973,0.912604,70,30
3,Polynomial Regression (degree=4),339.700171,15.508465,0.911034,313.798757,14.735471,0.913868,70,30


### Part 1 Discussion

• **Best test-set model:**
The degree-4 polynomial model performs best on the test set, with the lowest Test MSE (313.8), lowest Test MAE (14.74), and highest Test R^2 (0.9139). The improvement suggests that higher-order terms capture some additional nonlinear structure beyond a purely linear relationship.

• **Effect of increasing polynomial degree:**
Increasing polynomial degree does not consistently improve performance. The degree-2 model actually performs worse than linear regression on the test set (higher MSE and lower R^2), indicating that a simple quadratic form does not add useful information. Performance improves again for degrees 3 and 4, showing that only higher-order nonlinearities provide benefit.

• **Weaker-performing models and possible causes:**
The degree-2 model’s higher test error (MSE = 331.1) suggests mild underfitting, where the model is too simple to capture the true relationship. More broadly, all models are limited by the use of fuel economy alone, which cannot fully explain horsepower, and by noise or outliers in the data that increase prediction error.

• **Overfitting vs. underfitting:**\
Train and test metrics are very close for all models (e.g., degree-4 Train R^2 = 0.911 vs. Test R^2 = 0.914), indicating little overfitting. The relatively small gains from higher-degree models imply that the HP–fuel economy relationship is mostly linear, with only weak nonlinear effects.


# Part 2: Regression Case Study: Daily Electricity Consumption Prediction with Regression Models

**Dataset:** `electricity_consumption_based_weather_dataset.csv`  
**Task:** Build regression models to predict Daily Electricity Consumption based on Weather features.  


**Models:**  
- Linear Regression  
- Polynomial Regression (degree 2, 3, 4)  
**Regularization:** **Not used** (as requested)

---

## Evaluation Metrics (Train & Test)

For each model, we report:
- Mean Squared Error (**MSE**)
- Mean Absolute Error (**MAE**)
- Coefficient of Determination (**R²**)



## 1) Load the dataset and inspect basic information

In this section, we load the dataset into a pandas DataFrame and perform a **basic sanity check** before building any models.

### What this code does
- **Loads the CSV file** into a pandas DataFrame (`df`)
- Prints the **shape** of the dataset:
  - number of rows = number of samples (students)
  - number of columns = number of features (variables)
- Prints the **column names** to understand what information is available
- Displays the **first few rows** using `head()` to preview the data format and values
- Shows **summary statistics** using `describe()`:
  - for numeric columns: mean, standard deviation, min/max, quartiles, etc.
  - for non-numeric columns: count, unique values, most common value, etc.
- Checks for **missing values** in each column

### Why this matters
Machine learning models depend heavily on clean and well-structured data.  
Before training any regression model, we must confirm:
- the target column exists (here: **daily_consumption**)
- columns have reasonable values and types
- there are no unexpected missing values that could break model training


In [63]:

# ============================================================
# Load dataset
# ============================================================

DATA_PATH = "electricity_consumption_based_weather_dataset.csv"
df2 = pd.read_csv(DATA_PATH)

# Since we are told to randomly split the data 70-30, the date becomes
# inconsequential, i.e., not a time series forecasting problem.
df2.drop(columns=["date"], inplace=True)

print("Shape:", df2.shape)
print("\nColumns:")
print(df2.columns.tolist())

display(df2.head())

print("\nSummary statistics:")
display(df2.describe(include="all"))

print("\nMissing values per column:")
display(df2.isna().sum())


Shape: (1433, 5)

Columns:
['AWND', 'PRCP', 'TMAX', 'TMIN', 'daily_consumption']


Unnamed: 0,AWND,PRCP,TMAX,TMIN,daily_consumption
0,2.5,0.0,10.6,5.0,1209.176
1,2.6,0.0,13.3,5.6,3390.46
2,2.4,0.0,15.0,6.7,2203.826
3,2.4,0.0,7.2,2.2,1666.194
4,2.4,0.0,7.2,1.1,2225.748



Summary statistics:


Unnamed: 0,AWND,PRCP,TMAX,TMIN,daily_consumption
count,1418.0,1433.0,1433.0,1433.0,1433.0
mean,2.642313,3.800488,17.187509,9.141242,1561.078061
std,1.140021,10.973436,10.136415,9.028417,606.819667
min,0.0,0.0,-8.9,-14.4,14.218
25%,1.8,0.0,8.9,2.2,1165.7
50%,2.4,0.0,17.8,9.4,1542.65
75%,3.3,1.3,26.1,17.2,1893.608
max,10.2,192.3,39.4,27.2,4773.386



Missing values per column:


Unnamed: 0,0
AWND,15
PRCP,0
TMAX,0
TMIN,0
daily_consumption,0


## Results

Use the Utility Functions to run and evaluate the models

In [64]:
TARGET_COL="daily_consumption"
results = run_models_and_evaluate(
    df2,
    degrees=(1, 2, 3, 4),
    target_col=TARGET_COL,
    top_k_terms=15
    )

display(results)

Unnamed: 0,Model,Train MSE,Train MAE,Train R^2,Test MSE,Test MAE,Test R^2,Train size,Test size
0,Linear Regression,272403.396174,384.465016,0.276,248125.8,375.404537,0.299333,992,426
1,Polynomial Regression (degree=2),264765.769932,379.648753,0.2963,255268.5,379.039083,0.279163,992,426
2,Polynomial Regression (degree=3),259249.53487,375.952901,0.310961,265623.7,385.235167,0.249922,992,426
3,Polynomial Regression (degree=4),251909.339001,372.116566,0.33047,12151490.0,578.642201,-33.313844,992,426


### Part 2 Discussion

• **Best generalizing model:**

Linear regression generalizes the best, with the lowest test MSE (2.48 × 10^5), lowest test MAE (375.4), and highest test R^2 (0.299). This suggests that, given the available features, weather explains only a limited and mostly linear component of daily electricity consumption.

• **Do polynomial models help?**

Polynomial models do not improve test performance. Although electricity consumption can depend nonlinearly on weather (for example, temperature thresholds triggering heating or cooling), the provided weather variables do not support learning these effects reliably, so added polynomial terms do not generalize.

• **Why higher-degree models perform worse:**

As polynomial degree increases, training error decreases (Train R^2 rises from 0.276 to 0.330), while test performance degrades. This is especially clear for the degree-4 model, where Test MSE jumps to 1.21 × 10^7 and Test R^2 drops to −33.31. The widening gap between train and test metrics indicates severe overfitting.

• **Why overall test performance is weak:**

All models have low test R^2 values (at most 0.30), indicating that weather alone does not explain most of the variability in electricity usage. Strong seasonal patterns are likely present, but the random train–test split breaks temporal structure and prevents the models from learning seasonality. In addition, key drivers such as occupancy, behavior, and calendar effects are missing. A temporal split with explicit time-based or seasonal features would be more appropriate for this problem.