## Part 1: Fuel Consumption → Horsepower Prediction

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import matplotlib.pyplot as plt

### 1.1 Load and Inspect the Dataset

In [3]:
DATA_PATH = "FuelEconomy.csv"
df = pd.read_csv(DATA_PATH)

print("Dataset shape:", df.shape)

print("\nColumn names:")
print(df.columns.tolist())

display(df.head())

print("\nSummary statistics:")
display(df.describe())

print("\nMissing values per column:")
display(df.isna().sum())



Dataset shape: (100, 2)

Column names:
['Horse Power', 'Fuel Economy (MPG)']


Unnamed: 0,Horse Power,Fuel Economy (MPG)
0,118.770799,29.344195
1,176.326567,24.695934
2,219.262465,23.95201
3,187.310009,23.384546
4,218.59434,23.426739



Summary statistics:


Unnamed: 0,Horse Power,Fuel Economy (MPG)
count,100.0,100.0
mean,213.67619,23.178501
std,62.061726,4.701666
min,50.0,10.0
25%,174.996514,20.439516
50%,218.928402,23.143192
75%,251.706476,26.089933
max,350.0,35.0



Missing values per column:


Unnamed: 0,0
Horse Power,0
Fuel Economy (MPG),0


**Missing Value Handling:**

We checked the dataset for missing values, and no missing values were found.  
Therefore, no additional data cleaning was required.



### 1.2 Train / Test Split (70% / 30%)

In [5]:
X = df.drop(columns=["Horse Power"])
y = df["Horse Power"]

print("Feature matrix shape:", X.shape)
print("Target vector shape:", y.shape)

# Perform train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.30,
    random_state=42
)

print("Training set size:", X_train.shape)
print("Test set size:", X_test.shape)


Feature matrix shape: (100, 1)
Target vector shape: (100,)
Training set size: (70, 1)
Test set size: (30, 1)


### 1.3 Model Training: Linear and Polynomial Regression

In [6]:
# Linear Regression model
lin_reg = LinearRegression()

# Fit model on training data
lin_reg.fit(X_train, y_train)

# Make predictions
y_train_pred_lin = lin_reg.predict(X_train)
y_test_pred_lin = lin_reg.predict(X_test)

# Print coefficients
print("Linear Regression intercept:", lin_reg.intercept_)
print("Linear Regression coefficient(s):", lin_reg.coef_)


Linear Regression intercept: 500.36382048365186
Linear Regression coefficient(s): [-12.3785158]


In [7]:
# Polynomial Regression (degree = 2)
poly2_model = Pipeline([
    ("poly", PolynomialFeatures(degree=2)),
    ("lr", LinearRegression())
])

# Fit model
poly2_model.fit(X_train, y_train)

# Predictions
y_train_pred_poly2 = poly2_model.predict(X_train)
y_test_pred_poly2 = poly2_model.predict(X_test)

print("Polynomial Regression (degree=2) model trained.")


Polynomial Regression (degree=2) model trained.


In [8]:
# Polynomial Regression (degree = 3)
poly3_model = Pipeline([
    ("poly", PolynomialFeatures(degree=3)),
    ("lr", LinearRegression())
])

poly3_model.fit(X_train, y_train)

y_train_pred_poly3 = poly3_model.predict(X_train)
y_test_pred_poly3 = poly3_model.predict(X_test)

print("Polynomial Regression (degree=3) model trained.")


Polynomial Regression (degree=3) model trained.


In [9]:
# Polynomial Regression (degree = 4)
poly4_model = Pipeline([
    ("poly", PolynomialFeatures(degree=4)),
    ("lr", LinearRegression())
])

poly4_model.fit(X_train, y_train)

y_train_pred_poly4 = poly4_model.predict(X_train)
y_test_pred_poly4 = poly4_model.predict(X_test)

print("Polynomial Regression (degree=4) model trained.")


Polynomial Regression (degree=4) model trained.


### 1.4 Model Evaluation (Train and Test)

In [10]:
results = []

# ---------- Linear Regression ----------
results.append({
    "Model": "Linear Regression",
    "Train MSE": mean_squared_error(y_train, y_train_pred_lin),
    "Train MAE": mean_absolute_error(y_train, y_train_pred_lin),
    "Train R2": r2_score(y_train, y_train_pred_lin),
    "Test MSE": mean_squared_error(y_test, y_test_pred_lin),
    "Test MAE": mean_absolute_error(y_test, y_test_pred_lin),
    "Test R2": r2_score(y_test, y_test_pred_lin),
})

# ---------- Polynomial Regression (Degree 2) ----------
results.append({
    "Model": "Polynomial (Degree 2)",
    "Train MSE": mean_squared_error(y_train, y_train_pred_poly2),
    "Train MAE": mean_absolute_error(y_train, y_train_pred_poly2),
    "Train R2": r2_score(y_train, y_train_pred_poly2),
    "Test MSE": mean_squared_error(y_test, y_test_pred_poly2),
    "Test MAE": mean_absolute_error(y_test, y_test_pred_poly2),
    "Test R2": r2_score(y_test, y_test_pred_poly2),
})

# ---------- Polynomial Regression (Degree 3) ----------
results.append({
    "Model": "Polynomial (Degree 3)",
    "Train MSE": mean_squared_error(y_train, y_train_pred_poly3),
    "Train MAE": mean_absolute_error(y_train, y_train_pred_poly3),
    "Train R2": r2_score(y_train, y_train_pred_poly3),
    "Test MSE": mean_squared_error(y_test, y_test_pred_poly3),
    "Test MAE": mean_absolute_error(y_test, y_test_pred_poly3),
    "Test R2": r2_score(y_test, y_test_pred_poly3),
})

# ---------- Polynomial Regression (Degree 4) ----------
results.append({
    "Model": "Polynomial (Degree 4)",
    "Train MSE": mean_squared_error(y_train, y_train_pred_poly4),
    "Train MAE": mean_absolute_error(y_train, y_train_pred_poly4),
    "Train R2": r2_score(y_train, y_train_pred_poly4),
    "Test MSE": mean_squared_error(y_test, y_test_pred_poly4),
    "Test MAE": mean_absolute_error(y_test, y_test_pred_poly4),
    "Test R2": r2_score(y_test, y_test_pred_poly4),
})

results_df = pd.DataFrame(results)
results_df


Unnamed: 0,Model,Train MSE,Train MAE,Train R2,Test MSE,Test MAE,Test R2
0,Linear Regression,357.69918,16.061689,0.90632,318.561087,14.940628,0.912561
1,Polynomial (Degree 2),350.879731,15.995824,0.908106,331.105434,15.14833,0.909118
2,Polynomial (Degree 3),345.108668,15.746762,0.909618,318.404012,14.764973,0.912604
3,Polynomial (Degree 4),339.700171,15.508465,0.911034,313.798757,14.735471,0.913868


### 1.5 Discussion and Explanation

From the results in Table, it can be seen that, overall, the degree 4 polynomial regression model performs best on the test set. The lowest test MSE (313.80) and test MAE (14.74) were achieved, as well as the highest test R² (0.9139). Compared with linear models and low degree polynomial models, this indicates that the degree 4 model offers a slightly better fit when predicting horsepower from fuel consumption.

However, increasing the degree of polynomial does not lead to a significant improvement in performance. Although the training error continues to decline with the increase of degree, the improvement of the test set is relatively small. For instance, the test R² only increased from 0.9126 in the degree 3 model to 0.9139 in the degree 4 model. This indicates that, in terms of generalization performance, models with a higher degree offer diminishing returns.

There is also no clear evidence to suggest that these results have serious overfitting. Although the higher degree polynomial model achieved better training performance, its test performance did not decline. This indicates that in this case, the additional model complexity is not harmful, possibly because the relationship between fuel consumption and horsepower is rather smooth and the feature space is also simple.

Overall, all four models performed quite well, with test R² values above 0.90. The powerful performance of the linear model indicates that fuel consumption has already accounted for a large part of the variation in Horse Power. Polynomial regression offers a moderate improvement by capturing nonlinear effects, but the gain is limited, suggesting that additional features may be needed to achieve more significant performance improvements.


## Part 2: Weather → Daily Electricity Consumption Prediction

### 2.1 Load and Inspect the Dataset


In [13]:
# Load the electricity consumption dataset
DATA_PATH = "electricity_consumption_based_weather_dataset.csv"
df2 = pd.read_csv(DATA_PATH)

print("Dataset shape:", df2.shape)

print("\nColumn names:")
print(df2.columns.tolist())

display(df2.head())

print("\nSummary statistics:")
display(df2.describe())

print("\nMissing values per column:")
display(df2.isna().sum())

Dataset shape: (1433, 6)

Column names:
['date', 'AWND', 'PRCP', 'TMAX', 'TMIN', 'daily_consumption']


Unnamed: 0,date,AWND,PRCP,TMAX,TMIN,daily_consumption
0,2006-12-16,2.5,0.0,10.6,5.0,1209.176
1,2006-12-17,2.6,0.0,13.3,5.6,3390.46
2,2006-12-18,2.4,0.0,15.0,6.7,2203.826
3,2006-12-19,2.4,0.0,7.2,2.2,1666.194
4,2006-12-20,2.4,0.0,7.2,1.1,2225.748



Summary statistics:


Unnamed: 0,AWND,PRCP,TMAX,TMIN,daily_consumption
count,1418.0,1433.0,1433.0,1433.0,1433.0
mean,2.642313,3.800488,17.187509,9.141242,1561.078061
std,1.140021,10.973436,10.136415,9.028417,606.819667
min,0.0,0.0,-8.9,-14.4,14.218
25%,1.8,0.0,8.9,2.2,1165.7
50%,2.4,0.0,17.8,9.4,1542.65
75%,3.3,1.3,26.1,17.2,1893.608
max,10.2,192.3,39.4,27.2,4773.386



Missing values per column:


Unnamed: 0,0
date,0
AWND,15
PRCP,0
TMAX,0
TMIN,0
daily_consumption,0


**Missing Value Handling:**

We checked the dataset for missing values, and no missing values were found.  
Therefore, no additional data cleaning was required.



### 2.2 Train / Test Split (70% / 30%)


In [21]:
# Convert date column to datetime
df2["date"] = pd.to_datetime(df2["date"])

# Convert date to numeric feature (days since first date)
df2["date_numeric"] = (df2["date"] - df2["date"].min()).dt.days

# Separate features and target variable
X2 = df2.drop(columns=["daily_consumption", "date"])
y2 = df2["daily_consumption"]

print("Feature matrix shape:", X2.shape)
print("Target vector shape:", y2.shape)

# Perform train/test split
X2_train, X2_test, y2_train, y2_test = train_test_split(
    X2,
    y2,
    test_size=0.30,
    random_state=42
)

print("Training set size:", X2_train.shape)
print("Test set size:", X2_test.shape)


Feature matrix shape: (1433, 5)
Target vector shape: (1433,)
Training set size: (1003, 5)
Test set size: (430, 5)


In [22]:
from sklearn.impute import SimpleImputer

# Impute missing values using mean strategy
imputer = SimpleImputer(strategy="mean")

X2_train_imputed = imputer.fit_transform(X2_train)
X2_test_imputed = imputer.transform(X2_test)

print("Missing values handled using mean imputation.")


Missing values handled using mean imputation.


### 2.3 Model Training: Linear and Polynomial Regression


In [23]:
# Linear Regression model
lin_reg_2 = LinearRegression()

lin_reg_2.fit(X2_train_imputed, y2_train)

y2_train_pred_lin = lin_reg_2.predict(X2_train_imputed)
y2_test_pred_lin = lin_reg_2.predict(X2_test_imputed)

print("Linear Regression model trained.")


Linear Regression model trained.


In [24]:
poly2_model_2 = Pipeline([
    ("poly", PolynomialFeatures(degree=2)),
    ("lr", LinearRegression())
])

poly2_model_2.fit(X2_train_imputed, y2_train)

y2_train_pred_poly2 = poly2_model_2.predict(X2_train_imputed)
y2_test_pred_poly2 = poly2_model_2.predict(X2_test_imputed)

print("Polynomial Regression (degree=2) model trained.")


Polynomial Regression (degree=2) model trained.


In [25]:
poly3_model_2 = Pipeline([
    ("poly", PolynomialFeatures(degree=3)),
    ("lr", LinearRegression())
])

poly3_model_2.fit(X2_train_imputed, y2_train)

y2_train_pred_poly3 = poly3_model_2.predict(X2_train_imputed)
y2_test_pred_poly3 = poly3_model_2.predict(X2_test_imputed)

print("Polynomial Regression (degree=3) model trained.")


Polynomial Regression (degree=3) model trained.


In [26]:
poly4_model_2 = Pipeline([
    ("poly", PolynomialFeatures(degree=4)),
    ("lr", LinearRegression())
])

poly4_model_2.fit(X2_train_imputed, y2_train)

y2_train_pred_poly4 = poly4_model_2.predict(X2_train_imputed)
y2_test_pred_poly4 = poly4_model_2.predict(X2_test_imputed)

print("Polynomial Regression (degree=4) model trained.")


Polynomial Regression (degree=4) model trained.


### 2.4 Model Evaluation (Train and Test)


In [28]:
results_2 = []

# ---------- Linear Regression ----------
results_2.append({
    "Model": "Linear Regression",
    "Train MSE": mean_squared_error(y2_train, y2_train_pred_lin),
    "Train MAE": mean_absolute_error(y2_train, y2_train_pred_lin),
    "Train R2": r2_score(y2_train, y2_train_pred_lin),
    "Test MSE": mean_squared_error(y2_test, y2_test_pred_lin),
    "Test MAE": mean_absolute_error(y2_test, y2_test_pred_lin),
    "Test R2": r2_score(y2_test, y2_test_pred_lin),
})

# ---------- Polynomial Regression (Degree 2) ----------
results_2.append({
    "Model": "Polynomial (Degree 2)",
    "Train MSE": mean_squared_error(y2_train, y2_train_pred_poly2),
    "Train MAE": mean_absolute_error(y2_train, y2_train_pred_poly2),
    "Train R2": r2_score(y2_train, y2_train_pred_poly2),
    "Test MSE": mean_squared_error(y2_test, y2_test_pred_poly2),
    "Test MAE": mean_absolute_error(y2_test, y2_test_pred_poly2),
    "Test R2": r2_score(y2_test, y2_test_pred_poly2),
})

# ---------- Polynomial Regression (Degree 3) ----------
results_2.append({
    "Model": "Polynomial (Degree 3)",
    "Train MSE": mean_squared_error(y2_train, y2_train_pred_poly3),
    "Train MAE": mean_absolute_error(y2_train, y2_train_pred_poly3),
    "Train R2": r2_score(y2_train, y2_train_pred_poly3),
    "Test MSE": mean_squared_error(y2_test, y2_test_pred_poly3),
    "Test MAE": mean_absolute_error(y2_test, y2_test_pred_poly3),
    "Test R2": r2_score(y2_test, y2_test_pred_poly3),
})

# ---------- Polynomial Regression (Degree 4) ----------
results_2.append({
    "Model": "Polynomial (Degree 4)",
    "Train MSE": mean_squared_error(y2_train, y2_train_pred_poly4),
    "Train MAE": mean_absolute_error(y2_train, y2_train_pred_poly4),
    "Train R2": r2_score(y2_train, y2_train_pred_poly4),
    "Test MSE": mean_squared_error(y2_test, y2_test_pred_poly4),
    "Test MAE": mean_absolute_error(y2_test, y2_test_pred_poly4),
    "Test R2": r2_score(y2_test, y2_test_pred_poly4),
})

results_df_2 = pd.DataFrame(results_2)
results_df_2

Unnamed: 0,Model,Train MSE,Train MAE,Train R2,Test MSE,Test MAE,Test R2
0,Linear Regression,274691.822827,387.618122,0.273301,236487.153837,366.084087,0.313614
1,Polynomial (Degree 2),263797.158438,380.878665,0.302123,232382.266938,362.173758,0.325528
2,Polynomial (Degree 3),252639.506869,374.862007,0.33164,240390.816248,374.111119,0.302284
3,Polynomial (Degree 4),230622.84363,360.596718,0.389886,640816.905897,460.790599,-0.859923


### 2.5 Discussion and Explanation

According to the evaluation results in Table, the degree 2 polynomial regression model has the best comprehensive generalization performance on the test set. It has the lowest test MSE (232,382.27) and the highest test R² (0.3255) among all models. Compared with linear regression, this indicates that introducing a small amount of nonlinearity helps to better capture the relationship between weather features and daily electricity consumption.

Compared with linear regression, polynomial regression does improve model fitting, but only to a certain extent. The degree 2 model improves the training and testing performance, which indicates that electricity consumption has a certain nonlinear dependence on weather variables such as temperature and humidity. These nonlinear effects are reasonable because electricity consumption often increases disproportionately under extreme weather conditions.

However, higher-order polynomial models do not generalize well. Although the training performance continues to improve as the degree increases (for example, for a model of degree 4, the training R² increases to 0.3899), the test performance drops significantly. In particular, in the degree -4 model, the test MSE (640,816.91) increased significantly, and the test R² (-0.8599) was negative, which is a strong indicator of overfitting. This behavior reflects that a model is too close to the training data to be generalized to the invisible data.

Overall, the relatively low test R² values in all models indicate that weather characteristics alone are insufficient to fully explain daily electricity consumption. Other factors, such as human behavior, occupancy patterns, economic activities and seasonal influences, may be important drivers not captured in the dataset. Therefore, although polynomial regression can offer a moderate improvement over linear regression, the model's performance is ultimately limited by the available feature set.
