#Automotive Data Modeling: MPG vs. Horsepower
By Benjamin Cabrera & Alexander Ohye

____________________________________________________________________________________________________________________________________________________________________

#Introduction

This project explores how automotive features can be used to predict fuel efficiency (miles per gallon, mpg) using polynomial regression techniques.

The dataset contains variables such as horsepower, cylinders, displacement, weight, acceleration, year, and origin.

The goal is to apply polynomial regression with different degrees (2, 3, and 4) to model the nonlinear relationship between horsepower and mpg, evaluate model performance with cross-validation, and select the degree that best balances accuracy and generalization.

A data frame with 392 observations on the following variables:

- mpg: Miles per gallon (target variable)

- cylinders: Number of cylinders

- displacement: Engine displacement (cubic inches)

- horsepower: Engine horsepower

- weight: Vehicle weight (lbs)

- acceleration: Time to accelerate from 0–60 mph (seconds)

- year: Model year

- origin: Origin of car (USA, Europe, Japan)

- name: Car name (string identifier, dropped in this project)

The outcome variable to predict is mpg.


In [None]:
# Import all of the packages needed for the assignment
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, cross_val_score, KFold, GridSearchCV
from sklearn.metrics import mean_squared_error, root_mean_squared_error, r2_score
from sklearn.linear_model import Ridge, Lasso
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.pipeline import Pipeline

In [None]:
# Connect to google drive
from google.colab import drive
drive.mount('/content/drive/')

Mounted at /content/drive/


In [None]:
# Importing auto dataset

In [None]:
path_auto_data = "/content/drive/MyDrive/Colab Notebooks/Datasets_ML/Auto_ISLR.csv"

In [None]:
auto_df = pd.read_csv(path_auto_data)

In [None]:
auto_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 392 entries, 0 to 391
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           392 non-null    float64
 1   cylinders     392 non-null    int64  
 2   displacement  392 non-null    float64
 3   horsepower    392 non-null    int64  
 4   weight        392 non-null    int64  
 5   acceleration  392 non-null    float64
 6   year          392 non-null    int64  
 7   origin        392 non-null    int64  
 8   name          392 non-null    object 
dtypes: float64(3), int64(5), object(1)
memory usage: 27.7+ KB


Dropped the columns 'name' and 'origin' because they do not bring any value to the analysis.

In [None]:
# Dropping columns: name and origin

In [None]:
# Preprocessing
# Removing 'name'

auto_df.drop (['name'], axis = 1, inplace = True)

In [None]:
# Preprocessing
# Removing 'origin'

auto_df.drop (['origin'], axis = 1, inplace = True)

In [None]:
auto_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 392 entries, 0 to 391
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           392 non-null    float64
 1   cylinders     392 non-null    int64  
 2   displacement  392 non-null    float64
 3   horsepower    392 non-null    int64  
 4   weight        392 non-null    int64  
 5   acceleration  392 non-null    float64
 6   year          392 non-null    int64  
dtypes: float64(3), int64(4)
memory usage: 21.6 KB


____________________________________________________________________________________________________________________________________________________________________

**a)** Decided to fit a polynomial with 'horsepower' as the predictor. Used degrees 2, 3, and 4, and selected the best option among them.

Justified my choice by showing the results of my analysis.

In [None]:
# variable with all the predictors
X_all_auto = auto_df.drop('mpg', axis=1)

In [None]:
# variable with outcome
y_auto = auto_df['mpg']

In [None]:
# split the dataset
X_train_auto, X_test_auto, y_train_auto, y_test_auto = train_test_split(X_all_auto, y_auto, test_size=0.2, random_state=1)

In [None]:
cv_set_up = KFold (n_splits=10, shuffle=True, random_state=1)

In [None]:
predictor_to_test = 'horsepower'

In [None]:
degrees_to_test = [2, 3, 4]

In [None]:
cv_for_diff_degrees_hp_df = pd.DataFrame(columns= ['Degree_Poly', 'Mean_CV_MSE'])

In [None]:
for i in degrees_to_test:

    # Fit polynomials of different degrees according to i
    poly_loop_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('poly_features', PolynomialFeatures(degree=i)) ]) # degree i (changes with loop iterations)

    X_poly_degree_i = poly_loop_pipeline.fit_transform(X_train_auto[predictor_to_test].values.reshape(-1, 1))

    # Apply CV with polynomial of degree i
    mse_scores_cv = - cross_val_score(LinearRegression(), X_poly_degree_i, y_train_auto, scoring='neg_mean_squared_error', cv = cv_set_up )
    mean_mse = np.round ( np.mean(mse_scores_cv), 2)

    cv_for_diff_degrees_hp_df = pd.concat([cv_for_diff_degrees_hp_df, pd.DataFrame({'Degree_Poly': [i],
                                                       'Mean_CV_MSE': [mean_mse]})], ignore_index=True)


# After the loop, add 'percent_change' as a new column to the data frame

cv_for_diff_degrees_hp_df['Pct_change_from_previous'] = np.round ( cv_for_diff_degrees_hp_df['Mean_CV_MSE'].pct_change() * 100, 2)

  cv_for_diff_degrees_hp_df = pd.concat([cv_for_diff_degrees_hp_df, pd.DataFrame({'Degree_Poly': [i],


In [None]:
cv_for_diff_degrees_hp_df

Unnamed: 0,Degree_Poly,Mean_CV_MSE,Pct_change_from_previous
0,2,18.81,
1,3,18.93,0.64
2,4,19.43,2.64


The second-degree polynomial is the best option to use when fitting a polynomial with 'horsepower' as the predictor. This is as performance worsens beyond this point, indicated by the Mean CV MSE.

____________________________________________________________________________________________________________________________________________________________________

**b)** Using the training data, fit a polynomial equation with the degree selected in part (a).

In [None]:
# Add as many code chunks as needed to complete work

In [None]:
poly_2_hp_pipeline = Pipeline([ ('scaler', StandardScaler()),('poly_features', PolynomialFeatures(degree= 2)) ])

In [None]:
X_poly_hp_train = poly_2_hp_pipeline.fit_transform(X_train_auto['horsepower'].values.reshape(-1, 1))

In [None]:
reg_out_poly2_hp_train = LinearRegression().fit(X_poly_hp_train, y_train_auto)

Coefficients of the second degree poly for horsepower


In [None]:
# Intercept
print ( reg_out_poly2_hp_train.intercept_)

21.723598005694893


In [None]:
# Coefficients
print ( reg_out_poly2_hp_train.coef_)

[ 0.         -7.82522286  1.68055535]


#Equation
Predicted Horsepower = 21.72 - 7.82 * (standardized horsepower) + 1.68 * (standardized horsepower squared)

____________________________________________________________________________________________________________________________________________________________________

**c)** Estimating the test MSE of the equation fit in part (b).

In [None]:
X_poly_hp_test = poly_2_hp_pipeline.transform(X_test_auto['horsepower'].values.reshape(-1, 1))

In [None]:
y_pred_poly2_hp_test = reg_out_poly2_hp_train.predict(X_poly_hp_test)

In [None]:
# Test root MSE
root_mean_squared_error(y_test_auto, y_pred_poly2_hp_test)

4.625642193019177

In [None]:
# Coefficient of variation to assess how good (low) the root MSE is.
print (root_mean_squared_error(y_test_auto, y_pred_poly2_hp_test)/ np.mean(y_auto)*100)

19.72898702684769


____________________________________________________________________________________________________________________________________________________________________

#Conclusion

Polynomial regression revealed a nonlinear relationship between horsepower and fuel efficiency. Cross-validation identified the quadratic (degree 2) model as the most reliable, while higher-degree polynomials provided no real gain and risked overfitting. The final model explained mpg with an interpretable quadratic curve, achieving a test RMSE of about 4.6 (≈20% of mean mpg). This confirms that a simple nonlinear adjustment captures most of the pattern: mpg decreases as horsepower rises, but the rate of decrease flattens for high horsepower vehicles. The project highlights how polynomial regression can uncover meaningful nonlinear trends while maintaining interpretability.