# Fuel Economy Data &mdash; Part 3

### This notebook contains model improvements and feature engineering.

In [1]:
import pandas as pd
import numpy as np
import xgboost as xgb
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from pdpbox import pdp, info_plots
from category_encoders import TargetEncoder, OrdinalEncoder, OneHotEncoder
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score, train_test_split, KFold
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import GradientBoostingClassifier
from utils import get_val_scores

In [131]:
df = pd.read_csv('./data/fuel_eco_clean.csv')

In [7]:
# these columns were identified in part 3
cols_to_drop = ['City MPG (FT1)', 'Unrounded City MPG (FT1)','City MPG (FT2)','Unrounded City MPG (FT2)',
 'Highway MPG (FT1)','Unrounded Highway MPG (FT1)','Highway MPG (FT2)','Unrounded Highway MPG (FT2)',
 'Unadjusted City MPG (FT1)','Unadjusted Highway MPG (FT1)','Unadjusted City MPG (FT2)',
 'Unadjusted Highway MPG (FT2)','Combined MPG (FT1)','Unrounded Combined MPG (FT1)','Combined MPG (FT2)',
 'Unrounded Combined MPG (FT2)',
 'My MPG Data','Composite City MPG','Composite Highway MPG','Composite Combined MPG','City Range (FT1)',
 'Range (FT1)','City Range (FT1)','Highway Range (FT1)','City Range (FT2)','Highway Range (FT2)',
 'Range (FT2) Clean','Save or Spend (5 Year)','Tailpipe CO2 (FT1)','Annual Fuel Cost (FT1)',
 'Annual Consumption in Barrels (FT1)','Tailpipe CO2 in Grams/Mile (FT1)','Fuel Economy Score',
 'GHG Score','City Gasoline Consumption (CD)','City Electricity Consumption',
 'Highway Gasoline Consumption (CD)','Highway Electricity Consumption','Combined Electricity Consumption',
 'Combined Gasoline Consumption (CD)','Annual Consumption in Barrels (FT1)','Annual Consumption in Barrels (FT2)',
 'Fuel Type','Fuel Type 1','Fuel Type 2','Alternative Fuel/Technology','Gas Guzzler Tax']

### Feature Engineering
<br>
The mean validation score once we removed all columns that give away the answer was 0.922756898. This so quite high, so there's not much room to prove. Nevertheless, I want to try some feature engineering. 

##### Vehicle Volume&mdash; 2-Door Volume and 4-Door Volume
There aren't too many improvements I can think of, but maybe some value that represents the entire vehicle volume (Passenger Volume + Luggage Volume) could provide some useful information to the model. This is especially true because EPA classifies cars differents from trucks, SUVs, and vans. Cars are split into classes (e.g. Subcompact) based on volume whereas the other vehicles are classified according to gross vehicle weight rating (GVWR). So having one value that can apply to all classes of vehicles might be useful.
<br>
<br>
According to my data explorations below, it looks like about 2100 vehicles have values for both 2D and 4D volumes, so I can't just add all four columns together. So I'll add two new columns to the dataset: one that sums 2D Passenger and Luggage, and one that sums 4D Passenger and Luggage volumes.

In [128]:
df[(df['2D Passenger Volume'] > 0) & (df['4D Passenger Volume'] > 0)][['2D Passenger Volume',
                                                                      '4D Passenger Volume']]

Unnamed: 0,2D Passenger Volume,4D Passenger Volume
1319,76,84
1320,76,84
1321,76,84
1322,76,84
1390,76,84
...,...,...
37315,91,91
37322,91,91
37323,91,91
37324,91,91


In [129]:
df[(df['2D Luggage Volume'] > 0) & (df['4D Luggage Volume'] > 0)][['2D Luggage Volume',
                                                                      '4D Luggage Volume']]

Unnamed: 0,2D Luggage Volume,4D Luggage Volume
1319,12,13
1320,12,13
1321,12,13
1322,12,13
1390,12,11
...,...,...
37315,11,11
37322,11,11
37323,11,11
37324,11,11


In [132]:
df['2D Total Volume'] = df['2D Passenger Volume'] + df['2D Luggage Volume']
df['4D Total Volume'] = df['4D Passenger Volume'] + df['4D Luggage Volume']
df.head()

Unnamed: 0,Vehicle ID,Year,Make,Model,Class,Drive,Transmission,Transmission Descriptor,Engine Index,Engine Descriptor,...,Composite Combined MPG,Range (FT1),City Range (FT1),Highway Range (FT1),City Range (FT2),Highway Range (FT2),Range (FT2) Clean,Manufacturer Code Clean,2D Total Volume,4D Total Volume
0,26587,1984,Alfa Romeo,GT V6 2.5,Minicompact Cars,Unknown,Manual 5-Speed,Unknown,9001,(FFS),...,0,0,0.0,0.0,0.0,0.0,0,CRX,81,0
1,27705,1984,Alfa Romeo,GT V6 2.5,Minicompact Cars,Unknown,Manual 5-Speed,Unknown,9005,(FFS) CA model,...,0,0,0.0,0.0,0.0,0.0,0,CRX,81,0
2,26561,1984,Alfa Romeo,Spider Veloce 2000,Two Seaters,Unknown,Manual 5-Speed,Unknown,9002,(FFS),...,0,0,0.0,0.0,0.0,0.0,0,CRX,0,0
3,27681,1984,Alfa Romeo,Spider Veloce 2000,Two Seaters,Unknown,Manual 5-Speed,Unknown,9006,(FFS) CA model,...,0,0,0.0,0.0,0.0,0.0,0,CRX,0,0
4,27550,1984,AM General,DJ Po Vehicle 2WD,Special Purpose Vehicle 2WD,2-Wheel Drive,Automatic 3-Speed,Unknown,1830,(FFS),...,0,0,0.0,0.0,0.0,0.0,0,Unknown,0,0


In [133]:
mod = xgb.XGBRegressor(booster='gbtree',
                 n_estimators=100,
                 max_depth=3, learning_rate=.1,
                 nthread=-1, gamma=0, min_child_weight=1,
                 max_delta_step=0, subsample=1, colsample_bytree=1, colsample_bylevel=1,
                        colsample_bynode=1, n_jobs=1, scale_pos_weight=1,
                 reg_alpha=0, reg_lambda=1,
                 base_score=0.5, seed=0, missing=None,
                 frac=None,k_neighbors=None,m_neighbors=None,out_step=None)
pipe = make_pipeline(TargetEncoder(), mod)



In [134]:
X = df.drop(cols_to_drop, axis=1)
y = df['Combined MPG (FT1)']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                    shuffle=True, random_state=2021)

In [137]:
val_scores = cross_val_score(pipe, X=X_train, y=y_train, cv=5)
np.mean(val_scores)

0.9228215769245264

Barely made an impact. I think these columns can be dropped.

In [144]:
df = df.drop(['2D Total Volume', '4D Total Volume'], axis=1)

##### Supercharger and Turbocharger
Research on [Wikipedia](https://en.wikipedia.org/wiki/Supercharger) reveals that both superchargers and turbochargers increase the efficiency of internal combustion engines by allowing more air to enter. They differ in the way that that this is powered&mdash;superchargers are mechanically driven, while turbochargers are powered by a turbine. 
<br>
<br>
Maybe the presence of either of these features could predict increased engine efficiency and therefore lower predicted MPG. I'll encode such vehicles with a 1 in the new column. Vehicles without either of these features will get a 0.

In [150]:
df['Has Compressor'] = np.where((df['Supercharger'] == 'S') | (df['Turbocharger'] == 'T'),
                               1, 0)
df['Has Compressor'].mean()

0.15488153648361452

This new column shows that only 15% of vehicles have either feature, so I'm not expecting big gains in my model's scores, but I'll try anyway:

In [154]:
mod = xgb.XGBRegressor(booster='gbtree',
                 n_estimators=100,
                 max_depth=3, learning_rate=.1,
                 nthread=-1, gamma=0, min_child_weight=1,
                 max_delta_step=0, subsample=1, colsample_bytree=1, colsample_bylevel=1,
                        colsample_bynode=1, n_jobs=1, scale_pos_weight=1,
                 reg_alpha=0, reg_lambda=1,
                 base_score=0.5, seed=0, missing=None,
                 frac=None,k_neighbors=None,m_neighbors=None,out_step=None)
pipe = make_pipeline(TargetEncoder(), mod)

In [156]:
X = df.drop(cols_to_drop, axis=1)
y = df['Combined MPG (FT1)']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                    shuffle=True, random_state=2021)

In [157]:
val_scores = cross_val_score(pipe, X=X_train, y=y_train, cv=5)
np.mean(val_scores)

0.9227644568697093

This helped even less than the previous addition. Columns will be dropped:

In [158]:
df = df.drop(['Has Compressor'], axis=1)

### Model Parameter Tuning

There isn't much room to improve the model because the validation scores are already high, but I do want to determine what the best parameters are.

In [160]:
learning_rates = [.1, .15]
colsample_bytrees = [.9, 1]

X = df.drop(cols_to_drop, axis=1)
y = df['Combined MPG (FT1)']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                    shuffle=True, random_state=2021)

for depth in range(3,6,1):
    for learn_rate in learning_rates:
        for n_estimator in range(50, 200, 50):
            for col_samp in colsample_bytrees:
                mod = xgb.XGBRegressor(booster='gbtree',
                 n_estimators= n_estimator,
                 max_depth= depth, learning_rate= learn_rate,
                 nthread=-1, gamma=0, min_child_weight=1,
                 max_delta_step=0, subsample=1, colsample_bytree= col_samp, colsample_bylevel=1,
                        colsample_bynode=1, n_jobs=1, scale_pos_weight=1,
                 reg_alpha=0, reg_lambda=1,
                 base_score=0.5, seed=0, missing=None,
                 frac=None,k_neighbors=None,m_neighbors=None,out_step=None)
                pipe = make_pipeline(TargetEncoder(), mod)
                
                val_scores = cross_val_score(pipe, X=X_train, y=y_train, cv=5)
                print(f"Depth: {depth}. Learning Rate: {learn_rate}. Estimators: {n_estimator}. Columns Sampled: {col_samp}")
                print(f"Mean Validation Score: {np.mean(val_scores)}")
                print("\n")
            
        

Depth: 3. Learning Rate: 0.1. Estimators: 50. Columns Sampled: 0.9
Mean Validation Score: 0.9083722837962236


Depth: 3. Learning Rate: 0.1. Estimators: 50. Columns Sampled: 1
Mean Validation Score: 0.9090683809066327


Depth: 3. Learning Rate: 0.1. Estimators: 100. Columns Sampled: 0.9
Mean Validation Score: 0.9218412773345609


Depth: 3. Learning Rate: 0.1. Estimators: 100. Columns Sampled: 1
Mean Validation Score: 0.9227568988672396


Depth: 3. Learning Rate: 0.1. Estimators: 150. Columns Sampled: 0.9
Mean Validation Score: 0.9296366903315822


Depth: 3. Learning Rate: 0.1. Estimators: 150. Columns Sampled: 1
Mean Validation Score: 0.930486725761565


Depth: 3. Learning Rate: 0.15. Estimators: 50. Columns Sampled: 0.9
Mean Validation Score: 0.9166601589978193


Depth: 3. Learning Rate: 0.15. Estimators: 50. Columns Sampled: 1
Mean Validation Score: 0.9170277276831543


Depth: 3. Learning Rate: 0.15. Estimators: 100. Columns Sampled: 0.9
Mean Validation Score: 0.9292022960800524


De

Looks like the best parameter settings are:

- Depth: 5. 
- Learning Rate: 0.15. 
- Estimators: 150. 
- Columns Sampled: 0.9

<br>
Mean Validation Score: 0.9593254832151052


In [161]:
mod = xgb.XGBRegressor(booster='gbtree',
                 n_estimators= 150,
                 max_depth= 5, learning_rate= .15,
                 nthread=-1, gamma=0, min_child_weight=1,
                 max_delta_step=0, subsample=1, colsample_bytree= 0.9, colsample_bylevel=1,
                        colsample_bynode=1, n_jobs=1, scale_pos_weight=1,
                 reg_alpha=0, reg_lambda=1,
                 base_score=0.5, seed=0, missing=None,
                 frac=None,k_neighbors=None,m_neighbors=None,out_step=None)
pipe = make_pipeline(TargetEncoder(), mod)

In [162]:
X = df.drop(cols_to_drop, axis=1)
y = df['Combined MPG (FT1)']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                    shuffle=True, random_state=2021)

In [163]:
pipe.fit(X_train, y_train);
feats = pd.DataFrame({ 'Importance': pipe.steps[1][1].feature_importances_,
                      'Column': X_train.columns})
feats.sort_values(by='Importance', ascending=False)

Unnamed: 0,Importance,Column
11,0.285692,Engine Displacement
3,0.235207,Model
10,0.182448,Engine Cylinders
33,0.084034,Hours to Charge (240V)
28,0.031845,Electric Motor
1,0.026519,Year
14,0.019902,City Utility Factor
27,0.013232,Start Stop Technology
6,0.011882,Transmission
8,0.011159,Engine Index


In [164]:
pipe.score(X_test, y_test)

0.9700961008051511

In [165]:
df['Predicted Combined MPG (FT1)'] = pipe.predict(X)
df[['Year', 'Make', 'Model', 'Combined MPG (FT1)', 'Predicted Combined MPG (FT1)']].head()

Unnamed: 0,Year,Make,Model,Combined MPG (FT1),Predicted Combined MPG (FT1)
0,1984,Alfa Romeo,GT V6 2.5,20,20.045708
1,1984,Alfa Romeo,GT V6 2.5,20,20.087807
2,1984,Alfa Romeo,Spider Veloce 2000,21,22.520048
3,1984,Alfa Romeo,Spider Veloce 2000,21,22.586445
4,1984,AM General,DJ Po Vehicle 2WD,17,17.648951


In [166]:
fig1 = px.scatter(df, x='Combined MPG (FT1)', y='Predicted Combined MPG (FT1)', 
                  title='Actual vs. Predicted Charges',
                 trendline='ols')
fig1.show()

There are a few outliers, but the predictions are almost scarily accurate. However, just by glancing at the regression line, it looks like the model overpredicts by more than it underpredicts. I would think the model is overfit, but the test score was just as high as the validation scores.

The next question I could answer is whether the model systematically over- or underestimates values. What determines whether it over- or underestimates?

In [167]:
df['Error'] = df['Combined MPG (FT1)'] - df['Predicted Combined MPG (FT1)']

In [168]:
df['Error Encoded'] = np.where(df['Error'] < 0, 0, 1)

In [176]:
mod_error = xgb.XGBRegressor(booster='gbtree',
                 n_estimators= 150,
                 max_depth= 5, learning_rate= .15,
                 nthread=-1, gamma=0, min_child_weight=1,
                 max_delta_step=0, subsample=1, colsample_bytree= 0.9, colsample_bylevel=1,
                        colsample_bynode=1, n_jobs=1, scale_pos_weight=1,
                 reg_alpha=0, reg_lambda=1,
                 base_score=0.5, seed=0, missing=None,
                 frac=None,k_neighbors=None,m_neighbors=None,out_step=None)
pipe_error = make_pipeline(TargetEncoder(), mod_error)

In [177]:
X = df.drop(cols_to_drop+['Error', 'Error Encoded', 'Predicted Combined MPG (FT1)'], axis=1)
y = df['Error Encoded']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                    shuffle=True, random_state=2021)

In [178]:
pipe_error.fit(X_train, y_train);
feats = pd.DataFrame({ 'Importance': pipe_error.steps[1][1].feature_importances_,
                      'Column': X_train.columns})
feats.sort_values(by='Importance', ascending=False)

Unnamed: 0,Importance,Column
3,0.253202,Model
9,0.076564,Engine Descriptor
12,0.0464,Turbocharger
10,0.043751,Engine Cylinders
11,0.037757,Engine Displacement
7,0.033825,Transmission Descriptor
6,0.03359,Transmission
35,0.032073,Manufacturer Code Clean
1,0.03188,Year
2,0.029759,Make


**Model** was the most important factor affecting whether my model overpredicts MPG. But how do other categories affect how variable my error is?

In [184]:
df.groupby('Engine Descriptor')['Error'].std().sort_values(ascending=False)

Engine Descriptor
(CALIF) CA model                    3.073261
Lead Acid                           2.750082
NiMH                                2.496296
Gasoline only                       2.162539
Hybrid; PR                          2.032521
                                      ...   
Stop-Start                               NaN
V-6        (FFS)      (S-CHARGE)         NaN
V-6 FFS                                  NaN
V8                                       NaN
Z/28                                     NaN
Name: Error, Length: 545, dtype: float32

In [185]:
df.groupby('Turbocharger')['Error'].std().sort_values(ascending=False)

Turbocharger
T     1.149616
NT    0.996491
Name: Error, dtype: float32

In [186]:
df.groupby('Fuel Type')['Error'].std().sort_values(ascending=False)

Fuel Type
Electricity                    3.381321
Regular Gas and Electricity    2.024942
Diesel                         1.662870
CNG                            1.356145
Gasoline or propane            1.347546
Premium and Electricity        1.096220
Regular                        0.973239
Premium                        0.958811
Premium Gas or Electricity     0.844647
Midgrade                       0.819088
Premium or E85                 0.668996
Gasoline or E85                0.628353
Regular Gas or Electricity     0.515235
Gasoline or natural gas        0.447561
Name: Error, dtype: float32

In [190]:
df.groupby('Make')['Error'].std().sort_values(ascending=False)

Make
BYD                                   5.034926
Tesla                                 2.548594
Saleen                                1.801079
Dacia                                 1.716689
Grumman Olson                         1.520249
                                        ...   
S and S Coach Company  E.p. Dutton         NaN
Shelby                                     NaN
Superior Coaches Div E.p. Dutton           NaN
Vixen Motor Company                        NaN
Volga Associated Automobile                NaN
Name: Error, Length: 133, dtype: float32

In [191]:
df.groupby('Drive')['Error'].std().sort_values(ascending=False)

Drive
Unknown                       1.392894
2-Wheel Drive                 1.272439
All-Wheel Drive               1.124210
Front-Wheel Drive             1.121799
Rear-Wheel Drive              0.957333
4-Wheel Drive                 0.940725
Part-time 4-Wheel Drive       0.810321
4-Wheel or All-Wheel Drive    0.767708
Name: Error, dtype: float32

In [192]:
df.groupby('Engine Cylinders')['Error'].std().sort_values(ascending=False)

Engine Cylinders
0     3.353467
3     1.621734
4     1.190312
10    0.998502
6     0.888883
5     0.809266
8     0.798022
2     0.697512
12    0.690079
16    0.133298
Name: Error, dtype: float32

In [193]:
df.groupby('Class')['Error'].std().sort_values(ascending=False)

Class
Small Pickup Trucks                   1.428210
Compact Cars                          1.191542
Small Pickup Trucks 2WD               1.167969
Subcompact Cars                       1.136934
Small Sport Utility Vehicle 4WD       1.112278
Two Seaters                           1.105996
Minicompact Cars                      1.087710
Midsize Cars                          1.086846
Small Station Wagons                  1.032867
Midsize Station Wagons                1.030151
Small Sport Utility Vehicle 2WD       1.010669
Standard Sport Utility Vehicle 4WD    1.008568
Vans Passenger                        0.950384
Large Cars                            0.942736
Special Purpose Vehicle 2WD           0.920494
Sport Utility Vehicle - 2WD           0.916927
Standard Pickup Trucks 2WD            0.882467
Standard Pickup Trucks/2wd            0.876177
Midsize-Large Station Wagons          0.858878
Small Pickup Trucks 4WD               0.845485
Special Purpose Vehicles              0.837236
Standar

My model has the most trouble with 0 cylinder vehicles (i.e. electric vehicles), vehicles with unknown drive type, and vehicles made by BYD, and Tesla. 

<br>
It makes sense that the model struggles with electric vehicles because they don't have cylinders, and the model puts a lot of weight on number of engine cylinders. It's not so much that the value is 0, but the fact that having all electric vehicles with that same value doesn't allow the model to differentiate them using that column.
<br>
<br>
BYD Manufacturer made only 4 vehicle models. The small sample size plus the fact that they're all electric probably contributes to the higher error variability. The Tesla errors also make sense because they make electric vehicles exclusively, and we've already established that the model performs slightly more poorly for these types of vehicles.

In [196]:
df[df['Fuel Type'] == 'Electricity']['Engine Cylinders'].describe()

count    133.0
mean       0.0
std        0.0
min        0.0
25%        0.0
50%        0.0
75%        0.0
max        0.0
Name: Engine Cylinders, dtype: float64

In [195]:
df[df['Make'] == 'BYD']['Fuel Type']

31019    Electricity
32183    Electricity
33350    Electricity
35836    Electricity
Name: Fuel Type, dtype: object

For future projects, I would expand this work by joining on other datasets, like sales data. I could use sales data to measure the predicted environmental impact for given vehicle types. I would also want to work more on models that can forecast, taking into account year over year changes in vehicle MPG.
<br>
<br>
I also found a dataset on FuelEconomy.gov with 2018 vehicle data. I would love to bring in that new data and predict the MPGs based on the model that learned from 1984-2017 data. It would be an excellent test to see if vehicle MPG can be accurately forecast based on previous models. Unfortunately, the dataset is in an extremely unusable state and would require significantly more work to clean than the dataset I got from Kaggle.