Does Owning an Energy Efficient Vehicle Lead to Longer Driving Distance

This project aims to explore the question "Does owning an energy efficient vehicle lead to longer driving distance". Using data from National Household Travel Survey(2017), I explore how households' driving pattern is correlated with owning an energy efficient vehicle, which includes hybrid electric vehicles(HEV), plug-in hybrid electric vehicles(PHEV), electric vehicles(EV), and other alternative fuel vehicles.

The question could be of interest to policy makers who provide financial incentives for purchasing energy efficient vehicles. Policy makers promote energy efficient vehicles with a hope to reduce the environmental impact of driving. However, if there exists the notorious rebound effect, which means "owning a green vehicle leads to more driving", the environmental benefit of driving a green vehicle would be discounted. Therefore it would be benificial to the policy maker to detect and quantify such a rebound effect.

A main difficulty of quantifying rebound effect is "selection bias": households who anticipate to drive longer mileage have greater incentive to purchase energy efficient vehicles due to fuel cost saving. Not addressing this issue will result in over-estimate in the rebound effect. To alleviate such concern, I use propensity score matching method to first pair up households with similar characteristics and are equally likely to purchase energy efficient vehicles, then compare the difference of their driving distances. Since the paired households are believed to be equally likely to purchase energy efficient vehicles, the purchase decision becomes quasi-random. Therefore, we overcome the selection bias problem.

The dataset contains information regarding to:

households' size, income, state, urban/rural area, number of adults, number of vehicles, etc.
vehicles' fuel type, size, annual mileage, etc.

The following code will first import and clean the dataset, and then use propensity score matching method to calculate how much extra mileage are caused by owning a green vehicle.

import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
from matplotlib import rcParams
rcParams.update({'figure.autolayout': True})
import warnings
warnings.filterwarnings('ignore')
from pymatch.Matcher import Matcher
import statsmodels.api as sm
import seaborn as sns

Import and clean dataset

data = pd.read_csv('/Users/chengchen/Dropbox/NHTS_2018/data/NHTS2017/vehpub.csv')
data.head()

	HOUSEID	VEHID	VEHYEAR	VEHAGE	MAKE	MODEL	FUELTYPE	VEHTYPE	WHOMAIN	OD_READ	...	HH_CBSA	HBHTNRNT	HBPPOPDN	HBRESDN	HTEEMPDN	HTHTNRNT	HTPPOPDN	HTRESDN	SMPLSRCE	WTHHFIN
0	30000007	1	2007	10	49	49032	1	1	3	69000	...	XXXXX	20	1500	750	750	50	1500	750	2	187.31432
1	30000007	2	2004	13	49	49442	1	2	-8	164000	...	XXXXX	20	1500	750	750	50	1500	750	2	187.31432
2	30000007	3	1998	19	19	19014	1	1	1	120000	...	XXXXX	20	1500	750	750	50	1500	750	2	187.31432
3	30000007	4	1997	20	19	19021	1	1	2	-88	...	XXXXX	20	1500	750	750	50	1500	750	2	187.31432
4	30000007	5	1993	24	20	20481	1	4	2	300000	...	XXXXX	20	1500	750	750	50	1500	750	2	187.31432

5 rows Ã� 49 columns

data = data[data.FUELTYPE<4] 
data = data[data.FUELTYPE>0] 
data = data[data.ANNMILES>0] 
data = data[data.HHFAMINC>0] 
data = data[data.HOMEOWN>0] 
data = data[data.VEHAGE>0] 
data = data[data.VEHTYPE>0]# drop the observations where some information is missing
income_dic = {1:5000, 2:12500, 3:20000, 4:30000, 5:42500, 6:62500, 7:87500, 8:112500, 9: 137500, 10: 175000, 11:225000}
data['income'] = data['HHFAMINC'].map(income_dic)
# map the mean income amount of the income category in the survey
home_dic = {1:1, 2:0, 97:0}
# yes: owning home  no: not owning home
data['homeown'] = data['HOMEOWN'].map(home_dic)
urban_dic = {1:'urban_area',2:'urban_cluster',3:'near_urban',4:'not_urban'}
data['urban'] = data['URBAN'].map(urban_dic)
vehtype_dic = {1: 'car',2: 'van',3: 'SUV',4: 'pickup',5: 'truck',6: 'RV',7: 'motorcycle',97: 'else'}
data['vehtype'] = data['VEHTYPE'].map(vehtype_dic)
fueltype_dic = {1: 'gas', 2: 'diesel', 3: 'hybrid/electric/alternative'}
data['fueltype'] = data['FUELTYPE'].map(fueltype_dic)

Summary Statistics

# Frequency of FuelType
data.groupby('fueltype')['fueltype'].count()

fueltype
diesel                           5362
gas                            170384
hybrid/electric/alternative      4966
Name: fueltype, dtype: int64

# Average Driving Distance Grouped by FuelType
data.groupby('fueltype')['ANNMILES'].mean()

fueltype
diesel                         11247.242260
gas                             9620.500235
hybrid/electric/alternative    12308.237213
Name: ANNMILES, dtype: float64

Above shows the mean annual mileage of vehicles of different fuel types. As can be seen, the lower the fuel cost, the longer the mileage driven is.

More Summary Statistics

htype_dic = {1:  'biodiesel', 2:  'plug-in hybrid', 3:  'electric', 4:  'hybrid', -9: 'NA', -8: 'NA', -1: 'NA', 97: 'NA'}
data['hfuel'] = data['HFUEL'].map(htype_dic)
data.groupby(['fueltype','hfuel'])['ANNMILES'].mean()

fueltype                     hfuel         
diesel                       NA                11247.242260
gas                          NA                 9620.500235
hybrid/electric/alternative  NA                12387.113208
                             biodiesel         12541.944444
                             electric           8235.281720
                             hybrid            12874.606100
                             plug-in hybrid    11516.482587
Name: ANNMILES, dtype: float64

Within the category of energy efficient vehicle, annual mileage ranking is :
hybrid > biudiesel > plug-in hybrid > electric
(Battery size probably plays a role in limiting the mileage of pure electric vehicles.)

area_avg_mile = data.groupby(['HHSTATE','HH_CBSA'])['ANNMILES'].mean().to_frame()
# average annual driving distance of each region (State + Core Based Statistical Area)
data = pd.merge(data,area_avg_mile, how = 'right', left_on = ['HHSTATE','HH_CBSA'], right_index = True)
# add as new column in the dataframe
data = data.rename(columns = {'ANNMILES_x': 'ANNMILES', 'ANNMILES_y': 'area_avg_mile'})
data['relative_mile'] = data['ANNMILES']/data['area_avg_mile']
# this is relative driving milage, compared to the average level of the local area
data.groupby(['fueltype','hfuel'])['relative_mile'].mean()

fueltype                     hfuel         
diesel                       NA                1.173388
gas                          NA                0.985955
hybrid/electric/alternative  NA                1.284829
                             biodiesel         1.306946
                             electric          0.888567
                             hybrid            1.351362
                             plug-in hybrid    1.223351
Name: relative_mile, dtype: float64

Energy efficient vehicles are driven for more mileage compared to its local average levels. This further comfirms the driving behavior pattern. Until this stage we have looked at the general data pattern without handling the "selection bias" issue. We will try to deal with this problem using propensity score matching.

Propensity Score Matching

The following section will match the treatment/control groups:

treatment group: hybrid/electric/alternative vehicles
control group: gasoline/diesel vehicles They will be matched by both household and vehicle characteristics. The python package pymatch will be used here.

Calculating Propensity Score

trt_dic = {1: False, 2: False,3: True}
data['treatment'] = data['FUELTYPE'].map(trt_dic)
fields = ["income", "homeown", "urban", "vehtype", "VEHAGE", "HHSIZE", "HHSTATE", "HHVEHCNT", "treatment", "ANNMILES"] 
data_match = data[fields]
treatment = data_match[data_match.treatment == 1]
control = data_match[data_match.treatment == 0]
m = Matcher(treatment, control, yvar = "treatment", exclude = ["ANNMILES"])

Formula:
treatment ~ income+homeown+urban+vehtype+VEHAGE+HHSIZE+HHSTATE+HHVEHCNT
n majority: 175746
n minority: 4966

np.random.seed(1)
m.fit_scores(balance = True, nmodels = 100)

Fitting Models on Balanced Samples: 100\100
Average Accuracy: 76.45%

This step calculates the propensity score (probability of purchasing a clean energy vehicle).
The treatment group (minority) has 4966 obs, whereas control group (majority) has 175746 obs, rendering imbalanced sample.
Use "balance = true" so that when fitting logistic models, equal number of treatment and control samples are used.
Use "nmodels = 100" to randomly draw sample for 100 times, so that the logistic model will be fitted using more data points from the control group (the majority group).

The average accuracy is 76.45%, suggesting a good fit of the model.

m.predict_scores() # calculate p-score

m.plot_scores()

The above step plot the p-score distribution of treatment group and control group.
For each p-score, there is positive chance of belonging to either group. (common support)
The treatment group has much higher scores than the control group. (separability)
Such evidence support the use of propensity score matching.

Setting Matching Threshold

m.tune_threshold(method='random')

Tune threshold for matching:
From the graph: a threshold at 0.0003 will retain 100% of the data.

m.match(method = "min", nmatches = 1, threshold = 0.0003)
# matching the treatment data points with control data points

Preview Matched Data

m.matched_data.sort_values("match_id").head(6)

	income	homeown	urban	vehtype	VEHAGE	HHSIZE	HHSTATE	HHVEHCNT	treatment	ANNMILES	scores	match_id	record_id
0	62500	1	not_urban	car	7	2	NC	2	1	15000	0.578314	0	0
12165	62500	1	not_urban	car	7	2	NC	2	0	10000	0.578314	0	12165
1	112500	1	not_urban	car	12	3	NC	7	1	8571	0.368433	1	1
148095	20000	1	urban_cluster	car	11	2	AZ	5	0	12000	0.368417	1	148095
2	112500	1	urban_area	car	3	2	NC	2	1	8000	0.759758	2	2
143971	112500	1	urban_area	car	3	2	NC	2	0	10500	0.759758	2	143971

Validity of Matching

categorical_results = m.compare_categorical(return_table = True)
categorical_results

	var	after
0	urban	0.119381
1	vehtype	0.752881
2	HHSTATE	0.960683

Above results(pvalue) shows chi-square test for independence before and after matching. After matching we want this pvalue to be > 0.05, resulting in failure to reject the null hypothesis that the frequency of the enumerated term values are independent of the test and control grousps.

cc = m.compare_continuous(return_table = True)

cc

	var	ks_after	grouped_chisqr_before	grouped_chisqr_after	std_median_diff_before	std_median_diff_after	std_mean_diff_before	std_mean_diff_after
0	income	0.632	0.0	0.441	0.416377	0.00000	0.494024	0.005524
1	homeown	0.224	1.0	1.000	0.000000	0.00000	0.076510	-0.023737
2	VEHAGE	0.488	0.0	0.567	-0.501484	0.24207	-0.581891	-0.009603
3	HHSIZE	0.116	1.0	1.000	0.000000	0.00000	0.018855	0.048497
4	HHVEHCNT	0.874	1.0	1.000	0.000000	0.00000	-0.176163	0.011618

Tests included: Kolmogorov-Smirnov Goodness of fit Test (KS-test) and Chi-Square Distance. We want pvalues from both the KS-test and the grouped permutation of the Chi-Square distance after matching to be > 0.05, and they both are.

Results

Next we will assess the difference in driving distance of the treatment and control group.

Calculate the difference through linear regression:

covariates = ["treatment", "income", "homeown", "VEHAGE", "HHSIZE" ,"HHVEHCNT"] 
y=m.matched_data["ANNMILES"]
X=sm.add_constant(pd.concat([m.matched_data[covariates], 
                             pd.get_dummies(m.matched_data["HHSTATE"]),
                             pd.get_dummies(m.matched_data["urban"]),
                             pd.get_dummies(m.matched_data["vehtype"])],axis=1))
linmodel = sm.OLS(y,X).fit()
print('results from propensity score matching')
print('Average Difference: %.2f' %linmodel.params.treatment)
print('standard error: %.2f' %linmodel.tvalues.treatment)

results from propensity score matching
Average Difference: 1203.45
standard error: 4.94

The above shows that energy efficient vehicles can induce 1203 miles more of annual driving distance, and the result is statistically significant.

unmatched1 = data[data['treatment']==1]['ANNMILES'].mean()
unmatched0 = data[data['treatment']==0]['ANNMILES'].mean()
print('Difference without propensity score matching: %.2f' %(unmatched1 - unmatched0))

Difference without propensity score matching: 2638.11

The above is the raw difference of driving distance between two groups (without matching). This is greater than the matched result, providing evidence for selection bias: those who buy energy efficient vehicles probably drive more on average. Yet, the result from PSM indicates despite of such selection bias, buying an energy efficient vehicle will change people's driving behavior and induce more driving.

More analysis

Another question that policy makers might be interested in is "Does this extra mileage of energy efficient vehicle substitute some mileage from other vehicles owned by the household?".If total annual mileage of households who own energy efficient vehicles does not differ from similar households who do not own energy efficient vehicles, that might suggest subsitution between household vehicles. Therefore the following section will aggregate data to household level and conduct similar matching procedure to households.

hh_total_mile = data.groupby(['HOUSEID'])['ANNMILES'].sum() \
    .to_frame().rename(columns = {'ANNMILES': 'hh_total_mile'})
hh_own_clean_veh = ((data.groupby(['HOUSEID'])['treatment'].sum()>0) \
    .to_frame()*1).rename(columns = {'treatment': 'hh_own_clean_veh'})
n_clean_veh = data.groupby(['HOUSEID'])['treatment'].sum() \
    .to_frame().rename(columns = {'treatment': 'n_clean_veh'})
data_hh = pd.merge(data, hh_total_mile, how = 'inner', left_on = ['HOUSEID'], right_index = True)
data_hh = pd.merge(data_hh, hh_own_clean_veh, how = 'inner', left_on = ['HOUSEID'], right_index = True)
data_hh = pd.merge(data_hh, n_clean_veh, how = 'inner', left_on = ['HOUSEID'], right_index = True)
data_hh = data_hh.drop_duplicates(['HOUSEID'], keep='first')
data_hh = data_hh[data_hh['NUMADLT']>0]
data_hh['mile_per_adult'] = data_hh['hh_total_mile']/data_hh['NUMADLT']
data_hh['num_children'] = data_hh['HHSIZE'] - data_hh['NUMADLT']

fields = ["income", "homeown", "urban", "HHSIZE", "HHSTATE", "HHVEHCNT", "hh_own_clean_veh", "hh_total_mile","n_clean_veh", "mile_per_adult", "num_children"] 
data_match = data_hh[fields]
data_match.head()
treatment = data_match[data_match.hh_own_clean_veh == 1]
control = data_match[data_match.hh_own_clean_veh == 0]
m = Matcher(treatment, control, yvar = "hh_own_clean_veh", exclude = ["hh_total_mile","n_clean_veh","mile_per_adult"])

Formula:
hh_own_clean_veh ~ income+homeown+urban+HHSIZE+HHSTATE+HHVEHCNT+num_children
n majority: 90974
n minority: 4568

np.random.seed(1)
m.fit_scores(balance = True, nmodels = 100)
# average accuracy: 64.91%

Fitting Models on Balanced Samples: 100\100
Average Accuracy: 64.91%

m.predict_scores() # calculate p-score
m.plot_scores()

m.tune_threshold(method='random')

m.match(method = "min", nmatches = 1, threshold = 0.0008)

m.matched_data.sort_values("match_id").head(6)

	income	homeown	urban	HHSIZE	HHSTATE	HHVEHCNT	hh_own_clean_veh	hh_total_mile	n_clean_veh	mile_per_adult	scores	match_id	record_id
0	62500	1	not_urban	2	NC	2	1	20000	1.0	10000.000000	0.372852	0	0
8288	62500	1	not_urban	2	NC	2	0	10000	0.0	5000.000000	0.372852	0	8288
1	112500	1	not_urban	3	NC	7	1	65374	1.0	21791.333333	0.554536	1	1
8861	112500	1	not_urban	3	NC	7	0	45850	0.0	15283.333333	0.554536	1	8861
2	112500	1	urban_area	2	NC	2	1	20000	2.0	10000.000000	0.540142	2	2
6376	112500	1	urban_area	2	NC	2	0	12800	0.0	6400.000000	0.540142	2	6376

categorical_results = m.compare_categorical(return_table = True)
categorical_results

	var	before	after
0	urban	0.0	0.845934
1	HHSTATE	0.0	0.999939

cc = m.compare_continuous(return_table = True)
cc

	var	ks_after	grouped_chisqr_before	grouped_chisqr_after	std_median_diff_before	std_mean_diff_before	std_mean_diff_after
0	income	0.998	0.0	0.998	0.856976	0.645195	0.005165
1	homeown	0.899	1.0	1.000	0.000000	0.180183	-0.003286
2	HHSIZE	0.512	1.0	1.000	0.000000	0.182328	0.024008
3	HHVEHCNT	0.842	1.0	1.000	0.000000	0.210103	0.019599
4	num_children	0.287	1.0	1.000	0.000000	0.059526	0.014982

covariates = ["hh_own_clean_veh", "income", "homeown", "HHSIZE" ,"HHVEHCNT"] 
linmodel = sm.OLS(m.matched_data["hh_total_mile"], 
                  pd.concat([m.matched_data[covariates], 
                             pd.get_dummies(m.matched_data["HHSTATE"]),
                             pd.get_dummies(m.matched_data["urban"])],axis=1)).fit()
print('the effect of owning a clean vehicle on household miles driven')
print('ATE:',linmodel.params.hh_own_clean_veh)
print('standard error:',linmodel.tvalues.hh_own_clean_veh)

the effect of owning a clean vehicle on household miles driven
ATE: 1729.01924946
standard error: 4.45509371037

# use "number of clean vehicle" as independent var instead
covariates2 = ["n_clean_veh", "income", "homeown", "HHSIZE" ,"HHVEHCNT"] 
linmodel2 = sm.OLS(m.matched_data["hh_total_mile"], 
                  pd.concat([m.matched_data[covariates2], 
                             pd.get_dummies(m.matched_data["HHSTATE"]),
                             pd.get_dummies(m.matched_data["urban"])],axis=1)).fit()
print('the effect of owning one more clean vehicle on household miles driven')
print('ATE:',linmodel2.params.n_clean_veh)
print('standard error:',linmodel2.tvalues.n_clean_veh)

the effect of owning one more clean vehicle on household miles driven
ATE: 1779.57031899
standard error: 5.34001409422

# use "mile per adult" as dependent var, "own clean veh dummy" as independent var
covariates3 = ["hh_own_clean_veh", "income", "homeown", "num_children" ,"HHVEHCNT"] 
linmodel3 = sm.OLS(m.matched_data["mile_per_adult"], 
                  pd.concat([m.matched_data[covariates3], 
                             pd.get_dummies(m.matched_data["HHSTATE"]),
                             pd.get_dummies(m.matched_data["urban"])],axis=1)).fit()
print('the effect of owning a clean vehicle on miles driven per adult in the household')
print('ATE:',linmodel3.params.hh_own_clean_veh)
print('standard error:',linmodel3.tvalues.hh_own_clean_veh)

the effect of owning a clean vehicle on miles driven per adult in the household
ATE: 881.540124761
standard error: 3.80814019464

Findout from this section:

Households that purchase clean vehicles are driving more annual miles,which does not support the substitution effect evidence.
Individuals in households with clean energy vehicles are driving more miles, which is consistent with the above results.

Combining the results, we can assert that owning an energy efficient vehicle leads to more driving, and this does not substitute the mileage of other vehicles owned by the household. The rebound effect does exist for energy efficient vehicles.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
README.md		README.md
output_18_0.png		output_18_0.png
output_21_0.png		output_21_0.png
output_40_0.png		output_40_0.png
output_41_0.png		output_41_0.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

output_18_0.png

output_18_0.png

output_21_0.png

output_21_0.png

output_40_0.png

output_40_0.png

output_41_0.png

output_41_0.png

Repository files navigation

Does Owning an Energy Efficient Vehicle Lead to Longer Driving Distance

Import and clean dataset

Summary Statistics

More Summary Statistics

Propensity Score Matching

Calculating Propensity Score

Setting Matching Threshold

Preview Matched Data

Validity of Matching

Results

More analysis

About

Releases

Packages

ccubc/PSmatching

Folders and files

Latest commit

History

Repository files navigation

Does Owning an Energy Efficient Vehicle Lead to Longer Driving Distance

Import and clean dataset

Summary Statistics

More Summary Statistics

Propensity Score Matching

Calculating Propensity Score

Setting Matching Threshold

Preview Matched Data

Validity of Matching

Results

More analysis

About

Topics

Resources

Stars

Watchers

Forks