# Case Study 2 - Analyzing Fuel Economy Data

## 1. Assessing Data

Using pandas to explore ```all_alpha_08.csv``` and ```all_alpha_18.csv``` to answer the following questions about the characteristics of the datasets:

- number of samples in each dataset
- number of columns in each dataset
- duplicate rows in each dataset
- datatypes of columns
- features with missing values
- number of non-null unique values for features in each dataset
- what those unique values are and counts for each
- Number of rows with missing values in each dataset
- Types of fuels present in each dataset

In [117]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
% matplotlib inline

df_08 = pd.read_csv('all_alpha_08.csv')
df_18 = pd.read_csv('all_alpha_18.csv')

### Number of samples & columns in the fuel economy 2008 dataset

In [118]:
print(df_08.shape)
df_08.head()

(2404, 18)


Unnamed: 0,Model,Displ,Cyl,Trans,Drive,Fuel,Sales Area,Stnd,Underhood ID,Veh Class,Air Pollution Score,FE Calc Appr,City MPG,Hwy MPG,Cmb MPG,Unadj Cmb MPG,Greenhouse Gas Score,SmartWay
0,ACURA MDX,3.7,(6 cyl),Auto-S5,4WD,Gasoline,CA,U2,8HNXT03.7PKR,SUV,7,Drv,15,20,17,22.0527,4,no
1,ACURA MDX,3.7,(6 cyl),Auto-S5,4WD,Gasoline,FA,B5,8HNXT03.7PKR,SUV,6,Drv,15,20,17,22.0527,4,no
2,ACURA RDX,2.3,(4 cyl),Auto-S5,4WD,Gasoline,CA,U2,8HNXT02.3DKR,SUV,7,Drv,17,22,19,24.1745,5,no
3,ACURA RDX,2.3,(4 cyl),Auto-S5,4WD,Gasoline,FA,B5,8HNXT02.3DKR,SUV,6,Drv,17,22,19,24.1745,5,no
4,ACURA RL,3.5,(6 cyl),Auto-S5,4WD,Gasoline,CA,U2,8HNXV03.5HKR,midsize car,7,Drv,16,24,19,24.5629,5,no


### Number of samples & columns in the fuel economy 2018 dataset

In [119]:
print(df_18.shape)
df_18.head()

(1611, 18)


Unnamed: 0,Model,Displ,Cyl,Trans,Drive,Fuel,Cert Region,Stnd,Stnd Description,Underhood ID,Veh Class,Air Pollution Score,City MPG,Hwy MPG,Cmb MPG,Greenhouse Gas Score,SmartWay,Comb CO2
0,ACURA RDX,3.5,6.0,SemiAuto-6,2WD,Gasoline,FA,T3B125,Federal Tier 3 Bin 125,JHNXT03.5GV3,small SUV,3,20,28,23,5,No,386
1,ACURA RDX,3.5,6.0,SemiAuto-6,2WD,Gasoline,CA,U2,California LEV-II ULEV,JHNXT03.5GV3,small SUV,3,20,28,23,5,No,386
2,ACURA RDX,3.5,6.0,SemiAuto-6,4WD,Gasoline,FA,T3B125,Federal Tier 3 Bin 125,JHNXT03.5GV3,small SUV,3,19,27,22,4,No,402
3,ACURA RDX,3.5,6.0,SemiAuto-6,4WD,Gasoline,CA,U2,California LEV-II ULEV,JHNXT03.5GV3,small SUV,3,19,27,22,4,No,402
4,ACURA TLX,2.4,4.0,AMS-8,2WD,Gasoline,CA,L3ULEV125,California LEV-III ULEV125,JHNXV02.4WH3,small car,3,23,33,27,6,No,330


### Duplicate rows in the fuel economy 2008 dataset

In [120]:
df_08.duplicated().sum()

25

### Duplicate rows in the fuel economy 2018 dataset

In [121]:
df_18.duplicated().sum()

0

### Datatype of column in the fuel economy 2008 dataset

In [122]:
df_08.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2404 entries, 0 to 2403
Data columns (total 18 columns):
Model                   2404 non-null object
Displ                   2404 non-null float64
Cyl                     2205 non-null object
Trans                   2205 non-null object
Drive                   2311 non-null object
Fuel                    2404 non-null object
Sales Area              2404 non-null object
Stnd                    2404 non-null object
Underhood ID            2404 non-null object
Veh Class               2404 non-null object
Air Pollution Score     2404 non-null object
FE Calc Appr            2205 non-null object
City MPG                2205 non-null object
Hwy MPG                 2205 non-null object
Cmb MPG                 2205 non-null object
Unadj Cmb MPG           2205 non-null float64
Greenhouse Gas Score    2205 non-null object
SmartWay                2404 non-null object
dtypes: float64(2), object(16)
memory usage: 338.1+ KB


### Datatype of column in the fuel economy 2018 dataset

In [123]:
df_18.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1611 entries, 0 to 1610
Data columns (total 18 columns):
Model                   1611 non-null object
Displ                   1609 non-null float64
Cyl                     1609 non-null float64
Trans                   1611 non-null object
Drive                   1611 non-null object
Fuel                    1611 non-null object
Cert Region             1611 non-null object
Stnd                    1611 non-null object
Stnd Description        1611 non-null object
Underhood ID            1611 non-null object
Veh Class               1611 non-null object
Air Pollution Score     1611 non-null int64
City MPG                1611 non-null object
Hwy MPG                 1611 non-null object
Cmb MPG                 1611 non-null object
Greenhouse Gas Score    1611 non-null int64
SmartWay                1611 non-null object
Comb CO2                1611 non-null object
dtypes: float64(2), int64(2), object(14)
memory usage: 226.6+ KB


> Noticed that the following features have different datatype in 2008 and 2018:
> - Cyl (2018) - float
> - Cyl (2008) - string
> - Greenhouse Gas Score (2008) - string
> - Greenhouse Gas Score (2018) - int

### Features with missing values in the fuel economy 2008 dataset

In [124]:
df_08.isnull().sum()

Model                     0
Displ                     0
Cyl                     199
Trans                   199
Drive                    93
Fuel                      0
Sales Area                0
Stnd                      0
Underhood ID              0
Veh Class                 0
Air Pollution Score       0
FE Calc Appr            199
City MPG                199
Hwy MPG                 199
Cmb MPG                 199
Unadj Cmb MPG           199
Greenhouse Gas Score    199
SmartWay                  0
dtype: int64

### Features with missing values in the fuel economy 2018 dataset

In [125]:
df_18.isnull().sum()

Model                   0
Displ                   2
Cyl                     2
Trans                   0
Drive                   0
Fuel                    0
Cert Region             0
Stnd                    0
Stnd Description        0
Underhood ID            0
Veh Class               0
Air Pollution Score     0
City MPG                0
Hwy MPG                 0
Cmb MPG                 0
Greenhouse Gas Score    0
SmartWay                0
Comb CO2                0
dtype: int64

### Number of unique values for quality in fuel economy 2008 dataset

In [126]:
df_08.nunique()

Model                   436
Displ                    47
Cyl                       8
Trans                    14
Drive                     2
Fuel                      5
Sales Area                3
Stnd                     12
Underhood ID            343
Veh Class                 9
Air Pollution Score      13
FE Calc Appr              2
City MPG                 39
Hwy MPG                  43
Cmb MPG                  38
Unadj Cmb MPG           721
Greenhouse Gas Score     20
SmartWay                  2
dtype: int64

### Number of unique values for quality in fuel economy 2018 dataset

In [127]:
df_18.nunique()

Model                   367
Displ                    36
Cyl                       7
Trans                    26
Drive                     2
Fuel                      5
Cert Region               2
Stnd                     19
Stnd Description         19
Underhood ID            230
Veh Class                 9
Air Pollution Score       6
City MPG                 58
Hwy MPG                  62
Cmb MPG                  57
Greenhouse Gas Score     10
SmartWay                  3
Comb CO2                299
dtype: int64

### Number of rows with missing values in fuel economy 2008 dataset

In [128]:
df_08.isnull().any(axis=1).sum()

199

### Number of rows with missing values in fuel economy 2018 dataset

In [129]:
df_18.isnull().any(axis=1).sum()

2

### Types of fuels present in 2008 dataset

In [130]:
df_08.Fuel.unique()

array(['Gasoline', 'ethanol/gas', 'ethanol', 'diesel', 'CNG'],
      dtype=object)

### Types of fuels present in 2018 dataset

In [131]:
df_18.Fuel.unique()

array(['Gasoline', 'Gasoline/Electricity', 'Diesel', 'Ethanol/Gas',
       'Electricity'], dtype=object)

## 2. Cleaning Column Labels

In [132]:
# view 2008 dataset
df_08.head(1)

Unnamed: 0,Model,Displ,Cyl,Trans,Drive,Fuel,Sales Area,Stnd,Underhood ID,Veh Class,Air Pollution Score,FE Calc Appr,City MPG,Hwy MPG,Cmb MPG,Unadj Cmb MPG,Greenhouse Gas Score,SmartWay
0,ACURA MDX,3.7,(6 cyl),Auto-S5,4WD,Gasoline,CA,U2,8HNXT03.7PKR,SUV,7,Drv,15,20,17,22.0527,4,no


In [133]:
# view 2018 dataset
df_18.head(1)

Unnamed: 0,Model,Displ,Cyl,Trans,Drive,Fuel,Cert Region,Stnd,Stnd Description,Underhood ID,Veh Class,Air Pollution Score,City MPG,Hwy MPG,Cmb MPG,Greenhouse Gas Score,SmartWay,Comb CO2
0,ACURA RDX,3.5,6.0,SemiAuto-6,2WD,Gasoline,FA,T3B125,Federal Tier 3 Bin 125,JHNXT03.5GV3,small SUV,3,20,28,23,5,No,386


### Drop Extraneous Columns

Drop features that aren't consistent (not present in both datasets) or aren't relevant to our questions.

Columns to Drop:
- From 2008 dataset: ```'Stnd', 'Underhood ID', 'FE Calc Appr', 'Unadj Cmb MPG'```
- From 2018 dataset: ```'Stnd', 'Stnd Description', 'Underhood ID', 'Comb CO2'```

In [134]:
# drop columns from 2008 dataset
df_08.drop(['Stnd', 'Underhood ID', 'FE Calc Appr', 'Unadj Cmb MPG'], axis=1, inplace=True)

# confirm changes
df_08.head(1)

Unnamed: 0,Model,Displ,Cyl,Trans,Drive,Fuel,Sales Area,Veh Class,Air Pollution Score,City MPG,Hwy MPG,Cmb MPG,Greenhouse Gas Score,SmartWay
0,ACURA MDX,3.7,(6 cyl),Auto-S5,4WD,Gasoline,CA,SUV,7,15,20,17,4,no


In [135]:
# drop columns from 2018 dataset
df_18.drop(['Stnd', 'Stnd Description', 'Underhood ID', 'Comb CO2'], axis=1, inplace=True)

# confirm changes
df_18.head(1)

Unnamed: 0,Model,Displ,Cyl,Trans,Drive,Fuel,Cert Region,Veh Class,Air Pollution Score,City MPG,Hwy MPG,Cmb MPG,Greenhouse Gas Score,SmartWay
0,ACURA RDX,3.5,6.0,SemiAuto-6,2WD,Gasoline,FA,small SUV,3,20,28,23,5,No


### Rename Columns

Change the "Sales Area" column label in the 2008 dataset to "Cert Region" for consistency.
Rename all column labels to replace spaces with underscores and convert everything to lowercase.

In [136]:
# rename Sales Area to Cert Region
df_08.rename(columns={'Sales Area': 'Cert Region'}, inplace=True)

# confirm changes
df_08.head(1)

Unnamed: 0,Model,Displ,Cyl,Trans,Drive,Fuel,Cert Region,Veh Class,Air Pollution Score,City MPG,Hwy MPG,Cmb MPG,Greenhouse Gas Score,SmartWay
0,ACURA MDX,3.7,(6 cyl),Auto-S5,4WD,Gasoline,CA,SUV,7,15,20,17,4,no


In [137]:
# replace spaces with underscores and lowercase labels for 2008 dataset
df_08.rename(columns=lambda x: x.strip().lower().replace(" ", "_"), inplace=True)

# confirm changes
df_08.head(1)

Unnamed: 0,model,displ,cyl,trans,drive,fuel,cert_region,veh_class,air_pollution_score,city_mpg,hwy_mpg,cmb_mpg,greenhouse_gas_score,smartway
0,ACURA MDX,3.7,(6 cyl),Auto-S5,4WD,Gasoline,CA,SUV,7,15,20,17,4,no


In [138]:
# replace spaces with underscores and lowercase labels for 2018 dataset
df_18.rename(columns=lambda x: x.strip().lower().replace(" ", "_"), inplace=True)

# confirm changes
df_18.head(1)

Unnamed: 0,model,displ,cyl,trans,drive,fuel,cert_region,veh_class,air_pollution_score,city_mpg,hwy_mpg,cmb_mpg,greenhouse_gas_score,smartway
0,ACURA RDX,3.5,6.0,SemiAuto-6,2WD,Gasoline,FA,small SUV,3,20,28,23,5,No


In [139]:
# confirm column labels for 2008 and 2018 datasets are identical
df_08.columns == df_18.columns

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True])

In [140]:
# make sure they're all identical like this
(df_08.columns == df_18.columns).all()

True

In [141]:
# save new datasets for next section
df_08.to_csv('data_08.csv', index=False)
df_18.to_csv('data_18.csv', index=False)

## 3. Filter, Drop Nulls, Dedupe

In [142]:
# view dimensions of dataset
df_08.shape

(2404, 14)

In [143]:
# view dimensions of dataset
df_18.shape

(1611, 14)

### Filter by Certification Region

In [144]:
# filter datasets for rows following California standards
df_08 = df_08.query('cert_region == "CA"')
df_18 = df_18.query('cert_region == "CA"')

In [145]:
# confirm only certification region is California
df_08['cert_region'].unique()

array(['CA'], dtype=object)

In [146]:
# confirm only certification region is California
df_18['cert_region'].unique()

array(['CA'], dtype=object)

In [147]:
# drop certification region columns form both datasets

df_08.drop('cert_region', axis=1, inplace=True)
df_18.drop('cert_region', axis=1, inplace=True)

In [148]:
df_08.shape

(1084, 13)

In [149]:
df_18.shape

(798, 13)

### Drop Rows with Missing Values

In [150]:
# view missing value count for each feature in 2008
df_08.isnull().sum()

model                    0
displ                    0
cyl                     75
trans                   75
drive                   37
fuel                     0
veh_class                0
air_pollution_score      0
city_mpg                75
hwy_mpg                 75
cmb_mpg                 75
greenhouse_gas_score    75
smartway                 0
dtype: int64

In [151]:
# view missing value count for each feature in 2018
df_18.isnull().sum()

model                   0
displ                   1
cyl                     1
trans                   0
drive                   0
fuel                    0
veh_class               0
air_pollution_score     0
city_mpg                0
hwy_mpg                 0
cmb_mpg                 0
greenhouse_gas_score    0
smartway                0
dtype: int64

In [152]:
# drop rows with any null values in both datasets
df_08.dropna(inplace=True)
df_18.dropna(inplace=True)

In [153]:
# checks if any of columns in 2008 have null values - should print False
df_08.isnull().sum().any()

False

In [154]:
# checks if any of columns in 2018 have null values - should print False
df_18.isnull().sum().any()

False

### Dedupe Data

In [155]:
# print number of duplicates in 2008 datasets
df_08.duplicated().sum()

23

In [156]:
# print number of duplicates in 2018 datasets
df_18.duplicated().sum()

3

In [157]:
# drop duplicates in 2008 datasets
df_08.drop_duplicates(inplace=True)

In [158]:
# drop duplicates in 2018 datasets
df_18.drop_duplicates(inplace=True)

In [159]:
# print number of duplicates again to confirm dedupe - should be 0
df_08.duplicated().sum()

0

In [160]:
# print number of duplicates again to confirm dedupe - should be 0
df_18.duplicated().sum()

0

In [161]:
# save progress for the next section
df_08.to_csv('data_08.csv', index=False)
df_18.to_csv('data_18.csv', index=False)

# 4. Inspecting Data Types

Use the following Jupyter Notebook to inspect the datatypes of features in each dataset and think about what changes should be made to make them practical and consistent (in both datasets).

In [162]:
df_08 = pd.read_csv('data_08.csv')
df_18 = pd.read_csv('data_18.csv')

In [163]:
# Datatype of column in the fuel economy 2008 dataset
df_08.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 986 entries, 0 to 985
Data columns (total 13 columns):
model                   986 non-null object
displ                   986 non-null float64
cyl                     986 non-null object
trans                   986 non-null object
drive                   986 non-null object
fuel                    986 non-null object
veh_class               986 non-null object
air_pollution_score     986 non-null object
city_mpg                986 non-null object
hwy_mpg                 986 non-null object
cmb_mpg                 986 non-null object
greenhouse_gas_score    986 non-null object
smartway                986 non-null object
dtypes: float64(1), object(12)
memory usage: 100.2+ KB


In [164]:
# Datatype of column in the fuel economy 2018 dataset
df_18.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 794 entries, 0 to 793
Data columns (total 13 columns):
model                   794 non-null object
displ                   794 non-null float64
cyl                     794 non-null float64
trans                   794 non-null object
drive                   794 non-null object
fuel                    794 non-null object
veh_class               794 non-null object
air_pollution_score     794 non-null int64
city_mpg                794 non-null object
hwy_mpg                 794 non-null object
cmb_mpg                 794 non-null object
greenhouse_gas_score    794 non-null int64
smartway                794 non-null object
dtypes: float64(2), int64(2), object(9)
memory usage: 80.7+ KB


In [165]:
# View first row of data in the fuel economy 2008 dataset
df_08.head(1)

Unnamed: 0,model,displ,cyl,trans,drive,fuel,veh_class,air_pollution_score,city_mpg,hwy_mpg,cmb_mpg,greenhouse_gas_score,smartway
0,ACURA MDX,3.7,(6 cyl),Auto-S5,4WD,Gasoline,SUV,7,15,20,17,4,no


In [166]:
# View first row of data in the fuel economy 2018 dataset
df_18.head(1)

Unnamed: 0,model,displ,cyl,trans,drive,fuel,veh_class,air_pollution_score,city_mpg,hwy_mpg,cmb_mpg,greenhouse_gas_score,smartway
0,ACURA RDX,3.5,6.0,SemiAuto-6,2WD,Gasoline,small SUV,3,20,28,23,5,No


> 1) Need to extract int values from the strings in the 2008 ```cyl``` column and convert floats to ints in the 2018 >```cyl``` column to make the ```cyl``` column in both dataset clear and consistent.

>2) For ```air_pollution_score``` column: Need to convert strings to floats in the 2008 column and convert ints to floats in >the 2018 columns.

>3) The following features need to be convereted to floats from strings for both dataset:
> - city_mpg
> - hwy_mpg
> - cmb_mpg

>4) Need to convert string to ints in the 2008 column to make the ```greenhouse_gas_score``` columns in both datasets >consistent.
 

# 5. Fixing Data Types

Make the following changes to make the datatypes consistent and practical to work with:

### 1) Fix ```cyl``` datatype
- 2008: extract int from string.
- 2018: convert float to int.

In [167]:
# check value counts for the 2008 cyl column
df_08['cyl'].value_counts()

(6 cyl)     409
(4 cyl)     283
(8 cyl)     199
(5 cyl)      48
(12 cyl)     30
(10 cyl)     14
(2 cyl)       2
(16 cyl)      1
Name: cyl, dtype: int64

In [168]:
# Extract int from strings in the 2008 cyl column

# str.extract() takes the string data and extracts the argument passed within
# \d refers to characters which are digits
# + matches one or more characters
# astype(int) then converts it to an integer

df_08['cyl'] = df_08['cyl'].str.extract('(\d+)').astype(int)

  


In [169]:
# Check value counts for 2008 cyl column again to confirm the change
df_08['cyl'].value_counts()

6     409
4     283
8     199
5      48
12     30
10     14
2       2
16      1
Name: cyl, dtype: int64

In [170]:
# convert 2018 cyl column to int
df_18['cyl'] = df_18['cyl'].astype(int)

In [171]:
# Check value counts for 2018 cyl column again to confirm the change
df_18['cyl'].value_counts()

4     365
6     246
8     153
3      18
12      9
5       2
16      1
Name: cyl, dtype: int64

In [172]:
df_08.to_csv('data_08.csv', index=False)
df_18.to_csv('data_18.csv', index=False)

### 2) Fix ```air_pollution_score``` datatype
- 2008: convert string to float.
- 2018: convert int to float.

In [173]:
# try using Pandas to_numeric or astype function to convert the
# 2008 air_pollution_score column to float -- this won't work
df_08.air_pollution_score = df_08.air_pollution_score.astype(float)

ValueError: could not convert string to float: '6/4'

> The above did not work. According to the error above, there is error with the value "6/4" - let's check it out.

In [None]:
# To find out the position for the value - "6/4"

error_position = df_08.query('air_pollution_score == "6/4"')
print(error_position)

> According to the error above and position, the value at row 582 is "6/4".

In [None]:
# Figuring out the issue
df_08.iloc[582]

> According to [this link](http://www.fueleconomy.gov/feg/findacarhelp.shtml#airPollutionScore), which I found from the PDF documentation:

>    "If a vehicle can operate on more than one type of fuel, an estimate is provided for each fuel type."
    
>So all vehicles with more than one fuel type, or hybrids, like the one above (it uses ethanol AND gas) will have a string that holds two values - one for each.

In [None]:
# First, let's get all the hybrids in 2008
hb_08 = df_08[df_08['fuel'].str.contains('/')]
hb_08

In [None]:
# hybrids in 2018
hb_18 = df_18[df_18['fuel'].str.contains('/')]
hb_18

> Take each hybrid row and split them into two new rows - one with values for the first fuel type (values before the "/"), and the other with values for the second fuel type (values after the "/").

In [None]:
# create two copies of the 2008 hybrids dataframe
df1 = hb_08.copy()  # data on first fuel type of each hybrid vehicle
df2 = hb_08.copy()  # data on second fuel type of each hybrid vehicle

# Each one should look like this
df1

In [None]:
# columns to split by "/"
split_columns = ['fuel', 'air_pollution_score', 'city_mpg', 'hwy_mpg', 'cmb_mpg', 'greenhouse_gas_score']

# apply split function to each column of each dataframe copy
for c in split_columns:
    df1[c] = df1[c].apply(lambda x: x.split("/")[0])
    df2[c] = df2[c].apply(lambda x: x.split("/")[1])

In [None]:
# this dataframe holds info for the FIRST fuel type of the hybrid
# aka the values before the "/"s
df1

In [None]:
# this dataframe holds info for the SECOND fuel type of the hybrid
# aka the values before the "/"s
df2

In [None]:
# combine dataframes to add to the original dataframe
new_rows = df1.append(df2)

# now we have separate rows for each fuel type of each vehicle!
new_rows

In [None]:
# drop the original hybrid rows
df_08.drop(hb_08.index, inplace=True)

# add in our newly separated rows
df_08 = df_08.append(new_rows, ignore_index=True)

In [None]:
# check that all the original hybrid rows with "/"s are gone
df_08[df_08['fuel'].str.contains('/')]

In [None]:
df_08.shape

### Repeat process for the 2018 dataset

In [None]:
# create two copies of the 2018 hybrids dataframe, hb_18
df1 = hb_18.copy()
df2 = hb_18.copy()

### Split values for `fuel`, `city_mpg`, `hwy_mpg`, `cmb_mpg`
not required for `air_pollution_score` or `greenhouse_gas_score` here because these columns are already ints in the 2018 dataset.

In [None]:
# list of columns to split
split_columns = ['fuel', 'city_mpg', 'hwy_mpg', 'cmb_mpg']

# apply split function to each column of each dataframe copy
for c in split_columns:
    df1[c] = df1[c].apply(lambda x: x.split("/")[0])
    df2[c] = df2[c].apply(lambda x: x.split("/")[1])

In [None]:
# append the two dataframes
new_rows = df1.append(df2)

# drop each hybrid row from the original 2018 dataframe
# do this by using Pandas drop function with hb_18's index
df_18.drop(hb_18.index, inplace=True)

# append new_rows to df_18
df_18 = df_18.append(new_rows, ignore_index=True)

In [None]:
# check that they're gone
df_18[df_18['fuel'].str.contains('/')]

In [None]:
df_18.shape

### Continue the changes needed for `air_pollution_score`:
- 2008: convert string to float
- 2018: convert int to float

In [None]:
# convert string to float for 2008 air pollution column
df_08.air_pollution_score = df_08.air_pollution_score.astype(float)

In [None]:
# convert int to float for 2018 air pollution column
df_18.air_pollution_score = df_18.air_pollution_score.astype(float)

In [None]:
df_08.to_csv('data_08.csv', index=False)
df_18.to_csv('data_18.csv', index=False)

### 3) Fix ```city_mpg```, ```hwy_mpg```, ```cmb_mpg``` datatypes
- 2008 and 2018: convert string to float.

In [None]:
# convert mpg columns to floats
mpg_columns = ['city_mpg', 'hwy_mpg', 'cmb_mpg']
for c in mpg_columns:
    df_18[c] = df_18[c].astype(float)
    df_08[c] = df_08[c].astype(float)

### 4) Fix ```greenhouse_gas_score``` datatype
- 2008: convert from float to int.

In [None]:
# convert from float to int
df_08['greenhouse_gas_score'] = df_08['greenhouse_gas_score'].astype(int)

### Check to make sure all datatype are fixed

In [None]:
df_08.dtypes

In [None]:
df_18.dtypes

In [None]:
df_08.dtypes == df_18.dtypes

In [None]:
# Save your new CLEAN datasets as new files!
df_08.to_csv('clean_08.csv', index=False)
df_18.to_csv('clean_18.csv', index=False)

# 6. Exploring with Visuals

Use histograms and scatterplots to explore clean_08.csv and clean_18.csv in the Jupyter notebook. 

In [None]:
df_08 = pd.read_csv('clean_08.csv')
df_18 = pd.read_csv('clean_18.csv')

### Compare the distributions of greenhouse gas score in 2008 and 2018.

In [None]:
df_08['greenhouse_gas_score'].hist();

In [None]:
df_18['greenhouse_gas_score'].hist();

> Distribution for 2008 is more skewed to the left.

### How has the distribution of combined mpg changed from 2008 to 2018?

In [None]:
df_08['cmb_mpg'].hist();

In [None]:
df_18['cmb_mpg'].hist();

> Became much more skewed to the right

### Describe the correlation between displacement and combined mpg.

In [None]:
df_08.plot(x='displ', y='cmb_mpg', kind='scatter');

In [None]:
df_18.plot(x='displ', y='cmb_mpg', kind='scatter');

> Negative correlation

### Describe the correlation between greenhouse gas score and combined mpg.

In [None]:
df_08.plot(x='greenhouse_gas_score', y='cmb_mpg', kind='scatter');

In [None]:
df_18.plot(x='greenhouse_gas_score', y='cmb_mpg', kind='scatter');

> Positive correlation

# 7. Conclusions & Visuals

In [None]:
# load datasets
df_08 = pd.read_csv('clean_08.csv')
df_18 = pd.read_csv('clean_18.csv')

In [None]:
df_08.head(1)

### Q1: Are more unique models using alternative sources of fuel? By how much?

Let's first look at what the sources of fuel are and which ones are alternative sources.

In [None]:
df_08.fuel.value_counts()

In [None]:
df_18.fuel.value_counts()

Looks like the alternative sources of fuel available in 2008 are CNG and ethanol, and those in 2018 ethanol and electricity. (You can use Google if you weren't sure which ones are alternative sources of fuel!)

In [None]:
# how many unique models used alternative sources of fuel in 2008
alt_08 = df_08.query('fuel in ["CNG", "ethanol"]').model.nunique()
alt_08

In [None]:
# how many unique models used alternative sources of fuel in 2018
alt_18 = df_18.query('fuel in ["Ethanol", "Electricity"]').model.nunique()
alt_18

In [None]:
plt.bar(["2008", "2018"], [alt_08, alt_18])
plt.title("Number of Unique Models Using Alternative Fuels")
plt.xlabel("Year")
plt.ylabel("Number of Unique Models");

Since 2008, the number of unique models using alternative sources of fuel increased by 24. We can also look at proportions.

In [None]:
# total unique models each year
total_08 = df_08.model.nunique()
total_18 = df_18.model.nunique()
total_08, total_18

In [None]:
prop_08 = alt_08/total_08
prop_18 = alt_18/total_18
prop_08, prop_18

In [None]:
plt.bar(["2008", "2018"], [prop_08, prop_18])
plt.title("Proportion of Unique Models Using Alternative Fuels")
plt.xlabel("Year")
plt.ylabel("Proportion of Unique Models");

### Q2: How much have vehicle classes improved in fuel economy?  

Let's look at the average fuel economy for each vehicle class for both years.

In [None]:
veh_08 = df_08.groupby('veh_class').cmb_mpg.mean()
veh_08

In [None]:
veh_18 = df_18.groupby('veh_class').cmb_mpg.mean()
veh_18

In [None]:
# how much they've increased by for each vehicle class
inc = veh_18 - veh_08
inc

In [None]:
# only plot the classes that exist in both years
inc.dropna(inplace=True)
plt.subplots(figsize=(8, 5))
plt.bar(inc.index, inc)
plt.title('Improvements in Fuel Economy from 2008 to 2018 by Vehicle Class')
plt.xlabel('Vehicle Class')
plt.ylabel('Increase in Average Combined MPG');

### Q3: What are the characteristics of SmartWay vehicles? Have they changed over time?

We can analyze this by filtering each dataframe by SmartWay classification and exploring these datasets.

In [None]:
# smartway labels for 2008
df_08.smartway.unique()

In [None]:
# get all smartway vehicles in 2008
smart_08 = df_08.query('smartway == "yes"')

In [None]:
# explore smartway vehicles in 2008
smart_08.describe()

Use what you've learned so for to further explore this dataset on 2008 smartway vehicles.

In [None]:
# smartway labels for 2018
df_18.smartway.unique()

In [None]:
# get all smartway vehicles in 2018
smart_18 = df_18.query('smartway in ["Yes", "Elite"]')

In [None]:
smart_18.describe()

Use what you've learned so for to further explore this dataset on 2018 smartway vehicles.

### Q4: What features are associated with better fuel economy?

You can explore trends between cmb_mpg and the other features in this dataset, or filter this dataset like in the previous question and explore the properties of that dataset. For example, you can select all vehicles that have the top 50% fuel economy ratings like this.

In [None]:
top_08 = df_08.query('cmb_mpg > cmb_mpg.mean()')
top_08.describe()

In [None]:
top_18 = df_18.query('cmb_mpg > cmb_mpg.mean()')
top_18.describe()

### Q5: For all of the models that were produced in 2008 that are still being produced in 2018, how much has the mpg improved and which vehicle improved the most?

This is a question regarding models that were updated since 2008 and still being produced in 2018. In order to do this, we need a way to compare models that exist in both datasets.

### Create combined dataset

1. Rename 2008 columns to distinguish from 2018 columns after the merge
To do this, use Pandas' rename() with a lambda function. In the lambda function, take the first 10 characters of the column label and and concatenate it with _2008. (Only take the first 10 characters to prevent really long column names.)

In [175]:
# rename 2008 columns
df_08.rename(columns=lambda x: x[:10] + "_2008", inplace=True)

In [176]:
# view to check names
df_08.head()

Unnamed: 0,model_2008_2008,displ_2008_2008,cyl_2008_2008,trans_2008_2008,drive_2008_2008,fuel_2008_2008,veh_class__2008,air_pollut_2008,city_mpg_2_2008,hwy_mpg_20_2008,cmb_mpg_20_2008,greenhouse_2008,smartway_2_2008
0,ACURA MDX,3.7,6,Auto-S5,4WD,Gasoline,SUV,7,15,20,17,4,no
1,ACURA RDX,2.3,4,Auto-S5,4WD,Gasoline,SUV,7,17,22,19,5,no
2,ACURA RL,3.5,6,Auto-S5,4WD,Gasoline,midsize car,7,16,24,19,5,no
3,ACURA TL,3.2,6,Auto-S5,2WD,Gasoline,midsize car,7,18,26,21,6,yes
4,ACURA TL,3.5,6,Auto-S5,2WD,Gasoline,midsize car,7,17,26,20,6,yes


In [177]:
# merge datasets
df_combined = df_08.merge(df_18, left_on='model_2008', right_on='model', how='inner')

KeyError: 'model_2008'

In [None]:
# view to check merge
df_combined.head()

Save the combined dataset

In [None]:
df_combined.to_csv('combined_dataset.csv', index=False)

1) Create a new dataframe, ```model_mpg```, that contain the mean combined mpg values in 2008 and 2018 for each unique model

To do this, group by model and find the mean ```cmb_mpg_2008``` and ```mean cmb_mpg``` for each.

In [178]:
combined_df = pd.read_csv('combined_dataset.csv')

In [179]:
# get mean values for each model
model_mpg = combined_df.groupby(['model'])['cmb_mpg_2008', 'cmb_mpg','model'].mean()
model_mpg.head()

Unnamed: 0_level_0,cmb_mpg_2008,cmb_mpg
model,Unnamed: 1_level_1,Unnamed: 2_level_1
ACURA RDX,19.0,22.5
AUDI A3,23.333333,28.0
AUDI A4,21.0,27.0
AUDI A6,19.666667,25.666667
AUDI A8 L,16.5,22.0


2) Create a new column, ```mpg_change```, with the change in mpg 

Subtract the mean mpg in 2008 from that in 2018 to get the change in mpg

In [180]:
# add column for mpg_change
model_mpg['mpg_change'] = model_mpg['cmb_mpg'] - model_mpg['cmb_mpg_2008']

3) Find the vehicle that improved the most

Find the max mpg change, and then use query or indexing to see what model it is!

In [182]:
model_mpg.sort_values(by='mpg_change', ascending=False).head(10)

Unnamed: 0_level_0,cmb_mpg_2008,cmb_mpg,mpg_change
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
VOLVO XC 90,15.666667,32.2,16.533333
CHEVROLET Malibu,22.333333,33.0,10.666667
CHEVROLET Equinox,19.0,27.833333,8.833333
AUDI S4,15.5,24.0,8.5
AUDI S5,16.0,24.0,8.0
VOLKSWAGEN Passat,21.25,29.0,7.75
MERCEDES-BENZ C300,18.0,25.666667,7.666667
SUBARU Impreza,21.75,29.0,7.25
MAZDA 3,24.4,30.833333,6.433333
AUDI A6,19.666667,25.666667,6.0


> Model "Volvo XC 90" improved the most!