# Fuel Economy Testing
Exploration and analysis of Fuel Economy data obtained from vehicle testing done by EPA (Environmental Protection Agency) at National Vehicle and Fuel Emissions laboratory in Ann Arbor, Michigan.

Data for years 2008 and 2018 was chosen to explore changes in car emissions and fuel economy over a decade.

Source: [EPA Fuel Economy Data](https://www.fueleconomy.gov/feg/download.shtml)

## 1) Assessing 
General exploration of datasets for number of samples, missing values, duplicate rows, etc. 

Files: `all_alpha_18.csv`, `all_alpha_08.csv`

In [None]:
# import and load dataset
import pandas as pd

df_18 = pd.read_csv('all_alpha_18.csv')
df_08 = pd.read_csv('all_alpha_08.csv')

In [None]:
# Number of samples / columns in each dataset
print("2018 samples and columns: {}".format(df_18.shape))
print("2008 samples and columns: {}".format(df_08.shape))

In [None]:
# Columns and datatypes info
df_18.info()
df_08.info()

### Missing Values 

In [None]:
# Columns with missing values
missing_18 = df_18.columns[df_18.isnull().any()]
print('Columns with missing values in 2018 data: {}'.format(missing_18))

# Rows with null values in each column
df_18.isnull().sum()

In [None]:
# Columns with missing values
missing_08 = df_08.columns[df_08.isnull().any()]
print('Columns with missing values in 2008 data: {}'.format(missing_08))

# Rows with null values in each column
df_08.isnull().sum()

The 2008 dataset has more missing values for multiple columns versus 2018 dataset which is missing values only in `Displ` and `Cyl` columns. 

### Duplicate Rows 

In [79]:
# Duplicate rows in dataset
duplicate_18 = df_18[df_18.duplicated(keep=False)]
duplicate_08 = df_08[df_08.duplicated(keep=False)]

# Uncomment to explore duplicate rows
#print("2018 Duplicate Rows: \n{}".format(duplicate_18))
#print("2008 Duplicate Rows: \n{}".format(duplicate_08))

# Number of duplicated rows
df_18.duplicated(keep='last').sum()

0

In [80]:
# Duplicate rows in dataset
duplicate_18 = df_18[df_18.duplicated(keep=False)]
duplicate_08 = df_08[df_08.duplicated(keep=False)]

# Uncomment to explore duplicate rows
#print("2008 Duplicate Rows: \n{}".format(duplicate_08))

# Number of duplicated rows
df_08.duplicated(keep='last').sum()

25

2018 dataset contains no duplicate values while the 2008 dataset contains 25 duplicated rows 

In [None]:
# Non-unique values in each column
#print("\nUnique values in 2018 dataset: \n{}".format(df_18.nunique()))
#print("\nUnique values in 2008 dataset: \n{}".format(df_08.nunique()))

# Unique values and counts for each in specific columns 
print(df_18['Cyl'].value_counts())
print(df_08['Cyl'].value_counts())

## 2) Cleaning Columns Labels 
Drop extraneous columns and rename applicable columns 

In [None]:
# View 2018 dataset
df_18.head(1)

In [81]:
# View 2008 dataset
df_08.head(1)

Unnamed: 0,Model,Displ,Cyl,Trans,Drive,Fuel,Sales Area,Stnd,Underhood ID,Veh Class,Air Pollution Score,FE Calc Appr,City MPG,Hwy MPG,Cmb MPG,Unadj Cmb MPG,Greenhouse Gas Score,SmartWay
0,ACURA MDX,3.7,(6 cyl),Auto-S5,4WD,Gasoline,CA,U2,8HNXT03.7PKR,SUV,7,Drv,15,20,17,22.0527,4,no


### Drop Extraneous Columns 