In [2]:
import pandas as pd
import numpy as np

In [28]:
# Loading the DataSets
all_08 = pd.read_csv('./DataSet/all_alpha_08.csv')
all_18 = pd.read_csv('./DataSet/all_alpha_18.csv')

In [4]:
all_08.shape, all_18.shape

((2404, 18), (1611, 18))

## Cleaning Column Labels

**Drop extraneous columns**  
  
Drop features that aren't consistent (not present in both datasets) or aren't relevant to our questions. Use pandas' drop function.
Columns to Drop:
  
- From 2008 dataset: `'Stnd', 'Underhood ID', 'FE Calc Appr', 'Unadj Cmb MPG'`
- From 2018 dataset: `'Stnd', 'Stnd Description', 'Underhood ID', 'Comb CO2'`

In [29]:
all_08.columns, all_18.columns

(Index(['Model', 'Displ', 'Cyl', 'Trans', 'Drive', 'Fuel', 'Sales Area', 'Stnd',
        'Underhood ID', 'Veh Class', 'Air Pollution Score', 'FE Calc Appr',
        'City MPG', 'Hwy MPG', 'Cmb MPG', 'Unadj Cmb MPG',
        'Greenhouse Gas Score', 'SmartWay'],
       dtype='object'),
 Index(['Model', 'Displ', 'Cyl', 'Trans', 'Drive', 'Fuel', 'Cert Region',
        'Stnd', 'Stnd Description', 'Underhood ID', 'Veh Class',
        'Air Pollution Score', 'City MPG', 'Hwy MPG', 'Cmb MPG',
        'Greenhouse Gas Score', 'SmartWay', 'Comb CO2'],
       dtype='object'))

In [30]:
all_08.drop(columns=['Stnd', 'Underhood ID', 'FE Calc Appr', 'Unadj Cmb MPG'], inplace=True)
all_18.drop(columns=['Stnd', 'Stnd Description', 'Underhood ID', 'Comb CO2'], inplace=True)

**Rename Columns**  
  
- Change the `"Sales Area"` column label in the 2008 dataset to `"Cert Region"` for consistency.
- Rename all column labels to replace spaces with underscores and convert everything to lowercase. (Underscores can be much easier to work with in Python than spaces. For example, having spaces wouldn't allow you to use `df.column_name` instead of `df['column_name']` to select columns or use `query()`. Being consistent with lowercase and underscores also helps make column names easy to remember.)

In [31]:
all_08.rename(columns={"Sales Area": "Cert Region"}, inplace=True)
all_08.columns

Index(['Model', 'Displ', 'Cyl', 'Trans', 'Drive', 'Fuel', 'Cert Region',
       'Veh Class', 'Air Pollution Score', 'City MPG', 'Hwy MPG', 'Cmb MPG',
       'Greenhouse Gas Score', 'SmartWay'],
      dtype='object')

In [32]:
all_08.rename(columns=lambda x: x.strip().lower().replace(" ", "_"), inplace=True)
all_18.rename(columns=lambda x: x.strip().lower().replace(" ", "_"), inplace=True)

In [33]:
all_08.columns, all_18.columns

(Index(['model', 'displ', 'cyl', 'trans', 'drive', 'fuel', 'cert_region',
        'veh_class', 'air_pollution_score', 'city_mpg', 'hwy_mpg', 'cmb_mpg',
        'greenhouse_gas_score', 'smartway'],
       dtype='object'),
 Index(['model', 'displ', 'cyl', 'trans', 'drive', 'fuel', 'cert_region',
        'veh_class', 'air_pollution_score', 'city_mpg', 'hwy_mpg', 'cmb_mpg',
        'greenhouse_gas_score', 'smartway'],
       dtype='object'))

## Check Columns Similarity

In [36]:
all_08.columns == all_18.columns

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True])

In [37]:
(all_08.columns == all_18.columns).all()

True

## Save the Cleaned DataSets

In [34]:
all_08.to_csv('./DataSet/all_alpha_08_cleaned.csv', index=False)
all_18.to_csv('./DataSet/all_alpha_18_cleaned.csv', index=False)