# Project 4: EV Cars

## Part I - Project Intro and Data Cleaning

Authors: Aichieh Lin, Bede Young, and Charles Ramey

Date: 05/05/2023

---

## Problem Statement

In a bid to improve air quality, manufacturers and policy makers are working to reduce carbon emissions from road vehicles by improving both the facilities and availablity of electric vehicles. Policy makers are aiming to meet net zero emission targets, and have requested analysis into the progress and effectiveness of increasing EV registration on air pollution levels. As part of progressing towards net zero emissions, policy makers are seeking to determine how many EV registrations will yield a 10% improvement to the air quality index. This project analyses the trends in EV registration and data collected on pollution levels, and uses a regression model to predict air quality based on EV registration numbers.

#### Notebook Links

Part II - Exploratory Data Analysis (EDA)
- [`Part-2_eda.ipynb`](../code/Part-1_eda.ipynb)

Part III - Modeling
- [`Part-3_modeling.ipynb`](../code/Part-3_modeling.ipynb)

Part IV - Conclusion, Recommendations, and Sources
- [`Part-4_conclusion-and-recommendations.ipynb`](../code/Part-4_conclusion-and-recommendations.ipynb)

### Contents

- [Background](#Background)
- [Data Import](#Data-Import)
- [Cleaning](#Cleaning)

## Background

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

### Library Imports

In [1]:
import numpy as np
import pandas as pd

import os

## Data Import & Cleaning

### Dataset 1: Annual Air Quality Index by County, 2000 - 2022
Data source: United States Environmental Protection Agency, https://aqs.epa.gov/aqsweb/airdata/download_files.html

In [2]:
# Create a files variable that contains all of our data files.
files = os.listdir("../data")

In [3]:
files[:]

['.ipynb_checkpoints',
 'annual_aqi_by_county_2000.csv',
 'annual_aqi_by_county_2001.csv',
 'annual_aqi_by_county_2002.csv',
 'annual_aqi_by_county_2003.csv',
 'annual_aqi_by_county_2004.csv',
 'annual_aqi_by_county_2005.csv',
 'annual_aqi_by_county_2006.csv',
 'annual_aqi_by_county_2007.csv',
 'annual_aqi_by_county_2008.csv',
 'annual_aqi_by_county_2009.csv',
 'annual_aqi_by_county_2010.csv',
 'annual_aqi_by_county_2011.csv',
 'annual_aqi_by_county_2012.csv',
 'annual_aqi_by_county_2013.csv',
 'annual_aqi_by_county_2014.csv',
 'annual_aqi_by_county_2015.csv',
 'annual_aqi_by_county_2016.csv',
 'annual_aqi_by_county_2017.csv',
 'annual_aqi_by_county_2018.csv',
 'annual_aqi_by_county_2019.csv',
 'annual_aqi_by_county_2020.csv',
 'annual_aqi_by_county_2021.csv',
 'annual_aqi_by_county_2022.csv',
 'Electric_Vehicle_Population_Data.csv',
 'geopandas-data',
 'historical-station-counts.xlsx',
 'state-populations-2020-2022.csv',
 'state-populations.csv',
 'vehicle_counts_by_state.csv']

In [4]:
# Read in data from 2000
year_2000 = pd.read_csv("../data/annual_aqi_by_county_2000.csv")
year_2000.head()

Unnamed: 0,State,County,Year,Days with AQI,Good Days,Moderate Days,Unhealthy for Sensitive Groups Days,Unhealthy Days,Very Unhealthy Days,Hazardous Days,Max AQI,90th Percentile AQI,Median AQI,Days CO,Days NO2,Days Ozone,Days PM2.5,Days PM10
0,Alabama,Baldwin,2000,257,111,96,39,10,1,0,205,129,54,0,0,215,42,0
1,Alabama,Clay,2000,271,155,101,14,1,0,0,177,90,46,0,0,188,83,0
2,Alabama,Colbert,2000,106,38,66,2,0,0,0,124,80,58,0,0,0,106,0
3,Alabama,DeKalb,2000,354,169,123,58,4,0,0,159,115,51,0,0,297,56,1
4,Alabama,Elmore,2000,242,125,78,37,2,0,0,166,112,50,0,0,242,0,0


In [5]:
# function to read data
def process_data(file):
    df = pd.read_csv("../data/" + file)
    return df

In [6]:
file_list = [file for file in files if file.startswith("annual_aqi_by_county_")]
file_list = [process_data(file) for file in file_list]

In [7]:
air_quality = pd.concat(file_list, axis=0)

In [8]:
air_quality.columns = air_quality.columns.str.lower().str.replace(' ', '_', regex=False)

In [9]:
# info
air_quality.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 24442 entries, 0 to 965
Data columns (total 18 columns):
 #   Column                               Non-Null Count  Dtype 
---  ------                               --------------  ----- 
 0   state                                24442 non-null  object
 1   county                               24442 non-null  object
 2   year                                 24442 non-null  int64 
 3   days_with_aqi                        24442 non-null  int64 
 4   good_days                            24442 non-null  int64 
 5   moderate_days                        24442 non-null  int64 
 6   unhealthy_for_sensitive_groups_days  24442 non-null  int64 
 7   unhealthy_days                       24442 non-null  int64 
 8   very_unhealthy_days                  24442 non-null  int64 
 9   hazardous_days                       24442 non-null  int64 
 10  max_aqi                              24442 non-null  int64 
 11  90th_percentile_aqi                  24442 

In [10]:
# check null value
air_quality.isna().sum()

state                                  0
county                                 0
year                                   0
days_with_aqi                          0
good_days                              0
moderate_days                          0
unhealthy_for_sensitive_groups_days    0
unhealthy_days                         0
very_unhealthy_days                    0
hazardous_days                         0
max_aqi                                0
90th_percentile_aqi                    0
median_aqi                             0
days_co                                0
days_no2                               0
days_ozone                             0
days_pm2.5                             0
days_pm10                              0
dtype: int64

In [11]:
air_quality.describe()

Unnamed: 0,year,days_with_aqi,good_days,moderate_days,unhealthy_for_sensitive_groups_days,unhealthy_days,very_unhealthy_days,hazardous_days,max_aqi,90th_percentile_aqi,median_aqi,days_co,days_no2,days_ozone,days_pm2.5,days_pm10
count,24442.0,24442.0,24442.0,24442.0,24442.0,24442.0,24442.0,24442.0,24442.0,24442.0,24442.0,24442.0,24442.0,24442.0,24442.0,24442.0
mean,2010.727518,288.01211,215.566811,64.366869,6.535717,1.347721,0.150642,0.04435,133.47713,64.744334,37.893053,2.765199,8.892112,160.193151,100.362327,15.799321
std,6.617716,100.823494,89.398696,51.899912,11.704127,4.983134,1.29625,0.682415,287.1969,22.562806,11.793375,21.111848,31.766668,120.035352,102.783951,53.726892
min,2000.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,8.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,2005.0,214.0,147.0,24.0,0.0,0.0,0.0,0.0,90.0,51.0,33.0,0.0,0.0,0.0,0.0,0.0
50%,2011.0,355.0,227.0,54.0,2.0,0.0,0.0,0.0,115.0,61.0,39.0,0.0,0.0,178.0,77.0,0.0
75%,2016.0,365.0,292.0,92.0,8.0,1.0,0.0,0.0,153.0,77.0,44.0,0.0,0.0,242.0,156.0,2.0
max,2022.0,366.0,366.0,339.0,148.0,114.0,74.0,37.0,20646.0,306.0,132.0,366.0,365.0,366.0,366.0,366.0


In [12]:
air_quality = air_quality[air_quality['max_aqi'] <= 500]

In [13]:
air_quality['state'].unique()

array(['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California', 'Canada',
       'Colorado', 'Connecticut', 'Country Of Mexico', 'Delaware',
       'District Of Columbia', 'Florida', 'Georgia', 'Hawaii', 'Idaho',
       'Illinois', 'Indiana', 'Iowa', 'Kansas', 'Kentucky', 'Louisiana',
       'Maine', 'Maryland', 'Massachusetts', 'Michigan', 'Minnesota',
       'Mississippi', 'Missouri', 'Montana', 'Nebraska', 'Nevada',
       'New Hampshire', 'New Jersey', 'New Mexico', 'New York',
       'North Carolina', 'North Dakota', 'Ohio', 'Oklahoma', 'Oregon',
       'Pennsylvania', 'Puerto Rico', 'Rhode Island', 'South Carolina',
       'South Dakota', 'Tennessee', 'Texas', 'Utah', 'Vermont',
       'Virgin Islands', 'Virginia', 'Washington', 'West Virginia',
       'Wisconsin', 'Wyoming'], dtype=object)

In [14]:
air_quality = air_quality[~air_quality['state'].isin(['Canada', 'Country Of Mexico', 'Puerto Rico'])]

In [15]:
quality_cols = [
    'good_days', 
    'moderate_days',
    'unhealthy_for_sensitive_groups_days',
    'unhealthy_days',
    'very_unhealthy_days',
    'hazardous_days'
]

cols_to_sum = [col for col in quality_cols if col != 'good_days']

# Create the new column
air_quality['bad_days'] = air_quality[cols_to_sum].sum(axis=1)

In [16]:
day_count_cols = [
    'good_days',
    'moderate_days',
    'unhealthy_for_sensitive_groups_days',
    'unhealthy_days', 'very_unhealthy_days',
    'hazardous_days',
    'days_co',
    'days_no2',
    'days_ozone',
    'days_pm2.5',
    'days_pm10',
    'bad_days'
]

# iterate through the columns and divide each by 'days_with_aqi'
for col in day_count_cols:
    air_quality[f'pct_{col}'] = air_quality[col] / air_quality['days_with_aqi']

In [17]:
air_quality = air_quality.drop(columns=day_count_cols)

In [18]:
air_quality.head()

Unnamed: 0,state,county,year,days_with_aqi,max_aqi,90th_percentile_aqi,median_aqi,pct_good_days,pct_moderate_days,pct_unhealthy_for_sensitive_groups_days,pct_unhealthy_days,pct_very_unhealthy_days,pct_hazardous_days,pct_days_co,pct_days_no2,pct_days_ozone,pct_days_pm2.5,pct_days_pm10,pct_bad_days
0,Alabama,Baldwin,2000,257,205,129,54,0.431907,0.373541,0.151751,0.038911,0.003891,0.0,0.0,0.0,0.836576,0.163424,0.0,0.568093
1,Alabama,Clay,2000,271,177,90,46,0.571956,0.372694,0.051661,0.00369,0.0,0.0,0.0,0.0,0.693727,0.306273,0.0,0.428044
2,Alabama,Colbert,2000,106,124,80,58,0.358491,0.622642,0.018868,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.641509
3,Alabama,DeKalb,2000,354,159,115,51,0.477401,0.347458,0.163842,0.011299,0.0,0.0,0.0,0.0,0.838983,0.158192,0.002825,0.522599
4,Alabama,Elmore,2000,242,166,112,50,0.516529,0.322314,0.152893,0.008264,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.483471


In [19]:
air_quality.describe()

Unnamed: 0,year,days_with_aqi,max_aqi,90th_percentile_aqi,median_aqi,pct_good_days,pct_moderate_days,pct_unhealthy_for_sensitive_groups_days,pct_unhealthy_days,pct_very_unhealthy_days,pct_hazardous_days,pct_days_co,pct_days_no2,pct_days_ozone,pct_days_pm2.5,pct_days_pm10,pct_bad_days
count,23964.0,23964.0,23964.0,23964.0,23964.0,23964.0,23964.0,23964.0,23964.0,23964.0,23964.0,23964.0,23964.0,23964.0,23964.0,23964.0,23964.0
mean,2010.733058,288.965281,122.487314,64.755884,37.991571,0.758967,0.215727,0.020798,0.004078,0.000406,2.4e-05,0.007047,0.025359,0.525244,0.37277,0.06958,0.241033
std,6.613098,99.916815,47.026673,22.109457,11.525058,0.175744,0.150139,0.034934,0.014157,0.003448,0.000416,0.057132,0.091762,0.384015,0.364293,0.221722,0.175744
min,2000.0,1.0,8.0,4.0,0.0,0.032258,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,2005.0,219.0,90.0,51.0,33.0,0.652055,0.098361,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.102804
50%,2011.0,356.0,115.0,61.0,39.0,0.790935,0.192308,0.005587,0.0,0.0,0.0,0.0,0.0,0.569863,0.27949,0.0,0.209065
75%,2016.0,365.0,152.0,77.0,44.0,0.897196,0.308411,0.026946,0.00274,0.0,0.0,0.0,0.0,0.917412,0.630137,0.002956,0.347945
max,2022.0,366.0,500.0,215.0,126.0,1.0,0.92623,0.419355,0.312329,0.202186,0.019178,1.0,1.0,1.0,1.0,1.0,0.967742


In [20]:
# save csv file
air_quality.to_csv('../output/air_quality_by_county.csv', index=False)

In [21]:
air_quality_by_state = air_quality.groupby(['state', 'year']).mean(numeric_only=True).reset_index()
air_quality_by_state

Unnamed: 0,state,year,days_with_aqi,max_aqi,90th_percentile_aqi,median_aqi,pct_good_days,pct_moderate_days,pct_unhealthy_for_sensitive_groups_days,pct_unhealthy_days,pct_very_unhealthy_days,pct_hazardous_days,pct_days_co,pct_days_no2,pct_days_ozone,pct_days_pm2.5,pct_days_pm10,pct_bad_days
0,Alabama,2000,201.090909,151.136364,95.818182,52.727273,0.460557,0.439145,0.086704,0.012601,0.000994,0.000000,0.005216,0.003041,0.387520,0.478625,0.125597,0.539443
1,Alabama,2001,219.000000,137.047619,80.523810,45.571429,0.595764,0.360452,0.036892,0.005848,0.001044,0.000000,0.008871,0.001003,0.444249,0.426518,0.119359,0.404236
2,Alabama,2002,238.818182,143.818182,79.636364,42.909091,0.640923,0.308082,0.044374,0.006334,0.000286,0.000000,0.007181,0.003711,0.509011,0.364713,0.115384,0.359077
3,Alabama,2003,234.000000,132.750000,73.500000,43.416667,0.631339,0.338255,0.028684,0.001721,0.000000,0.000000,0.004489,0.001829,0.486147,0.402407,0.105127,0.368661
4,Alabama,2004,236.800000,120.200000,69.800000,42.280000,0.684022,0.293580,0.021230,0.001057,0.000111,0.000000,0.001093,0.003191,0.523395,0.375292,0.097029,0.315978
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1190,Wyoming,2018,329.666667,106.500000,58.611111,39.611111,0.789477,0.206487,0.003579,0.000457,0.000000,0.000000,0.000000,0.007854,0.779945,0.047889,0.164312,0.210523
1191,Wyoming,2019,343.888889,98.000000,51.944444,38.833333,0.871004,0.125178,0.003362,0.000457,0.000000,0.000000,0.000000,0.004811,0.818655,0.018977,0.157558,0.128996
1192,Wyoming,2020,338.352941,131.705882,50.352941,34.882353,0.876343,0.113637,0.007989,0.001768,0.000263,0.000000,0.000000,0.061075,0.654444,0.079556,0.204924,0.123657
1193,Wyoming,2021,349.941176,128.823529,65.235294,36.294118,0.783710,0.199038,0.016930,0.000161,0.000000,0.000161,0.000323,0.061565,0.693599,0.089626,0.154887,0.216290


In [22]:
# save csv file
air_quality_by_state.to_csv('../output/air_quality_by_state.csv', index=False)

### Dataset 2: Vehicle Registration Counts by State, 2016 - 2021
Data source: U.S. Department of Energy, https://afdc.energy.gov/vehicle-registration

### Dataset 3: Electric Vehicle Population Data
Data source: data.gov, https://catalog.data.gov/dataset/electric-vehicle-population-data

### Dataset 4: Alternative Fueling Station Counts by State, 2007 - 2022
Data source: U.S. Department of Energy, https://afdc.energy.gov/stations/states

---
#### Notebook Links

Part II - Exploratory Data Analysis (EDA)
- [`Part-2_eda.ipynb`](../code/Part-1_eda.ipynb)

Part III - Modeling
- [`Part-3_modeling.ipynb`](../code/Part-3_modeling.ipynb)

Part IV - Conclusion, Recommendations, and Sources
- [`Part-4_conclusion-and-recommendations.ipynb`](../code/Part-4_conclusion-and-recommendations.ipynb)