**If you lost points on the last checkpoint you can get them back by responding to TA/IA feedback**  

Update/change the relevant sections where you lost those points, make sure you respond on GitHub Issues to your TA/IA to call their attention to the changes you made here.

Please update your Timeline... no battle plan survives contact with the enemy, so make sure we understand how your plans have changed.

# COGS 108 - Data Checkpoint

# Names

- Bianca Gao | A16866233 
- Minsang Kim | A16636382
- Sebastian | A16419096
- Andrew Tran | A17644760
- Sukhman Virk | A16228396

# Research Question

Is there a positive correlation between the quality of life of different states in the United States, measured by multiple factors such as AQI and unemployment from the year 2000 and onwards, among many other factors, and the use of electric cars in that county and state. We aim to check if electric vehicles are more likely to exist in regions where quality of life is higher or if that is not the case. 

## Background and Prior Work

Electrical vehicle (EV) sales have increased in the United States from 0.2% to 4.6% from 2011 to 2021 <a name="cite_ref-1"></a>[<sup>1</sup>](#cite_note-1). Sales and number of electric vehicles on the road are projected to increase over the next few years as well. According to S&P Global Mobility, 40% of passenger vehicles are expected to be electric vehicles by 2030 <a name="cite_ref-2"></a>[<sup>2</sup>](#cite_note-2). As current California residents, it is interesting to see this anticipated growth – as of September 2021, the state ranks first for Tesla sales <a name="cite_ref-3"></a>[<sup>3</sup>](#cite_note-3). 

The growing market can be attributed to environmental concerns, money-saving concerns, government incentives (tax credits, rebates, programs), and job creation from the production of EVs <a name="cite_ref-1"></a>[<sup>1</sup>](#cite_note-1). However, raises the question of access – who can drive the sales for this market, and who can participate? 

As governments attempt to transition to cleaner energy usage and promote the consumption of greener, electric vehicles, certain groups of people are excluded from the movement despite needing the benefits. Studies have shown that low-income, underrepresented communities live in areas of more transport emissions and lower air quality, so an increase in EVs would greatly improve living conditions <a name="cite_ref-4"></a>[<sup>4</sup>](#cite_note-4). However, does this translate to electric vehicle numbers and distribution in these communities, and what variables contribute to this? 

Our data aims to support the idea that a large-scale electric vehicle transition is not yet equitable as we attempt to discover correlations among income level, quality of life, and electric vehicle purchase. 

1. <a name="cite_note-1"></a> [^](#cite_ref-1) https://www.bls.gov/opub/btn/volume-12/charging-into-the-future-the-transition-to-electric-vehicles.htm 
2. <a name="cite_note-2"></a> [^](#cite_ref-2) https://www.spglobal.com/mobility/en/research-analysis/ev-chargers-how-many-do-we-need.html 
3. <a name="cite_note-3"></a> [^](#cite_ref-3) https://www.cross-sell.com/tesla-sales-data#:~:text=September%202021,24.77%25%20of%20Tesla%20September%20sales 
4. <a name="cite_note-4"></a> [^](#cite_ref-4) https://sciencepolicyreview.org/2021/08/equity-transition-electric-vehicles/ 

# Hypothesis


Considering that clean energy is a generally progressive policy and the fact that clean energy is known to be more expensive than its alternatives, we expect there to be a positive relationship between those with a high quality of life and adoption of electric vehicles, because those with an already high quality of life can afford to adopt an environmentally friendly option. However, considering there are many other relevant factors that affect quality of life, along with the fact that different states have different policies and views on clean energy and are implementing them at different rates, we hypothesize that the correlation may be fairly weak.

# Setup

In [8]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib as plt

# Data

## Data overview

For each dataset include the following information
- Dataset #1
  - Dataset Name: City/ZIP/County/FIPS - Quality of Life (US)
  - Link to the dataset: https://www.kaggle.com/datasets/zacvaughan/cityzipcountyfips-quality-of-life
  - Number of observations: 3134 
  - Number of variables: 34
- Dataset #2
  - Dataset Name: 1980-2021 Yearly Air Quality Index from the EPA
  - Link to the dataset: https://www.kaggle.com/datasets/threnjen/40-years-of-air-quality-index-from-the-epa-yearly
  - Number of observations: 21083
  - Number of variables: 21
- Dataset #3
  - Dataset Name: City and County Vehicle Inventories
  - Link to the dataset: https://catalog.data.gov/dataset/city-and-county-vehicle-inventories-f07a0 
  - Number of observations: 50815 (only using 3094) 
  - Number of variables: 51 (only using 29)

Dataset #1 considers various characteristics of US states, counties, and cities, including population, city type, unemployment rates, cost of living, crime rates, median income levels, etcetera to determine quality of life per location. This dataset is a combination of various other datasets to create a comprehensive comparison table of metrics. Some columns of data are from different years, including 2016 Reported Crimes Rates versus 2022 Unemployment Rates. We plan to take into consideration the city type, unemployment, cost of living, median income, and air quality (good days/total) columns to determine quality of living per region. 

Daset #2 is a set of yearly air quality index reports from various US Metro areas and includes geographic data as well. The data is sourced from the US Environmental Protection Agency and offers various information regarding the air quality indexes by state and county. We plan to find the percentage of ‘AQI%Good’ days by taking the proportion of recorded ‘Good Air’ days with the total number of recorded days. After filtering the data range to measure from 2000-2021, we will use it as an estimation of the air quality at a location and compare with the rest of our data. 

Dataset #3 from the US Department of Energy has been downloaded in CSV format. The dataset contains vehicle registration by vehicle type (car vs. truck), fuel type, and model year. It also contains the percentage of vehicles in each jurisdiction per each vehicle category from before 1980 to 2018. We will be filtering by vehicle type (“Electric Vehicle”) for the years 2000 and up, looking at the percentage of electric vehicles in state counties. 

We are narrowing our timescope to 2000-2024, at the turn of the century and when electric vehicles began its rise in popularity. We want to take advantage of location data that is in all the datasets to explore relationships between the individual quality of life variables that we’ve selected and the density of electric vehicles in a region. 

All datasets will be combined by state and county names associated with the data.

## Dataset: City and County Quality of Life

In [9]:
qol_data = pd.read_csv('QOL(County Level).csv')
# Refining the dataset
relevant_columns = ['LSTATE', 'NMCNTY', 'ULOCALE', 'Unemployment', '2022 Median Income', 'Cost of Living', 'AQI%Good']
qol_refined = qol_data[relevant_columns]

# Crerating copy to avoid errors
qol_cleaned = qol_refined.copy()

# Removing trailing percentage signs using rstrip and converting to float
qol_cleaned['Unemployment'] = qol_cleaned['Unemployment'].str.rstrip('%').astype('float')
qol_cleaned['AQI%Good'] = qol_cleaned['AQI%Good'].str.rstrip('%').astype('float')

# Removing dollar signs, commas, using regex and converting to float
qol_cleaned['2022 Median Income'] = qol_cleaned['2022 Median Income'].replace({'\$': '', ',': ''}, regex=True).astype('float')
qol_cleaned['Cost of Living'] = qol_cleaned['Cost of Living'].replace({'\$': '', ',': ''}, regex=True).astype('float')

qol_cleaned

Unnamed: 0,LSTATE,NMCNTY,ULOCALE,Unemployment,2022 Median Income,Cost of Living,AQI%Good
0,VA,Charles City County,42-Rural: Distant,3.21,78038.78,75531.37,93.76
1,TX,McMullen County,43-Rural: Remote,1.81,67513.81,63913.28,75.33
2,TX,Terrell County,43-Rural: Remote,3.54,55946.62,64361.02,75.33
3,AK,Skagway Municipality,43-Rural: Remote,7.19,85446.30,87709.32,87.86
4,GA,Baker County,42-Rural: Distant,4.19,52946.23,59389.29,83.30
...,...,...,...,...,...,...,...
3129,CA,Orange County,21-Suburb: Large,3.27,104789.65,111909.21,68.46
3130,AZ,Maricopa County,43-Rural: Remote,3.46,78828.41,82847.38,68.39
3131,TX,Harris County,21-Suburb: Large,4.41,73169.57,68223.99,75.33
3132,IL,Cook County,21-Suburb: Large,5.30,82910.04,81548.25,80.20


## Dataset: 1980-2021 Yearly Air Quality Index from the EPA

In [10]:
# Read csv file. Must download from https://www.kaggle.com/datasets/threnjen/40-years-of-air-quality-index-from-the-epa-yearly/
aqi_data = pd.read_csv('aqi_yearly_1980_to_2021.csv')
# Filter out miscellaneous columns
aqi_data = aqi_data.iloc[:, :5]
# Filter out years before 2000
aqi_data = aqi_data[aqi_data['Year'] >= 2000]
# Calculate the AQI%Good (Percentage of Good AQI days)
aqi_data['AQI%Good'] = (aqi_data.iloc[:, 4] / aqi_data.iloc[:, 3]).round(2)


aqi_data.head()
# aqi_data['Year'].value_counts()

Unnamed: 0,State,County,Year,Days with AQI,Good Days,AQI%Good
0,Alabama,DeKalb,2021,58,58,1.0
1,Alabama,Jefferson,2021,60,33,0.55
2,Alaska,Denali,2021,59,59,1.0
3,Arizona,Apache,2021,87,87,1.0
4,Arizona,Cochise,2021,90,77,0.86


## Dataset: City and County Vehicle Inventories

In [11]:
def removePercentage(str_in):
    try:
        str_in = str_in.strip()
        if '%' in str_in:
            str_in = str_in.replace('%', '')
        return float(str_in)
    except:
        return float(str_in)

In [12]:
#Loading data + resetting column names 
vehicle_inventory = pd.read_csv('2016cityandcountylightdutyvehicleinventory.csv', header = 1)

#Filtering by vehicle type 
vehicle_inventory = vehicle_inventory[vehicle_inventory['fuel_type'] == 'Electric vehicle']
vehicle_inventory = vehicle_inventory.iloc[: , :29]

#Removing Null values and replacing with 0.0 as percent 
vehicle_inventory.iloc[: , 7:29] = vehicle_inventory.iloc[: , 7:29].fillna(0.00)

#Using function to remove '%' symbol from data 
vehicle_inventory.iloc[: , 7:29] = vehicle_inventory.iloc[: , 7:29].map(removePercentage)
vehicle_inventory.reset_index(drop=True, inplace=True)


vehicle_inventory

Unnamed: 0,state_abbr,geoid,county_id,county_name,fuel_type_org,fuel_type,class,before 1980,1980-99,1990-99,...,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018
0,AL,1001,1001,Autauga,ELECTRIC VEHICLE,Electric vehicle,Car,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.00381,0.0019,0.00381,0.0,0.0
1,AL,1003,1003,Baldwin,ELECTRIC VEHICLE,Electric vehicle,Car,0.0,0.0,0.0,...,0.00193,0.00193,0.0029,0.0029,0.0029,0.00483,0.00531,0.0029,0.0,0.0
2,AL,1005,1005,Barbour,ELECTRIC VEHICLE,Electric vehicle,Car,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.00381,0.0,0.0,0.0,0.0,0.0
3,AL,1007,1007,Bibb,ELECTRIC VEHICLE,Electric vehicle,Car,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,AL,1009,1009,Blount,ELECTRIC VEHICLE,Electric vehicle,Car,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3089,WY,56037,56037,Sweetwater,ELECTRIC VEHICLE,Electric vehicle,Car,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3090,WY,56039,56039,Teton,ELECTRIC VEHICLE,Electric vehicle,Car,0.0,0.0,0.0,...,0.0,0.0,0.0,0.00375,0.01874,0.01125,0.01499,0.04498,0.0,0.0
3091,WY,56041,56041,Uinta,ELECTRIC VEHICLE,Electric vehicle,Car,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3092,WY,56043,56043,Washakie,ELECTRIC VEHICLE,Electric vehicle,Car,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


# Ethics & Privacy

The data collected might inadvertently exclude certain populations, especially those in rural areas or regions with limited internet access, leading to an incomplete representation. There's also a risk of bias in data collection methods, as regions with poor air quality or lower income might have less data available on EV ownership. Also, The proposed data might have biases, especially if it's primarily sourced from online platforms or regions with higher internet penetration. This could exclude populations in areas with limited internet access or those who don't engage online. To ensure this, the data collected should be diverse enough, covering both rural and urban areas. 

To detect biases, we'll conduct preliminary analyses to identify any underrepresented regions or demographics. During analysis, we'll use statistical methods to account for potential biases and ensure our findings are generalizable. A potential concern beyond the biases is the analysis could provide certain stereotypes to certain attributes like race and gender. To handle this, we need to have inclusive analysis on the data that we collected. 

We are planning to conduct demographic analysis and geographical distribution analysis from the EDA to avoid potential biases. Using density plots and histogram will help us to observe data skew caused by over- or under-represented regions. The visualization displays the overall flow of the data, so it will be better for us to see the regions that are off from our expectation.

# Team Expectations 


- Complete assigned aspects of the project on time and with adequate effort (as much as you can contribute)! 
- Try to attend all meeting throughout the quarter and be on time for them unless one has to miss it for unavoidable reasons.
- Communicate promptly and effectively on Discord. Whether anything will or won’t be worked on, it should be communicated.
- Openly discuss and point out problems if something doesn’t look right so we can quickly fix or improve it. 
- Respect each other’s opinions and perspectives even in disagreements. Be willing to compromise and find solutions that works for the team.

# Project Timeline Proposal

| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 10/24  |  4 PM | Look through past projects  | Look through past projects together and discuss corresponding answers to the google form | 
| 10/25  |  7 PM |  Complete assigned questions for Part 1 of Final Project  | Consolidate ideas and submit form | 
| 11/1  | 5 PM  | Look over Project Proposal guidelines  | Come up with research topic/question and divide up work |
| 11/7  | 5 PM  | Background research and finalize project details (data, question, etc) – ensuring everyone is on the same page  | Discuss datasets, wrangling, and possible analytical approaches – assign group members to lead parts |
| 11/14  | 5 PM  | Import and wrangle data  | Review/edit wrangling – discuss analysis plan |
| 11/15 | 5 PM | Finalize wrangling | Checkpoint #1 - Data |
| 11/29 | 5 PM  | Finalize EDA | Checkpoint #2 - EDA + begin analysis |
| 12/5 | 5 PM| Complete analysis: draft results/conclusion/discussion  | Discuss/edit full project  |
| 12/12-13 | Before 11:59 PM | Final review | Final Project Video and Submission  |