# COGS 108 - Data Checkpoint

# Names

- Asher Av
- Quoc-Zuy  Do
- Hector Gallo
- Jeremy Nurding
- Andres Villegas

<a id='research_question'></a>
# Research Question

Is there a statistically significant relationship between COVID-19 cases and the levels of NO<sub>2</sub> in the atmosphere in San Diego county during the years 2020 and 2021?

# Dataset(s)

**Data Set Name: COVID-19 Data - US Counties from NYTimes**
- Link to Dataset: https://github.com/nytimes/covid-19-data
- Number of Obsevations: 2,170,941 
- <ins>Description of Dataset:</ins> This dataset is collated from  data across the U.S. by the New York Times and draws from the official reportings about the cumulative number of cases and deaths reported in each county and state across the U.S since the start of the COVID-19 pandemic. This dataset contains 6 columns of  data: date, county, state, fips, cases and  deaths. The FIPS column  crefers to a FIPS code, a geographic identifier that determines the location of the county the data was pulled from and makes it easy to associate with other datasets

**Data Set Name: EPA 2020 Air Quality**
- Link to Dataset: https://www.epa.gov/outdoor-air-quality-data/download-daily-data 
- Number of Observations: 2,869
- <ins>Description of Dataset:</ins> This dataset tool allows us to analyze data by selecting a specific air pollutant, year, county, and site.  This dataset has collected data for NO<sub>2</sub> concentration in the atmosphere in eight different sites in San Diego county for the year 2020.  For each site in San Diego county there is information for each day of the year and the concentration of NO<sub>2</sub> concentration for the given date.  This data set contains 20 columns of data: Date, Source, Site ID, POC, Daily Max 1-hour NO<sub>2</sub> concentration, Units, Daily AQI Value, Site Name, Site Name (number), Percent Complete, Aqs Parameter Code, AQS Parameter DESC, CBSA Code, CBSA Name, State Code, State,County Code, County, Site Latitude, and Site Longitude.  There are 2,870 rows in this data set 2,869 of them contain observations. 

**Data Set Name: EPA 2021 Air Quality**
- Link to Dataset: https://www.epa.gov/outdoor-air-quality-data/download-daily-data 
- Number of Observations: 2,011
- <ins>Description of Dataset:</ins> This dataset has collected data for the concentration of the NO<sub>2</sub> in the atmosphere in eight different sites in San Diego county for the year 2021.  For each site in San Diego county there is information for each day of the year and the concentration of NO<sub>2</sub> concentration for the given date.  This data set contains 20 columns of data: Date, Source, Site ID, POC, Daily Max 1-hour NO<sub>2</sub> concentration, Units, Daily AQI Value, Site Name, Site Name (number), Percent Complete, Aqs Parameter Code, AQS Parameter DESC, CBSA Code, CBSA Name, State Code, State,County Code, County, Site Latitude, and Site Longitude.  There are 2,012 rows in this data set 2,011 of them contain observations. This is essentially the same as the other EPA dataset, just that it contains data for 2021 instead of 2020.

**Combining Data Sets for Analysis:** For the time being we are not planning on combining the three data sets into a single dataframe. This is because each respective dataset contains data that contains a variable measured over time (Cases for the COVID-19  dataset and NO<sub>2</sub> levels for the EPA datasets) so they are better left in their own separate dataframes so that we can plot their respective trends over time. In addition, the EPA datasets contain daily measurements of NO<sub>2</sub> levels from different locations within San Diego county, this makes it so that the number of rows within each dataset do not match and make it difficult to  provide a one to one grouping of the dates within the COVID-19 dataset and the dates within the EPA datasets. Going forward with our analysis we are going to be observing the general trends over time for each respective dataset before moving onto a more sophisticated statistical analysis.

# Setup

##### Import Modules

In [1]:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt

In [2]:
# nytimes daily covid dataset
covid_df = pd.read_csv('https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv')

# EPA 2020 NO2 dataset
epa2020_df = pd.read_csv('https://raw.githubusercontent.com/asherbav/covid_pollution_files/main/epa2020.csv')

# EPA 2021 NO2 dataset 
epa2021_df = pd.read_csv('https://raw.githubusercontent.com/asherbav/covid_pollution_files/main/epa2021.csv')


# Data Cleaning

---
### COVID-19 Dataset Cleaning
The first thing that we want to do is take a look at the original datasets to see what they look like. We first take a look at the COVID-19 Dataset titled: "COVID-19 Data - US Counties from NYTimes". 

In [3]:
covid_df

Unnamed: 0,date,county,state,fips,cases,deaths
0,2020-01-21,Snohomish,Washington,53061.0,1,0.0
1,2020-01-22,Snohomish,Washington,53061.0,1,0.0
2,2020-01-23,Snohomish,Washington,53061.0,1,0.0
3,2020-01-24,Cook,Illinois,17031.0,1,0.0
4,2020-01-24,Snohomish,Washington,53061.0,1,0.0
...,...,...,...,...,...,...
2203466,2022-02-10,Sweetwater,Wyoming,56037.0,10790,119.0
2203467,2022-02-10,Teton,Wyoming,56039.0,9560,15.0
2203468,2022-02-10,Uinta,Wyoming,56041.0,5564,36.0
2203469,2022-02-10,Washakie,Wyoming,56043.0,2267,42.0


We then want to check if there are any null-values that we might want to get rid of  using isnull().sum. We see that there are a bunch of null values inside of the death's columns and the FIPS column. 

In [4]:
covid_df.isnull().sum()

date          0
county        0
state         0
fips      20470
cases         0
deaths    50429
dtype: int64

However, we do not need either of these columns for our analysis. Our group is trying to investigate the relationship between COVID-19  cases in San Diego county and the levels of NO<sub>2</sub>  in the atmosphere. This being said, the number of deaths over this period will not be necesssary for our analysis. In addition since our scope is being narrowed to only look at San Diego county, we do not need the FIPS geographic identifiers or the state column.

In [5]:
col_drop = ['deaths', 'fips', 'state']

covid_df = covid_df.drop(col_drop, axis = 1)
covid_df = covid_df.dropna()


Now we will sort the data to only look at San Diego County to narrow our geographical scope, then we can remove the column entirely because we know all the remaining data will be for San Diego county.

In [6]:
covid_df_sd = covid_df[covid_df['county'] == 'San Diego']
covid_df_sd = covid_df_sd.drop(['county'], axis = 1)

At this point we are done cleaning the COVID-19 dataset. We run isnull() again to double check that all the NULL/NaN values have been removed from out dataset properly.

In [7]:
covid_df_sd.isnull().sum()


date     0
cases    0
dtype: int64

At this point the COVID-19 data set is in a state we can use for our analysis.

In [8]:
covid_df_sd

Unnamed: 0,date,cases
118,2020-02-10,1
128,2020-02-11,1
138,2020-02-12,1
149,2020-02-13,1
160,2020-02-14,1
...,...,...
2187426,2022-02-06,749616
2190679,2022-02-07,755597
2193932,2022-02-08,761903
2197187,2022-02-09,763555


---
### EPA 2020 Air Quality Data Set Cleaning
We will repeat the steps for the COVID dataset from above for the dataset titled "EPA 2020 Air Quality" and "EPA 2021 Air Quality". We display the dataset to get an idea of what it looks like, search for any NULL/NaN values, and then also check the types of all the objects to make sure if everything is in order.

In [9]:
epa2020_df

Unnamed: 0,Date,Source,Site ID,POC,Daily Max 1-hour NO2 Concentration,UNITS,DAILY_AQI_VALUE,Site Name,DAILY_OBS_COUNT,PERCENT_COMPLETE,AQS_PARAMETER_CODE,AQS_PARAMETER_DESC,CBSA_CODE,CBSA_NAME,STATE_CODE,STATE,COUNTY_CODE,COUNTY,SITE_LATITUDE,SITE_LONGITUDE
0,01/01/2020,AQS,60730001,1,18,ppb,17,Chula Vista,22,92.0,42602,Nitrogen dioxide (NO2),41740,"San Diego-Carlsbad, CA",6,California,73,San Diego,32.631242,-117.059088
1,01/02/2020,AQS,60730001,1,35,ppb,33,Chula Vista,22,92.0,42602,Nitrogen dioxide (NO2),41740,"San Diego-Carlsbad, CA",6,California,73,San Diego,32.631242,-117.059088
2,01/03/2020,AQS,60730001,1,36,ppb,34,Chula Vista,22,92.0,42602,Nitrogen dioxide (NO2),41740,"San Diego-Carlsbad, CA",6,California,73,San Diego,32.631242,-117.059088
3,01/04/2020,AQS,60730001,1,31,ppb,29,Chula Vista,22,92.0,42602,Nitrogen dioxide (NO2),41740,"San Diego-Carlsbad, CA",6,California,73,San Diego,32.631242,-117.059088
4,01/05/2020,AQS,60730001,1,36,ppb,34,Chula Vista,22,92.0,42602,Nitrogen dioxide (NO2),41740,"San Diego-Carlsbad, CA",6,California,73,San Diego,32.631242,-117.059088
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2864,12/27/2020,AQS,60731026,1,11,ppb,10,San Diego - Sherman Elementary School,22,92.0,42602,Nitrogen dioxide (NO2),41740,"San Diego-Carlsbad, CA",6,California,73,San Diego,32.710177,-117.142665
2865,12/28/2020,AQS,60731026,1,12,ppb,11,San Diego - Sherman Elementary School,22,92.0,42602,Nitrogen dioxide (NO2),41740,"San Diego-Carlsbad, CA",6,California,73,San Diego,32.710177,-117.142665
2866,12/29/2020,AQS,60731026,1,29,ppb,27,San Diego - Sherman Elementary School,22,92.0,42602,Nitrogen dioxide (NO2),41740,"San Diego-Carlsbad, CA",6,California,73,San Diego,32.710177,-117.142665
2867,12/30/2020,AQS,60731026,1,31,ppb,29,San Diego - Sherman Elementary School,20,83.0,42602,Nitrogen dioxide (NO2),41740,"San Diego-Carlsbad, CA",6,California,73,San Diego,32.710177,-117.142665


In [10]:
epa2020_df.isnull().sum()

Date                                  0
Source                                0
Site ID                               0
POC                                   0
Daily Max 1-hour NO2 Concentration    0
UNITS                                 0
DAILY_AQI_VALUE                       0
Site Name                             0
DAILY_OBS_COUNT                       0
PERCENT_COMPLETE                      0
AQS_PARAMETER_CODE                    0
AQS_PARAMETER_DESC                    0
CBSA_CODE                             0
CBSA_NAME                             0
STATE_CODE                            0
STATE                                 0
COUNTY_CODE                           0
COUNTY                                0
SITE_LATITUDE                         0
SITE_LONGITUDE                        0
dtype: int64

Since there are no null values present, we will not be excluding any rows. 

In [11]:
epa2020_df.dtypes

Date                                   object
Source                                 object
Site ID                                 int64
POC                                     int64
Daily Max 1-hour NO2 Concentration      int64
UNITS                                  object
DAILY_AQI_VALUE                         int64
Site Name                              object
DAILY_OBS_COUNT                         int64
PERCENT_COMPLETE                      float64
AQS_PARAMETER_CODE                      int64
AQS_PARAMETER_DESC                     object
CBSA_CODE                               int64
CBSA_NAME                              object
STATE_CODE                              int64
STATE                                  object
COUNTY_CODE                             int64
COUNTY                                 object
SITE_LATITUDE                         float64
SITE_LONGITUDE                        float64
dtype: object

Since the date column does not store datetime objects, we will need to convert the Date to datetime objects. 

A majority of the columns will not be used for our analysis:
1) After looking at each column for their unique values, we noticed that we can safely drop the 'COUNTY', 'COUNTY_CODE','STATE','STATE_CODE','CBSA_NAME','CBSA_CODE','AQS_PARAMETER_DESC','AQS_PARAMETER_CODE','UNITS',and 'Source' columns as there is only a single value representing the entire column. To elaborate, the values representing the columns are 'San Diego' for the county and county code, 'San Diego-Carlsbad, CA' for the CBSA, 'NO<sub>2</sub>' for the air quality parameter, ppb or parts per billion for the unit, 'AQS' for the source. 

2) Furthermore the values 'SITE_LATITUDE','SITE_LONGITUDE','Site ID','POC','PERCENT_COMPLETE', and'DAILY_OBS_COUNT' are not going to be used for the analysis that we will be peforming. 

In short, the columns that we're interested in are the date, the NO<sub>2</sub> levels, the AQI(Air Quality Index), and the particular sites. 
An additional task we have is to re-name multi-word labels to be concise. 


In [12]:
same_value_list = ['COUNTY', 'COUNTY_CODE','STATE','STATE_CODE','CBSA_NAME','CBSA_CODE','AQS_PARAMETER_DESC','AQS_PARAMETER_CODE','UNITS','Source']
not_used_list = ['SITE_LATITUDE','SITE_LONGITUDE','Site ID','POC','PERCENT_COMPLETE','DAILY_OBS_COUNT']
updated_labels = {'Daily Max 1-hour NO2 Concentration': 'NO2', 'DAILY_AQI_VALUE': 'AQI', 'Site Name': 'Site'}

In [13]:
# Drop columns that contain only a single value
epa2020_df = epa2020_df.drop(same_value_list, axis = 1)

# Drop columns that are not going to be used
epa2020_df = epa2020_df.drop(not_used_list, axis = 1)

# Rename columns
epa2020_df = epa2020_df.rename(columns=updated_labels)

# Convert date to datetime object
epa2020_df['Date'] = pd.to_datetime(epa2020_df['Date'])

At this point the EPA 2020 Air Quality dataset is in a state we can use for our analysis. 

In [14]:
epa2020_df

Unnamed: 0,Date,NO2,AQI,Site
0,2020-01-01,18,17,Chula Vista
1,2020-01-02,35,33,Chula Vista
2,2020-01-03,36,34,Chula Vista
3,2020-01-04,31,29,Chula Vista
4,2020-01-05,36,34,Chula Vista
...,...,...,...,...
2864,2020-12-27,11,10,San Diego - Sherman Elementary School
2865,2020-12-28,12,11,San Diego - Sherman Elementary School
2866,2020-12-29,29,27,San Diego - Sherman Elementary School
2867,2020-12-30,31,29,San Diego - Sherman Elementary School


Since the 2021 dataset from the EPA is of a similar format to the 2020 dataset, the same rationale that we used for cleaning the 2020 dataset will be applied.
We begin by viewing the dataset in its current state.

In [15]:
epa2021_df

Unnamed: 0,Date,Source,Site ID,POC,Daily Max 1-hour NO2 Concentration,UNITS,DAILY_AQI_VALUE,Site Name,DAILY_OBS_COUNT,PERCENT_COMPLETE,AQS_PARAMETER_CODE,AQS_PARAMETER_DESC,CBSA_CODE,CBSA_NAME,STATE_CODE,STATE,COUNTY_CODE,COUNTY,SITE_LATITUDE,SITE_LONGITUDE
0,01/01/2021,AQS,60730001,1,32,ppb,30,Chula Vista,22,92.0,42602,Nitrogen dioxide (NO2),41740,"San Diego-Carlsbad, CA",6,California,73,San Diego,32.631242,-117.059088
1,01/02/2021,AQS,60730001,1,32,ppb,30,Chula Vista,22,92.0,42602,Nitrogen dioxide (NO2),41740,"San Diego-Carlsbad, CA",6,California,73,San Diego,32.631242,-117.059088
2,01/03/2021,AQS,60730001,1,22,ppb,21,Chula Vista,22,92.0,42602,Nitrogen dioxide (NO2),41740,"San Diego-Carlsbad, CA",6,California,73,San Diego,32.631242,-117.059088
3,01/04/2021,AQS,60730001,1,27,ppb,25,Chula Vista,22,92.0,42602,Nitrogen dioxide (NO2),41740,"San Diego-Carlsbad, CA",6,California,73,San Diego,32.631242,-117.059088
4,01/05/2021,AQS,60730001,1,29,ppb,27,Chula Vista,22,92.0,42602,Nitrogen dioxide (NO2),41740,"San Diego-Carlsbad, CA",6,California,73,San Diego,32.631242,-117.059088
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2006,08/27/2021,AQS,60731026,2,12,ppb,11,San Diego - Sherman Elementary School,22,92.0,42602,Nitrogen dioxide (NO2),41740,"San Diego-Carlsbad, CA",6,California,73,San Diego,32.710177,-117.142665
2007,08/28/2021,AQS,60731026,2,9,ppb,8,San Diego - Sherman Elementary School,22,92.0,42602,Nitrogen dioxide (NO2),41740,"San Diego-Carlsbad, CA",6,California,73,San Diego,32.710177,-117.142665
2008,08/29/2021,AQS,60731026,2,10,ppb,9,San Diego - Sherman Elementary School,22,92.0,42602,Nitrogen dioxide (NO2),41740,"San Diego-Carlsbad, CA",6,California,73,San Diego,32.710177,-117.142665
2009,08/30/2021,AQS,60731026,2,14,ppb,13,San Diego - Sherman Elementary School,22,92.0,42602,Nitrogen dioxide (NO2),41740,"San Diego-Carlsbad, CA",6,California,73,San Diego,32.710177,-117.142665


In [16]:
epa2021_df.isnull().sum()

Date                                  0
Source                                0
Site ID                               0
POC                                   0
Daily Max 1-hour NO2 Concentration    0
UNITS                                 0
DAILY_AQI_VALUE                       0
Site Name                             0
DAILY_OBS_COUNT                       0
PERCENT_COMPLETE                      0
AQS_PARAMETER_CODE                    0
AQS_PARAMETER_DESC                    0
CBSA_CODE                             0
CBSA_NAME                             0
STATE_CODE                            0
STATE                                 0
COUNTY_CODE                           0
COUNTY                                0
SITE_LATITUDE                         0
SITE_LONGITUDE                        0
dtype: int64

Since there are no null values present, we will not be excluding any rows.

In [17]:
epa2021_df.dtypes

Date                                   object
Source                                 object
Site ID                                 int64
POC                                     int64
Daily Max 1-hour NO2 Concentration      int64
UNITS                                  object
DAILY_AQI_VALUE                         int64
Site Name                              object
DAILY_OBS_COUNT                         int64
PERCENT_COMPLETE                      float64
AQS_PARAMETER_CODE                      int64
AQS_PARAMETER_DESC                     object
CBSA_CODE                               int64
CBSA_NAME                              object
STATE_CODE                              int64
STATE                                  object
COUNTY_CODE                             int64
COUNTY                                 object
SITE_LATITUDE                         float64
SITE_LONGITUDE                        float64
dtype: object

The datatype for the Date object requires conversion as well. 

In [18]:
# Drop columns that contain only a single value
epa2021_df = epa2021_df.drop(same_value_list, axis = 1)

# Drop columns that are not going to be used
epa2021_df = epa2021_df.drop(not_used_list, axis = 1)

# Rename columns
epa2021_df = epa2021_df.rename(columns=updated_labels)

# Convert date to datetime object
epa2021_df['Date'] = pd.to_datetime(epa2021_df['Date'])

At this point the EPA 2021 Air Quality dataset is in a state we can use for our analysis. 

In [19]:
epa2021_df

Unnamed: 0,Date,NO2,AQI,Site
0,2021-01-01,32,30,Chula Vista
1,2021-01-02,32,30,Chula Vista
2,2021-01-03,22,21,Chula Vista
3,2021-01-04,27,25,Chula Vista
4,2021-01-05,29,27,Chula Vista
...,...,...,...,...
2006,2021-08-27,12,11,San Diego - Sherman Elementary School
2007,2021-08-28,9,8,San Diego - Sherman Elementary School
2008,2021-08-29,10,9,San Diego - Sherman Elementary School
2009,2021-08-30,14,13,San Diego - Sherman Elementary School
