# COGS 108 - Data Checkpoint

# Names

- Asher Av
- Quoc-Zuy  Do
- Hector Gallo
- Jeremy Nurding
- Andres Villegas

<a id='research_question'></a>
# Research Question

Is there a statistically significant relationship between COVID-19 cases and the levels of NO<sub>2</sub> in the atmosphere in San Diego county during the years 2020 and 2021?

# Dataset(s)

*Fill in your dataset information here*

(Copy this information for each dataset)
- Dataset Name:
- Link to the dataset:
- Number of observations:

1-2 sentences describing each dataset. 

If you plan to use multiple datasets, add 1-2 sentences about how you plan to combine these datasets.


**Data Set Name: COVID-19 Data - US Counties from NYTimes**
- Link to Dataset: https://github.com/nytimes/covid-19-data
- Number of Obsevations: 2,170,941 
- <ins>Description of Dataset:</ins> This dataset is collated from  data across the U.S. by the New York Times and draws from the official reportings about the cumulative number of cases and deaths reported in each county and state across the U.S since the start of the COVID-19 pandemic. This dataset contains 6 columns of  data: date, county, state, fips, cases and  deaths. The FIPS column  crefers to a FIPS code, a geographic identifier that determines the location of the county the data was pulled from and makes it easy to associate with other datasets

**Data Set Name: Air Quality Across Countries in COVID-19**
- Link to Dataset: https://www.kaggle.com/aestheteaman01/air-quality-across-countries-in-covid19/version/3?select=USA.csv
- Number of Observations: 179,365
- <ins>Description of Dataset:</ins> This dataset shows the collected information of the air quality from 2020-2021 across countries including Brazil, Canada, France, Italy, India, and USA and their respective cities. The parameters that they use to show the air quality for each of the cities are the following: Carbon Monoxide (CO), Dew (dew), Humidity (humidity), Nitrogen Dioxide (NO2), Ozone (O3), Particulate Matter 10 (pm10, Particulate Matter 2.5 (pm25), Pressure (pressure), Sulphur Dioxide (SO2), Temperature, Wind Gusts, and Wind Speed.

**Data Set Name: Air Quality Statistics by County, 2020**
- Link to Dataset: https://www.epa.gov/air-trends/air-quality-cities-and-counties 
- Number of Observations: 1,144
- <ins>Description of Dataset:</ins> The data set “Air Quality Statistics by County, 2020” presents data of counties with highest levels of air pollutants across states in the United States.  This data set is organized alphabetically in rows by state, each county is assigned to its state accordingly and also arranged in alphabetical order.  There are a total of 1162 rows in this data set and out of the total of rows however 1144 rows contain observations of air pollutants for specific counties.  This dataset contains 13 columns from left to right as follows: State, County, County FIPS Code, 2010 population, CO, Pb, NO2 AM (ppb), NO2 hr (ppb), O3, PM10, PM2.5 Wtd AM, PM2.5 hr, and SO2.  These show the highest number of such pollutant in the given area.

# Setup

##### Import Modules

In [None]:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt

In [None]:
# nytimes daily covid dataset
covid_df = pd.read_csv('https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv')

# EPA 2020 NO2 dataset
epa2020_df = pd.read_csv('https://raw.githubusercontent.com/asherbav/covid_pollution_files/main/epa2020.csv')

# EPA 2021 NO2 dataset 
epa2021_df = pd.read_csv('https://raw.githubusercontent.com/asherbav/covid_pollution_files/main/epa2021.csv')


# Data Cleaning

---
### COVID-19 Dataset Cleaning
The first thing that we want to do is take a look at the original datasets to see what they look like. We first take a look at the COVID-19 Dataset entitled: "COVID-19 Data - US Counties from NYTimes". 

In [None]:
covid_df

We then want to check if there are any null-values that we might want to get rid of  using isnull().sum. We see that there are a bunch of null values inside of the death's columns and the FIPS column. 
We also want to check the types of the columns to make sure that they are in the format that is desired for instance it seems that the dates were properly  converted to datetime objects, however the FIPS columns are floats instead of integers as expected.

In [None]:
print(covid_df.isnull().sum())
print('-----------------') 
covid_df.dtypes


We however, will not be needing the deaths column for our analysis so we can drop the column entirely. We may want to keep the fips data however as creating a heatmaps of COVID-19 distribution across regions will allow us to analyze which regions are the most heavily impacted by the virus. This requires that we remove the entries where there are null values using dropna() from Pandas. 

We also want to change the type of the FIPS codes to integers so they are in the expected format.

In [None]:
covid_df = covid_df.drop(['deaths'], axis = 1)
covid_df = covid_df.dropna()
covid_df['fips'] = covid_df['fips'].astype(int)


At this point we are done cleaning the COVID-19 dataset. We run isnull() again to double check that all the NULL/NaN values have been removed from out dataset properly. We also display the dataset and can see that the FIPS codes are now integers  as expected.  At this point the COVID-19 dataset has been cleaned and is now in a state where it can be used in our project.

In [None]:
print(covid_df.isnull().sum())
print('-----------------') 
covid_df

---
### Kaggle Air Quality Data Set Cleaning
We repeat the steps for the COVID dataset from above for the Kaggle dataset on Air Quality. We display the dataset to  get an idea of what it looks like, search for any NULL/NaN values, and then also check the types of all the objects to make sure if everything is in order.

In [None]:
#print(kagaq_df.isnull().sum())
print('-----------------') 
#print(kagaq_df.dtypes)
print('-----------------') 
#print(kagaq_df['Country'].value_counts())
#kagaq_df

From the above analysis we can see that there are no null values, and that data types of each column is as expected for this dataset. This dataset has already clean and we don't have to do much from here. However, this dataset has an extraneous column and that is the "Country" column. This particular dataset only had dataset for the United States of America, therefore this column does not provide any additional useful information we can safely drop this column.

In [None]:
#kagaq_df = kagaq_df.drop(['Country'], axis = 1)

Next we redisplay the data set and confirm that the column was removed properly.

In [None]:
#kagaq_df

---
### EPA 2020 Air Quality Data Set Cleaning
First we display the data set to get an idea of what we're looking at

In [None]:
epa2020_df

After looking at each column for their unique values, we noticed that we can safely drop the 'County Codes', 'County', 'State', 'STATE_CODE' 'CBSA', 'CBSA_CODE', 'AQS_PARAMETER_DESC', 'AQS_PARAMETER_CODE','UNITS', and 'Source' columns as there is only a single value representing the entire dataset. 
The description of the dataset is as follows: The county and its associated code is 'San Diego', the CBSA is 'San Diego-Carlsbad, CA', the air quality parameter we're looking at is 'NO<sub>2</sub>', the unit is in ppb or parts per billion, and the source is 'AQS'. 

Unique values from columns
 ---
 County codes all the same

 County all the same

 State all the same

 CBSA name all the same

 CBSA code all the same

 AQS_PARAMETER_DESC all the same

 AQS_PARAMETER_CODE all the same
 

 UNITS all the same in ppb(part per billion)

 Source all the same in AQS

 PERCENT_COMPLETE = [ 92., 100.,  83.,  88.,  75.,  79.,  96.]

 Site name = ['Chula Vista', 'Alpine', 'Camp Pendleton', 'Donovan',
       #'Kearny Villa Rd.', 'San Diego -Rancho Carmel Drive',
       #'El Cajon - Lexington Elementary School',
       #'San Diego - Sherman Elementary School']

 DAILY_OBS_COUNT = [22, 24, 20, 21, 18, 19, 23]

 DAILY AQI VALUE = [17, 33, 34, 29, 38, 35, 28, 24, 30, 26, 21, 27, 36, 31, 25, 32, 15,
       # 23, 40, 16, 13, 39,  8, 22, 11, 12, 18,  7, 19, 20, 10, 14,  6,  4,
       # 9,  3,  5,  2, 41, 37, 42,  1, 53, 43, 55, 44, 45, 46, 47, 52, 48,
       #49, 51, 50]

 Daily Max 1-hour NO2 Concentration = [18, 35, 36, 31, 40, 37, 30, 25, 32, 28, 22, 29, 38, 33, 26, 34, 16,
       # 24, 42, 17, 14, 41, 27,  8, 23, 12, 13, 19,  7, 20, 21, 11, 15,  9,
       # 6,  4, 10,  3,  5,  2, 43, 39, 45, 44,  1, 56, 46, 58, 47, 48, 49,
       # 50, 55, 51, 52, 54, 53]

 POC = [1, 2]

 Site ID = [60730001, 60731006, 60731008, 60731014, 60731016, 60731017, 60731022, 60731026]
 
 Dates range from 1/01/2020 to 12/31/2020

In [None]:
epa2020_df = epa2020_df.drop(['COUNTY'], axis = 1)
epa2020_df = epa2020_df.drop(['COUNTY_CODE'], axis = 1)
epa2020_df = epa2020_df.drop(['STATE'], axis = 1)
epa2020_df = epa2020_df.drop(['STATE_CODE'], axis = 1)
epa2020_df = epa2020_df.drop(['CBSA_NAME'], axis = 1)
epa2020_df = epa2020_df.drop(['CBSA_CODE'], axis = 1)
epa2020_df = epa2020_df.drop(['AQS_PARAMETER_DESC'], axis = 1)
epa2020_df = epa2020_df.drop(['AQS_PARAMETER_CODE'], axis = 1)
epa2020_df = epa2020_df.drop(['UNITS'], axis = 1)
epa2020_df = epa2020_df.drop(['Source'], axis = 1)


In [None]:
epa2020_df

In [None]:
epa2020_df.isnull().sum()

We see that there's no null values in the dataset and so we do not have any rows to remove. 

We see that the 2021 dataset has a very similar situation and so we drop the same columns as well. 

In [None]:
epa2021_df = epa2021_df.drop(['COUNTY'], axis = 1)
epa2021_df = epa2021_df.drop(['COUNTY_CODE'], axis = 1)
epa2021_df = epa2021_df.drop(['STATE'], axis = 1)
epa2021_df = epa2021_df.drop(['STATE_CODE'], axis = 1)
epa2021_df = epa2021_df.drop(['CBSA_NAME'], axis = 1)
epa2021_df = epa2021_df.drop(['CBSA_CODE'], axis = 1)
epa2021_df = epa2021_df.drop(['AQS_PARAMETER_DESC'], axis = 1)
epa2021_df = epa2021_df.drop(['AQS_PARAMETER_CODE'], axis = 1)
epa2021_df = epa2021_df.drop(['UNITS'], axis = 1)
epa2021_df = epa2021_df.drop(['Source'], axis = 1)

In [None]:
epa2021_df

In [None]:

#new_header = epa_df.iloc[2] #grab the first row for the header
#epa_df = epa_df[3:] #take the data less the header row
#epa_df.columns = new_header #set the header row as the df header
#epa_df.head()