### 90803 Data Cleaning and Question Definition
### Data Cleaning: COVID-19 Datasets

**Team 14**

Chi-Shiun Tsai & Colton Lapp

This notebook is used for cleaning the COVID-19 datasets from the NYTimes.

### 0. Importing libraries

In [1]:
import numpy as np
import pandas as pd
from datetime import datetime
from census import Census
from us import states

### 1. Reading datasetss
We tried to configure this to work with google drive but it wasn't working well. You may have to download the file manually if you want to recreate this notebook from scratch at the link below

In [2]:
# COVID-19 data

#Try to read in from google drive link:
try:
    covid_20 = pd.read_csv('https://drive.google.com/uc?export=download&id=1anYHHwpp1gISwfRaIBu3xgZmoIeGd9aD',  sep=',',lineterminator='\n', dtype={'fips': str})
except:
    print("Could not read covid data from google drive link. Please manually download from this link:\n \
            https://drive.google.com/uc?export=download&id=1anYHHwpp1gISwfRaIBu3xgZmoIeGd9aD \n \
            and read in below manually")
    
if 'covid_20' not in locals():
    try:
        covid_20 = pd.read_csv('../data/COVID-19 Cases/us-counties-2020.csv', dtype={'fips': str})
    except:
        print("Please download file in correct directory")

Could not read covid data from google drive link. Please manually download from this link:
             https://drive.google.com/uc?export=download&id=1anYHHwpp1gISwfRaIBu3xgZmoIeGd9aD 
             and read in below manually


#### Merge Population Data in with Covid data on the county level using Census API

In [3]:
# Population data

# Initialize Census object using census API key
census_object = Census("bf6690c63fb4bbd43ccae839241e9be45bfc0881")

# 2020 population data
pop_2020 = census_object.acs5.state_county(fields = ('NAME', 'B01003_001E'),
                                      state_fips = "*",
                                      county_fips = "*",
                                      year = 2020)

In [4]:
#Print
pop_2020_df = pd.DataFrame(pop_2020)
pop_2020_df

Unnamed: 0,NAME,B01003_001E,state,county
0,"Autauga County, Alabama",55639.0,01,001
1,"Baldwin County, Alabama",218289.0,01,003
2,"Barbour County, Alabama",25026.0,01,005
3,"Bibb County, Alabama",22374.0,01,007
4,"Blount County, Alabama",57755.0,01,009
...,...,...,...,...
3216,"Renville County, Minnesota",14572.0,27,129
3217,"Roseau County, Minnesota",15259.0,27,135
3218,"Sherburne County, Minnesota",96015.0,27,141
3219,"Steele County, Minnesota",36710.0,27,147


#### Create county fips code by combining state and county fips codes

In [5]:
pop_2020_df['fips'] = pop_2020_df['state'] + pop_2020_df['county']
pop_2020_df

Unnamed: 0,NAME,B01003_001E,state,county,fips
0,"Autauga County, Alabama",55639.0,01,001,01001
1,"Baldwin County, Alabama",218289.0,01,003,01003
2,"Barbour County, Alabama",25026.0,01,005,01005
3,"Bibb County, Alabama",22374.0,01,007,01007
4,"Blount County, Alabama",57755.0,01,009,01009
...,...,...,...,...,...
3216,"Renville County, Minnesota",14572.0,27,129,27129
3217,"Roseau County, Minnesota",15259.0,27,135,27135
3218,"Sherburne County, Minnesota",96015.0,27,141,27141
3219,"Steele County, Minnesota",36710.0,27,147,27147


We want to focus on the aggregated year level. Therefore, we want to keep only the number of cumulative cases and deaths for the last day of 2020.

In [6]:
def keep_last(df):
    df['date'] = pd.to_datetime(df['date'])
    df.sort_values('date', inplace=True, ascending=False)
    df = df.groupby(by=['county', 'state'], as_index=False).first()
    df['year'] = df['date'].dt.year
    df.drop(columns=['date'], inplace=True)
    return df

covid = keep_last(covid_20)
covid['fips'] = covid['fips'].astype(str)
covid

Unnamed: 0,county,state,fips,cases,deaths,year
0,Abbeville,South Carolina,45001,1275,25.0,2020
1,Acadia,Louisiana,22001,5082,153.0,2020
2,Accomack,Virginia,51001,1698,27.0,2020
3,Ada,Idaho,16001,38417,355.0,2020
4,Adair,Iowa,19001,606,17.0,2020
...,...,...,...,...,...,...
3268,Yuma,Arizona,04027,27366,519.0,2020
3269,Yuma,Colorado,08125,549,13.0,2020
3270,Zapata,Texas,48505,958,10.0,2020
3271,Zavala,Texas,48507,944,25.0,2020


### 2. Data cleaning

In [7]:
# Check for missing values
covid.isnull().sum(axis=0)

county     0
state      0
fips       0
cases      0
deaths    78
year       0
dtype: int64

In [8]:
covid[covid['deaths'].isnull()]['state'].unique()

array(['Puerto Rico'], dtype=object)

In this dataset, Puerto Rico does not record number of deaths. We can drop Puerto Rico since we will focus on the contiguous United States.

In [9]:
covid.drop(covid[covid['deaths'].isnull()].index, inplace=True)
covid

Unnamed: 0,county,state,fips,cases,deaths,year
0,Abbeville,South Carolina,45001,1275,25.0,2020
1,Acadia,Louisiana,22001,5082,153.0,2020
2,Accomack,Virginia,51001,1698,27.0,2020
3,Ada,Idaho,16001,38417,355.0,2020
4,Adair,Iowa,19001,606,17.0,2020
...,...,...,...,...,...,...
3268,Yuma,Arizona,04027,27366,519.0,2020
3269,Yuma,Colorado,08125,549,13.0,2020
3270,Zapata,Texas,48505,958,10.0,2020
3271,Zavala,Texas,48507,944,25.0,2020


In [10]:
covid[covid['fips'] == 'None']['county'].unique()

array(['Joplin', 'Kansas City', 'New York City', 'Unknown'], dtype=object)

In [11]:
# Join two dataframes
covid_df = covid.merge(pop_2020_df, how='left', on='fips')
covid_df

Unnamed: 0,county_x,state_x,fips,cases,deaths,year,NAME,B01003_001E,state_y,county_y
0,Abbeville,South Carolina,45001,1275,25.0,2020,"Abbeville County, South Carolina",24582.0,45,001
1,Acadia,Louisiana,22001,5082,153.0,2020,"Acadia Parish, Louisiana",62371.0,22,001
2,Accomack,Virginia,51001,1698,27.0,2020,"Accomack County, Virginia",32560.0,51,001
3,Ada,Idaho,16001,38417,355.0,2020,"Ada County, Idaho",469473.0,16,001
4,Adair,Iowa,19001,606,17.0,2020,"Adair County, Iowa",7048.0,19,001
...,...,...,...,...,...,...,...,...,...,...
3190,Yuma,Arizona,04027,27366,519.0,2020,"Yuma County, Arizona",211931.0,04,027
3191,Yuma,Colorado,08125,549,13.0,2020,"Yuma County, Colorado",10013.0,08,125
3192,Zapata,Texas,48505,958,10.0,2020,"Zapata County, Texas",14243.0,48,505
3193,Zavala,Texas,48507,944,25.0,2020,"Zavala County, Texas",11930.0,48,507


#### Investigate counties without fips codes

In [12]:
covid_df[covid_df['fips'] == 'None']['county_x'].unique()

array(['Joplin', 'Kansas City', 'New York City', 'Unknown'], dtype=object)

In [13]:
# Drop unknown counties
covid_df.drop(covid_df[covid_df['county_x'] == 'Unknown'].index, inplace=True)

#### Fill in county code for MSA's that have code missing
We will use a county that is within the MSA

In [14]:
# Joplin population 2020: 51703
# Kansas city: 507932
# New York city: 8804190

# Fill in missing population values
covid_df.loc[covid_df['county_x'] == 'Joplin', 'B01003_001E'] = 51703
covid_df.loc[covid_df['county_x'] == 'Kansas City', 'B01003_001E'] = 507932
covid_df.loc[covid_df['county_x'] == 'New York City', 'B01003_001E'] = 8804190

In [15]:
# Check
covid_df[covid_df['county_x'] == 'Joplin']

Unnamed: 0,county_x,state_x,fips,cases,deaths,year,NAME,B01003_001E,state_y,county_y
1453,Joplin,Missouri,,3975,74.0,2020,,51703.0,,


# Calculate cases and deaths per capita

In [16]:
covid_df['cases'] = covid_df['cases']/covid_df['B01003_001E']
covid_df['deaths'] = covid_df['deaths']/covid_df['B01003_001E']

In [17]:
covid_df

Unnamed: 0,county_x,state_x,fips,cases,deaths,year,NAME,B01003_001E,state_y,county_y
0,Abbeville,South Carolina,45001,0.051867,0.001017,2020,"Abbeville County, South Carolina",24582.0,45,001
1,Acadia,Louisiana,22001,0.081480,0.002453,2020,"Acadia Parish, Louisiana",62371.0,22,001
2,Accomack,Virginia,51001,0.052150,0.000829,2020,"Accomack County, Virginia",32560.0,51,001
3,Ada,Idaho,16001,0.081830,0.000756,2020,"Ada County, Idaho",469473.0,16,001
4,Adair,Iowa,19001,0.085982,0.002412,2020,"Adair County, Iowa",7048.0,19,001
...,...,...,...,...,...,...,...,...,...,...
3190,Yuma,Arizona,04027,0.129127,0.002449,2020,"Yuma County, Arizona",211931.0,04,027
3191,Yuma,Colorado,08125,0.054829,0.001298,2020,"Yuma County, Colorado",10013.0,08,125
3192,Zapata,Texas,48505,0.067261,0.000702,2020,"Zapata County, Texas",14243.0,48,505
3193,Zavala,Texas,48507,0.079128,0.002096,2020,"Zavala County, Texas",11930.0,48,507


In [18]:
covid_df[covid_df['fips'] == 'None']

Unnamed: 0,county_x,state_x,fips,cases,deaths,year,NAME,B01003_001E,state_y,county_y
1453,Joplin,Missouri,,0.076881,0.001431,2020,,51703.0,,
1469,Kansas City,Missouri,,0.05914,0.000693,2020,,507932.0,,
2061,New York City,New York,,0.048938,0.002856,2020,,8804190.0,,


### Fill in missing fips values

In [19]:
# Jasper County: 13159
# Jackson county: 29095
# New York County: 36061

covid_df.loc[covid_df['county_x'] == 'Joplin', 'fips'] = 13159
covid_df.loc[covid_df['county_x'] == 'Kansas City', 'fips'] = 29095
covid_df.loc[covid_df['county_x'] == 'New York City', 'fips'] = 36061

In [20]:
# Drop unnecessary columns for final dataframe
covid_df_cleaned = covid_df[['county_x', 'state_x', 'fips', 'cases', 'deaths', 'year']]
covid_df_cleaned.columns = ['county', 'state', 'fips', 'cases', 'deaths', 'year']
covid_df_cleaned

Unnamed: 0,county,state,fips,cases,deaths,year
0,Abbeville,South Carolina,45001,0.051867,0.001017,2020
1,Acadia,Louisiana,22001,0.081480,0.002453,2020
2,Accomack,Virginia,51001,0.052150,0.000829,2020
3,Ada,Idaho,16001,0.081830,0.000756,2020
4,Adair,Iowa,19001,0.085982,0.002412,2020
...,...,...,...,...,...,...
3190,Yuma,Arizona,04027,0.129127,0.002449,2020
3191,Yuma,Colorado,08125,0.054829,0.001298,2020
3192,Zapata,Texas,48505,0.067261,0.000702,2020
3193,Zavala,Texas,48507,0.079128,0.002096,2020


### 3. Saving cleaned dataset

In [21]:
covid_df_cleaned.to_csv('../data/data_cleaned/covid.csv', index=False)

### References

* Data source: https://github.com/nytimes/covid-19-data
* https://pandas.pydata.org/docs/reference/api/pandas.isnull.html