# Collecting Covid Data
Here we collect the most recent covid data, combine it with the latest enrollment data from our camper database, and export the results for exploration and visualization.

1. Import and prepare latest county-level data from *The New York Times*
2. Import and prepare latest state-level data from *The Atlantic* covid tracker API
3. Query latest enrolled camper data from our exiting database
4. Join datasets 
5. Export to GitHub


__Note on sources:__ 
We use data compiled by news organizations to create the most reliable, up-to-date, and expansive picture of covid levels in each county and state. By using these sources we also benefit from having the data preprepared and screened by experts. 

Why include state data and not simply aggregate counties? We want to add a crucial metric to our dashboard later that is only present in the state-level data: the total number of tests in each state and the percent that come back positive.

In [1]:
import pandas as pd
import requests
import io
from sqlalchemy import create_engine, MetaData, Table

## 1. Prepare County Data

Source: https://github.com/nytimes/covid-19-data/blob/master/us-counties.csv

In [2]:
nyt_covid_url = 'https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv'

download_nyt = requests.get(nyt_covid_url).content

nyt_covid_df = pd.read_csv(io.StringIO(download_nyt.decode('utf-8')), parse_dates=True)

print(nyt_covid_df.head())

         date     county       state     fips  cases  deaths
0  2020-01-21  Snohomish  Washington  53061.0      1       0
1  2020-01-22  Snohomish  Washington  53061.0      1       0
2  2020-01-23  Snohomish  Washington  53061.0      1       0
3  2020-01-24       Cook    Illinois  17031.0      1       0
4  2020-01-24  Snohomish  Washington  53061.0      1       0


To match with our FIPS data, we'll need to remove the trailing zeros on the fips column. (When we try loading the FIPS as an integer, python threw an error. So we decide to load it as default and see what's going on.)

First, let's check null values, columns, and data types. 

In [3]:
print(nyt_covid_df.shape)
print('======')
print(nyt_covid_df.dtypes)
print('======')
print(nyt_covid_df.isna().sum())

(677122, 6)
date       object
county     object
state      object
fips      float64
cases       int64
deaths      int64
dtype: object
date         0
county       0
state        0
fips      6488
cases        0
deaths       0
dtype: int64


Since our goal is to match our camper zip codes with this data, we don't particularly care about the missing fips values. Let's drop them from the dataset.

In [4]:
nyt_covid_df.dropna(inplace=True)

print(nyt_covid_df.isna().sum())

date      0
county    0
state     0
fips      0
cases     0
deaths    0
dtype: int64


We do care, however, about the trailing '.0' at the end of the fips. To remove them, we need to temporarily convert the data type to string.

In [5]:
nyt_covid_df['fips'] = nyt_covid_df['fips'].astype('str')

print(nyt_covid_df.dtypes)

date      object
county    object
state     object
fips      object
cases      int64
deaths     int64
dtype: object


Remove trailing '.0' and convert to integer and convert date to datetime.

In [6]:
nyt_covid_df['fips'] = nyt_covid_df['fips'].str.replace(r'\.0', '')
nyt_covid_df['fips'] = nyt_covid_df['fips'].astype('int')

nyt_covid_df['date'] = pd.to_datetime(nyt_covid_df['date'])

print(nyt_covid_df.head())
print('============')
print(nyt_covid_df.dtypes)

        date     county       state   fips  cases  deaths
0 2020-01-21  Snohomish  Washington  53061      1       0
1 2020-01-22  Snohomish  Washington  53061      1       0
2 2020-01-23  Snohomish  Washington  53061      1       0
3 2020-01-24       Cook    Illinois  17031      1       0
4 2020-01-24  Snohomish  Washington  53061      1       0
date      datetime64[ns]
county            object
state             object
fips               int32
cases              int64
deaths             int64
dtype: object


We only care about the most recent case data, so lets remove as much unneeded data as possible. We first sort by most recent date. Then we drop duplicate fips, keeping the first (most recent) value.

In [7]:
nyt_sorted = nyt_covid_df.sort_values('date', ascending=False)

print(nyt_sorted.head())

             date      county     state   fips  cases  deaths
677121 2020-10-28      Weston   Wyoming  56045    148       0
674955 2020-10-28       Mason  Kentucky  21161    176       2
674965 2020-10-28      Morgan  Kentucky  21175    133       0
674964 2020-10-28  Montgomery  Kentucky  21173    498       4
674963 2020-10-28      Monroe  Kentucky  21171    298       4


In [8]:
print(nyt_sorted.shape)

nyt_truncated = nyt_sorted.drop_duplicates(['fips'], keep='first')

print(nyt_truncated.shape)

(670634, 6)
(3215, 6)


Marvelous. It looks like our county-level COVID data is ready to ingegrate with our camper data. 

## 2. Prepare State Data

Source: https://covidtracking.com/data/api

First let's download the data and check the output.

In [9]:
# Define request
atlantic_api_url = 'https://api.covidtracking.com'
query_string = '/v1/states/current.csv'  # we're requesting current data from all states in csv format
columns_to_fetch = ['date', 'state', 'positive', 'negative', 'positiveIncrease']

# Perform request
atlantic_request_url = atlantic_api_url + query_string
download_atlantic = requests.get(atlantic_request_url).content
state_covid_df = pd.read_csv(io.StringIO(download_atlantic.decode('utf-8')), usecols=columns_to_fetch)

# Check colums and shape
print(state_covid_df.columns)
print(state_covid_df.shape)

Index(['date', 'state', 'positive', 'negative', 'positiveIncrease'], dtype='object')
(56, 5)


In [10]:
print(state_covid_df.head())

       date state  positive  negative  positiveIncrease
0  20201028    AK     15155    565444               357
1  20201028    AL    187706   1159926              1269
2  20201028    AR    108640   1212191               961
3  20201028    AS         0      1616                 0
4  20201028    AZ    241165   1495814              1043


In [11]:
state_covid_df['date'] = pd.to_datetime(state_covid_df['date'], format='%Y%m%d')

print(state_covid_df['date'].dtypes)
print(state_covid_df.head())

datetime64[ns]
        date state  positive  negative  positiveIncrease
0 2020-10-28    AK     15155    565444               357
1 2020-10-28    AL    187706   1159926              1269
2 2020-10-28    AR    108640   1212191               961
3 2020-10-28    AS         0      1616                 0
4 2020-10-28    AZ    241165   1495814              1043


### 2a. Add state populations
We preformatted this data to match our needs here, but it comes from the U.S. Census Bureau.

Source: https://www.census.gov/data/datasets/time-series/demo/popest/2010s-state-total.html

In [12]:
state_pop_url = 'https://raw.githubusercontent.com/amcgaha/camp-community-covid-levels/main/state_pop_2019_census_bureau.csv'

download_pop = requests.get(state_pop_url).content

state_pop_df = pd.read_csv(io.StringIO(download_pop.decode('utf-8')))

print(state_pop_df.head())

  state_long_name state  pop_2019
0         Alabama    AL   4903185
1          Alaska    AK    731545
2         Arizona    AZ   7278717
3        Arkansas    AR   3017804
4      California    CA  39512223


Merge state populations with state covid data.

In [13]:
state_joined_df = pd.merge(state_covid_df, state_pop_df, on='state', how='left')

print(state_joined_df.head())

        date state  positive  negative  positiveIncrease state_long_name  \
0 2020-10-28    AK     15155    565444               357          Alaska   
1 2020-10-28    AL    187706   1159926              1269         Alabama   
2 2020-10-28    AR    108640   1212191               961        Arkansas   
3 2020-10-28    AS         0      1616                 0             NaN   
4 2020-10-28    AZ    241165   1495814              1043         Arizona   

    pop_2019  
0   731545.0  
1  4903185.0  
2  3017804.0  
3        NaN  
4  7278717.0  


We don't typically have campers from American Samoa (AS), so we can drop this. Let's get rid of all nulls so we can perform calculations safely.

In [14]:
state_no_null = state_joined_df.dropna(axis=0)

print(state_no_null.head())

        date state  positive  negative  positiveIncrease state_long_name  \
0 2020-10-28    AK     15155    565444               357          Alaska   
1 2020-10-28    AL    187706   1159926              1269         Alabama   
2 2020-10-28    AR    108640   1212191               961        Arkansas   
4 2020-10-28    AZ    241165   1495814              1043         Arizona   
5 2020-10-28    CA    908713  17314883              4515      California   

     pop_2019  
0    731545.0  
1   4903185.0  
2   3017804.0  
4   7278717.0  
5  39512223.0  


### 2b. Add columns with calculations

In [15]:
# suppress a false warning that arises after adding columns with calculations
pd.options.mode.chained_assignment = None

# percent positive cases compared to total tests
state_no_null['positive_of_total'] = state_no_null['positive'] / (state_no_null['positive'] + state_no_null['negative'])

# recent increase per 100,000 people in the state
state_no_null['increase_per_100k'] = state_no_null['positiveIncrease'] / (state_no_null['pop_2019'] / 100000)

# check output, sorted by highest increase
print(state_no_null.sort_values('increase_per_100k', ascending=False).head())

         date state  positive  negative  positiveIncrease state_long_name  \
45 2020-10-28    SD     42000    209296              1270    South Dakota   
18 2020-10-28    KS     82045    550988              3369          Kansas   
31 2020-10-28    ND     39907    247617               777    North Dakota   
53 2020-10-28    WI    221559   1795161              4130       Wisconsin   
55 2020-10-28    WY     12146    113686               340         Wyoming   

     pop_2019  positive_of_total  increase_per_100k  
45   884659.0           0.167134         143.558139  
18  2913314.0           0.129606         115.641500  
31   762062.0           0.138795         101.960208  
53  5822434.0           0.109861          70.932534  
55   578759.0           0.096526          58.746387  


Rename dataframe to fit with later context.

In [16]:
state_final_df = state_no_null

## 3. Query Camper Database

After the database has been updating using this program (https://github.com/amcgaha/camp-community-covid-levels/blob/main/update_campers_in_database_public.ipynb) we can import that data to match with our covid data.

In [17]:
password = '*********'

engine = create_engine(f'postgresql://postgres:{password}@localhost:5432/grp_data')

metadata = MetaData()

connection = engine.connect()

In [18]:
stmt = "SELECT c.camper_id, h.zipcode, h.state, a.session_id, counties.county_id FROM campers AS c JOIN households AS h USING(household_id) JOIN applications AS a USING(camper_id) JOIN counties USING(zipcode) WHERE a.application_date > '2020-09-01';"

query_result = pd.read_sql(stmt, con=connection)

query_result['county_id'] = query_result['county_id'].astype('int') 

print(query_result.head())

   camper_id zipcode state                      session_id  county_id
0    2424030   28601    NC              western_expedition      37035
1    2587669   37350    TN                       session_2      47065
2    2601673   32746    FL  leadership_in_training_(lit)_1      12117
3    2683141   29464    SC              western_expedition      45019
4    2748592   27608    NC              western_expedition      37183


In [19]:
connection.close()

## 4. Merge county, state, and camper data

In [20]:
nyt_query_ready = nyt_truncated.drop('state', axis=1)
nyt_query_ready = nyt_query_ready.rename({'fips': 'county_id',
                                         'cases':'cases_county',
                                         'deaths': 'deaths_county'}, axis=1)

In [21]:
query_county = pd.merge(query_result, nyt_query_ready, on='county_id', how='left')

print(query_county.head())

   camper_id zipcode state                      session_id  county_id  \
0    2424030   28601    NC              western_expedition      37035   
1    2587669   37350    TN                       session_2      47065   
2    2601673   32746    FL  leadership_in_training_(lit)_1      12117   
3    2683141   29464    SC              western_expedition      45019   
4    2748592   27608    NC              western_expedition      37183   

        date      county  cases_county  deaths_county  
0 2020-10-28     Catawba        4435.0           62.0  
1 2020-10-28    Hamilton       12312.0          111.0  
2 2020-10-28    Seminole       10188.0          241.0  
3 2020-10-28  Charleston       17181.0          277.0  
4 2020-10-28        Wake       21400.0          270.0  


In [22]:
query_county_state = pd.merge(query_county, state_final_df, on='state', how='left', suffixes=['_county', '_state'])

query_county_state = query_county_state.rename({'positive': 'postive_state',
                                               'negative': 'negative_state',
                                               'positiveIncrease': 'increase_state',
                                               'pop_2019': 'pop_2019_county',
                                               'positive_of_total': 'positive_of_total_state',
                                               'increase_per_100k': 'increase_per_100k_state'}, axis=1)
pd.options.mode.chained_assignment = None
query_county_state['cases_per_100k_county'] = query_county_state['cases_county'] / (query_county_state['pop_2019_county'] / 100000)

print(query_county_state.head())

   camper_id zipcode state                      session_id  county_id  \
0    2424030   28601    NC              western_expedition      37035   
1    2587669   37350    TN                       session_2      47065   
2    2601673   32746    FL  leadership_in_training_(lit)_1      12117   
3    2683141   29464    SC              western_expedition      45019   
4    2748592   27608    NC              western_expedition      37183   

  date_county      county  cases_county  deaths_county date_state  \
0  2020-10-28     Catawba        4435.0           62.0 2020-10-28   
1  2020-10-28    Hamilton       12312.0          111.0 2020-10-28   
2  2020-10-28    Seminole       10188.0          241.0 2020-10-28   
3  2020-10-28  Charleston       17181.0          277.0 2020-10-28   
4  2020-10-28        Wake       21400.0          270.0 2020-10-28   

   postive_state  negative_state  increase_state state_long_name  \
0       266136.0       3656906.0          2253.0  North Carolina   
1       25

Finally, let's select our final columns and put them in an order that makes sense.

In [23]:
final_reordered_df = query_county_state[['camper_id', 
                                         'session_id', 
                                         'zipcode', 
                                         'county_id', 
                                         'county', 
                                         'state', 
                                         'state_long_name', 
                                         'cases_per_100k_county', 
                                         'date_county', 
                                         'positive_of_total_state', 
                                         'increase_per_100k_state', 
                                         'date_state']]

final_reordered_df['positive_of_total_state'] = final_reordered_df['positive_of_total_state'] * 100
final_reordered_df['cases_per_100k_county'] = round(final_reordered_df['cases_per_100k_county'], 2)
final_reordered_df['increase_per_100k_state'] = round(final_reordered_df['increase_per_100k_state'], 2)

print(final_reordered_df.head())

   camper_id                      session_id zipcode  county_id      county  \
0    2424030              western_expedition   28601      37035     Catawba   
1    2587669                       session_2   37350      47065    Hamilton   
2    2601673  leadership_in_training_(lit)_1   32746      12117    Seminole   
3    2683141              western_expedition   29464      45019  Charleston   
4    2748592              western_expedition   27608      37183        Wake   

  state state_long_name  cases_per_100k_county date_county  \
0    NC  North Carolina                  42.29  2020-10-28   
1    TN       Tennessee                 180.29  2020-10-28   
2    FL         Florida                  47.44  2020-10-28   
3    SC  South Carolina                 333.69  2020-10-28   
4    NC  North Carolina                 204.04  2020-10-28   

   positive_of_total_state  increase_per_100k_state date_state  
0                 6.783919                    21.48 2020-10-28  
1                 7.04

## 5. Save CSV to Repository
With this combined and current data, we can now move on to exploration and visualization. Let's upload the dataframe as a csv to the project repository.

In [24]:
final_reordered_df.to_csv('C:\\Users\\avery\\OneDrive\\Documents\\GitHub\\\camp-community_covid_levels\\latest_combined_covid_data.csv')