# OVERVIEW #

This notebook will import United States COVID case data through an API, convert the data into useful datatypes, and then export a Pandas DataFrame to .csv file to your local machine.  The COVID19 Data has more than 5 million rows; the exported DataFrame will only have a few hundred rows.

*After the first run, do not re-run this notebook unless you want to re-call/update the COVID data*

In [4]:
#import Pandas, Numpy, and MatPlotLib
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

In [5]:
#import requests in order to call the COVID data API data 
#import time in order to pause between API calls and not freeze in the process
#import pprint in order to pretty-print the data.
import requests
import time
from pprint import pprint


# SECTION 1: Execute an API call to the CDC Public Use Data Website #

In [6]:
#This is a variable which calls the API link from the COVID data website
    #it orders the data by CDC Report Date
    #and it has a limit larger than the max number of rows in the CDC data
r = requests.get('https://data.cdc.gov/resource/vbim-akqf.json?$order=cdc_report_dt&$limit=15000000')

#This is the API call with a 1 second pause between requests
covid_json = r.json()
time.sleep(1)

In [7]:
#pretty-print the first 2 entries to preview the data.
pprint(covid_json[:2])

[{'age_group': '20 - 29 Years',
  'cdc_report_dt': '2020-01-01T00:00:00.000',
  'current_status': 'Probable Case',
  'death_yn': 'No',
  'hosp_yn': 'No',
  'icu_yn': 'Missing',
  'medcond_yn': 'No',
  'onset_dt': '2020-01-01T00:00:00.000',
  'race_ethnicity_combined': 'Multiple/Other, Non-Hispanic',
  'sex': 'Female'},
 {'age_group': '40 - 49 Years',
  'cdc_report_dt': '2020-01-01T00:00:00.000',
  'current_status': 'Probable Case',
  'death_yn': 'Unknown',
  'hosp_yn': 'Unknown',
  'icu_yn': 'Missing',
  'medcond_yn': 'Unknown',
  'onset_dt': '2020-01-01T00:00:00.000',
  'race_ethnicity_combined': 'White, Non-Hispanic',
  'sex': 'Male'}]


**I want to compare the following COVID19 data points:** 

**case report date** = 'cdc_rpt_dt' - number of confirmed cases

**death status** = 'death_yn' - number of deaths

**hospitalization status** = 'hosp_yn' - number of people hospitalized
    
These are the key stats reported re: COVID19 counts in the news.  So these are the data columns used to create a Pandas DataFrame.

*Note: technically, I would pull positive result confirmation date, but this is missing from some cases reported*

In [8]:
#Right now, the API data is a list with nested dictionaries

print(type(covid_json))

<class 'list'>


In [9]:
#Loop through the list and call each desired datapoint, by its key, into its own list.
#Here is a sample

case_date, death, hospital = [],[],[]

for data in covid_json:
    case_date.append(data['cdc_report_dt'])
    death.append(data['death_yn'])
    hospital.append(data['hosp_yn'])


In [10]:
#Now that each datapoint is separated, add them back into a dictionary 
#This will make sure that each datapoint column is labelled
covid_values = {'cdc_report_dt': case_date, 'death_yn': death, 'hosp_yn': hospital}

#Convert the dictionary into a DataFrame
covid_df = pd.DataFrame(covid_values)


In [11]:
#Check the COVID DataFrame with a summary

covid_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8405079 entries, 0 to 8405078
Data columns (total 3 columns):
 #   Column         Dtype 
---  ------         ----- 
 0   cdc_report_dt  object
 1   death_yn       object
 2   hosp_yn        object
dtypes: object(3)
memory usage: 192.4+ MB


# SECTION 2: Data Conversion #

**Here is a list of next steps:**

1. convert the COVID dates into standard date format

2. group the covid data by days (cdc_report_dt) - each day will have total number of cases, deaths, hospitalizations

In [12]:
#Preview the COVID DataFrame created in Section 1
covid_df.head()

Unnamed: 0,cdc_report_dt,death_yn,hosp_yn
0,2020-01-01T00:00:00.000,No,No
1,2020-01-01T00:00:00.000,Unknown,Unknown
2,2020-01-01T00:00:00.000,No,Yes
3,2020-01-01T00:00:00.000,No,No
4,2020-01-01T00:00:00.000,No,No


#### Convert the COVID dates into standard date format ####

In [13]:
#Import the datetime module and use it to extract date details
import datetime

rpt_date = pd.to_datetime(covid_df["cdc_report_dt"]) 
report_date = pd.DataFrame(rpt_date)

In [14]:
#Check the datatype - it has changed from object to datetime64
report_date.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8405079 entries, 0 to 8405078
Data columns (total 1 columns):
 #   Column         Dtype         
---  ------         -----         
 0   cdc_report_dt  datetime64[ns]
dtypes: datetime64[ns](1)
memory usage: 64.1 MB


#### Create a count of cases reported each day ####

In [15]:
#Count the number of cases each day by adding a column with '1' for each case
    #The numpy 'where' function will add the column based on the condition I define
    #np.where(condition, value if condition is true, value if condition is false)
    #my code asks for '1' to be listed in a new column if the report date is not blank (list '0' if blank)
    
report_date['case_reported'] = np.where(report_date['cdc_report_dt']!= '[]', 1, 0)

#here is a preview of the new DataFrame
report_date.head()

Unnamed: 0,cdc_report_dt,case_reported
0,2020-01-01,1
1,2020-01-01,1
2,2020-01-01,1
3,2020-01-01,1
4,2020-01-01,1


#### Re-Create the COVID DataFrame with the new date format ####

In [16]:
#Create a separate dataframe with each column from the original 
deceased = pd.DataFrame(covid_df["death_yn"])
hospital = pd.DataFrame(covid_df["hosp_yn"])

#Concatenate the report_date, deceased, and hospital DataFrames
covid_data = pd.concat([report_date, deceased, hospital], axis=1)
covid_data.sort_values('death_yn')

Unnamed: 0,cdc_report_dt,case_reported,death_yn,hosp_yn
4202539,2020-08-30,1,Missing,Missing
4721680,2020-09-16,1,Missing,Missing
4721678,2020-09-16,1,Missing,No
4721672,2020-09-16,1,Missing,Missing
4721667,2020-09-16,1,Missing,No
...,...,...,...,...
2252430,2020-07-03,1,Yes,Yes
2252479,2020-07-03,1,Yes,Yes
1078867,2020-05-21,1,Yes,Missing
2252382,2020-07-03,1,Yes,Yes


#### Create a count of number of deaths and hospitalizations reported each day ####

In [17]:
#Count the number of deaths each day by adding a column with '1' for each death or '0' if no death 
    #death = Yes ('1'), No ('0'), Missing ('0'), or Unknown ('0')
    #This code is using a for loop to create the column

num_deaths = []

for value in covid_data["death_yn"]:
    if value == 'Yes':
        num_deaths.append(1)
    else: 
        num_deaths.append(0)

covid_data["deaths"] = num_deaths
#print(covid_data)
covid_data_r = pd.DataFrame(covid_data)
covid_data_r.sort_values('deaths')

Unnamed: 0,cdc_report_dt,case_reported,death_yn,hosp_yn,deaths
0,2020-01-01,1,No,No,0
5594742,2020-10-11,1,Unknown,Unknown,0
5594741,2020-10-11,1,Missing,Missing,0
5594740,2020-10-11,1,No,No,0
5594739,2020-10-11,1,Missing,Missing,0
...,...,...,...,...,...
3722278,2020-08-16,1,Yes,Yes,1
274808,2020-04-07,1,Yes,Yes,1
274809,2020-04-07,1,Yes,Yes,1
4327847,2020-09-04,1,Yes,Yes,1


In [18]:
#Count the number of hospitalizations each day by adding a column for each hosp_yn value
    #hosptialization = Yes ('1'), No ('0'), Missing ('0'), or Unknown ('0')
    #This code is using a for loop to create the column

num_hospital = []

for value in covid_data_r["hosp_yn"]:
    if value == 'Yes':
        num_hospital.append(1)
    else: 
        num_hospital.append(0)

covid_data_r["hospitalizations"] = num_hospital
#print(covid_data)
covid_data_r2 = pd.DataFrame(covid_data_r)
covid_data_r2.sort_values('hospitalizations')

Unnamed: 0,cdc_report_dt,case_reported,death_yn,hosp_yn,deaths,hospitalizations
0,2020-01-01,1,No,No,0,0
5563416,2020-10-10,1,Missing,Missing,0,0
5563415,2020-10-10,1,Missing,Missing,0,0
5563414,2020-10-10,1,No,No,0,0
5563413,2020-10-10,1,Unknown,Unknown,0,0
...,...,...,...,...,...,...
5658641,2020-10-12,1,No,Yes,0,1
6582085,2020-10-30,1,Missing,Yes,0,1
3610775,2020-08-13,1,Yes,Yes,1,1
3610756,2020-08-13,1,No,Yes,0,1


#### Create a new DataFrame which only shows counts of each datapoint ####

In [19]:
#filter the previous DataFrame by the new counts columns
    
covid_counts = covid_data_r2[["cdc_report_dt", "case_reported", "deaths", "hospitalizations"]]
covid_counts.tail()

Unnamed: 0,cdc_report_dt,case_reported,deaths,hospitalizations
8405074,2020-11-19,1,0,0
8405075,2020-11-19,1,0,0
8405076,2020-11-19,1,0,0
8405077,2020-11-19,1,0,0
8405078,2020-11-19,1,0,0


# SECTION 3: Group the COVID data and export as .csv # 

Cases reported, deaths, and hospitalizations will be grouped by the total number reported each day

In [20]:
#Group the COVID data by report date, give it a new variable name
covid_by_day = covid_counts.groupby(['cdc_report_dt'])

#Create total count of for each value, by day
daily_cases = covid_by_day['case_reported'].sum()
daily_deaths = covid_by_day['deaths'].sum()
daily_hospitalizations = covid_by_day['hospitalizations'].sum()

#Create a new DataFrame with these values
covid_daily_counts = pd.concat([daily_cases, daily_deaths, daily_hospitalizations], axis=1)
covid_daily_counts.columns = ["Cases", "Deaths", "Hospitalizations"]
covid_daily_counts

Unnamed: 0_level_0,Cases,Deaths,Hospitalizations
cdc_report_dt,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2020-01-01,12,0,1
2020-01-02,3,0,0
2020-01-03,2,0,0
2020-01-05,1,0,0
2020-01-08,1,0,0
...,...,...,...
2020-11-15,85177,467,2198
2020-11-16,106744,562,2673
2020-11-17,155142,844,4959
2020-11-18,102010,475,2700


In [21]:
#Now, there are less than 500 rows of data and this can be saved as a .csv file
covid_daily_counts.to_csv('covid_data.csv') 