# The US Department of Health and Human Services provides federal-level collection and publishing of COVID-19 testing and patient outcome data. To better understand the current state of COVID-19 testing in the US, we’d like you to create a Python project and documentation for the following metrics:

# •	The total number of PCR tests performed as of yesterday in the United States.



# •	The 7-day rolling average number of new cases per day for the last 30 days.

# •	The 10 states with the highest test positivity rate (positive tests / tests performed) for tests performed in the last 30 days.

###### •	It is assumed data is in a json file with the following structure:        [{"state":"xx","state_name":"xxx","state_fips":"xx","fema_region":"xxx…","overall_outcome":"xxx…","date":"xxxx…","new_results_reported":"xxx…","total_results_reported":"xxx…"},{….}]



###### •	It is assumed data may not be up to date so "yesterday" definition is the date before to the last date available in the source data set

In [300]:
# load pandas and json modules                                                                                               
import pandas as pd
import json

##### Source location and file name definition

In [301]:
source_location = 'https://healthdata.gov/resource/'
source_file = 'j8mb-icvb.json'

##### Data ingestion

In [302]:
df_covid = pd.read_json(source_location + source_file)

In [303]:
#To order data by "date"
df_covid.sort_values(by=['date'], inplace = True)
df_covid = df_covid.reset_index(drop=True)  

### Metrics:  The total number of PCR tests performed as of yesterday in the United States.

In [304]:
#To get number of PCRs per date for USA
df_pcr_totals = df_covid.groupby('date').new_results_reported.agg(df_pcr_totals = ('sum'))
df_pcr_totals = df_pcr_totals.reset_index()  

In [305]:
#To print yesyerday's PCR test performed.  Yesterday is defined as the prevoius date to the most curent date we have data.
print('The total number of PCR tests performed as of yesterday in the United States :',df_pcr_totals['df_pcr_totals'].iloc[len(df_pcr_totals)-2], 'cases')

The total number of PCR tests performed as of yesterday in the United States : 5609 cases


### Metrics: The 7-day rolling average number of new cases per day for the last 30 days.


##### -  New cases are assumed to be the Positive PCR tests

In [306]:
#To get all positive PCR test cases from original data set for all USA 
df_newcases1 = df_covid[(df_covid.overall_outcome == "Positive")].copy() 
df_newcases = df_newcases1.groupby(['date']).new_results_reported.agg(new_cases = ('sum'))
df_newcases = df_newcases.reset_index()  

In [307]:
#To get unique list of date from the data set, order them oldest to newest, and to pick the 30 most recent dates
date_list = df_covid.date.unique()
date_list.sort()
days_list = date_list[-30:]
last_30days = pd.DataFrame({'date':days_list})

In [308]:
#To get data from all positive PCR test date for only the last 30 days
df_rolling_last30days = pd.merge(last_30days, df_newcases, on=['date'], how = "left")

In [309]:
#To calculate last 30 days rolling average of new cases per day for the last 30 days
df_pcr_rolling_avg = df_rolling_last30days['new_cases'].rolling(7).mean()
df_pcr_rolling_avg = df_pcr_rolling_avg.reset_index() 
df_pcr_rolling_avg = df_pcr_rolling_avg.rename(columns={'new_cases':'7day_rolling_avg'})
df_pcr_rolling_avg['day'] = df_pcr_rolling_avg['index']+1
df_pcr_rolling_avg =df_pcr_rolling_avg[['day','7day_rolling_avg']]


In [310]:
#To present results
print('The 7-day rolling average number of new cases per day for the last 30 days.\n',df_pcr_rolling_avg)

The 7-day rolling average number of new cases per day for the last 30 days.
     day  7day_rolling_avg
0     1               NaN
1     2               NaN
2     3               NaN
3     4               NaN
4     5               NaN
5     6               NaN
6     7       3270.714286
7     8       3248.000000
8     9       3239.000000
9    10       3330.714286
10   11       3120.428571
11   12       2981.000000
12   13       2904.571429
13   14       2755.428571
14   15       2554.428571
15   16       2514.142857
16   17       2521.285714
17   18       2453.000000
18   19       2332.428571
19   20       2267.142857
20   21       2284.571429
21   22       2415.714286
22   23       2316.714286
23   24       2178.142857
24   25       2097.142857
25   26       2081.857143
26   27       1921.142857
27   28       1854.714286
28   29       1797.714286
29   30               NaN


### Metrics: The 10 states with the highest test positivity rate (positive tests / tests performed) for tests performed in the last 30 days.

In [311]:
#To get last 30 days of PCR tests from original data set
df_covid_last_30days = pd.merge(last_30days, df_covid, on=['date'], how = "left")
#To have the total of PCR test perfomed per state during the last 30 days
df_pcr_totals = df_covid_last_30days.groupby(['state_name']).new_results_reported.agg(tests_performed = ('sum'))
df_pcr_totals = df_pcr_totals.reset_index()

In [312]:
#to get positive test from last 30 days data set
df_pcr_positive1 = df_covid_last_30days[(df_covid_last_30days.overall_outcome == "Positive")].copy()
# to add up al lthe positive test in the last 30 days per state
df_pcr_positive = df_pcr_positive1.groupby(['state_name']).new_results_reported.agg(positive_tests_performed = ('sum'))


In [313]:
#To merge two data sets: pcr totals per state and positive pcr total per state for the last 30 days
df_positivity_rate = pd.merge(df_pcr_totals, df_pcr_positive, on=['state_name'], how = "left")
df_positivity_rate.sort_values(by=['state_name'], inplace = True)


In [314]:
#To avoid math error if there is nan values for dividend as "0" and divisor as "1".  This may never happend on this process
#but it is good to have it
df_positivity_rate['positive_tests_performed'] = df_positivity_rate['positive_tests_performed'].fillna(0)
df_positivity_rate['tests_performed'] = df_positivity_rate['tests_performed'].fillna(1)
#To calculate "positivity rate"
df_positivity_rate['positivity_rate'] = df_positivity_rate['positive_tests_performed']/df_positivity_rate['tests_performed']
df_positivity_rate = df_positivity_rate[['state_name','positivity_rate']]

In [315]:
df_positivity_rate.sort_values(by=['positivity_rate'], inplace = True)
df_positivity_rate = df_positivity_rate.reset_index(drop=True)  
df_states = df_positivity_rate.tail(10)
df_states = df_states.reset_index()
df_states = df_states[['state_name','positivity_rate']]
print('he 10 states with the highest test positivity rate (positive tests / tests performed) for tests performed in the last 30 days.\n',df_states)

he 10 states with the highest test positivity rate (positive tests / tests performed) for tests performed in the last 30 days.
   state_name  positivity_rate
0    Alabama          0.16799
