# Visualizing COVID-19 Internationally and Locally

This notebook uses the Johns Hopkins University COVID-19 resource to plot data on cases, recoveries, and deaths

#### Notes
- As of 3/24/20, "recovered" is no longer maintained, and the data is replaced by new "confirmed" and "deaths" .csv files. "Recovered" plots are removed from this notebook
- Unfortunately, the "global" time series datasets no longer have individual states' data. Need to pull directory of daily reports from github directly and then assemble a new dataframe


### CURRENTLY WORKING ON:
- Formatting dates in `format_datetime()` function (see notes)

### Notes on countries'/states' containment efforts and events
*Insert notes here about containment efforts performed*
- States instituted isolation:
- Hong Kong eased restrictions 3/22/20

- MS governor overrode local ordinances 3/26/2020
- Anti containment protests 4-17-20 through ____
- Many states started reopening on 5-1-20
- George Floyd Protests in MN 5/26-, spreading to many other cities throughout the week

In [1]:
import numpy as np, pandas as pd, matplotlib.pyplot as plt, seaborn as sns
from datetime import datetime, timedelta
import os

# plotly and cufflinks
import cufflinks as cf
import plotly
from plotly.offline import download_plotlyjs,init_notebook_mode,plot,iplot
init_notebook_mode(connected=True) # allows jupyter notebook to access the visualizations in java
cf.go_offline() # make cufflinks go offline

import plotly.graph_objects as go

%matplotlib inline

### Create date row for large table
- Some dates have funny formatting: Dates for 1/29, 1/28, 1/23, 1/24, 1/25 have 2-digit years, some use `/` and others use `-` between month/day/year
- `if/elif` statements to determine which formatting. Within the old format, `try/except` to capture the correct style (4 digit year or 2 digit year)

#### Add "Date" field to all rows
Extracted from "Last Updated" column

# Plot rates based on JHU timecourse data

In [2]:
# Get data from Github
#confirmed_url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Confirmed.csv'
#deaths_url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Deaths.csv'
#recovered_url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Recovered.csv'

# NEW DATA STRUCTURES implemented 3/24/20
confirmed_url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv'
deaths_url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv'
recovered_url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_recovered_global.csv'

timecourse_confirmed = pd.read_csv(confirmed_url)
timecourse_deaths = pd.read_csv(deaths_url)
timecourse_recovered = pd.read_csv(recovered_url)
#timecourse_recovered = pd.read_csv(recovered_url)

In [3]:
# join 3 dataframes into one with status
timecourse_confirmed["Status"] = "Confirmed"
timecourse_deaths["Status"] = "Deaths"
timecourse_recovered['Status'] = 'Recovered'

timecourse = pd.concat([timecourse_confirmed, timecourse_deaths, timecourse_recovered])

timecourse.head()

Unnamed: 0,Province/State,Country/Region,Lat,Long,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,...,6/19/20,6/20/20,6/21/20,6/22/20,6/23/20,6/24/20,6/25/20,6/26/20,6/27/20,Status
0,,Afghanistan,33.0,65.0,0,0,0,0,0,0,...,27878,28424,28833,29157,29481,29640,30175,30451,30616,Confirmed
1,,Albania,41.1533,20.1683,0,0,0,0,0,0,...,1838,1891,1962,1995,2047,2114,2192,2269,2330,Confirmed
2,,Algeria,28.0339,1.6596,0,0,0,0,0,0,...,11504,11631,11771,11920,12076,12248,12445,12685,12968,Confirmed
3,,Andorra,42.5063,1.5218,0,0,0,0,0,0,...,855,855,855,855,855,855,855,855,855,Confirmed
4,,Angola,-11.2027,17.8739,0,0,0,0,0,0,...,172,176,183,186,189,197,212,212,259,Confirmed


In [4]:
# display country names
timecourse["Country/Region"].unique()

array(['Afghanistan', 'Albania', 'Algeria', 'Andorra', 'Angola',
       'Antigua and Barbuda', 'Argentina', 'Armenia', 'Australia',
       'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain', 'Bangladesh',
       'Barbados', 'Belarus', 'Belgium', 'Benin', 'Bhutan', 'Bolivia',
       'Bosnia and Herzegovina', 'Brazil', 'Brunei', 'Bulgaria',
       'Burkina Faso', 'Cabo Verde', 'Cambodia', 'Cameroon', 'Canada',
       'Central African Republic', 'Chad', 'Chile', 'China', 'Colombia',
       'Congo (Brazzaville)', 'Congo (Kinshasa)', 'Costa Rica',
       "Cote d'Ivoire", 'Croatia', 'Diamond Princess', 'Cuba', 'Cyprus',
       'Czechia', 'Denmark', 'Djibouti', 'Dominican Republic', 'Ecuador',
       'Egypt', 'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Estonia',
       'Eswatini', 'Ethiopia', 'Fiji', 'Finland', 'France', 'Gabon',
       'Gambia', 'Georgia', 'Germany', 'Ghana', 'Greece', 'Guatemala',
       'Guinea', 'Guyana', 'Haiti', 'Holy See', 'Honduras', 'Hungary',
       'Iceland', 'India

In [5]:
# sort all statuses by country
drop_cols = ["Lat", "Long"]

# select countries to display
countries = ["US", "United Kingdom", "Italy", "France", "China", "Korea, South", "Germany", "Japan", "Sweden",
            "Finland", "Spain"]

by_country = timecourse.drop(drop_cols, axis=1).groupby(by=["Country/Region", "Status"]).sum().transpose()[countries]
by_country.rename(columns={'Korea, South':'South Korea'}, inplace=True) # change "Korea, South" to "South Korea"

by_country.head()

Country/Region,US,US,US,United Kingdom,United Kingdom,United Kingdom,Italy,Italy,Italy,France,...,Japan,Sweden,Sweden,Sweden,Finland,Finland,Finland,Spain,Spain,Spain
Status,Confirmed,Deaths,Recovered,Confirmed,Deaths,Recovered,Confirmed,Deaths,Recovered,Confirmed,...,Recovered,Confirmed,Deaths,Recovered,Confirmed,Deaths,Recovered,Confirmed,Deaths,Recovered
1/22/20,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1/23/20,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1/24/20,2,0,0,0,0,0,0,0,0,2,...,0,0,0,0,0,0,0,0,0,0
1/25/20,2,0,0,0,0,0,0,0,0,3,...,0,0,0,0,0,0,0,0,0,0
1/26/20,5,0,0,0,0,0,0,0,0,3,...,1,0,0,0,0,0,0,0,0,0


In [6]:
# Select 'confirmed' from the second level
# Use the cross section (.xs) method to get confirmed cases only
by_country.xs('Confirmed', axis=1, level=1).head()

Country/Region,US,United Kingdom,Italy,France,China,South Korea,Germany,Japan,Sweden,Finland,Spain
1/22/20,1,0,0,0,548,1,0,2,0,0,0
1/23/20,1,0,0,0,643,1,0,2,0,0,0
1/24/20,2,0,0,2,920,2,0,2,0,0,0
1/25/20,2,0,0,3,1406,2,0,2,0,0,0
1/26/20,5,0,0,3,2075,3,0,4,0,0,0


In [7]:
# plot the confirmed rates
fig = go.Figure()
by_country.xs('Confirmed', axis=1, level=1).iplot(
    kind='lines', yaxis_type='log', theme='ggplot',
    title='COVID-19 Confirmed Cases by Country (Updated {})'.format(by_country.index.tolist()[-1]),
    yaxis_title='Number of cases', xaxis_title='Date')

### Deaths

In [8]:
# plot the confirmed rates
fig = go.Figure()
by_country.xs('Deaths', axis=1, level=1).iplot(
    kind='lines', yaxis_type='log', theme='ggplot',
    title='COVID-19 Deaths by Country (Updated {})'.format(by_country.index.tolist()[-1]),
    yaxis_title='Number of cases', xaxis_title='Date')

## Recovered

In [9]:
# plot the confirmed rates
fig = go.Figure()
by_country.xs('Recovered', axis=1, level=1).iplot(
    kind='lines', yaxis_type='log', theme='ggplot',
    title='COVID-19 Recoveries by Country (Updated {})'.format(by_country.index.tolist()[-1]),
    yaxis_title='Number of cases', xaxis_title='Date')

## Plot new cases by day

In [10]:
by_country.info()

<class 'pandas.core.frame.DataFrame'>
Index: 158 entries, 1/22/20 to 6/27/20
Data columns (total 33 columns):
(US, Confirmed)                158 non-null int64
(US, Deaths)                   158 non-null int64
(US, Recovered)                158 non-null int64
(United Kingdom, Confirmed)    158 non-null int64
(United Kingdom, Deaths)       158 non-null int64
(United Kingdom, Recovered)    158 non-null int64
(Italy, Confirmed)             158 non-null int64
(Italy, Deaths)                158 non-null int64
(Italy, Recovered)             158 non-null int64
(France, Confirmed)            158 non-null int64
(France, Deaths)               158 non-null int64
(France, Recovered)            158 non-null int64
(China, Confirmed)             158 non-null int64
(China, Deaths)                158 non-null int64
(China, Recovered)             158 non-null int64
(South Korea, Confirmed)       158 non-null int64
(South Korea, Deaths)          158 non-null int64
(South Korea, Recovered)       158 non-n

In [13]:
#by_country.loc['Alabama'].diff().head() # use .diff() method to subtract previous row
country_daily = by_country.diff()#.tail(60) # get the last 60 days of info

# set negative values to zero
country_daily[country_daily < 0] = 0 # replace all negative values with 0

In [29]:
# plot line graph of country by new daily confirmed cases
country_daily.xs('Confirmed', axis=1, level=1).iplot(scale='log', title='Daily New Confirmed Cases')

In [28]:
window_size = 5
rolling = country_daily.xs('Confirmed', axis=1, level=1).rolling(window=window_size, min_periods=3).mean()
rolling.iplot(scale='log', yaxis_title='Number of cases', xaxis_title='Date',
              title='Daily New Confirmed Cases (Rolling {} Day Average)'.format(window_size))

In [30]:
# plot line graph of country by new daily confirmed cases
country_daily.xs('Deaths', axis=1, level=1).iplot(scale='log', title='Daily New Confirmed Deaths')

In [31]:
window_size = 5
rolling = country_daily.xs('Deaths', axis=1, level=1).rolling(window=window_size, min_periods=3).mean()
rolling.iplot(yaxis_title='Number of cases', xaxis_title='Date',
              title='Daily New Confirmed Deaths (Rolling {} Day Average)'.format(window_size))

# Plot COVID-19 cases by state

Dictionary to change state names to abbreviations and vice-versa

Pull data directly from github

### Current issues
- Handling of sums of cases is wonky, especially in early phases
    - Currently this is handled by only keeping data from each day that it is reported. This prevents over-counting of cases in the same location; however, it has the drawback of days when fewer municipalities reported new cases not counting the existing cases. **How do I best handle this??**
    - If I don't remove duplicate data, the result is worse, but I haven't figured out why

In [2]:
### make a method to generate a list of date strings

# make a single date from today - 60 days
t = datetime.today().date()
d = timedelta(days=60)
print('default t-d string output:', str(t - d))

# reformat t-d string to be mm-dd-yyyy
print('formatted using strftime')
print((t-d).strftime("%m-%d-%Y"))

default t-d string output: 2020-04-30
formatted using strftime
04-30-2020


In [3]:
# create iterator for days

days = 30
date_iter = [(datetime.today().date() - timedelta(days=d)).strftime("%m-%d-%Y") 
             for d in range(days, 0, -1)]
date_iter

['05-30-2020',
 '05-31-2020',
 '06-01-2020',
 '06-02-2020',
 '06-03-2020',
 '06-04-2020',
 '06-05-2020',
 '06-06-2020',
 '06-07-2020',
 '06-08-2020',
 '06-09-2020',
 '06-10-2020',
 '06-11-2020',
 '06-12-2020',
 '06-13-2020',
 '06-14-2020',
 '06-15-2020',
 '06-16-2020',
 '06-17-2020',
 '06-18-2020',
 '06-19-2020',
 '06-20-2020',
 '06-21-2020',
 '06-22-2020',
 '06-23-2020',
 '06-24-2020',
 '06-25-2020',
 '06-26-2020',
 '06-27-2020',
 '06-28-2020']

In [4]:
def create_date_list(d, method = 'days'):
    """function that returns a list of formatted date strings 
    based on number of days prior to today start date
    method can be 'days' or 'date'
    'days' = number of days prior to today
    'date' = start date with ending date of today"""
    if method == 'days':
        date_list = [(datetime.today().date() - timedelta(days=days)).strftime("%m-%d-%Y") 
                     for days in range(d, 0, -1)]
    elif method == 'date':
        date_list = [(datetime.today().date() - timedelta(days=d)).strftime("%m-%d-%Y") 
                     for d in range(days, 0, -1)]
    return date_list

#### Manually read .csv files from original repo

In [5]:
# create list of dates
dateList = create_date_list(60)
suffixes = [s + '.csv' for s in dateList]

# open files, concatenate into list of urls
main_url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/'
urls = list(map(lambda x: main_url + x, suffixes))

#### Data formatting has changed a bit over time.

##### Dates
- Early formatting date `M/DD/YY HH:MM` or `MM/DD/YYYY HH:MM`. Later dates resumed this format
- Middle dates adopted `YYYY-MM-DDTHH:MM:SS`

##### Locations
- `admin2` refers to county/municipality level location data
- "Combined_Key" is not needed for my analysis, it's an amalgam of place names
- "Country/Region" changed with "Country_Region" in later timepoints
- "Province/State" changed to "Province_State"
- "Latitude" --> "Lat"
- "Longitude" --> "Long_"

In [6]:
# open and concatenate all .csvs into the big one
# Rename discordant columns upon opening each dataframe for uniformity

frames = []
for i in urls:
    df = pd.read_csv(i)
    # Add a File_date field based on the url
    df['file_date'] = datetime.strptime(i.split(".")[-2][-10:], '%m-%d-%Y').date()
    # logic to rename fields
    if "Lat" in df.columns:
        df.rename(columns={'Lat':'Latitude'}, inplace=True)
    if "Country_Region" in df.columns:
        df.rename(columns={"Country_Region":"Country/Region"}, inplace=True)
    if "Long_" in df.columns:
        df.rename(columns={"Long_":"Longitude"}, inplace=True)
    if "Province_State" in df.columns:
        df.rename(columns={"Province_State":"Province/State"}, inplace=True)
    if "Last_Update" in df.columns:
        df.rename(columns={'Last_Update':'Last Update'}, inplace=True)
    frames.append(df)
    
data = pd.concat(frames, axis=0, sort=True)


In [7]:
data['file_date'].nunique()

60

In [8]:
data.head()

Unnamed: 0,Active,Admin2,Case-Fatality_Ratio,Combined_Key,Confirmed,Country/Region,Deaths,FIPS,Incidence_Rate,Last Update,Latitude,Longitude,Province/State,Recovered,file_date
0,31,Abbeville,,"Abbeville, South Carolina, US",31,US,0,45001.0,,2020-05-01 02:32:28,34.223334,-82.461707,South Carolina,0,2020-04-30
1,120,Acadia,,"Acadia, Louisiana, US",130,US,10,22001.0,,2020-05-01 02:32:28,30.295065,-92.414197,Louisiana,0,2020-04-30
2,260,Accomack,,"Accomack, Virginia, US",264,US,4,51001.0,,2020-05-01 02:32:28,37.767072,-75.632346,Virginia,0,2020-04-30
3,655,Ada,,"Ada, Idaho, US",671,US,16,16001.0,,2020-05-01 02:32:28,43.452658,-116.241552,Idaho,0,2020-04-30
4,1,Adair,,"Adair, Iowa, US",1,US,0,19001.0,,2020-05-01 02:32:28,41.330756,-94.471059,Iowa,0,2020-04-30


#### Handle dates
Date formats:
- YYYY-MM-DD hh:mm:ss
- M/DD/YYYY hh:mm
- YYYY-MM-DDThh:mm:ss


#### Functions below
`format_date()`: Returns a date only

`format_datetime()`: Returns a date and timestamp

In [9]:
# format dates

def format_date(s):
    # format dates to a uniform way
    if "T" in s:
        date = datetime.strptime(s.split('T')[0], '%Y-%m-%d').date()
    elif "/" in s:
        try:
            date = datetime.strptime(s.split(" ")[0], '%m/%d/%Y').date()
        except:
            date = datetime.strptime(s.split(" ")[0], '%m/%d/%y').date()
    else:
        date = datetime.strptime(s.split(" ")[0], '%Y-%m-%d').date()
    return date

# tests
print(format_date('2020-03-23 23:19:21'))
print(format_date('2020-03-20T23:19:21'))
print(format_date('3/22/2020 23:19:21'))

2020-03-23
2020-03-20
2020-03-22


In [10]:
## TODO
# APPROACH 2
# For each 
# format dates

def format_datetime(s):
    # format dates to a uniform way
    # this function retains timestamp as well
    if "T" in s:
        date = datetime.strptime(s, '%Y-%m-%dT%H:%M:%S')
    elif "/" in s and len(s.split(' ')[1]) <= 5:
        try:
            date = datetime.strptime(s, '%m/%d/%Y %H:%M')
        except:
            date = datetime.strptime(s, '%m/%d/%y %H:%M')
    elif "/" in s and len(s.split(' ')[1]) > 5: # likely to capture those containing seconds
        try:
            date = datetime.strptime(s, '%m/%d/%Y %H:%M:%S')
        except:
            date = datetime.strptime(s, '%m/%d/%y %H:%M:%S')
    else:
        date = datetime.strptime(s, '%Y-%m-%d %H:%M:%S')
    return date

# tests
print(format_datetime('2020-03-23 23:19:21'))
print(format_datetime('2020-03-20T23:19:21'))
print(format_datetime('3/22/2020 23:19:21'))
print(format_datetime('3/22/2020 23:19'))
print(format_datetime('2/1/2020 1:52'))

2020-03-23 23:19:21
2020-03-20 23:19:21
2020-03-22 23:19:21
2020-03-22 23:19:00
2020-02-01 01:52:00


In [11]:
# apply date to dataframe
data['Date Updated'] = data['Last Update'].apply(format_date)
data['Last Update'] = data['Last Update'].apply(format_datetime)
data.head()

Unnamed: 0,Active,Admin2,Case-Fatality_Ratio,Combined_Key,Confirmed,Country/Region,Deaths,FIPS,Incidence_Rate,Last Update,Latitude,Longitude,Province/State,Recovered,file_date,Date Updated
0,31,Abbeville,,"Abbeville, South Carolina, US",31,US,0,45001.0,,2020-05-01 02:32:28,34.223334,-82.461707,South Carolina,0,2020-04-30,2020-05-01
1,120,Acadia,,"Acadia, Louisiana, US",130,US,10,22001.0,,2020-05-01 02:32:28,30.295065,-92.414197,Louisiana,0,2020-04-30,2020-05-01
2,260,Accomack,,"Accomack, Virginia, US",264,US,4,51001.0,,2020-05-01 02:32:28,37.767072,-75.632346,Virginia,0,2020-04-30,2020-05-01
3,655,Ada,,"Ada, Idaho, US",671,US,16,16001.0,,2020-05-01 02:32:28,43.452658,-116.241552,Idaho,0,2020-04-30,2020-05-01
4,1,Adair,,"Adair, Iowa, US",1,US,0,19001.0,,2020-05-01 02:32:28,41.330756,-94.471059,Iowa,0,2020-04-30,2020-05-01


### Fixing dates
#### Special note about dates contained in these files
Be careful of duplicates on a given day. The daily reports have some line items that were last updated on previous days (for example, 2-22-2020, https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/csse_covid_19_daily_reports/02-22-2020.csv)

- Have to figure out a way to *not* multi-count these occurrences on a given day, so more processing will be needed before `sum()` operations can be used.

#### Approaches attempted
- I added a `file_date` column upon import that labels the date from the url. Should be able to combine that plus the "Date Updated" field to make a unique ID for each country and avoid double-counting
    - Idea 1: For each municipality or item, keep the item that has the latest (largest) file_date, remove earlier ones
    - Easier idea (performed): Check the above link, and go visit the file from the "last updated" day - it is probably easiest and quickest to simply remove records that don't match the file_date and then have `connectgaps=True` in the graph
        - **This does not yield accurate data overall. Some of the updates that occur later than the given time are not included, or they are double-counted**
        - *Best approach:* For each *date* (`datetime.date()`), keep the latest `datetime.time()`
        
##### Approach 2
- For each *date* (`datetime.date()`), keep the latest (largest) `datetime()` (includes update time)
- Application of the `format_datetime()` function recursively means that "Last Update" is now properly formatted with date and time
- Function to keep only the largest "Last Update" value for each "Date Updated" -- **This may also prove erroneous**

##### Approach 3
- **CURRENTLY WORKING ON** For items with equal `Date Updated`, keep the one with highest `file_date` value

##### Approach 4
- What if I just use the file_date?

In [12]:
testData = data.copy()

test = testData[(testData['Date Updated'] >= pd.Timestamp(2020, 4, 20).date()) & 
                (testData['Date Updated'] <= pd.Timestamp(2020, 4, 25).date()) &
                (testData['Province/State'] == 'New York')]

#test.groupby('file_date').sum()
test[test['file_date'] == pd.to_datetime('04-23-2020').date()].sum()
#test[test['Last Update'] == test['Last Update'].max()]

Active                 0.0
Admin2                 0.0
Case-Fatality_Ratio    0.0
Combined_Key           0.0
Confirmed              0.0
Country/Region         0.0
Deaths                 0.0
FIPS                   0.0
Incidence_Rate         0.0
Last Update            0.0
Latitude               0.0
Longitude              0.0
Province/State         0.0
Recovered              0.0
file_date              0.0
Date Updated           0.0
dtype: float64

## Filter for US and states

In [13]:
# drop unneeded columns
#drop_cols = ['Combined_Key', 'Last Update', 'Latitude', 'Longitude', 'Country/Region', 'Admin2', 'file_date']
drop_cols = ['Combined_Key', 'Latitude', 'Longitude', 'Country/Region', 'Admin2']

states = data[data['Country/Region'] == 'US'].drop(drop_cols, axis=1)

# Reorder columns for easier reading
newcols = ['Province/State', 'FIPS', 'file_date','Date Updated', 'Active', 'Confirmed', 'Deaths', 'Recovered']
states = states[newcols]

# Rename "Date Updated" to "Date"
#states.rename(columns={'Date Updated':'Date'}, inplace=True)
states.rename(columns={'file_date':'Date'}, inplace=True)
states.head()

Unnamed: 0,Province/State,FIPS,Date,Date Updated,Active,Confirmed,Deaths,Recovered
0,South Carolina,45001.0,2020-04-30,2020-05-01,31,31,0,0
1,Louisiana,22001.0,2020-04-30,2020-05-01,120,130,10,0
2,Virginia,51001.0,2020-04-30,2020-05-01,260,264,4,0
3,Idaho,16001.0,2020-04-30,2020-05-01,655,671,16,0
4,Iowa,19001.0,2020-04-30,2020-05-01,1,1,0,0


#### Cleaning Province/State data
1. Manually fix any small errors (eg: 'Chicago')
1. Convert to full state names by `abbrev_us_state`
2. Drop cruise ships not assigned to a state
3. Drop territories
4. Rename things that need renaming

In [14]:
# rename 'Chicago' row to 'Chicago, IL'
test = states.replace('Chicago', value = 'Chicago, IL', inplace=True)

In [15]:
# United States of America Python Dictionary to translate States,
# Districts & Territories to Two-Letter codes and vice versa.
#
# https://gist.github.com/rogerallen/1583593
#
# Dedicated to the public domain.  To the extent possible under law,
# Roger Allen has waived all copyright and related or neighboring
# rights to this code.

us_state_abbrev = {
    'Alabama': 'AL',
    'Alaska': 'AK',
    'Arizona': 'AZ',
    'Arkansas': 'AR',
    'California': 'CA',
    'Colorado': 'CO',
    'Connecticut': 'CT',
    'Delaware': 'DE',
    'District of Columbia': 'DC',
    'Florida': 'FL',
    'Georgia': 'GA',
    'Hawaii': 'HI',
    'Idaho': 'ID',
    'Illinois': 'IL',
    'Indiana': 'IN',
    'Iowa': 'IA',
    'Kansas': 'KS',
    'Kentucky': 'KY',
    'Louisiana': 'LA',
    'Maine': 'ME',
    'Maryland': 'MD',
    'Massachusetts': 'MA',
    'Michigan': 'MI',
    'Minnesota': 'MN',
    'Mississippi': 'MS',
    'Missouri': 'MO',
    'Montana': 'MT',
    'Nebraska': 'NE',
    'Nevada': 'NV',
    'New Hampshire': 'NH',
    'New Jersey': 'NJ',
    'New Mexico': 'NM',
    'New York': 'NY',
    'North Carolina': 'NC',
    'North Dakota': 'ND',
    'Northern Mariana Islands':'MP',
    'Ohio': 'OH',
    'Oklahoma': 'OK',
    'Oregon': 'OR',
    'Palau': 'PW',
    'Pennsylvania': 'PA',
    'Puerto Rico': 'PR',
    'Rhode Island': 'RI',
    'South Carolina': 'SC',
    'South Dakota': 'SD',
    'Tennessee': 'TN',
    'Texas': 'TX',
    'Utah': 'UT',
    'Vermont': 'VT',
    'Virgin Islands': 'VI',
    'Virginia': 'VA',
    'Washington': 'WA',
    'West Virginia': 'WV',
    'Wisconsin': 'WI',
    'Wyoming': 'WY',
}

# thank you to @kinghelix and @trevormarburger for this idea
abbrev_us_state = dict(map(reversed, us_state_abbrev.items()))

# Simple test examples
if __name__ == '__main__':
    print("Wisconsin --> WI?", us_state_abbrev['Wisconsin'] == 'WI')
    print("WI --> Wisconsin?", abbrev_us_state['WI'] == 'Wisconsin')

Wisconsin --> WI? True
WI --> Wisconsin? True


In [16]:
# take the ones with abbreviations and convert to state names using abbrev_us_state
def convert_to_state(s):
    s = s.split(',')[-1].strip()
    
    # fix D.C. abbreviation
    if s == "D.C.":
        s = "DC"
    
    if len(s) == 2 and s in abbrev_us_state:
        return abbrev_us_state[s]
    elif '(' in s and len(s.split(' ')[0]) == 2: # has parenthesis and state abbrev (cruise ship)
        return abbrev_us_state[s.split(' ')[0]]
    else:
        return s
    
states['State'] = states['Province/State'].apply(convert_to_state)
states.head()

Unnamed: 0,Province/State,FIPS,Date,Date Updated,Active,Confirmed,Deaths,Recovered,State
0,South Carolina,45001.0,2020-04-30,2020-05-01,31,31,0,0,South Carolina
1,Louisiana,22001.0,2020-04-30,2020-05-01,120,130,10,0,Louisiana
2,Virginia,51001.0,2020-04-30,2020-05-01,260,264,4,0,Virginia
3,Idaho,16001.0,2020-04-30,2020-05-01,655,671,16,0,Idaho
4,Iowa,19001.0,2020-04-30,2020-05-01,1,1,0,0,Iowa


In [17]:
# Drop territories & cruises
to_remove = ['Virgin Islands', 'United States Virgin Islands', 'Guam', 'Puerto Rico', 'US',
            'Diamond Princess', 'Grand Princess', 'Unassigned Location (From Diamond Princess)',
            'Grand Princess Cruise Ship', 'Northern Mariana Islands', 'Recovered',
            'American Samoa', 'U.S.', 'Wuhan Evacuee'] # define territories to drop
states = states[~states['State'].isin(to_remove)] # drop territories

print(states['State'].nunique()) # check number of states

51


### Plot state data

In [18]:
# group by state
grouped_states = states.drop(['Province/State', 'FIPS'], axis=1).groupby(
    by=['State', 'Date']).sum()

# fill in active cases
grouped_states['Active'] = grouped_states['Confirmed'] - grouped_states['Deaths'] - grouped_states['Recovered']

grouped_states.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Active,Confirmed,Deaths,Recovered
State,Date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Alabama,2020-04-30,6816,7088,272,0
Alabama,2020-05-01,7005,7294,289,0
Alabama,2020-05-02,7323,7611,288,0
Alabama,2020-05-03,7598,7888,290,0
Alabama,2020-05-04,7814,8112,298,0


In [19]:
# plot confirmed cases
states_confirmed = grouped_states.reset_index().pivot(index='Date', columns='State', values='Confirmed').iplot(
    kind='lines', yaxis_type='log', theme='ggplot', connectgaps=True, asFigure=True,
    title='COVID-19 Confirmed Cases by State', yaxis_title='Number of Cases', xaxis_title='Date')

states_confirmed

In [20]:
# plot deaths by state
# plot confirmed cases
states_deaths = grouped_states.reset_index().pivot(index='Date', columns='State', values='Deaths').iplot(
    kind='lines', yaxis_type='log', theme='ggplot', connectgaps=True, asFigure=True,
    title='COVID-19 Deaths by State', yaxis_title='Number of Cases', xaxis_title='Date')

states_deaths

In [21]:
# plot active cases by state
states_recovered = grouped_states.reset_index().pivot(index='Date', columns='State', values='Active').iplot(
    kind='lines', yaxis_type='log', theme='ggplot', connectgaps=True, asFigure=True,
    title='Active COVID-19 Cases by State', yaxis_title='Number of Cases', xaxis_title='Date')

states_recovered

In [22]:
# save figures
plotly.io.write_html(states_confirmed, file='states-confirmed.html')
plotly.io.write_html(states_deaths, file='states-deaths.html')

OSError: [Errno 22] Invalid argument

## Plot number of new cases for each day
1. Format data using `pivot()` to get an arrangement like this:

|    | state1 | state2 | state3 | .....

date | xxx    | xxx    | xxx    | xxxx

### TODO: Figure out why some of the numbers are wonky
- States like IL and NY have their total case load as day 0

In [23]:
# make DataFrames of daily cases and deaths with rows=date, cols=state
states_daily_confirmed = grouped_states.reset_index().pivot(index='Date', columns='State', values='Confirmed').diff()
states_daily_confirmed[states_daily_confirmed < 0] = 0 # replace all negative values with 0

states_daily_deaths = grouped_states.reset_index().pivot(index='Date', columns='State', values='Deaths').diff()
states_daily_deaths[states_daily_deaths < 0] = 0 # replace all negative values with 0

In [24]:
# make a function to make it easier to select dates from a dataframe

def select_dates(first, last):
    # return a list of dates, properly formatted
    dates = [pd.to_datetime(first).date(),
            pd.to_datetime(last).date()]
    return dates

first, last = select_dates('2020-04-21', '2020-04-25') # input dates here

# filter dataframe for dates
test_dates = states[(states['Date'] >= first)
                   & (states['Date'] <= last)]

test_dates.info() # print info of filtered dateframe

<class 'pandas.core.frame.DataFrame'>
Int64Index: 0 entries
Data columns (total 9 columns):
Province/State    0 non-null object
FIPS              0 non-null float64
Date              0 non-null object
Date Updated      0 non-null object
Active            0 non-null int64
Confirmed         0 non-null int64
Deaths            0 non-null int64
Recovered         0 non-null int64
State             0 non-null object
dtypes: float64(1), int64(4), object(4)
memory usage: 0.0+ bytes


In [25]:
states_daily_confirmed.iplot(kind='lines', theme='ggplot', connectgaps=True, asFigure=True,
    title='New COVID-19 Cases by State', yaxis_title='Number of Cases', xaxis_title='Date')

In [27]:
window_size = 7
rolling = states_daily_confirmed.rolling(window=window_size, min_periods=3).mean()
states_confirmed_plot = rolling.iplot(scale='log', yaxis_title='Number of cases', xaxis_title='Date',
              title='Daily New Confirmed Cases by State (Rolling {} Day Average)'.format(window_size), asFigure = True)

states_confirmed_plot

In [28]:
# save states plot
plotly.io.write_html(states_confirmed_plot, file='states-daily_confirmed.html')

Plot graphs of new cases

In [64]:
# use plotly express
import plotly.express as px

#fig = px.scatter(grouped_states.reset_index(), x='Date', y='New cases', color='State', log_y=True)
fig = px.line(grouped_states.reset_index(), x='Date', y='New cases', color='State', log_y=True,
             title='USA states: New daily cases')


fig.show()
#px.bar(combine_us_data, x='Province_State', y='Count', text='Count', barmode='group', color='Case',
#            title='USA State-wise combined number of confirmed deaths, recoveries, and active cases')

plotly.io.write_html(fig, file='states-new_cases.html')

ValueError: Value of 'y' is not the name of a column in 'data_frame'. Expected one of ['State', 'Date', 'Active', 'Confirmed', 'Deaths', 'Recovered'] but received: New cases

#### Plot new deaths

In [43]:
fig = px.line(grouped_states.reset_index(), x='Date', y='New deaths', color='State', log_y=True,
             title='USA states: Deaths per day')
fig.show()

In [44]:
grouped_states.loc['Minnesota', pd.to_datetime('04-17-2020').date()] # check data manually

Active        1959.0
Confirmed     2070.0
Deaths         111.0
Recovered        0.0
New cases      261.0
New deaths      24.0
Name: (Minnesota, 2020-04-17), dtype: float64

In [45]:
# how to cross section by a specific date
grouped_states.xs(pd.to_datetime('2020-03-13').date(), level=1, axis=0).head(10)

Unnamed: 0_level_0,Active,Confirmed,Deaths,Recovered,New cases,New deaths
State,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Alabama,5,5,0,0,,
Alaska,1,1,0,0,1.0,0.0
Arizona,8,9,0,1,0.0,0.0
Arkansas,6,6,0,0,0.0,0.0
California,272,282,4,6,61.0,0.0
Colorado,49,49,0,0,4.0,0.0
Connecticut,11,11,0,0,6.0,0.0
Delaware,4,4,0,0,3.0,0.0
District of Columbia,10,10,0,0,0.0,0.0
Florida,48,50,2,0,15.0,0.0


In [46]:
## make bar graph for the number of new cases in the past 7 days

# Get last 7 days of data, average and produce dataframe with relevant data
avg_new_cases = grouped_states.reset_index()[grouped_states.reset_index()['Date'] >= (datetime.today()-timedelta(days=7)).date()].groupby(
    by=['State']).mean().drop(['Active', 'Confirmed', 'Deaths', 'Recovered'], axis=1).reset_index()

# plot bar
fig = px.bar(avg_new_cases, x='State', y='New cases', barmode='group', title='Average number of new daily cases for the week leading up to {}'.format(datetime.today().date()))
fig.show()

plotly.io.write_html(fig, file='states-new_cases_7-day_avg_{}.html'.format(datetime.today().date()))

In [47]:
## make bar graph for the number of new cases in the past 7 days

# Get last 7 days of data, average and produce dataframe with relevant data
avg_new_cases = grouped_states.reset_index()[grouped_states.reset_index()['Date'] > (datetime.today()-timedelta(days=7)).date()].groupby(
    by=['State']).mean().drop(['Active', 'Confirmed', 'Deaths', 'Recovered'], axis=1).reset_index()

# plot bar
px.bar(avg_new_cases, x='State', y='New deaths', barmode='group', 
       title='Average number of new daily deaths for the week leading up to {}'.format(datetime.today().date()))

In [48]:
# Example of using timedelta to get a moving time window based on current day
(datetime.today()-timedelta(days=7)).date()

datetime.date(2020, 5, 15)

## Plot bars of current totals (using Bonny's demo below)

In [58]:
today = grouped_states.xs(pd.to_datetime('05-21-2020').date(), level=1).reset_index()
today.head()

Unnamed: 0,State,Active,Confirmed,Deaths,Recovered,New cases,New deaths
0,Alabama,12759,13288,529,0,236.0,7.0
1,Alaska,391,401,10,0,0.0,0.0
2,Arizona,14584,15348,764,0,442.0,17.0
3,Arkansas,5348,5458,110,0,455.0,3.0
4,California,84448,88031,3583,0,2034.0,86.0


In [59]:
# plot bar of total confirmed, deaths
combine_us_data = pd.melt(today, id_vars='State', 
                          value_vars=['Confirmed', 'Deaths'],
                          value_name='Count', var_name='Case')
combine_us_data.head()

Unnamed: 0,State,Case,Count
0,Alabama,Confirmed,13288
1,Alaska,Confirmed,401
2,Arizona,Confirmed,15348
3,Arkansas,Confirmed,5458
4,California,Confirmed,88031


In [61]:
# plot bar
fig = px.bar(combine_us_data, x='State', y='Count', text='Count', barmode='group', color='Case',
            title='USA State-wise combined number of confirmed deaths, recoveries, and active cases')
fig.show()

Next, get populations for each state and compile a list

In [124]:
# import state populations
state_pops = pd.read_csv('state_pops.csv')
state_pops = state_pops[['State', 'Pop']]
# convert to dictionary
state_pops = state_pops.set_index('State')['Pop'].to_dict()

In [126]:
# add population to combine_us_data
combine_us_data['Pop'] = combine_us_data['State'].apply(
    lambda x: state_pops[x])
combine_us_data.head()

Unnamed: 0,State,Case,Count,Pop
0,Alabama,Confirmed,13288,4908621
1,Alaska,Confirmed,401,734002
2,Arizona,Confirmed,15348,7378494
3,Arkansas,Confirmed,5458,3038999
4,California,Confirmed,88031,39937489


In [128]:
# get per capita rates
perCapita = combine_us_data.copy()
perCapita['Per 100k'] = perCapita['Count']/perCapita['Pop']*100000
perCapita.drop('Count', axis=1, inplace=True)
perCapita.head()

Unnamed: 0,State,Case,Pop,Per 100k
0,Alabama,Confirmed,4908621,270.707394
1,Alaska,Confirmed,734002,54.632004
2,Arizona,Confirmed,7378494,208.009927
3,Arkansas,Confirmed,3038999,179.598611
4,California,Confirmed,39937489,220.42197


In [131]:
# plot bar of per capita counts
fig = px.bar(perCapita, x='State', y='Per 100k', text='Per 100k', barmode='group', color='Case',
            title='USA State-wise combined number of confirmed cases and deaths per 100k')
fig.show()

In [148]:
# dictionary for red or blue state
colors = pd.read_csv('state_color.csv')
colors.head()

Unnamed: 0,rank,State,By governor,By past 4 elections
0,1,California,Blue,Blue
1,2,Texas,Red,Red
2,3,Florida,Red,Red
3,4,New York,Blue,Blue
4,5,Pennsylvania,Blue,Blue


In [150]:
# create dictionaries
governor = colors.drop(['rank','By past 4 elections'], axis=1).set_index('State')['By governor'].to_dict()
elections = colors.drop(['rank','By governor'], axis=1).set_index('State')['By past 4 elections'].to_dict()

In [151]:
# classify by governorship

# combine_us_data with color added
combine_us_data['Color'] = combine_us_data['State'].apply(
    lambda x: governor[x])
combine_us_data.head()

Unnamed: 0,State,Case,Count,Pop,Color
0,Alabama,Confirmed,13288,4908621,Red
1,Alaska,Confirmed,401,734002,Red
2,Arizona,Confirmed,15348,7378494,Red
3,Arkansas,Confirmed,5458,3038999,Red
4,California,Confirmed,88031,39937489,Blue


In [156]:
## classify by governorship and population
# DataFrame today with color added
today['Color'] = today['State'].apply(
    lambda x: governor[x])
today['Pop'] = today['State'].apply(
    lambda x: state_pops[x])

political_groups = today.groupby(['Color']).sum()
political_groups.head()

Unnamed: 0_level_0,Active,Confirmed,Deaths,Recovered,New cases,New deaths,Pop
Color,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Blue,1021348,1093627,72279,0,14356.0,758.0,180033912
Red,457918,480199,22281,0,10879.0,504.0,151285080


In [163]:
today.head()

Unnamed: 0,State,Active,Confirmed,Deaths,Recovered,New cases,New deaths,Color,Pop
0,Alabama,12759,13288,529,0,236.0,7.0,Red,4908621
1,Alaska,391,401,10,0,0.0,0.0,Red,734002
2,Arizona,14584,15348,764,0,442.0,17.0,Red,7378494
3,Arkansas,5348,5458,110,0,455.0,3.0,Red,3038999
4,California,84448,88031,3583,0,2034.0,86.0,Blue,39937489


In [160]:
# per capita numbers
political_groups = political_groups[['Confirmed', 'Deaths', 'Pop']]

# convert political_groups into per_capita
political_groups['Confirmed per 100k'] = political_groups['Confirmed']/political_groups['Pop']*100000
political_groups['Deaths per 100k'] = political_groups['Deaths']/political_groups['Pop']*100000

political_groups

Unnamed: 0_level_0,Confirmed,Deaths,Pop,Confirmed per 100k,Deaths per 100k
Color,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Blue,1093627,72279,180033912,607.456111,40.147436
Red,480199,22281,151285080,317.413323,14.727824


In [165]:
px.bar(today, x='State', y='Deaths', color='Color', color_discrete_sequence=['red','blue'])

## Plotly Express functions tutorial
From Bonny McClain demos

In [49]:
import plotly.express as px

# get dataset and perform functions
covid_data = pd.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/05-04-2020.csv')
covid_data['Active'] = covid_data['Confirmed'] - covid_data['Deaths'] - covid_data['Recovered']
result = covid_data.groupby('Province_State')['Confirmed', 'Deaths', 'Recovered', 'Active'].sum().reset_index()
print(result)

                       Province_State  Confirmed  Deaths  Recovered  Active
0                             Alabama       8112     298          0    7814
1                              Alaska        370       9          0     361
2                             Alberta       5836     104          0    5732
3                            Anguilla          3       0          3       0
4                               Anhui        991       6        985       0
5                             Arizona       8924     362          0    8562
6                            Arkansas       3491      80          0    3411
7                               Aruba        100       2         81      17
8        Australian Capital Territory        107       3        103       1
9                             Beijing        593       9        554      30
10                            Bermuda        115       7         54      54
11   Bonaire, Sint Eustatius and Saba          6       0          0       6
12          

In [50]:
combine_us_data = covid_data[covid_data['Country_Region'] == 'US'].drop(['Country_Region', 'Lat', 'Long_'], axis=1)
combine_us_data = combine_us_data[combine_us_data.sum(axis = 1) > 0]
combine_us_data = combine_us_data.groupby(['Province_State'])['Confirmed', 'Deaths', 'Recovered', 'Active'].sum().reset_index()
combine_us_data.head()

Unnamed: 0,Province_State,Confirmed,Deaths,Recovered,Active
0,Alabama,8112,298,0,7814
1,Alaska,370,9,0,361
2,Arizona,8924,362,0,8562
3,Arkansas,3491,80,0,3411
4,California,55884,2278,0,53606


In [51]:
combine_us_data = pd.melt(combine_us_data, id_vars='Province_State', 
                          value_vars=['Confirmed', 'Deaths', 'Recovered', 'Active'],
                          value_name='Count', var_name='Case')
fig = px.bar(combine_us_data, x='Province_State', y='Count', text='Count', barmode='group', color='Case',
            title='USA State-wise combined number of confirmed deaths, recoveries, and active cases')
fig.show()

In [52]:
combine_us_data.head()

Unnamed: 0,Province_State,Case,Count
0,Alabama,Confirmed,8112
1,Alaska,Confirmed,370
2,Arizona,Confirmed,8924
3,Arkansas,Confirmed,3491
4,California,Confirmed,55884


In [53]:
grouped_states.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Active,Confirmed,Deaths,Recovered,New cases,New deaths
State,Date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Alabama,2020-03-13,5,5,0,0,,
Alabama,2020-03-14,6,6,0,0,1.0,0.0
Alabama,2020-03-15,12,12,0,0,6.0,0.0
Alabama,2020-03-16,29,29,0,0,17.0,0.0
Alabama,2020-03-17,39,39,0,0,10.0,0.0
