<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Python-set-up" data-toc-modified-id="Python-set-up-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Python set-up</a></span></li><li><span><a href="#Get-US-population-data" data-toc-modified-id="Get-US-population-data-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Get US population data</a></span></li><li><span><a href="#get-the-data" data-toc-modified-id="get-the-data-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>get the data</a></span></li><li><span><a href="#Semilog-plot-of-US-States" data-toc-modified-id="Semilog-plot-of-US-States-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Semilog plot of US States</a></span></li><li><span><a href="#Plot-of-new-vs-cumulative" data-toc-modified-id="Plot-of-new-vs-cumulative-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Plot of new vs cumulative</a></span></li><li><span><a href="#Regional-per-capita" data-toc-modified-id="Regional-per-capita-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Regional per capita</a></span></li><li><span><a href="#Growth-factor" data-toc-modified-id="Growth-factor-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Growth factor</a></span></li><li><span><a href="#Plot-new-cases:-raw-and-smoothed" data-toc-modified-id="Plot-new-cases:-raw-and-smoothed-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>Plot new cases: raw and smoothed</a></span></li><li><span><a href="#Bring-it-all-together" data-toc-modified-id="Bring-it-all-together-9"><span class="toc-item-num">9&nbsp;&nbsp;</span>Bring it all together</a></span></li></ul></div>

## Python set-up

In [1]:
# imports
import sys
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import datetime
from pathlib import Path

#pandas
pd.options.display.max_rows = 999
pd.options.display.max_columns = 999

# scraping
from selenium.webdriver import Chrome
import re

# local imports
sys.path.append(r'../bin')
import plotstuff as ps

# plotting
plt.style.use('ggplot')
%matplotlib inline

# save location
CHART_DIRECTORY = '../charts'
Path(CHART_DIRECTORY).mkdir(parents=True, exist_ok=True)
CHART_DIRECTORY += '/zzUS-'

## Get US population data

In [2]:
wiki = 'https://simple.wikipedia.org/wiki/List_of_U.S._states_by_population'
browser = Chrome('../Chrome/chromedriver')
browser.get(wiki)

In [3]:
html = browser.find_element_by_xpath('//table')
html = '<table>' + html.get_attribute('innerHTML') + '</table>'
html = re.sub('<span[^>]*>Sort Table[^/]*/span>', '', html)
population = pd.read_html(html)[0]
population = population[['State', 'Population estimate, July 1, 2019[2]']]
population = population.set_index('State')
population = population[population.columns[0]]
population = population[:-4] # drop vsrious totals
population = population.rename({'U.S. Virgin Islands': 'Virgin Islands'})

In [4]:
browser.quit()

## get the data

In [5]:
source = 'https://covidtracking.com/'
url = source + 'api/v1/states/daily.json'
data = pd.read_json(url)
source = 'Source: ' + source

In [6]:
cases = data.pivot(index='date', columns='state', values='positive').astype(float)
cases.index = pd.DatetimeIndex(cases.index.astype(str))
deaths = data.pivot(index='date', columns='state', values='death').astype(float)
deaths.index = pd.DatetimeIndex(deaths.index.astype(str))

## Semilog plot of US States

In [7]:
def plot_semi_log_trajectory(data, mode, threshold, source):
    
    styles = ['-'] #, '--', '-.', ':'] # 4 lines 
    markers = list('PXo^v<>D*pH.d') # 13 markers
    colours = ['maroon', 'brown', 'olive', 'red', 
               'darkorange', 'darkgoldenrod', 'green',  
               'blue', 'purple', 'black', 'teal'] # 11 colours

    ax = plt.subplot(111)
    ax.set_title(f'COVID-19 US States: Number of {mode}')
    ax.set_xlabel('Days from the notional ' +
                f'{int(threshold)}th {mode[:-1]}')
    ax.set_ylabel(f'Cumulative {mode} (log scale)')
    ax.set_yscale('log')

    fig = ax.figure

    endpoints = {}
    color_legend = {}
    for i, name in enumerate(data.columns):
        # Get x and y data for nation
        # - where two sequential days have the same 
        #   value let's assume the second day is 
        #   because of no reporting, and remove the 
        #   second/subsequent data points.
        y = data[name].dropna()
        #print(f'{name}: \n{y}')
        y = y.drop_duplicates(keep='first')
        x = y.index.values
        y = y.values
        
        # let's not worry about the very short runs
        if len(y) <= 2:
            continue
    
        # adjust the x data to start at the start_threshold at the y intercept
        if y[0] == threshold:
            adjust = 0
        else:
            span = y[1] - y[0]
            adjust = (threshold - y[0]) / span
        x = x - adjust
        endpoints[name] = [x[-1], y[-1]]
        
        # and plot
        s = styles[i % len(styles)]
        m = markers[i % len(markers)]
        c = colours[i % len(colours)]
        lw = 1
        ax.plot(x, y, label=f'{name} ({int(y[-1])})', 
                #marker=m, 
                linewidth=lw, color=c, linestyle=s)
        color_legend[name] = c 

    # label each end-point
    min, max = ax.get_xlim()
    ax.set_xlim(min, max+(max*0.02))
    for label in endpoints:
        x, y = endpoints[label]
        ax.text(x=x+(max*0.01), y=y, s=f'{label}',
                size='small', color=color_legend[label],
                bbox={'alpha':0.5, 'facecolor':'white'})
    
    # etc.
    ax.legend(loc='upper left', ncol=4, fontsize='7')
    fig.set_size_inches(8, 8)
    fig.text(0.99, 0.005, source,
                ha='right', va='bottom',
                fontsize=9, fontstyle='italic',
                color='#999999')
    fig.tight_layout(pad=1)
    fig.savefig(f'{CHART_DIRECTORY}!semilog-comparison-{mode}', dpi=125)
    plt.show()
    plt.close()

In [8]:
def prepare_comparative_data(data, threshold):
    
    # focus on data at/above threshold (and just before)
    mask = data >= threshold
    for i in mask.columns:
        ilocate = mask.index.get_loc(mask[i].idxmax()) - 1
        if data[i].iloc[ilocate+1] > threshold:
            mask[i].iloc[ilocate] = True
    data = data.where(mask, other=np.nan)

    # Rebase the data in terms of days starting at 
    # day at or immediately before the threshold
    nans_in_col = data.isna().sum()
    for i in nans_in_col.index:
        data[i] = data[i].shift(-nans_in_col[i])
    data.index = range(len(data))
    
    return data

In [9]:
def semilog(data, mode, threshold, source):
    x = prepare_comparative_data(data, threshold)
    plot_semi_log_trajectory(x, mode, threshold, source)

## Plot of new vs cumulative

In [10]:
us_state_abbrev = {
    'Alabama': 'AL',
    'Alaska': 'AK',
    'American Samoa': 'AS',
    'Arizona': 'AZ',
    'Arkansas': 'AR',
    'California': 'CA',
    'Colorado': 'CO',
    'Connecticut': 'CT',
    'Delaware': 'DE',
    'District of Columbia': 'DC',
    'Florida': 'FL',
    'Georgia': 'GA',
    'Guam': 'GU',
    'Hawaii': 'HI',
    'Idaho': 'ID',
    'Illinois': 'IL',
    'Indiana': 'IN',
    'Iowa': 'IA',
    'Kansas': 'KS',
    'Kentucky': 'KY',
    'Louisiana': 'LA',
    'Maine': 'ME',
    'Maryland': 'MD',
    'Massachusetts': 'MA',
    'Michigan': 'MI',
    'Minnesota': 'MN',
    'Mississippi': 'MS',
    'Missouri': 'MO',
    'Montana': 'MT',
    'Nebraska': 'NE',
    'Nevada': 'NV',
    'New Hampshire': 'NH',
    'New Jersey': 'NJ',
    'New Mexico': 'NM',
    'New York': 'NY',
    'North Carolina': 'NC',
    'North Dakota': 'ND',
    'Northern Mariana Islands':'MP',
    'Ohio': 'OH',
    'Oklahoma': 'OK',
    'Oregon': 'OR',
    'Pennsylvania': 'PA',
    'Puerto Rico': 'PR',
    'Rhode Island': 'RI',
    'South Carolina': 'SC',
    'South Dakota': 'SD',
    'Tennessee': 'TN',
    'Texas': 'TX',
    'Utah': 'UT',
    'Vermont': 'VT',
    'Virgin Islands': 'VI',
    'Virginia': 'VA',
    'Washington': 'WA',
    'West Virginia': 'WV',
    'Wisconsin': 'WI',
    'Wyoming': 'WY'
}

abbrev_us_state = dict(map(reversed, us_state_abbrev.items()))

In [11]:
def plot_new_and_cum_cases(states_new, states_cum, mode, lfooter=''):
    
    for name in states_cum.columns:
            
        ps.plot_new_cum(
            states_new[name], states_cum[name], mode, name, 
            title=f'COVID-19 {mode.title()}: {name}',
            rfooter=source,
            lfooter=lfooter,
            savefig=f'{CHART_DIRECTORY}'+
                f'{name}-new-vs-cum-{mode}-{lfooter}.png',
        ) 

In [12]:
def joint(cases, deaths, mode):
    cases = cases.sort_index().fillna(0).diff().rolling(7).mean().copy()
    cases = ps.negative_correct(cases)
    deaths = deaths.sort_index().fillna(0).diff().rolling(7).mean().copy()
    deaths = ps.negative_correct(deaths)
    
    for state in cases.columns:
        name = state
        
        # plot cases
        ax = plt.subplot(111)
        labels = [f'{p.day}/{p.month}' for p in cases.index]
        ax.plot(labels, cases[state].values, 
               color='darkorange', label=f'New cases (left)')
        ax.set_title(f'COVID-19 in {name} {mode}')
        ax.set_ylabel(f'Num. per Day {mode}\n7-day rolling average')
        
        #plot deaths
        axr = ax.twinx()
        axr.plot(labels, deaths[state],
             lw=2.0, color='royalblue', label=f'New deaths (right)')
        axr.set_ylabel(None)
        axr.grid(False)

        # manually label the x-axis
        MAX_LABELS = 9
        ticks = ax.xaxis.get_major_ticks()
        if len(ticks):
            modulus = int(np.floor(len(ticks) / MAX_LABELS) + 1)
            for i in range(len(ticks)):
                if i % modulus:
                    ticks[i].label1.set_visible(False)

        # put in a legend
        ax.legend(loc='upper left')
        axr.legend(loc='center left')

        # wrap-up
        fig = ax.figure
        fig.set_size_inches(8, 4)
        fig.tight_layout(pad=1)
        fig.savefig(f'{CHART_DIRECTORY}{state}-cases-v-deaths-{mode}.png', dpi=125)

        #plt.show()
        plt.close()
        

## Regional per capita

In [13]:
def regional(df, mode):
    
    regions = {
        'Far West': ['Alaska', 'California', 'Hawaii', 'Nevada', 'Oregon', 'Washington'],
        'Rocky Mountains': ['Colorado', 'Idaho', 'Montana', 'Utah', 'Wyoming'],
        'Southwest': ['Arizona', 'New Mexico', 'Oklahoma', 'Texas'],
        'South': ['Alabama', 'Arkansas', 'Kentucky', 'Louisiana', 'Mississippi', 'Tennessee'],
        'Southeast': ['Florida', 'Georgia', 'North Carolina', 'South Carolina', 'Virginia', 'West Virginia'],
        'Plains': ['Iowa', 'Kansas', 'Minnesota', 'Missouri', 'Nebraska', 'North Dakota',  'South Dakota'],
        'Great Lakes': ['Illinois', 'Indiana', 'Michigan', 'Ohio', 'Wisconsin'],
        'Mideast': ['Delaware', 'District of Columbia', 'Maryland', 'New Jersey', 'New York', 'Pennsylvania'],
        'New England': ['Connecticut', 'Maine', 'Massachusetts', 'New Hampshire', 'Rhode Island', 'Vermont'],
        'Other': ['American Samoa', 'Guam', 'Northern Mariana Islands', 'Puerto Rico', 'Virgin Islands'],
    }
    
    ps.plot_regional_per_captia(df, mode, regions, population, 
        tight=1,
        rfooter=source,
        savefig_prefix=CHART_DIRECTORY
    )

## Growth factor

In [14]:
def plot_growth_factor(states_new, mode):
    for name in states_new.columns:
        ps.plot_growth_factor(states_new[name], 
            title=f'{name}: weekly growth - new COVID-19 {mode.lower()}',
            ylabel='Growth factor',
            xlabel=None,
            figsize=(8, 4),
            savefig=f'{CHART_DIRECTORY}{name}-growth-chart-{name}-{mode}.png',
            rfooter=source,
            lfooter=f'Weekly rolling average daily new {mode.lower()} this week / last week'
        )

## Plot new cases: raw and smoothed

In [15]:
def plot_new_original_smoothed(states_new, mode):
    HMA = 15
    ROLLING_PERIOD = 7
    rolling_all = pd.DataFrame()
    for name in states_new.columns:
        title = f'{name} (new COVID-19 {mode} per day)'
        ps.plot_orig_smooth(states_new[name].copy(), 
            HMA, 
            mode,
            'Australia', # this is used to get starting point for series
            title=title, 
            ylabel=f'New {mode} per day',
            tight=1.25,
            rfooter=source,
            savefig=f'{CHART_DIRECTORY}{title}.png'
        )
        
    # gross numbers per state
    for name in states_new.columns:
        rolling_all[name] = states_new[name].rolling(ROLLING_PERIOD).mean()
        
    rolling_all = rolling_all.iloc[-1].sort_values() # latest
    title = f'COVID19 Daily New {mode.title()} ({ROLLING_PERIOD} day average)'
    ps.plot_barh(rolling_all.round(2),
        title=title,
        figsize=(8,8),
        savefig=f'{CHART_DIRECTORY}!bar-chart-{title}.png',
        rfooter=source
    )
        
    # latest per-captia comparison 
    power = 6
    pop_factor = int(10 ** power)
    title = f"COVID19 Daily New {mode.title()} ({ROLLING_PERIOD} day average per $10^{power}$ pop'n)"
    rolling_all = rolling_all[population.index] # same order as population
    rolling_all = ((rolling_all / population) * pop_factor).round(2)
    ps.plot_barh(rolling_all.sort_values(),
        title=title,
        figsize=(8,8),
        savefig=f'{CHART_DIRECTORY}!bar-chart-{title}.png',
        rfooter=source
    )

## Bring it all together

In [16]:
cases.columns = cases.columns.map(abbrev_us_state)
cases_pc = cases.div(population / 1_000_000, axis=1)

deaths.columns = deaths.columns.map(abbrev_us_state)
deaths_pc = deaths.div(population / 1_000_000, axis=1)

In [17]:
def main():
    
    modes = ['cases', 'deaths']
    frames = [cases.copy().fillna(0), deaths.copy().fillna(0)]
    
    for mode, uncorrected_cumulative in zip(modes, frames):
    
        # data transformation - correct for data glitches
        (uncorrected_daily_new, 
         corrected_daily_new, 
         corrected_cumulative) = ps.dataframe_correction(uncorrected_cumulative)
        
        print(uncorrected_daily_new.tail(7))
        
        # New cases original and smoothed
        plot_new_original_smoothed(corrected_daily_new.copy(), mode)
        
        # regional plots
        regional(corrected_daily_new.copy(), mode)
        
        # new v cum plots
        plot_new_and_cum_cases(corrected_daily_new.copy(), corrected_cumulative.copy(), mode, 
                               lfooter='Any extreme outliers have been adjusted')
        plot_new_and_cum_cases(uncorrected_daily_new.copy(), uncorrected_cumulative.copy(), mode, 
                               lfooter='Original data')
                               
        # Growth rates
        plot_growth_factor(corrected_daily_new.copy(), mode)

    #joint(cases.copy(), deaths.copy(), '')
    #joint(cases_pc, deaths_pc, 'per million pop')

In [18]:
main()

Negatives in Arkansas
date
2020-08-15   -400.0
Name: Arkansas, dtype: float64
Data too sparse in American Samoa (max_consecutive=0)
Negatives in Connecticut
date
2020-05-27   -15.0
2020-08-18   -12.0
Name: Connecticut, dtype: float64
Negatives in Delaware
date
2020-07-25   -27.0
2020-08-27   -10.0
2020-08-30    -6.0
Name: Delaware, dtype: float64
Negatives in Guam
date
2020-05-04   -1.0
Name: Guam, dtype: float64
Negatives in Hawaii
date
2020-05-15   -1.0
2020-05-23   -5.0
Name: Hawaii, dtype: float64
Negatives in Louisiana
date
2020-06-19   -119.0
Name: Louisiana, dtype: float64
Negatives in Massachusetts
date
2020-09-02   -7757.0
Name: Massachusetts, dtype: float64
Spikes in Massachusetts
date    2020-06-01
spike  3599.955421
mean    476.177139
zeros     0.000000
Data too sparse in Northern Mariana Islands (max_consecutive=3)
Negatives in Montana
date
2020-05-05   -1.0
2020-06-06   -1.0
Name: Montana, dtype: float64
Negatives in New Jersey
date
2020-07-19   -31.0
Name: New Jersey, dt

Data too sparse in Alaska (max_consecutive=7)
Negatives in Alabama
date
2020-09-25   -15.0
Name: Alabama, dtype: float64
Negatives in Arkansas
date
2020-04-22   -1.0
2020-08-02   -2.0
2020-08-16   -1.0
Name: Arkansas, dtype: float64
Spikes in Arkansas
date   2020-09-16
spike  147.000000
mean    11.785714
zeros    1.000000
Data too sparse in American Samoa (max_consecutive=0)
Negatives in Arizona
date
2020-07-27   -1.0
2020-08-31   -1.0
2020-09-07   -2.0
Name: Arizona, dtype: float64
Negatives in Colorado
date
2020-04-25    -2.0
2020-07-06   -10.0
2020-12-06    -1.0
Name: Colorado, dtype: float64
Negatives in District of Columbia
date
2020-04-07   -2.0
Name: District of Columbia, dtype: float64
Negatives in Delaware
date
2020-08-25   -1.0
Name: Delaware, dtype: float64
Spikes in Delaware
date   2020-07-24
spike   49.000000
mean     1.071429
zeros    6.000000
Spikes in Georgia
date   2020-11-03
spike  480.000000
mean    30.214286
zeros    0.000000
Data too sparse in Guam (max_consecutive

In [19]:
print('Finished')

Finished
