# Homework 2

This is the second homework for EN685.621. This notebook will run through the results with commentary, but the code for this analysis is housed in a repository on Github located here: https://github.com/choct155/en685_621. To the extent the homework referenced a need for future use of the tools developed here, a first pass library has been constructed.

## Problem 1

Develop code to evaluate the following statistics for a given collection of values:

+ Minimum
+ Maximum
+ Mean
+ Trimmed Mean
+ Standard Deviation
+ Covariance
+ Skewnesss
+ Kurtosis

This first problem utilizes code built from scratch (that is, without the use of libraries like `numpy` or `scipy`). The remainder of the analysis for problems 2-4, however, relied on more robust existing libraries. The existing libraries provide a lot of functionality in to I/O and data munging department that would have been a significantly heavier load to build from scratch.

In [1]:
from algorithms.iris.Reader import IrisReader
from algorithms.iris.AttrStats import AttrStats, Stats
from algorithms.iris.DataGenerator import DataGenerator
from algorithms.iris.IrisOps import IrisOps
from algorithms.data_structures.Vector import Vector
from algorithms.stats.Moment import Moment
from algorithms.covid.Config import Config
from algorithms.covid.Reader import CovidReader
from algorithms.covid.DataPreparer import DataPreparer
from algorithms.covid.CaseAnalyzer import CaseAnalyzer
from algorithms.covid.CaseVisualizer import CaseVisualizer
from algorithms.covid.CaseStats import CaseStats, CaseStatsByGeo, CaseStatOps
from typing import Dict
import pandas as pd
import numpy as np
import plotly.graph_objects as go

iris_reader: IrisReader = IrisReader()
iris_reader.load()
raw_data: Dict[str, pd.DataFrame] = iris_reader.data

raw_data

{'setosa': array([[5.1, 3.5, 1.4, 0.2],
        [4.9, 3. , 1.4, 0.2],
        [4.7, 3.2, 1.3, 0.2],
        [4.6, 3.1, 1.5, 0.2],
        [5. , 3.6, 1.4, 0.2],
        [5.4, 3.9, 1.7, 0.4],
        [4.6, 3.4, 1.4, 0.3],
        [5. , 3.4, 1.5, 0.2],
        [4.4, 2.9, 1.4, 0.2],
        [4.9, 3.1, 1.5, 0.1],
        [5.4, 3.7, 1.5, 0.2],
        [4.8, 3.4, 1.6, 0.2],
        [4.8, 3. , 1.4, 0.1],
        [4.3, 3. , 1.1, 0.1],
        [5.8, 4. , 1.2, 0.2],
        [5.7, 4.4, 1.5, 0.4],
        [5.4, 3.9, 1.3, 0.4],
        [5.1, 3.5, 1.4, 0.3],
        [5.7, 3.8, 1.7, 0.3],
        [5.1, 3.8, 1.5, 0.3],
        [5.4, 3.4, 1.7, 0.2],
        [5.1, 3.7, 1.5, 0.4],
        [4.6, 3.6, 1. , 0.2],
        [5.1, 3.3, 1.7, 0.5],
        [4.8, 3.4, 1.9, 0.2],
        [5. , 3. , 1.6, 0.2],
        [5. , 3.4, 1.6, 0.4],
        [5.2, 3.5, 1.5, 0.2],
        [5.2, 3.4, 1.4, 0.2],
        [4.7, 3.2, 1.6, 0.2],
        [4.8, 3.1, 1.6, 0.2],
        [5.4, 3.4, 1.5, 0.4],
        [5.2, 4.1, 1.5, 0.1],


To demonstrate the use of the univariate statistical routines, we will rely on the sepal length and width (for covariance) data for the setosa species in the iris data set.

In [2]:
sepal_setosa_length: np.array = IrisOps.getCol(raw_data["setosa"], "sepal_length")
sepal_setosa_width: np.array = IrisOps.getCol(raw_data["setosa"], "sepal_width")
len_vec: Vector = Vector(sepal_setosa_length)
wid_vec: Vector = Vector(sepal_setosa_width)
print([val for val in len_vec])
print([val for val in wid_vec])

[5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.8, 4.8, 4.3, 5.8, 5.7, 5.4, 5.1, 5.7, 5.1, 5.4, 5.1, 4.6, 5.1, 4.8, 5.0, 5.0, 5.2, 5.2, 4.7, 4.8, 5.4, 5.2, 5.5, 4.9, 5.0, 5.5, 4.9, 4.4, 5.1, 5.0, 4.5, 4.4, 5.0, 5.1, 4.8, 5.1, 4.6, 5.3, 5.0]
[3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.4, 3.0, 3.0, 4.0, 4.4, 3.9, 3.5, 3.8, 3.8, 3.4, 3.7, 3.6, 3.3, 3.4, 3.0, 3.4, 3.5, 3.4, 3.2, 3.1, 3.4, 4.1, 4.2, 3.1, 3.2, 3.5, 3.6, 3.0, 3.4, 3.5, 2.3, 3.2, 3.5, 3.8, 3.0, 3.8, 3.2, 3.7, 3.3]


Each of the statistics are displayed in turn.

In [3]:
dict(
    minimum=len_vec.min(),
    maximum=len_vec.max(),
    mean=Moment.mean(len_vec),
    trimmed_mean=Moment.trimmed_mean(len_vec, 5),
    standard_dev=Moment.std_dev(len_vec),
    covariance=Moment.covariance(len_vec, wid_vec),
    skewness=Moment.skewness(len_vec),
    kurtosis=Moment.kurtosis(len_vec)
)

{'minimum': 4.3,
 'maximum': 5.8,
 'mean': 5.005999999999999,
 'trimmed_mean': 5.029999999999999,
 'standard_dev': 0.013428571428571472,
 'covariance': 0.09921632653061219,
 'skewness': 7.000000000000001,
 'kurtosis': 49.0}

## Problem 2

With respect to part (a), the code that generates synthetic observations sits in the aforementioned repository in the subpackage called iris. For part (b), we will make use of a class called `DataGenerator` which has a static method that captures each step in the generation process. In the following cell, we will capture the runtime of the generation process for the Setosa data.

In [4]:
%%timeit
DataGenerator.gen_synthetic_data(raw_data["setosa"], 100)

202 µs ± 2.17 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


Part (c) draws on the code leveraged in part (b) to generate synthetic data. This code is integrated into a plotting function from `IrisOps` which can be used to plot comparisons of observed and synthetic data for each species of flower, which satisfies the requirements of part (d).

In [5]:
IrisOps.compare_synth_data(raw_data, "setosa", ("sepal_length", "sepal_width"))

In [6]:
IrisOps.compare_synth_data(raw_data, "versicolor", ("sepal_length", "sepal_width"))

In [7]:
IrisOps.compare_synth_data(raw_data, "virginica", ("sepal_length", "sepal_width"))

## Problem 3

In part (a), we are asked to develop an algorithm that performs better than $O(n^3)$ and provides the following information:

+ For each state determine the number of confirmed cases by day.
+ Calculate the maximum number of cases in any given day by state.
+ By state, once the first confirmed case it identified, calculate the mean from that point.

For my part, it seemed like the best way to achieve a reasonably efficient implementation was to try to adhere to a single pass of the data. One approach is to leverage a fold over the series to capture the information incrementally with each new data point.

```python
@staticmethod
def fold_stats(data: pd.Series) -> List[CaseStats]:

    def loop(data: List[Rational], agg: List[CaseStats]=[]) -> List[CaseStats]:
        if len(data) == 0:
            return agg
        elif len(agg) == 0:
            first: CaseStats = CaseStats(data[0], data[0], data[0], data[0], 0)
            return loop(data[1:], [first])
        else:
            last_stat: CaseStats = agg[-1]
            next_stock: Rational = data[0]
            next_flow: Rational = next_stock - last_stat.stock
            next_max_flow: Rational = next_flow if next_flow > last_stat.max_flow else last_stat.max_flow
            next_onset_days: int = (
                0 if next_max_flow == 0
                else last_stat.onset_days + 1
            )
            next_avg_flow: Rational = (
                0 if next_onset_days == 0
                else (((next_onset_days - 1) * last_stat.avg_flow) + next_flow) / next_onset_days
            )
            next_stat: CaseStats = CaseStats(
                next_stock,
                next_flow,
                next_max_flow,
                next_avg_flow,
                next_onset_days
            )
            return loop(data[1:], agg + [next_stat])
    return loop(data)
```

The struct `CaseStats` is just a dataclass:

```python
@dataclass
class CaseStats:
    """
    Struct holding relevant information that can be cumulatively extracted
    from a series of cumulative COVID-19 cases within an abritrary cell (e.g
    city, state, or country).

    Keyword arguments:
    stock      -- Cumalitive count of cases
    flow       -- Daily new cases
    max_flow   -- Maximum count of new cases in a day for the days since onset
    avg_flow   -- Average count of new cases in a day for the days since onset
    onset_days -- Days since the first case (inclusive of the first day)
    """
    stock: Rational
    flow: Rational
    max_flow: Rational
    avg_flow: Rational
    onset_days: int
```

In part (b), we can directly test the runtime of this algorithm, but first we need to load and prepare the data.

In [8]:
covid_config: Config = Config()
covid_reader: CovidReader = CovidReader(covid_config)
covid: Dict[str, pd.DataFrame] = covid_reader.load()
    
dp: Dict[str, DataPreparer] = {k:DataPreparer(v) for (k, v) in covid.items()}
dp

{'confirmed': <algorithms.covid.DataPreparer.DataPreparer at 0x7fd5fa1420a0>,
 'deaths': <algorithms.covid.DataPreparer.DataPreparer at 0x7fd5fa142100>,
 'recovered': <algorithms.covid.DataPreparer.DataPreparer at 0x7fd5fa1426d0>}

In part (c) of this problem we will be focusing on the max and mean confirmed new cases by state, so for part (b), let us capture the cumulative cases by state. To test the algorithm, we only need one series and we will choose the great state of Texas.

In [9]:
confirmed_state: pd.DataFrame = DataPreparer.to_state_series(
    dp["confirmed"],
    ["UID", "iso2", "iso3", "code3", "FIPS", "Lat", "Long_", "Combined_Key"],
    ["Admin2", "Province_State", "Country_Region"]
)

texas: pd.Series = confirmed_state["Texas"]
texas

date
2020-01-22         0
2020-01-23         0
2020-01-24         0
2020-01-25         0
2020-01-26         0
               ...  
2020-06-18    101259
2020-06-19    105394
2020-06-20    109581
2020-06-21    112944
2020-06-22    117790
Name: Texas, Length: 153, dtype: int64

We are now in a position to complete part (b) by testing the runtime of the algorithm, `fold_stats`.

In [10]:
%%timeit
CaseAnalyzer.fold_stats(texas)

24.7 ms ± 161 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


Hmmm, notably slower than our last algorithm, albeit complicated. In any event, here is what the actual output looks like.

In [11]:
pd.DataFrame(
    list(
        map(CaseStatOps.toCaseStatsTup, CaseAnalyzer.fold_stats(texas))
    )
)

Unnamed: 0,stock,flow,max_flow,avg_flow,onset_days
0,0,0,0,0.000000,0
1,0,0,0,0.000000,0
2,0,0,0,0.000000,0
3,0,0,0,0.000000,0
4,0,0,0,0.000000,0
...,...,...,...,...,...
148,101259,3560,4130,955.273585,106
149,105394,4135,4135,984.990654,107
150,109581,4187,4187,1014.638889,108
151,112944,3363,4187,1036.183486,109


For part (c), we need to determine the max and mean number of daily cases by state. For this we can leverage a special purpose function that relies on `fold_stats`.

In [12]:
pd.DataFrame(
    list(
        map(CaseStatOps.toCaseStatsByGeoTup, CaseAnalyzer.max_mean_by_area(confirmed_state))
    )
).sort_values(by=["avg_flow"])

Unnamed: 0,geo,stock,flow,max_flow,avg_flow,onset_days
26,Montana,734,17,33,7.196078,102
1,Alaska,758,6,29,7.431373,102
11,Hawaii,816,2,63,7.555556,108
45,Vermont,1163,4,72,10.869159,107
50,Wyoming,1230,33,126,11.941748,103
48,West Virginia,2552,19,125,26.309278,97
19,Maine,2971,14,78,28.84466,103
34,North Dakota,3313,25,135,32.165049,103
12,Idaho,4256,250,250,41.72549,102
29,New Hampshire,5558,14,217,49.185841,113


## Problem 4

For part (b) of this problem, we can focus on just the US totals.

In [13]:
# Capture US Totals
confirmed_us: pd.Series = DataPreparer.to_state_series(
    dp["confirmed"],
    ["UID", "iso2", "iso3", "code3", "FIPS", "Lat", "Long_", "Combined_Key"],
    ["Admin2", "Province_State", "Country_Region"]
).sum(axis=1)
    
deaths_us: pd.Series = DataPreparer.to_state_series(
    dp["deaths"],
    ["UID", "iso2", "iso3", "code3", "FIPS", "Lat", "Long_", "Combined_Key", "Population"],
    ["Admin2", "Province_State", "Country_Region"]
).sum(axis=1)
    
recovered_us: pd.Series = dp["recovered"].process_recovered()
    
us_series: Dict[str, pd.DataFrame] = dict(confirmed=confirmed_us, deaths=deaths_us, recovered=recovered_us)

# Capture case stats
us_stats: Dict[str, pd.DataFrame] = {
    k:pd.DataFrame(list(map(CaseStatOps.toCaseStatsTup, CaseAnalyzer.fold_stats(v))))
    for (k,v) in us_series.items()
}
    
# Capture comparisons of stock values across statuses
stock_viz: CaseVisualizer = CaseVisualizer(
    "stock", 
    us_stats["confirmed"], 
    us_stats["deaths"], 
    us_stats["recovered"], 
    confirmed_us.index
)
stock_viz.load()
stock_viz.data

Unnamed: 0,confirmed,deaths,recovered,date
0,1,0,0,2020-01-22
1,1,0,0,2020-01-23
2,2,0,0,2020-01-24
3,2,0,0,2020-01-25
4,5,0,0,2020-01-26
...,...,...,...,...
148,2184494,118269,599115,2020-06-18
149,2215929,118961,606715,2020-06-19
150,2248179,119556,617460,2020-06-20
151,2274285,119812,622133,2020-06-21


A convenience method here will be helpful to knock out visualizations by measure (e.g. stock -> cumulative cases).

In [14]:
def plot_measure(measure: str, ttl: str) -> None:
    """Please forgive the broken closure in the name of convenience!"""
    case_viz: CaseVisualizer = CaseVisualizer(
    measure, 
    us_stats["confirmed"], 
    us_stats["deaths"], 
    us_stats["recovered"], 
    confirmed_us.index
    )
    case_viz.load()
    case_viz.plot_measure_comparison(ttl)
    
plot_measure("stock", "Comparison of Cumulative Cases by Status")

In [15]:
import plotly.graph_objects as go
fig = go.Figure(data=go.Bar(y=[2, 3, 1]))
fig.show()