## Overview

This week you will be forecasting the incidence of COVID-19 cases. However, before forecasting you will begin by investigating the relationship between diabetes prevalence and COVID-19 incidence. 

This is split up into three notebooks

  1. Inspect correlations between diabetes and COVID-19 prevelance

  2. Implement LSTM model for COVID-19 forecasting and evaluate on county-level data

  3. Visualize the performance of two models trained on nation-level or county-level data

In [None]:
import plotly.figure_factory as ff
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import json
from scipy.stats import pearsonr
import data_cleaners as dc

In [None]:
%load_ext autoreload
%autoreload 2

## Pipeline task overview: Forecasting COVID-19 incidence



Recall from your previous courses that these tasks can typically be described by the following components: 

 1. Data collection - <font color='green'>Done</font>
 2. Data cleaning / transformation - <font color='green'>Done</font>
 3. Dataset splitting <font color='green'>Done</font>
 4. Model training <font color='magenta'>You will do</font>
 5. Model evaluation <font color='magenta'>You will do</font>
 6. Repeat 1-5 to perform model selection <font color='magenta'>You will do</font>
 7. Presenation of findings (Visualization) <font color='magenta'>You will do</font>



In [None]:
def get_covid_data():
    cub = dc.CUBData()
    covid = cub.us_cases
    covid['key'] = ['{}_{}'.format(state, county) for county,state in zip(covid['Admin2'],covid['Province_State'])]
    s=covid.filter(like='/')

    total_counts = s.iloc[:,-1]
    covids = covid[['key','Province_State']]
    covids['total_counts'] = total_counts
    return covids

## <font color='magenta'>Your task</font>

In this notebook you will be inspecting correlations between diabetes and COVID-19 prevalence. 

## <font color='magenta'>Task One</font>
Load in the two datasets: 

Merge the two datasets covid_data and counties using the column `key` from `covid_data` and `s_county` from `counties` ("../../assets/assignment4/County_Demo_1012.csv").


Save your new dataframe in the variable `merged_covid`.

**To receive credit you must use a Pandas merge**

In [None]:
merged_covid = None
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
#hidden tests are within this cell

## <font color='magenta'>Task Two</font>

Now you will look at how various factors in the county dataset are correlated both with the proportion of a population that is diabetic, and the proportion of the population with COVID-19 over time. 

Using the function ```pearsonr ``` compute the correlation coefficient between all factors in the list `factors` which are columns in the `merged_covid` dataframe and the `mean_cases` column computed below. 

Write a function which takes in the column of interest and outputs a dictionary with all factors as keys and a tuple of ```(correlation coefficient, p-value)```, sorted by correlation coefficient. Then report the factors with the **10 largest** correlation coefficients below.

In [None]:
merged_covid['mean_cases'] = merged_covid['total_counts']/merged_covid['population']

In [None]:
factors = ['white_pct',
       'black_pct', 'hispanic_pct', 'nonwhite_pct', 'foreignborn_pct',
       'female_pct', 'age29andunder_pct', 'age65andolder_pct', 'median_hh_inc',
       'clf_unemploy_pct', 'lesshs_pct', 'lesscollege_pct', 'rural_pct',
       'popdensity', 'housedensity', 'km_from_equator', 'poor_or_fair_health',
       'adult_smoking', 'adult_obesity', 'percent_uninsured',
       'social_association_rate', 'air_quality_avg_pm', 
       'percent_insufficient_sleep', 'percent_uninsured_adults',
       'percent_uninsured_children']

In [None]:
def get_factors(col_of_interest):
    to_return_dict = {}
    
    # your code goes here
    # YOUR CODE HERE
    raise NotImplementedError()
    
    return to_return_dict

result_dict = get_factors("mean_cases")

In [None]:
#hidden tests are within this cell

In [None]:
#Fill in column name of the top factor most correlated with mean cases
top_factor = ''
#Fill in column name of the second factor most correlated with mean cases
second_factor = ''
#Fill in column name of the third factor most correlated with mean cases
third_factor = ''
#Fill in column name of the fourth factor most correlated with mean cases
fourth_factor = ''
#Fill in column name of the fifth factor most correlated with mean cases
fifth_factor = ''

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
#hidden tests are within this cell

Compute the correlation coefficient between all factors in the list `factors` which are columns in the `merged_covid` dataframe and the `percent_diabetic` column.

In [None]:
result_dict = get_factors("percent_diabetic")


In [None]:
#Fill in column name of the top factor most correlated with percent_diabetic
top_factor = ''
#Fill in column name of the second factor most correlated with percent_diabetic
second_factor = ''
#Fill in column name of the third factor most correlated with percent_diabetic
third_factor = ''
#Fill in column name of the fourth factor most correlated with percent_diabetic
fourth_factor = ''
#Fill in column name of the fifth factor most correlated with percent_diabetic
fifth_factor = ''

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
#hidden tests are within this cell

## <font color='magenta'>Task Three</font>

Using the same `merged_covid` dataframe, prepare a Pandas Dataframe `df_states` with three columns, `total_diabetic`, `percent_diabetic` and `percent_covid` for each state. Do not reset the index to allow the autograder to match your results.



In order to pass the autograder, we suggest `df_states` contains the following columns:
```python
['Province_State', 'population', 'total_diabetic', 'total_counts','percent_diabetic','percent_covid','code']  

```

To add the column code to `df_states` take a look at this file ("../../assets/assignment4/states.json").

The first row of your DataFrame should look like the following: 

|    | Province_State   |   population |   total_diabetic |   total_counts |   percent_diabetic |   percent_covid | code   |
|---:|:-----------------|-------------:|-----------------:|---------------:|-------------------:|----------------:|:-------|
|  0 | Alabama          |      4831672 |           670972 |        1273213 |             13.887 |         26.3514 | AL     |

In [None]:
# setup df_states dataframe with two columns 'percent_diabetic', and 'percent_covid'; also add 'code' to get the
# alphabetic code for the state (to use with Plotly).

df_states = None

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
#hidden tests are within this cell

## <font color='magenta'>Task Four</font>

Using the prepared `df_states` Dataframe, you will recreate the plot below. 

This plot is a Choropleth rendered with the library Plotly. 

You can follow the example here: [Plotly Choropleth](https://plotly.com/python/choropleth-maps/)

Pay attention to detail, to get full points you will need to match the colors and titles of the subplots. You will also need to show both plots side by side. A JSON file containing alphabetic state codes has been given in the assets folder ('../../assets/assignment4/states.json'). Note that the colors used to create the chart are referred to as 'Blues'.

In [None]:
from IPython.display import SVG, display
def show_svg():
    display(SVG(filename="../../assets/assignment4/national_covid_and_diabetes.svg"))
show_svg()

In [None]:
# Your code to replicate the visualization goes here

# YOUR CODE HERE
raise NotImplementedError()