# Wrangling and Visualization

Here we'll use COVID-19 data (which should be familiar to all of you) to learn how to merge datasets and visualize data.

# COVID Dataset

Load the following dataset from the CDC:

```
COVID_by_State.csv
```
## Load and Inspect

1. Inspect the CDC data, make sure it looks reasonable.  
2. Select a subset of numerical columns and produce a scatter matrix
3. Plot the time series for the U.S. total cases by date  
4. Plot the time series for the U.S. total deaths by date  
5. Put both plots on the same graph  

## Massachusetts

Repeat the same for the state of Massachusetts

3. Plot the time series for the U.S. total cases by date  
4. Plot the time series for the U.S. total deaths by date  
5. Put both plots on the same graph  

## Census Data

Load the U.S. Census data with population by State:

```
nst-est2019-popchg2010_2019.csv
```

1. Inspect the data to make sure it looks reasonable
2. Merge the Census data with the CDC data
2. Compute the cases per 100,000 person for all 50 states
3. Compute the deaths per 100,000 person for all 50 states

## Produce Box Plots
1. To show cases per 100,000 person across all 50 states
2. To show deaths per 100,000 person across all 50 states

In [1]:
import os
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go

In [2]:
location = '../../data/'
files = os.listdir(location)
files

['responses.csv',
 'CrossStats20150102.txt',
 'auto_2020.xlsx',
 'multiple_choice.csv',
 'iris_names.txt',
 'state_codes.csv',
 'iris.csv',
 'nst-est2019-popchg2010-2019.pdf',
 'questions.csv',
 'mount_rainier_daily.csv',
 'COVID_by_State.csv',
 'Candidate Assessment.xlsx',
 'nst-est2019-popchg2010_2019.csv']

## Load and Inspect

1. Inspect the CDC data, make sure it looks reasonable.  
2. Select a subset of numerical columns and produce a scatter matrix
3. Plot the time series for the U.S. total cases by date  
4. Plot the time series for the U.S. total deaths by date  
5. Put both plots on the same graph  

In [3]:
from datetime import datetime
dateparse = lambda x: datetime.strptime(x, '%m/%d/%Y')

In [4]:
covid = pd.read_csv(location + 'COVID_by_State.csv',
                   parse_dates=['submission_date'],date_parser=dateparse)

In [5]:
covid.head()

Unnamed: 0,submission_date,state,tot_cases,conf_cases,prob_cases,new_case,pnew_case,tot_death,conf_death,prob_death,new_death,pnew_death,created_at,consent_cases,consent_deaths
0,2021-03-11,KS,297229,241035.0,56194.0,0,0.0,4851,,,0,0.0,03/12/2021 03:20:13 PM,Agree,
1,2020-07-28,MP,40,40.0,0.0,0,0.0,2,2.0,0.0,0,0.0,07/29/2020 02:34:46 PM,Agree,Agree
2,2020-02-04,AR,0,,,0,,0,,,0,,03/26/2020 04:22:39 PM,Not agree,Not agree
3,2020-08-22,AR,56199,,,547,0.0,674,,,11,0.0,08/23/2020 02:15:28 PM,Not agree,Not agree
4,2020-10-22,MP,88,88.0,0.0,0,0.0,2,2.0,0.0,0,0.0,10/23/2020 01:44:31 PM,Agree,Agree
