# Wrangling and Visualization

In data science, it is often necessary to source data from multiple providers in order to solve a problem.  Each provider may have a different way of expressing data that you might use to merge (dates, names, telephone numbers, etc).

This week's live session will demonstrate these concepts with COVID-19 data from the CDC and Census data from the U.S. Census Bureau.

# COVID Dataset

Load the following dataset from the CDC:

```
COVID_by_State.csv
```

Inspect the data to make sure it looks reasonable

## Scatter Matrix

Select a subset of numerical columns and produce a scatter matrix


## U.S. Plots
2. Plot the time series for the U.S. new cases by date
3. Plot the time series for the U.S. new deaths by date
4. Put both plots on the same graph

## Massachusetts (Breakout)

Repeat the same for the state of Massachusetts

# Census Data

Load the U.S. Census data with population by State:

```
nst-est2019-popchg2010_2019.csv
```

Inspect the data to make sure it looks reasonable

## Cases Per 100,000

1. Compute the total cases per 100,000 person for all 50 states
2. Compute the total deaths per 100,000 person for all 50 states

## Produce Box Plots
1. To show total cases per 100,000 person across all 50 states
2. To show total deaths per 100,000 person across all 50 states

## Scatter Plot
1. Produce a scatter plot of deaths vs cases for all 50 states and place useful Hover tips

## Produce a Paretto (Breakout)
1. Of new cases per 100k of population

In [1]:
import os
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go

In [2]:
location = '../../data/'
files = os.listdir(location)
files

['CrossStats20150102.txt',
 'multiple_choice.csv',
 'iris_names.txt',
 'iris.csv',
 'nst-est2019-popchg2010-2019.pdf',
 'mount_rainier_daily.csv',
 'COVID_by_State.csv',
 'Candidate Assessment.xlsx',
 'nst-est2019-popchg2010_2019.csv']

# CDC Data

## Load & Examine CDC Data

Let's load it:

In [3]:
from datetime import datetime
dateparse = lambda x: datetime.strptime(x, '%m/%d/%Y')

In [4]:
covid = pd.read_csv(location + 'COVID_by_State.csv',
                   parse_dates=['submission_date'],date_parser=dateparse)

# Census Data

## Load & Examine Census Data

Let's load it:

In [18]:
census = pd.read_csv(location + 'nst-est2019-popchg2010_2019.csv')