# Fill in the blank exercises

Read the instructions and try to create the visualizations by filling in the ---- blank sections.

In [None]:
import pandas as pd
import altair as alt
from altair import datum

# avoid MaxRowsError
import vegafusion as vf
vf.enable()

# Historical Polio cases by state from 1928-1969

[According to the US Centers for Disease Control (CDC)](https://www.cdc.gov/features/poliofacts/)

    Polio, or poliomyelitis, is a crippling and potentially deadly infectious disease. It is caused by the poliovirus. The virus spreads from person to person and can invade an infected person’s brain and spinal cord, causing paralysis (can’t move parts of the body).

    Thanks to the polio vaccine, dedicated health care professionals, and parents who vaccinate their children on schedule, polio has been eliminated in this country for more than 30 years.
    

### We will visualize some historical data about polio cases across the US.

**Note: Data is normalized cases per 100,000 people for each state**

- Data was [downloaded from visdatasets](https://visdatasets.github.io/)
- Original Retrieved from [Project Tycho](https://www.tycho.pitt.edu/); aggregated into yearly values.
- Some other time, refer to a [good article on visualizations of this data](http://www.randalolson.com/2016/03/04/revisiting-the-vaccine-visualizations/)

For this example we'll load data from an Excel file. If you have more than one sheet, you need to specify the sheet name.

In [None]:
polio = pd.read_excel('data/polio_incidence_rates_united_states.xlsx', 
                      sheet_name='polio_incidence_rates')
polio.head()

## Timeline of total incidence per year (summed over all states)

Start exploring by making a **line chart** that plots the sum of all the polio cases per year.

- Time on the horizontal axis
- Sum of all the polio cases per year (across all states) on the vertical axis

In [None]:
alt.Chart(----).mark_----().encode(
    x = '----:-',
    y = '----:-'
).properties(
    width = 600
)

## Timeline showing (detail) all states

We can add `detail=` encoding channel to split the data according to some categorical variable. This will make a mark for each unique entry in that category, adding a finer level of detail, without tying that variable to any other visual property like color or symbol type.

Add a `detail=` section to the previous encoding to create more detail – one line for each state.

In [None]:
alt.Chart(----).mark_----(opacity=0.3).encode(
    x = '----:-',
    y = '----:-',
    detail = '----:-'
).properties(
    width = 600
)

## Making a simple DataFrame to hold the year of the polio vaccine introduction

We'll use this DataFrame for a rule (solid reference line) to annotate some charts

In [None]:
vacc = pd.DataFrame([{"Introduction": 1955}])

## Timeline of all states overlayed with mean cases across states

### Now we'll practice layering up multiple charts using the `+` operator.

- Make the bottom layer just like your previous plot, with one line per state.
    - Try adding `strokeWidth=0.5, color='lightgray'` to make this part less prominent
- Put over that a single bolder line showing the **mean** number of cases per year (across all states)
- Also add vertical rule at 1955 introduction of vaccine

**Note that we can layer charts using data from different DataFrames!**

In [None]:
state_lines = alt.Chart(----).----

mean_line = alt.Chart(----).mark_line(strokeWidth=3).encode(----)

rule = alt.Chart(vacc).mark_rule().encode(
    x='Introduction:O'
)

state_lines + mean_line + rule

## Median line with upper and lower quartile boundaries

Often data patterns are more clear if you don't show as much detail. We can use an area plot to show upper and lower quartile bounds around a (layered) median line

- Remember to use median as the aggregation function
- The `mark_area()` has both y and y2 encoding channels
- these will be the lower and upper bounds of the area plotted
- `q1()` and `q3()` are aggregation functions which calculate the lower and upper quartile, respectively

In [None]:
base = alt.Chart(polio).properties(width=500)

line = base.mark_line().encode(----)

confidence_interval = base.mark_area(opacity=0.3).encode(
    x ='----:-',
    y = 'q1(----)',
    y2 = 'q3(----)'
)

rule = alt.Chart(----).----

confidence_interval + line + rule

## Heatmap of cases by state and year

A heatmap is a nice, compact data representation that can show a lot of data at once. Here we'll show all of the  details of all states' rates of disease over all the years by making a grid and representing the rate values by a colormap.

- Use rectangle marks
- show number of cases in color
- states on the left (vertical axis)
- years on the bottom (horizontal axis)

We're changing from the default color scheme, and increasing the contrast by setting the domain (values over which the color varies) to less than the bounds of the data.

[Vega-Lite color schemes](https://vega.github.io/vega/docs/schemes/)

*(Note: To see the trend more clearly, limit the color scale domain from 0-50.)*

In [None]:
alt.Chart(----).mark_----().encode(
    x = '----:-',
    y = '----:-',
    color = alt.Color('----:-').scale(scheme='reds', domain=[0,50])
).properties(
    width = 500,
    height = 500
)

## Heatmap with states sorted by sum of cases & vaccine introduction rule

As we've seen before, alphabetical ordering of categories is fine for lookup, but often hides the patterns in the data. It's better if we can order the categories (here the states) by some reasonable criteria.

So, we'll do basically the same heatmap as above, but this time we'll improve it by:

- sorting the states by the sum of the number of cases (over all years)
- add a vertical rule at the year when the vaccine was introduced (layered over the heatmap)

Remember we can sort a Y axis by following the alt.Y() with .sort()

```
alt.Y(__).sort(field="__", op="__", order="__")
```

In [None]:
heatmap = alt.Chart(----).----

rule = alt.Chart(----).----

---- - ----