## Altair for quick exploration of NC NO<sub>2</sub> emissions data

Altair is a Python module for declarative data visualization based on Vega-Lite

https://altair-viz.github.io/index.html

It's nice because the interface (syntax) is quite consistent, and conceptually similar to Tableau in its specifications and the fact that it works with tidy data.

In [1]:
import pandas as pd
import altair as alt
import time

In [2]:
df = pd.read_csv(
    "./data/AirDataEPA/NC_NO2_hourly_2018.csv",
    parse_dates = {"tstamp":["Date Local", "Time Local"]},
    dtype = {'Site Num':'str'},
    encoding='utf-8'
).rename(
    columns = {
        "State Name": "state",
        "County Name": "county",
        "Site Num": "site",
        "Sample Measurement": "measure"
    }
).set_index('tstamp')

#### Adding a convenience column with county + site

In [3]:
df['county_site'] = df['county'] + " " + df['site']
df.head()

Unnamed: 0_level_0,site,measure,state,county,county_site
tstamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2018-01-01 00:00:00,22,4.4,North Carolina,Forsyth,Forsyth 0022
2018-01-01 01:00:00,22,5.1,North Carolina,Forsyth,Forsyth 0022
2018-01-01 02:00:00,22,3.6,North Carolina,Forsyth,Forsyth 0022
2018-01-01 03:00:00,22,4.1,North Carolina,Forsyth,Forsyth 0022
2018-01-01 04:00:00,22,4.6,North Carolina,Forsyth,Forsyth 0022


#### More than 5000 rows need data out of the HTML Altair generates

See solutions to plotting large data sets: https://altair-viz.github.io/user_guide/faq.html#maxrowserror-how-can-i-plot-large-datasets

Here I'm using the data_server solution: https://pypi.org/project/altair-data-server/

```
pip install altair_data_server
```

In [4]:
alt.data_transformers.enable('data_server')

DataTransformerRegistry.enable('data_server')

## Altair has aggregation built in

We can just specify the aggregation function in the declaration of the plot

In [5]:
alt.Chart(df).mark_bar().encode(
    y = 'county_site',
    x = 'mean(measure)'
)

## Altair built-in time aggregation and resampling

These are called timeunit transforms: https://altair-viz.github.io/user_guide/transform/timeunit.html

#### Altair prefers that you explicitly specify the type of variable

- `N` – Nominal *(unordered categorical)*
- `O` – Ordinal *(ordered categorical)*
- `Q` – Quantitative *(numbers)*
- `T` – Temporal *(time)*

### Daily emissions patterns heatmap

In [6]:
alt.Chart(df.reset_index()).mark_rect().encode(
    y = 'county_site:N',
    x = 'hours(tstamp):T',
    color = 'mean(measure)'
)

### Line plot only requres switching three things

- mark_rect() to mark_line()
- y to color
- color to y

In [7]:
alt.Chart(df.reset_index()).mark_line().encode(
    color = 'county_site:N',
    x = 'hours(tstamp):T',
    y = 'mean(measure)'
)

### Easy switch to months instead of hours

In [8]:
alt.Chart(df.reset_index()).mark_line().encode(
    color = 'county_site:N',
    x = 'month(tstamp):T',
    y = 'mean(measure)'
)

## Speed

For some reason can't really use `%timeit` to measure whole rendering, so need to execute

**See how much faster this is when we let Pandas do the aggregation and only feed Altair a small dataset!**

In [9]:
grp = df.groupby(['county_site',df.index.month]).agg({'measure':'mean'}).reset_index()

alt.Chart(grp).mark_line().encode(
    color='county_site:N',
    x='tstamp:Q',
    y='measure'
)

---

### Hierarchy a bit awkward

Altair can handle two levels of hierarchy in "grouping" with a combination of axis and facets. Since not all sites are in all counties, need to do the equivalent of Pandas categorical `observed=True`, which is `resolve_scale(y='independent')`

In [10]:
alt.Chart(df).mark_bar().encode(
    y = 'site:N',
    x = 'mean(measure):Q',
    row = 'county:N'
).resolve_scale(y='independent')

### Could have created a county + site field in Altair

In [11]:
alt.Chart(df).mark_bar().encode(
    x = 'mean(measure):Q',
    y = 'county_site_calc:N'
).transform_calculate(
    county_site_calc = 'datum.county + " " + datum.site'
)