# Altair with simple, sample data

- **[Altair documentation](https://altair-viz.github.io/index.html)**


- [Vega-Lite site](https://vega.github.io/vega-lite/)
- [Vega-Lite documentation](https://vega.github.io/vega-lite/docs/)
- [Vega-Lite 2.0 Medium article](https://medium.com/@uwdata/introducing-vega-lite-2-0-de6661c12d58)
- [Vega-Lite 2.0 OpenVisConf 2017 talk](https://www.youtube.com/watch?v=9uaHRWj04D4)
- [About the Vega project](https://vega.github.io/vega/about/)

#### This is just a made-up data set inspired by a [Nature Methods article](https://www.nature.com/articles/nmeth.2807)

In [23]:
import pandas as pd
import altair as alt

## Load in sample data

We'll load all data into a Panda DataFrame. A DataFrame is just a special data structure that is meant for "tablular data", which is like a spreadsheet. DataFrames also have build-in functions that can modify and display the data.

This pretend data set has values for five items in five categories. It gives us a chance to play around with various visual representations. **The best choice depends on which comparisons are most important to the story you're trying to tell!**

In [24]:
df_orig = pd.read_csv('data/NatureVegValues.csv')

### Preview the data

You can just type the name of the dataframe to get a printout of the contents.

The original data from the Nature Methods article had a very generic five items in five categories. Here's I've tried to make it more specific: **Let's say we have five different vegetables, and the values are the number of people that thought these vegetables were very Green, Yellow, Cheap, Tasty or Gross.**

*Please excuse the numbers before the names of both items and categories. The data patterns are more clear with the original sorting, when instead of vegetables they had Item 1, Item 2... and instead of characteristics they had Category 1, Category 2. Since the default sorting in Altair (and almost all software) is alphabetical, and it getting it to not sort would have introduced extra code, I decided on this non-ideal compromise to keep the sort order with my new descriptor names.*

In [25]:
df_orig

Unnamed: 0,veg,1_Green,2_Yellow,3_Cheap,4_Tasty,5_Gross
0,1_Corn,6,29,18,30,7
1,2_Squash,8,27,17,13,11
2,3_Brussel sprouts,10,21,16,4,19
3,4_Green beans,20,17,16,9,7
4,5_Peas,23,5,15,19,2


## Make data "tidy"

The data isn't in the right form for visualization in Altair (or Tableau or ggplot2). Right now it's "wide" and it needs to be "tall". Once we "tidy" the data, each column will have only one type of information, and the same types of data won't be spread across multiple columns.

See my previous *Tidy Data in Python with JupyterLab*
[repository](https://github.com/emonson/tidy-data-python) and [video](https://library.capture.duke.edu/Panopto/Pages/Viewer.aspx?id=d8a3efe2-48d7-4505-acd3-a943013c2442)

In Pandas we do this by using the `melt()` function. We specify a list of columns that won't be pivoted using the `id_vars=` argument (here the list only has one item in it), and all other columns will be pivoted. We also here specify a name for the column that used to be the column headers. The `head()` method lets you view the first set of rows.

In [26]:
df = df_orig.melt(id_vars=["veg"], var_name="trait", value_name="votes")
df.head(10)

Unnamed: 0,veg,trait,votes
0,1_Corn,1_Green,6
1,2_Squash,1_Green,8
2,3_Brussel sprouts,1_Green,10
3,4_Green beans,1_Green,20
4,5_Peas,1_Green,23
5,1_Corn,2_Yellow,29
6,2_Squash,2_Yellow,27
7,3_Brussel sprouts,2_Yellow,21
8,4_Green beans,2_Yellow,17
9,5_Peas,2_Yellow,5


## Method chaining

In Altair it's easy to construct our visualization by **chaining together, with a dot between** a "declaration" of our visualization following the pattern:

- `alt` – calls the Altair module through it's abbreviated name stated above
- `Chart()` – feed in the Pandas DataFrame our data values come from
- `mark_xxxx()` – sets the [mark type](https://altair-viz.github.io/user_guide/marks.html) to use – here a rectangle
- `encode()` – specified the "[encoding channels](https://altair-viz.github.io/user_guide/encoding.html#encoding-channels)" for this visualization, things like x (horizontal axis), y (vertical axis), color, tooltip, shape, size, etc.
- `transform_xxxx()` – [data transformations](https://altair-viz.github.io/user_guide/transform.html) like filter, calculate, aggregate, lookup, etc. 

## Heatmap

A heatmap is a very compact visual representation of the data, very similar to the original table, where rectangles are colored by the values in each cell. *Note that `mark_rect()` makes a rectangle that will always fill the cell, which is perfect for making heatmaps. A `mark_square()` is a square which can have size variation.*

We're not really good at quantitatively comparing color values, though, so this isn't a great representation if you want people to accurately detect the numerical patterns. Also, note that Cheap values aren't distinguishable.

**Let's practice typing the specification!** 

Always start with `alt.Chart().mark_rect().encode()` and then fill in the DataFrame and encoding

```
alt.Chart(df).mark_rect().encode(
    x = 'trait',
    y = 'veg',
    color = 'votes'
)
```

![goal veg heatmap](images/vegheatmap.png)

## Vega-Lite specification

What Altair really produces is a Vega-Lite JSON declarative specification for building the visualization, and JupyterLab has Vega and Vega-Lite built in for rendering. This is a nice separation of concerns, where Altair just needs to know how to make JSON, and the renderer knows how to actually create the visuals!

We can see the specification behind each visualization by using `.to_json()` or `.to_dict()`. Here we'll use the latter, because the printout is more compact, but feel free to try the former.

**Note that all of the data is included in the JSON!!**

- [Medium article on Vega-Lite adoption](https://medium.com/@robin.linacre/why-im-backing-vega-lite-as-our-default-tool-for-data-visualisation-51c20970df39)

In [27]:
heatmap = alt.Chart(df).mark_rect().encode(
    x = 'trait',
    y = 'veg',
    color = 'votes'
)
heatmap.to_dict()

{'config': {'view': {'continuousWidth': 300, 'continuousHeight': 300}},
 'data': {'name': 'data-39f9d319c3f1fa0701a2f2e8a04001de'},
 'mark': {'type': 'rect'},
 'encoding': {'color': {'field': 'votes', 'type': 'quantitative'},
  'x': {'field': 'trait', 'type': 'nominal'},
  'y': {'field': 'veg', 'type': 'nominal'}},
 '$schema': 'https://vega.github.io/schema/vega-lite/v5.8.0.json',
 'datasets': {'data-39f9d319c3f1fa0701a2f2e8a04001de': [{'veg': '1_Corn',
    'trait': '1_Green',
    'votes': 6},
   {'veg': '2_Squash', 'trait': '1_Green', 'votes': 8},
   {'veg': '3_Brussel sprouts', 'trait': '1_Green', 'votes': 10},
   {'veg': '4_Green beans', 'trait': '1_Green', 'votes': 20},
   {'veg': '5_Peas', 'trait': '1_Green', 'votes': 23},
   {'veg': '1_Corn', 'trait': '2_Yellow', 'votes': 29},
   {'veg': '2_Squash', 'trait': '2_Yellow', 'votes': 27},
   {'veg': '3_Brussel sprouts', 'trait': '2_Yellow', 'votes': 21},
   {'veg': '4_Green beans', 'trait': '2_Yellow', 'votes': 17},
   {'veg': '5_Peas

## All data still included, even with a filter applied

One limitation of Altair is that **all of the data is included in the Vega-Lite JSON, even if you're using Altair to filter down to a subset, or not using part of it!**

Here we'll filter down to just a single value of data. *Note that `datum` is just the name for a single piece of data.*

In [28]:
from altair import datum

In [29]:
one_square_heatmap = alt.Chart(df).mark_rect().encode(
    x = 'trait',
    y = 'veg'
).transform_filter(
    (datum.trait == '1_Green') & (datum.veg == '1_Corn')
)

one_square_heatmap

...but all the data is still included in the JSON

In [30]:
one_square_heatmap.to_dict()

{'config': {'view': {'continuousWidth': 300, 'continuousHeight': 300}},
 'data': {'name': 'data-39f9d319c3f1fa0701a2f2e8a04001de'},
 'mark': {'type': 'rect'},
 'encoding': {'x': {'field': 'trait', 'type': 'nominal'},
  'y': {'field': 'veg', 'type': 'nominal'}},
 'transform': [{'filter': "((datum.trait === '1_Green') && (datum.veg === '1_Corn'))"}],
 '$schema': 'https://vega.github.io/schema/vega-lite/v5.8.0.json',
 'datasets': {'data-39f9d319c3f1fa0701a2f2e8a04001de': [{'veg': '1_Corn',
    'trait': '1_Green',
    'votes': 6},
   {'veg': '2_Squash', 'trait': '1_Green', 'votes': 8},
   {'veg': '3_Brussel sprouts', 'trait': '1_Green', 'votes': 10},
   {'veg': '4_Green beans', 'trait': '1_Green', 'votes': 20},
   {'veg': '5_Peas', 'trait': '1_Green', 'votes': 23},
   {'veg': '1_Corn', 'trait': '2_Yellow', 'votes': 29},
   {'veg': '2_Squash', 'trait': '2_Yellow', 'votes': 27},
   {'veg': '3_Brussel sprouts', 'trait': '2_Yellow', 'votes': 21},
   {'veg': '4_Green beans', 'trait': '2_Yellow'

## Other visual encodings

Let's start generating other alternative visual encodings for the data that will be better suited for particular comparisons we're trying to make easy for the audience.

### Circle with size variation

**Note that now we're specifying the variable types**, in this case so Altair will give us a categorical (nominal) color scheme instead of ordinal. **It's a good idea to always specify the data types!**

[Encoding data types](https://altair-viz.github.io/user_guide/encoding.html#encoding-data-types)

|Data Type|Shorthand Code|Description
|---|---|---
|quantitative|Q|a continuous real-valued quantity
|ordinal|O|a discrete ordered quantity
|nominal|N|a discrete unordered category
|temporal|T|a time or date value
|geojson|G|a geographic shape

**Try switching the color variable (encoding data) type from "nominal" to "ordinal" and see what changes.**


In [31]:
alt.Chart(df).mark_circle().encode(
    x = 'trait:N',
    y = 'veg:N',
    color = 'trait:N',
    size = 'votes:Q'
)

## Comparing summed levels with Stacked bars

Stacked bars might be a reasonable visual representation if part of the main story is the overall sums within vegetables. First let's start with just the sums of votes.

**We can just put the sum() aggregation function in quotes around the field we want to aggregate!**


In [32]:
alt.Chart(df).mark_bar().encode(
    x = 'sum(votes):Q',
    y = 'veg:N'
)

### Now add traits color

*Note that it does, what to me seems like a strange ordering, putting the trait at the top of the legend at the end of the stacked bar...*

In [33]:
alt.Chart(df).mark_bar().encode(
    x = 'sum(votes):Q',
    y = 'veg:N',
    color = 'trait:N'
)

---

## EXERCISE 1

Before we proceed, try making another stacked bar chart, but this time with 

- a vertical bar for each trait (with labels along the bottom), 
- each bar representing the sum of the votes within the trait, and 
- a different color for each vegetable.

![goal vert bars](images/vegverttraitbars.png)

```
alt.Chart(----).mark_----().encode(
    x = ----,
    y = ----,
    color = ----
)
```

- Use the code above as a hint, but **type instead of copy/paste** 
- Replace the dashes with correct code

---

## Comparing within vegetable traits as faceted bars

If we want people to be able to make comparisons across traits within each vegetable, say whether Yellow-ness or Tasty-ness is larger for Corn, we need to give them a common baseline.

We can make what are sometimes called "small multiples" in Altair using `facet()` to specify that facets, or unique values, of a categorical variable should be split off and arranged along either rows or columns of the overall visualization. **Visuals shown within each facet are only the subset of the data corresponding to that category!**

*Note that we could alternatively specified `row = 'veg:O'` right within the encoding instead of needing the `.facet()` section. The advantage of using `.facet()` is that you can make faceted views of more complicated charts.*

**Try changing the row facet to a column facet.** (You'll probably want to modify the height and width properties.) I prefer the row-faceted version beause I have an easier time comparing across vegetables.

In [34]:
alt.Chart(df).mark_bar().encode(
    x = 'trait:N',
    y = 'sum(votes):Q',
    color = 'trait:N'
).properties(
    width = 160,
    height = 80
).facet(
    row='veg:N'
)

## Comparing within trait bars across vegetables

The common baseline again gives us easy comparisons within a trait

In [35]:
alt.Chart(df).mark_bar().encode(
    x = 'sum(votes):Q',
    y = 'veg:N',
    color = 'trait:N'
).properties(
    width = 80,
    height = 120
).facet(
    column='trait:N'
)

## Dot plot for the same comparison

Another way to give the trait votes within a vegetable a common baseline is to make a dot plot. This works well as long as there isn't too much value overlap.

- Default is grid lines on continuous scale axes and not on nominal or ordinal, but I want lines along each item to help guide the eye. [Top-level configuration docs](https://altair-viz.github.io/user_guide/configuration.html). 
- We also could have specified the circle size and opacity in the `encode()` section with `size = alt.value(150), opacity = alt.value(1.0)`

In [36]:
alt.Chart(df).mark_circle(size=150,opacity=1.0).encode(
    x = 'votes:Q',
    y = 'veg:N',
    color = 'trait:N'
).configure_axisY(
    grid=True
).configure_axisX(
    grid=False
)

## Layering dots and lines – guides the eye

**It's not usually a great idea to connect categorical variables with lines, since there is nothing between the vegetables**, but let's try it here to see if it helps guide the eye.

**Altair let's us use a `+` to layer individual charts and match their axes.**

In [37]:
dots = alt.Chart(df).mark_circle(size=150,opacity=1.0).encode(
    x = 'votes:Q',
    y = 'veg:N',
    color = 'trait:N'
)

lines = alt.Chart(df).mark_line().encode(
    x = 'votes:Q',
    y = 'veg:N',
    color = 'trait:N'
)

lines + dots

## Changing the size of the layered plot

First, the aspect ratio of this plot makes it hard to follow some of the steep lines, so let's change the size and re-configure the axis grids at the same time. *We can configure the layered plot by putting them in parentheses and chaining on the properties*


In [38]:
dots = alt.Chart(df).mark_circle(size=150,opacity=1.0).encode(
    x = 'votes:Q',
    y = 'veg:N',
    color = 'trait:N'
)

lines = alt.Chart(df).mark_line().encode(
    x = 'votes:Q',
    y = 'veg:N',
    color = 'trait:N'
)

(lines + dots).properties(
    width = 200,
    height = 200
)

## Reducing repeated elements with a base plot

We can see in the example above that the lines and dots plots are almost exactly the same. Altair lets us use that and just add the differences.

- We'll define a base plot with all of the common elements, and 
- then add the differences to each right before layering

In [77]:
base = alt.Chart(df).encode(
    x = 'votes:Q',
    y = 'veg:N',
    color = 'trait:N'
).properties(
    width = 200,
    height = 200
)

base.mark_line() + base.mark_circle(size=150,opacity=1.0)

---

## EXERCISE 2

Layer (superimpose) two visualizations:

1. Vertical bar chart showing
    - traits across the horizontal (bottom) axis
    - mean votes (across all the vegetables) going up the vertical axis (bar height)
    - `color = alt.value('lightgray')`
1. Point plot of all individual votes
    - use `mark_point()`
    - traits again along the horizontal axis
    - votes up the vertical axis
    - color by trait
    
![goal bars points](images/vegbarspoints.png)
    
*Point plot should be on top of the bar chart!*

---

# Saving to HTML files

Saving an Altair visualization to an HTML file is very easy! As we'll see below, you just have to chain a `.save('filename.html')` command on to the end of the specification.

## *ASIDE: complication*

**Unfortunately, there is one complication – if you've set the `alt.data_transformers.enable('json')` to avoid the MaxRowsError, you might want to turn that off before you save, so you can easily double-click to view your HTML file.** Or, you must at least understand that you'll need to run/have a web server to view your HTML visualizations

## JSON Data Transformer – effect on saved files

If you have set `alt.data_transformers.enable('json')` to avoid the MaxRowsError, Altair will automatically, behind the scenes, saves your data to a JSON file on your local filesystem (hard drive), and just reference that file name instead of embedding teh data in the JSON specification of the Vega-Lite visualization.

When you save an HTML file from Altair, the JSON specification it saves along with the file is exactly like it was in your Jupyter Notebook. If you have enabled the 'json' data transformer, the HTML file will reference the same JSON file URL for your data instead of embedding all the data in the JSON (and thus the HTML file).

The problem comes when you try to double-click on the HTML file from your hard drive to view it. 

- If the data is embedded, there is no problem – you will be able to see your static *or interactive* Vega-Lite chart that Altair has generated.
- **If the Vega-Lite chart refers to the data through a local file (URL), the page won't display properly when you double-click on it! Grabbing a local file is considered a [CORS request](https://en.wikipedia.org/wiki/Cross-origin_resource_sharing) in this scenario, and so *for security reasons* it isn't executed.**

### HTML files referring to local JSON files need a server

What you have to do to view any HTML-embedded Vega-Lite visualization that refers to a local JSON file is to put it (and the JSON data file) on a web server and view it through your browser. If you don't have easy access to a web server, you can run a temporary one locally by going into the directory with the files in a terminal on the Mac, or in the Anaconda prompt on Windows, and type:

`python -m http.server`

That should print out a message saying:

`Serving HTTP on 0.0.0.0 port 8000 (http://0.0.0.0:8000/) …`

which means you can go to http://0.0.0.0:8000/ in your browser and see the current directory. Click on the HTML file you want to view and the visualization should work fine.


## Changing back to default data transformer before saving HTML

**If you want to be able to just double-click your HTML file and view your static or interactive visualization, your data will need to be embedded directly in the HTML file.** To do this, you'll need to switch the data transformer back to 'default' before saving.

*Note: if you want to see the available transformers, you can use the `.names()` method. You can also write your own transformers, say to [save the JSON files to a sub-directory](https://altair-viz.github.io/user_guide/data_transformers.html#storing-json-data-in-a-separate-directory).*

In [40]:
alt.data_transformers.names()

['csv', 'default', 'json']

In [41]:
alt.data_transformers.enable('default')

DataTransformerRegistry.enable('default')

---

## Saving to HTML and JSON files

Documentation: [Saving Altair charts](https://altair-viz.github.io/user_guide/saving_charts.html)

Now, it's easy to save out an HTML file for the visualization, or the JSON specification. **These will get saved in the same directory as JupyterLab is running.**

In [42]:
bars = alt.Chart(df).mark_bar().encode(
    x = 'sum(votes):Q',
    y = 'veg:N',
    color = 'trait:N'
)

bars.save('stacked_bars.html')
bars.save('stacked_bars.json')

## SVG and PNG require additional installs

Documentation: [Saving as PNG, SVG, and PDF](https://altair-viz.github.io/user_guide/saving_charts.html#png-svg-and-pdf-format)

**You can always render your visualization in the notebook and use the circular `...` button next to the visualization to render to SVG or PNG.** This method uses the web browswer you're using for your notebook to render to the file. You can also do the same from the HTML you generated in the previous step.

From the docs, "Saving these images requires an additional extension to run the javascript code necessary to interpret the Vega-Lite specification and output it in the form of an image. There are two packages that can be used to enable image export: 
[vl-convert](https://github.com/vega/vl-convert) or 
[altair_saver](http://github.com/altair-viz/altair_saver/)."

**What you want for now is vl-convert**, which can save to PNG and SVG. altair-saver can save to PDF, but requires extra dependencies and right now is not compatible with Altair 5.

*If you changed the cell below to a Code cell and executed these, they would probably give you an error on these machines.*

---

# *EXTRAS*

## Tooltip and stacking order

In the original stacked bar, the color stacking puts the first legend item furthest right, so **here we've added an "order"**. Also, a **tooltip** is easy to add.

In [43]:
alt.Chart(df).mark_bar().encode(
    x = 'sum(votes):Q',
    y = 'veg:N',
    color = 'trait:N',
    order = alt.Order('trait:N', sort='ascending'),
    tooltip=['veg','trait','sum(votes)']
).properties(height=150)

## *EXTRA:* changing the color scheme

The stacked bar chart you created in the first exercise re-used the same colors for the items that we've been using for the categoryies. 

**It's not a great idea to have the same color meaning two different things in nearby visualizations. We can switch the set of colors Altair uses by specifying a `Scale()` as an extra argument to `Color()`.**

*See the [Vega documentation](https://vega.github.io/vega/docs/schemes/) for a list of available color schemes*

In [44]:
alt.Chart(df).mark_bar().encode(
    x = 'trait:N',
    y = 'sum(votes):Q',
    color = alt.Color('veg:N', scale=alt.Scale(scheme='set2'))
).properties(width=150)

In [47]:
alt.value('lightgray')

{'value': 'lightgray'}