# Altair basics

- **[Altair home page](https://altair-viz.github.io/index.html)**


- [Vega-Lite site](https://vega.github.io/vega-lite/)
- [Vega-Lite documentation](https://vega.github.io/vega-lite/docs/)
- [Vega-Lite 2.0 Medium article](https://medium.com/@uwdata/introducing-vega-lite-2-0-de6661c12d58)
- [Vega-Lite 2.0 OpenVisConf 2017 talk](https://www.youtube.com/watch?v=9uaHRWj04D4)
- [About the Vega project](https://vega.github.io/vega/about/)

**This is just a made-up data set inspired by a [Nature Methods article](https://www.nature.com/articles/nmeth.2807)**

## Big learning objectives

- Exploration
- Basic specification (marks, encoding)
- Plot size
- Layering
- Faceting
- Configuration
- Saving charts

---

*To preserve the mystery, select from the notebook menus*

`Edit -> Clear All Outputs`

---

In [1]:
import pandas as pd
import altair as alt

## Load in sample data

We'll load all data into a Panda DataFrame. A DataFrame is just a special data structure that is meant for "tablular data", which is like a spreadsheet. DataFrames also have build-in functions that can modify and display the data.

This pretend data set has values for five items in five categories. It gives us a chance to play around with various visual representations. **The best choice depends on which comparisons are most important to the story you're trying to tell!**

*Please excuse the numbers before the names of both items and categories. The data patterns are more clear with the original sorting, when instead of vegetables they had Item 1, Item 2... and instead of characteristics they had Category 1, Category 2. Since the default sorting in Altair (and almost all software) is alphabetical, and getting it to not sort would have introduced extra code, I decided on this non-ideal compromise to keep the sort order with my new descriptor names.*


In [2]:
df = pd.read_csv('data/NatureVegSurveyScores.tsv', sep='\t', encoding='utf-8')
df.sample(10)

Unnamed: 0,respID,veg,trait,response
1226,r26,5_Peas,5_Gross,1
460,r10,2_Squash,5_Gross,2
1248,r48,5_Peas,5_Gross,1
1145,r45,5_Peas,3_Cheap,4
1160,r10,5_Peas,4_Tasty,4
18,r18,1_Corn,1_Green,2
400,r0,2_Squash,4_Tasty,3
1062,r12,5_Peas,2_Yellow,1
927,r27,4_Green beans,4_Tasty,2
962,r12,4_Green beans,5_Gross,1


### Data is already "tidy"

Hadley Wickham's [original tidy data paper](http://vita.had.co.nz/papers/tidy-data.html)


### Original survey data looked like this, in "wide" form

The original data from the Nature Methods article had a very generic five items in five categories. Here's I've tried to make it more concrete: **Let's say we have five different vegetables, and we're asking people if these vegetables are very Green, Yellow, Cheap, Tasty or Gross.**

*Note: this data is really just synthetic – we made it up to show how different visual representations change how we see the data patterns.*

**Fifty people responded to the survey questions like this:**

```
"Corn is very Green"

(5) Strongly agree
(4) Agree
(3) Neutral
(2) Disagree
(1) Strongly disagree
```

and our data consists of the numeric versions of the reponses.

In [3]:
df.pivot_table(index='respID', columns=['veg','trait'], values='response').head()

veg,1_Corn,1_Corn,1_Corn,1_Corn,1_Corn,2_Squash,2_Squash,2_Squash,2_Squash,2_Squash,...,4_Green beans,4_Green beans,4_Green beans,4_Green beans,4_Green beans,5_Peas,5_Peas,5_Peas,5_Peas,5_Peas
trait,1_Green,2_Yellow,3_Cheap,4_Tasty,5_Gross,1_Green,2_Yellow,3_Cheap,4_Tasty,5_Gross,...,1_Green,2_Yellow,3_Cheap,4_Tasty,5_Gross,1_Green,2_Yellow,3_Cheap,4_Tasty,5_Gross
respID,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
r0,1,5,3,5,1,1,4,5,3,1,...,4,5,3,1,1,2,2,4,2,1
r1,2,5,3,4,1,1,3,1,3,2,...,3,2,2,4,1,5,1,3,5,1
r10,1,5,3,5,1,1,5,2,3,2,...,3,3,3,1,1,3,3,4,4,1
r11,3,5,2,5,4,2,5,2,4,3,...,4,4,4,4,1,1,3,3,4,1
r12,5,5,2,5,1,2,5,2,4,4,...,3,2,5,2,1,2,1,1,1,2


## Table to visualize & summarize the data

**Altair isn't great for making tables, so we'll just use a Pandas pivot table**

- The table is compact
- it communicates values very precisely
- but it's hard for us to look at a bunch of numbers, take them in, and see patterns in them
- it also doesn't tell any particular story

In [4]:
(df.pivot_table(index='veg',
               columns='trait',
               values='response',
               aggfunc='mean')
 .style.format(precision=1)
#.background_gradient(vmin=1, vmax=5)
)

trait,1_Green,2_Yellow,3_Cheap,4_Tasty,5_Gross
veg,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1_Corn,1.8,4.7,3.3,4.9,1.9
2_Squash,2.0,4.5,3.2,2.7,2.4
3_Brussel sprouts,2.3,3.7,3.1,1.5,3.5
4_Green beans,3.6,3.2,3.0,2.1,1.9
5_Peas,4.0,1.7,2.9,3.5,1.3


## Method chaining

In Altair it's easy to construct our visualization by **chaining together, with a dot between** a "declaration" of our visualization following the pattern:

- `alt` – calls the Altair module through it's abbreviated name stated above
- `Chart()` – feed in the Pandas DataFrame our data values come from
- `mark_xxxx()` – sets the [mark type](https://altair-viz.github.io/user_guide/marks.html) to use – here a rectangle
- `encode()` – specified the "[encoding channels](https://altair-viz.github.io/user_guide/encoding.html#encoding-channels)" for this visualization, things like x (horizontal axis), y (vertical axis), color, tooltip, shape, size, etc.
- `transform_xxxx()` – [data transformations](https://altair-viz.github.io/user_guide/transform.html) like filter, calculate, aggregate, lookup, etc. 

## Heatmap

A heatmap is a very compact visual representation of the data, very similar to the original table, where rectangles are colored by the values in each cell. *Note that `mark_rect()` makes a rectangle that will always fill the cell, which is perfect for making heatmaps. A `mark_square()` is a square which can have size variation.*

We're not really good at quantitatively comparing color values, though, so this isn't a great representation if you want people to accurately detect the numerical patterns. Also, note that Cheap values aren't distinguishable.

**Let's practice typing the specification! Delete and retype the code.**

Always start with `alt.Chart().mark_rect().encode()` and then fill in the DataFrame and encoding

```
alt.Chart(df).mark_rect().encode(
    x = 'trait',
    y = 'veg',
    color = 'mean(response)'
)
```

- The heatmap is compact and eye-catching
- it tends to show blocks of light and dark, which isn't great for this data
- it also doesn't tell any particular story here

---

# Comparing trends within traits across vegetables

### Specifying the encoding data types

Let's start generating other alternative visual encodings for the data that will be better suited for particular comparisons we're trying to make easy for the audience.

- **Note that now we're specifying the variable types**, in this case so Altair will give us a categorical (nominal) color scheme instead of ordinal.
- **It's a good idea to always specify the data types!**

[Encoding data types documentation](https://altair-viz.github.io/user_guide/encoding.html#encoding-data-types)

|Data Type|Shorthand Code|Description
|---|---|---
|quantitative|Q|a continuous real-valued quantity
|ordinal|O|a discrete ordered quantity
|nominal|N|a discrete unordered category
|temporal|T|a time or date value
|geojson|G|a geographic shape

## Circle area encoding

- The color makes us see this in columns, because we group items with similar visual properties
- that starts telling the story of how the traits vary across the vegetables 
- but we're not great at quantitatively comparing areas
- note that these patterns wouldn't be clear without this particular sorting

**Try switching the *color* variable (encoding data) type from "nominal" to "ordinal" and see what changes.**

In [5]:
alt.Chart(df).mark_circle().encode(
    x = 'trait:N',
    y = 'veg:N',
    color = 'trait:N',
    size = 'mean(response):Q'
)

## Changing plot size

- Width and height go in `.properties()`
- We can also add a tooltip
- Also showing you can use a square mark instead of a circle for a similar effect

In [6]:
alt.Chart(df).mark_square().encode(
    x = 'trait:N',
    y = 'veg:N',
    color = 'trait:N',
    size = 'mean(response):Q',
    tooltip = ['mean(response):Q']
).properties(
    width = 200,
    height = 200
)

---

## EXERCISE 1

Try making a bar chart with:

- a **horizontal bar for each vegetable** (with labels on the left), 
- each bar length representing the **mean numeric response** within the vegetable

![goal horiz_bars](images/veghorizbars.png)

```
alt.Chart(----).mark_bar().encode(
    x = '----:-',
    y = '----:-'
)
```

- Use the code above as a hint, but **type instead of copy/paste** 
- Replace the dashes with correct code

---

## Line plot

Another way to show the trends within the traits across the vegetables would be to make a line plot.

**It's not usually a great idea to connect categorical variables with lines, since there is nothing between the vegetables**, but let's try it here to guide the eye.


In [7]:
alt.Chart(df).mark_line().encode(
    x = 'mean(response):Q',
    y = 'veg:N',
    color = 'trait:N'
)

## Layering charts

[Layered and multi-view charts documentation]()

**Let's put a dot on the lines to make it clear where the data points are.**

- **Altair let's us use a `+` to layer individual charts and match their axes:** `lines + dots`
- An alternative syntax is: `alt.layer(lines, dots)`
- Adding dots to lines is so common that there is a shortcut without layering: `.mark_line(point=True)`

In [8]:
dots = alt.Chart(df).mark_circle().encode(
    x = 'mean(response):Q',
    y = 'veg:N',
    color = 'trait:N'
)

lines = alt.Chart(df).mark_line().encode(
    x = 'mean(response):Q',
    y = 'veg:N',
    color = 'trait:N'
)

lines + dots

## Changing the size of the layered plot

- The aspect ratio of this plot makes it hard to follow some of the steep lines, so let's change the size
- *We can configure the layered plot by putting them in parentheses and chaining on the properties*


In [9]:
dots = alt.Chart(df).mark_circle().encode(
    x = 'mean(response):Q',
    y = 'veg:N',
    color = 'trait:N'
)

lines = alt.Chart(df).mark_line().encode(
    x = 'mean(response):Q',
    y = 'veg:N',
    color = 'trait:N'
)

(lines + dots).properties(
    width = 250,
    height = 200
)

## Reducing repeated elements with a base plot

We can see in the example above that the lines and dots plots are almost exactly the same. Altair lets us use that and just add the differences.

- We'll define a base plot with all of the common elements, and 
- then add the differences to each right before layeringj
- *Note: base is not a special word – that's just the variable name I used here*

In [10]:
base = alt.Chart(df).encode(
    x = 'mean(response):Q',
    y = 'veg:N',
    color = 'trait:N'
).properties(
    width = 250,
    height = 200
)

base.mark_line() + base.mark_circle()

## Layering to add labels

`.mark_text()` is used for labels

**Properties of the marks themselves – not encodings that bind the marks to the data – are put in as arguments for the mark method**

In [11]:
base = alt.Chart(df).encode(
    y = 'veg:N',
    x = 'mean(response):Q'
)

alt.layer(
    base.mark_bar(color = 'orange'),
    base.mark_text().encode(
        text = 'mean(response):Q'
    )
)

### With some extra formatting

- Modifying the x,y,text,color elements themselves requires object rather than shorthand string notation
- Customizing the *scale* of the X axis
- Customizing the *format* of the text

In [12]:
base = alt.Chart(df).encode(
    y = 'veg:N',
    x = alt.X('mean(response):Q').scale(domain=[0,4])
)

alt.layer(
    base.mark_bar(color = '#C84E00'),
    base.mark_text(dx = 5, align='left').encode(
        text = alt.Text('mean(response):Q').format('.1f')
    )
)

## Faceted plots

We can make what are sometimes called "small multiples" in Altair using `facet()` to specify that facets, or unique values, of a categorical variable should be split off and arranged along either rows or columns of the overall visualization. **Visuals shown within each facet are only the subset of the data corresponding to that category!**

- We could alternatively specified `column = 'trait:N'` within the encoding without a `.facet()` section.
- The advantage of using `.facet()` is that you can make faceted views of more complicated charts.

**Try changing the column facet to a row facet.** (You might want to modify the height and width.) I prefer the row-faceted version beause I have an easier time comparing across vegetables.

- *Note that since the traits are directly labeled we don't need a color legend.*

In [13]:
alt.Chart(df).mark_bar().encode(
    x = 'mean(response):Q',
    y = 'veg:N',
    color = alt.Color('trait:N').legend(None)
).properties(
    width = 80,
    height = 120
).facet(
    column='trait:N'
)

---

# Looking at traits of each vegetable

## EXERCISE 2: Vertical (row) faceting 

- This is great for comparing traits within a vegetable
- The lack of a common baseline makes it harder to compare quantitatively across vegetables, but it's easier than side-by-side grouped bars

**CREATE** – *(again, type rather than copy/paste)*

- Vertical bars now, but still representing the mean response
- One bar for each trait, still colored by trait
- Row facet by vegetable

```
alt.Chart(df).mark_bar().encode(
    x = '----:-',
    y = alt.Y('----:-').title('Avg response'),
    color = alt.Color('----:-').legend(None)
).properties(
    width = 140,
    height = 70
).facet(
    row='----:-'
)
```

![goal vert_bars](images/VerticalVegFacetedBars.png)

---

## Pie charts

Just so you can see how to do pies

In [14]:
alt.Chart(df).mark_arc().encode(
    theta = 'mean(response):Q',
    color = 'trait:N'
).properties(
    width = 80,
    height = 80
).facet(
    column='veg:N'
)

## Introducing configuration – Dot plot with horizontal grid lines

**[Top-level configuration docs](https://altair-viz.github.io/user_guide/configuration.html)** *(I find it quite difficult to find what I need in this documentation, even though it's very complete.)*

- We can try a dot plot again, this time focusing on the characteristics of each vegetable
- Default is grid lines on quantitative scale axes, not categorical, but I want lines along each item to help guide the eye and associate the dots with the vegetables, even though they're colored by traits.
- We also could have specified the circle size and opacity in the `encode()` section with `size = alt.value(150), opacity = alt.value(1.0)`

In [15]:
alt.Chart(df).mark_circle(size=150,opacity=1.0).encode(
    x = 'mean(response):Q',
    y = 'veg:N',
    color = 'trait:N'
).properties(
    width = 400,
    height = 150,
).configure_axisX(
    grid=False
).configure_axisY(
    grid=True
)

## More configuration + Title

**Let's make this a more complete figure.** I find configuration of things like font sizes, colors, and placement to be difficult and annoying in almost every piece of visualization software (Excel, Tableau, etc.), but it's often necessary to customize these settings to tell your story effectively!

- Titles are a little strange to specify – they go in the `.Chart()`
- If you want multiple lines in either the title or subtitle – *or any text in Altair* – put the lines as a list of strings: `["line1","line2]`

In [16]:
alt.Chart(
    df, 
    title = alt.TitleParams(
        "Corn and peas have the strongest characteristics",
        subtitle = ["Corn is very yellow and tasty – Peas are very green, but not as tasty", 
                    "Squash is also very yellow"]
    )
).mark_circle(
    size = 150,
    opacity = 1.0
).encode(
    x = 'mean(response):Q',
    y = 'veg:N',
    color = 'trait:N'
).properties(
    width = 400,
    height = 150,
).configure_axisX(
    grid=False
).configure_axisY(
    grid=True
).configure_axis(
    titleFontSize = 14,
    labelFontSize = 12
).configure_legend(
    titleFontSize = 14,
    labelFontSize = 12
).configure_title(
    fontSize = 20,
    anchor = 'start',
    offset = 15,
    color = "#333333",
    subtitleColor = "#666666"
)

---

## Saving to HTML and JSON files

Documentation: [Saving Altair charts](https://altair-viz.github.io/user_guide/saving_charts.html)

Now, it's easy to save out an HTML file for the visualization, or the JSON specification. **These will get saved in the same directory as JupyterLab is running.**

- Remember that all the data will get embedded in the HTML!
- See the end of the [41_MIDS_LibPageviews](41_MIDS_LibPageviews.ipynb) notebook for how to use VegaFusion to just embed the aggregated data

In [17]:
bars = alt.Chart(df).mark_bar().encode(
    x = 'mean(response):Q',
    y = 'veg:N',
    color = 'trait:N'
)

bars.save('stacked_bars.html')
bars.save('stacked_bars.json')

## SVG and PNG require additional installs

Documentation: [Saving as PNG, SVG, and PDF](https://altair-viz.github.io/user_guide/saving_charts.html#png-svg-and-pdf-format)

**You can always render your visualization in the notebook and use the circular `...` button next to the visualization to render to SVG or PNG.** This method uses the web browswer you're using for your notebook to render to the file. You can also do the same from the HTML you generated in the previous step.

From the docs, "Saving these images requires an additional extension to run the javascript code necessary to interpret the Vega-Lite specification and output it in the form of an image. There are two packages that can be used to enable image export: 
[vl-convert](https://github.com/vega/vl-convert) or 
[altair_saver](http://github.com/altair-viz/altair_saver/)."

**What you want for now is vl-convert**, which can save to PNG and SVG. altair-saver can save to PDF, but requires extra dependencies and right now is not compatible with Altair 5.

```
pip install vl-convert-python
```

In [18]:
bars.save('stacked_bars.svg')
bars.save('stacked_bars.png', scale_factor=4.0)

WARN Stacking is applied even though the aggregate function is non-summative ("mean").
WARN Stacking is applied even though the aggregate function is non-summative ("mean").


---

## *EXTRA:* changing the color scheme

- It's not a great idea to have the same color meaning two different things in nearby visualizations
- We can switch the set of colors Altair uses by specifying a `.scale()` for the color, embedded within a `Scale()` object

*See the [Vega documentation](https://vega.github.io/vega/docs/schemes/) for a list of available color schemes*

In [19]:
alt.Chart(df).mark_bar().encode(
    x = 'trait:N',
    y = 'mean(response):Q',
    color = alt.Color('veg:N').scale(alt.Scale(scheme='set2'))
).properties(width=150)