# Data Visualization in Python
## Introduction to Altair

Question
* How to create modern and interactive plots?

Objective
* Understanding the basics of a grammar of graphics.
* Create a first `alt.Chart` object.
* Save a figure as an image or as an interactive version.

In [None]:
import pandas as pd

# Load the cleaned data
surveys_complete = pd.read_csv('../data/surveys_0_NA.csv')
surveys_complete

## Grammar of graphics
* Introduced by Leland Wilkinson at the beginning of the 1990s.
* A common, structured language for describing and
  understanding the various aspects of any visualization.

Interesting explanatory video:

[![YouTube - A Grammar of Graphics](https://img.youtube.com/vi/RCaFBJWXfZc/default.jpg)](https://www.youtube.com/watch?v=RCaFBJWXfZc)

### Seven central elements
1. Data
   * What do we represent.
2. Aesthetic
   * How we represent it.
3. Coordinate system
   * How we position our data.
4. Scale
   * How we move from data to aesthetic representation.
5. Geometric objects
   * What type of geometric object is used?
6. Statistics
   * How data is calculated/modified.
      * For example, the count, quartiles, regression lines, etc.
7. Facets
   * How we divide our data into multiple sets or graphs.

![Facets by year and sex](../images/facet-year-sex.png)

In the above figure, we have:
1. Data
   * Fields: `weight`, `hindfoot_length`, `species_id`
   * Data distribution by `year` and by `sex`
2. Aesthetic
   * Position according to `hindfoot_length` and `weight`
   * Color according to `species_id`
3. Coordinate system
   * Cartesian coordinate system
      * Two axes: X and Y
4. Scale
   * X (`hindfoot_length`): from 0 to 65
   * Y (`weight`): from 4 to 512
      * Logarithmic scale to base 2
   * Colors (`species_id`): blue, orange, red, etc.
5. Geometric objects
   * Semi-transparent circles
6. Statistics
   * Data represented as is, without transformation
7. Facets
   * Facets by `year` and by `sex`

### Grammar of interactive graphics
Adding interaction elements on top of the grammar of graphics
* Events
  * Click on or near an item
  * Hovering over an item
  * Shift
  * Zoom
* Selection 
  * What is selected: point, region, subset
* Actions
  * What happens to the chart

Here is a video about it:

[![YouTube - Vega-Lite - A grammar of Interactive Graphics](https://img.youtube.com/vi/rydth27fB3Q/default.jpg)](https://www.youtube.com/watch?v=rydth27fB3Q)

## Visualization Libraries
Several visualization libraries explicitly rely
on the grammar of graphics in their design:
* **R** : ggplot
* **JavaScript** : Vega-Lite, nd3
* **Python** : Altair, Bokeh, Plotly, Plotnine, Seaborn (>0.12)

Which Python library to choose?
* Depending on the type of library you need.
* Depending on what is popular in your research area.

|              Type               |     Name     | Forks | Stars |
| ------------------------------- | ------------ | ----: | ----: |
| Traditional                     | `matplotlib` |  7900 | 21.2k |
| Traditional                     | `seaborn`    |  2000 | 13.1k |
| Grammar of graphics             | `plotnine`   |   233 |  4.2k |
| Grammar of interactive graphics | `altair`     |   805 |  9.8k |
| Grammar of interactive graphics | `bokeh`      |  4200 | 19.8k |
| Grammar of interactive graphics | `plotly`     |  2600 | 17.1k |
| Specialized                     | `folium`     |  2200 |  7.1k |
| Specialized                     | `geoplotlib` |   175 |  1.0k |

### Why `altair`? Why not `matplotlib`?

While `matplotlib` is a widely used and quite flexible
visualization library, the plots programming is not
as intuitive as with `altair`. Moreover, the `altair`
[library](https://altair-viz.github.io/index.html)
facilitates the creation of highly informative
[interactive graphics](https://vega.github.io/vega-lite/)
from data stored in Pandas objects.

We will see different visualization concepts that can
be reproduced more or less easily with other libraries
such as `matplotlib`, `plotnine`, `plotly` and `seaborn`.

In [None]:
import altair as alt

Because the charts generated by Altair are not just static
images, the generated information can be quite heavy and
it accumulates if there are multiple plots in a notebook.
By default, Altair processes DataFrames of up to 5000 records,
but we can disable that limit to our own risks.

In [None]:
alt.data_transformers.disable_max_rows()

## Plotting with `altair`
`altair` charts are built step by step from
a `Chart` object constructed with a DataFrame:
* **Choosing the type of chart** -
  The first mandatory method starts with `mark_`.
  For example, `mark_point()`.
  By default, all points are overlapping and this is normal.

In [None]:
# New Chart object and choice of type of chart
alt.Chart(surveys_complete).mark_point()

* **Encoding channels** - Then we need to
  [encode](https://altair-viz.github.io/user_guide/encodings/)
  channels that are linking some fields of the DataFrame to
  elements of the chart. The main parameters of `encode()` are:
  `x`, `y`, `color`, `shape` and `size`.

In [None]:
# Once the axises are defined, the points take their position
alt.Chart(surveys_complete).mark_point().encode(
    x=alt.X('hindfoot_length'),
    y=alt.Y('weight'),
)

* **Interactive navigation** - When a chart is made _interactive_,
  it allows to zoom in & out, and to drag the chart with the mouse.

In [None]:
# Enable interactions with the mouse
alt.Chart(surveys_complete).mark_point().encode(
    x=alt.X('hindfoot_length'),
    y=alt.Y('weight'),
).interactive()

* **Temporary columns** - To add some noise,
  we can create temporary columns with
  [`transform_calculate()`](https://altair-viz.github.io/user_guide/transform/calculate.html).
  Once done, we have to specify
  [the data type](https://altair-viz.github.io/user_guide/encodings/index.html#encoding-data-types).

In [None]:
# Temporary columns with noise
alt.Chart(surveys_complete).transform_calculate(
    noisy_length='datum.hindfoot_length + random() - 0.5',
    noisy_weight='datum.weight + random() - 0.5',
).mark_point().encode(
    x=alt.X('noisy_length').type('quantitative'),
    y=alt.Y('noisy_weight').type('quantitative'),
).interactive()

* **Having values displayed interactively** -
  Encode the `tooltip` channel with a list of
  fields to display when moving the mouse pointer.

In [None]:
# Display values of selected fields when moving the mouse
chart = alt.Chart(surveys_complete).transform_calculate(
    noisy_length='datum.hindfoot_length + random() - 0.5',
    noisy_weight='datum.weight + random() - 0.5',
).mark_point().encode(
    x=alt.X('noisy_length').type('quantitative'),
    y=alt.Y('noisy_weight').type('quantitative'),
    tooltip=['plot_id', 'species_id', 'hindfoot_length', 'weight'],
).interactive()
chart

* **Saving the figure** - It is possible to save the chart in the
  [format of our choice](https://altair-viz.github.io/user_guide/saving_charts.html).

In [None]:
chart.save('weight_length.html')

In [None]:
# Other formats: PDF (heavier), PNG, SVG
try:
    chart.save('weight_length.png')
    chart.save('weight_length.svg')
except BaseException as err:
    print('Error:', err)
    print('-> We better use the (•••) button')

### Exercise - Create a bar chart
From the `surveys_complete` DataFrame, create an histogram that
shows the count of records for each `plot_id`. Instructions:
* Use
  [`mark_bar()`](https://altair-viz.github.io/gallery/simple_bar_chart.html)
  to generate the
  [chart](https://altair-viz.github.io/gallery/simple_histogram.html)
* For the X axis, specify the `'plot_id'` field and the
  [`'ordinal'` type](https://altair-viz.github.io/user_guide/encodings/#encoding-data-types)
* For the Y axis, specify `'count()'` as a temporary field computed
  automatically by Altair, which saves us from using `groupby()`
* Activate the `tooltip` channel with `'count()'`

(7 min.)

In [None]:
alt.Chart(surveys_complete).mark_bar().encode(
    x=alt.X('plot_id').type('ordinal'),
    y=alt.Y('count()'),
    tooltip=['count()'],
)

## Key points
* **Altair module**
  * `import altair as alt`
  * Deactivate the limit: `alt.data_transformers.disable_max_rows()`
* **Creating a new empty chart**
  * `chart = alt.Chart(df)`
* **Temporary columns**
  * `chart.transform_calculate(col2='datum.col1 + random()-0.5')`
* **Choosing a type of chart**
  * `chart.mark_point()`
  * `chart.mark_bar()`
* **Assigning data fields to encoding channels**:
  * `chart.encode(...)`
  * Encoding channels:
    * `x=alt.X('varX')` and `y=alt.Y('varY')`
      * `.type('type')`, with the
        [different types](https://altair-viz.github.io/user_guide/encodings/index.html#encoding-data-types) :
        * Continuous quantity: `'quantitative'`, `'var:Q'`
        * Discrete ordered quantity: `'ordinal'`, `'var:O'`
        * Discrete unordered category: `'nominal'`, `'var:N'`
        * Time or date value: `'temporal'`, `'var:T'`
    * `tooltip=['field_name1', 'field_name2', 'field_name3', ...]`
* **Other properties of the chart**
  * `chart.interactive()`
* **Saving the figure**
  * `chart.save("chart.html")`
  * `chart.save("chart.png")`
  * `chart.save("chart.svg")`