# Data Analysis and Visualization in Python
## Making Plots With Altair
Questions
* How can I visualize data in Python?
* How to create modern and interactive plots?
* What is ‘grammar of graphics’?

Objectives
* Create an `alt.Chart` object.
* Build complex plots using a step-by-step approach.
  * Change the aesthetics of a plot such as color.
  * Edit the title and the axis labels.
* Create scatter plots, time series plots and box plots.
* Create a collection of plots splitting the data by a "factor" variable.
* Save a figure as an image or as an interactive version.

In [None]:
import pandas as pd

# Load and clean the data
surveys_complete = pd.read_csv('../data/surveys.csv').dropna()
surveys_complete

## Why `altair`? Why not `matplotlib`?

While `matplotlib` is a widely used and
quite flexible visualization library, the plots
programming does not follow a specific *grammar*.

In this chapter, we have decided to present the `altair` library which
[facilitates the creation of highly informative charts](https://altair-viz.github.io/index.html)
from data stored in Pandas objects.
It is based on the grammar of interactive graphics of
[Vega-Lite](https://vega.github.io/vega-lite/),
which makes the programming both elegant and powerful.

We will see different visualization concepts that can
be reproduced more or less easily with other libraries
such as `matplotlib`, `plotnine`, `plotly` and `seaborn`.

In [None]:
import altair as alt

Because the charts generated by Altair are not just static
images, the generated information can be quite heavy and
it accumulates if there are multiple plots in a notebook.
By default, Altair processes DataFrames of up to 5000 records,
but we can disable that limit to our own risks.

In [None]:
alt.data_transformers.disable_max_rows()

## Plotting with `altair`
`altair` charts are built step by step from
a `Chart` object constructed with a DataFrame:
* **Choosing the type of chart** -
  The first mandatory method starts with `mark_`.
  For example, `mark_point()`.
  By default, all points are overlapping and this is normal.

In [None]:
# New Chart object and choice of type of chart
alt.Chart(surveys_complete).mark_point()

* **Encoding channels** - Then we need to
  [encode channels](https://altair-viz.github.io/user_guide/encodings/)
  that are linking some fields of the DataFrame to elements
  of the chart. The main parameters of `encode()` are:
  `x`, `y`, `color`, `shape` and `size`.

In [None]:
# Once the axises are defined, the points take their position
alt.Chart(surveys_complete).mark_point().encode(
    x=alt.X('hindfoot_length'),
    y=alt.Y('weight'),
)

* **Interactive navigation** - When a chart is made _interactive_,
  it allows to zoom in & out, and to drag the chart with the mouse.

In [None]:
# Enable interactions with the mouse
alt.Chart(surveys_complete).mark_point().encode(
    x=alt.X('hindfoot_length'),
    y=alt.Y('weight'),
).interactive()

* **Temporary columns** - To add some noise,
  we can create temporary columns with
  [`transform_calculate()`](https://altair-viz.github.io/user_guide/transform/calculate.html).
  Once done, we have to specify
  [the data type](https://altair-viz.github.io/user_guide/encodings/index.html#encoding-data-types).

In [None]:
# Temporary columns with noise
alt.Chart(surveys_complete).transform_calculate(
    noisy_length='datum.hindfoot_length + random() - 0.5',
    noisy_weight='datum.weight + random() - 0.5',
).mark_point().encode(
    x=alt.X('noisy_length').type('quantitative'),
    y=alt.Y('noisy_weight').type('quantitative'),
).interactive()

* **Having values displayed interactively** -
  Encode the `tooltip` channel with a list of
  fields to display when moving the mouse pointer.

In [None]:
# Display values of selected fields when moving the mouse
chart = alt.Chart(surveys_complete).transform_calculate(
    noisy_length='datum.hindfoot_length + random() - 0.5',
    noisy_weight='datum.weight + random() - 0.5',
).mark_point().encode(
    x=alt.X('noisy_length').type('quantitative'),
    y=alt.Y('noisy_weight').type('quantitative'),
    tooltip=['plot_id', 'species_id', 'hindfoot_length', 'weight'],
).interactive()
chart

* **Saving the figure** -
  It is possible to save the chart in the format of our choice.

In [None]:
chart.save('weight_length.html')

In [None]:
try:
    chart.save('weight_length.png')
except BaseException as err:
    print('Error:', err)
    print('-> We better use the (•••) button')

### Exercise - Create a bar chart
From the `surveys_complete` DataFrame, create an histogram that
shows the count of records for each `plot_id`. Instructions:
* Use
  [`mark_bar()`](https://altair-viz.github.io/gallery/simple_bar_chart.html)
  to generate the
  [chart](https://altair-viz.github.io/gallery/simple_histogram.html)
* For the X axis, specify the `'plot_id'` field and the
  [`'ordinal'` type](https://altair-viz.github.io/user_guide/encodings/#encoding-data-types)
* For the Y axis, specify `'count()'` as a temporary field computed
  automatically by Altair, which saves us from using `groupby()`
* Activate the `tooltip` channel with `'count()'`

(4 min.)

In [None]:
###(surveys_complete)###
    ###('plot_id').type('ordinal'),
    ###('count()'),
    ###['count()'],
)

## Building your plots iteratively
Reminder: every Altair charts are `Chart()`
objects constructed with a DataFrame.
Then, a `mark_*()` method is called to specify the
type of chart, and some data fields are assigned
to encoding channels via the `encode()` method.

* We can then modify the chart in order to display more information.
  For example, with transparency:

In [None]:
alt.Chart(surveys_complete).mark_point().encode(
    x=alt.X('hindfoot_length'),
    y=alt.Y('weight'),
).configure_mark(
    opacity=0.05,
)

* To get a unique color per species, we need to encode
  the `species_id` field to the `color` channel:

In [None]:
alt.Chart(surveys_complete).mark_point().encode(
    x=alt.X('hindfoot_length'),
    y=alt.Y('weight'),
    color=alt.Color('species_id'),
).configure_mark(
    opacity=0.05,
)

* Because the colors are reused for multiple species, we
  better activate the `tooltip` channel with `species_id`:

In [None]:
alt.Chart(surveys_complete).mark_point().encode(
    x=alt.X('hindfoot_length'),
    y=alt.Y('weight'),
    color=alt.Color('species_id'),
    tooltip=['species_id'],
).configure_mark(
    opacity=0.05,
)

* The Y axis can be configured with a logarithmic scale:

In [None]:
alt.Chart(surveys_complete).mark_point().encode(
    x=alt.X('hindfoot_length'),
    y=alt.Y('weight').scale(type='log', base=2),
    color=alt.Color('species_id'),
    tooltip=['species_id'],
).configure_mark(
    opacity=0.05,
).properties(
    height=384,
)

* The title and axis labels can be set:

In [None]:
alt.Chart(surveys_complete).mark_point().encode(
    x=alt.X('hindfoot_length').title('Hindfoot length (mm)'),
    y=alt.Y('weight').scale(type='log', base=2).title('Weight (g)'),
    color=alt.Color('species_id'),
    tooltip=['species_id'],
).configure_mark(
    opacity=0.05,
).properties(
    height=384,
    title='Weight by the hindfoot length',
)

### Exercise - Enrich the bar chart
Modify the chart from the previous exercise by
encoding the `sex` field to a specific color scale:
* The `'sex'` field must be encoded to the `color` channel.
  The `.scale()` method can then associate domain values `'F'`
  and `'M'` to colors `'orange'` and `'green'`, respectively.
  See [an example here](https://altair-viz.github.io/user_guide/customization.html#color-domain-and-range)
* In the `tooltip` channel, add `'sex'` at the beginning of the list
* Activate the `xOffset` channel and see what it does to the bar-plot

(4 min.)

In [None]:
alt.Chart(surveys_complete).mark_bar().encode(
    x=alt.X('plot_id').type('ordinal'),
    y=alt.Y('count()'),
    color=alt.###(###).scale(
        ###=['F', 'M'],
        ###=['orange', 'green'],
    ),
    #xOffset='sex',
    tooltip=[###, 'count()'],
).properties(
    width=480,  # Fix the chart width (pixels)
)

## Plotting time series data
* Let’s visualize the number of records per year for each species

In [None]:
alt.Chart(surveys_complete).mark_line().encode(
    x=alt.X('year').type('ordinal'),
    y=alt.Y('count()').scale(type='log', base=2),
    color=alt.Color('species_id'),
)

* And now, the median weight per month for each species

In [None]:
alt.Chart(surveys_complete).mark_line().encode(
    x=alt.X('month').type('ordinal'),
    y=alt.Y('weight').aggregate('median'),
    color=alt.Color('species_id'),
    tooltip=['species_id'],
)

### Exercise - Plotting time series data
`1`. Use the `pd.to_datetime()` function to generate a new
`date` column from the columns `year`, `month` and `day`. (3 min.)

In [None]:
# Decade 1990 - to avoid April and September 2000
dec_1990 = surveys_complete[
    surveys_complete['year'].isin(range(1990, 2000))].copy()

dec_1990['date'] = ###['year', 'month', 'day']###
dec_1990['date']

`2`. Visualize the median weight of each species by the `date`.
(3 min.)

In [None]:
alt.Chart(###).mark_line().encode(
    x=alt.X(###),
    y=alt.Y('weight').###('median'),
    color=alt.Color('species_id'),
    tooltip=['species_id', 'date'],
)

## Faceting
`altair` has a special technique called *faceting*
that allows to split one plot into multiple plots
based on a factor variable included in the dataset.

* With the different values of `sex`:

In [None]:
alt.Chart(surveys_complete).mark_point().encode(
    x=alt.X('hindfoot_length'),
    y=alt.Y('weight').scale(type='log', base=2),
    color=alt.Color('species_id'),
    facet=alt.Facet('sex'),
    tooltip=['species_id'],
).configure_mark(
    opacity=0.05,
).properties(
    width=240,
    height=384,
)

* With the numerous values of `plot_id`:

In [None]:
alt.Chart(surveys_complete).mark_point().encode(
    x=alt.X('hindfoot_length'),
    y=alt.Y('weight').scale(type='log', base=2),
    color=alt.Color('species_id'),
    facet=alt.Facet('plot_id').columns(5),
    tooltip=['species_id'],
).configure_mark(
    opacity=0.05,
).properties(
    width=90,
    height=60,
)

* To create a grid of facets such that each row of facets
  corresponds to one value of a variable, and each column
  of facets corresponds to one value of a second variable,
  we use the encoding channels `row` and `column`:

In [None]:
# Only keep three years
surveys2000 = surveys_complete[surveys_complete['year'].isin([2000, 2001, 2002])]

alt.Chart(surveys2000).mark_point().encode(
    x=alt.X('hindfoot_length'),
    y=alt.Y('weight').scale(type='log', base=2),
    color=alt.Color('species_id'),
    row=alt.Row('sex'),
    column=alt.Column('year'),
    tooltip=['species_id'],
).configure_mark(
    opacity=0.05,
).properties(
    width=128,
    height=128,
)

### Exercise - Faceting
* Create two facets by the `sex`
* Each facet will have:
  * Years on the X axis
  * The average weight on the Y axis
  * One colored line per species

(5 min.)

In [None]:
alt.Chart(surveys_complete).###().encode(
    x=alt.X('year').type###,
    y=alt.Y('weight').aggregate###,
    color=###,
    ###('sex'),
).properties(
    width=256,
)

## Plotting distributions
* A boxplot can be used:

In [None]:
alt.Chart(surveys_complete).mark_boxplot().encode(
    x=alt.X('species_id').title('Species identifier'),
    y=alt.Y('weight').scale(type='log', base=2).title('Weight (g)'),
    color=alt.Color('species_id').legend(None),
)

* Narrow facets can be used to display multiple point clouds:

In [None]:
alt.Chart(surveys_complete).transform_calculate(
    noise='random() - 0.5',  # Horizontal position in the facet
    noisy_w='datum.weight + random() - 0.5',
).mark_circle(size=4).encode(
    x=alt.X('noise').type('quantitative').axis(None).title(None),
    y=alt.Y('noisy_w:Q').scale(type='log', base=2).title('Weight (g)'),
    color=alt.Color('species_id').legend(None),
    column=alt.Column('species_id').title('Weights by species'),
).configure_mark(
    opacity=0.25,  # Opacity factor of mark_circle()
).configure_facet(
    spacing=0,     # Delete the margin between each facet
).configure_view(
    stroke=None,   # Remove the box around each facet
).properties(
    width=18,      # Each facet width
)

### Exercise - Distributions
For this exercise, we want to display the
full species names on the X axis of a boxplot.

`1`. Compute the left-join of `surveys_complete`
and all the species details in `species.csv`. (3 min.)

In [None]:
species_df = pd.read_csv('../data/species.csv')

left_join = pd.###(
    left=###, right=###,
    on=###, how=###)

left_join.columns

`2`. Create the boxplot:
* The full species names on the X axis, with the label "Species"
* The noisy weights on the Y axis, with a logarithmic
  scale in base 2 and with the label "Weight (g)"
* One color for each species identifier
* A title for the chart

(6 min.)

In [None]:
alt.Chart(###)###.encode(
    x=alt.X(###).title('Species'),
    y=alt.Y('weight').scale(type='log', base=2).###('Weight (g)'),
    color=alt.Color(###).legend(None),
).properties(
    ###='Distribution of weights by species',
)

## Key points
* **Altair module**
  * `import altair as alt`
  * Deactivate the limit: `alt.data_transformers.disable_max_rows()`
* **Creating a new empty chart**
  * `chart = alt.Chart(df)`
* **Temporary columns**
  * `chart.transform_calculate(col2='datum.col1 + random()-0.5')`
* **Choosing a type of chart**
  * `chart.mark_point()`
  * `chart.mark_bar()`
  * `chart.mark_line()`
  * `chart.mark_boxplot()`
  * `chart.mark_circle(size=N)`
* **Assigning data fields to encoding channels**:
  * `chart.encode(...)`
  * Encoding channels:
    * `x=alt.X('varX')` and `y=alt.Y('varY')`
      * `.type('type')`, with the
        [different types](https://altair-viz.github.io/user_guide/encodings/index.html#encoding-data-types) :
        * Continuous quantity: `'quantitative'`, `'var:Q'`
        * Discrete ordered quantity: `'ordinal'`, `'var:O'`
        * Discrete unordered category: `'nominal'`, `'var:N'`
        * Time or date value: `'temporal'`, `'var:T'`
      * `.aggregate(...)`,
        with either `'mean'`, `'median'`, etc.
      * `.scale(type='log', base=2)`
      * `.title('Name for the X or Y axis')`
    * `color=alt.Color('field_name_for_colors')`
      * `.legend(None)`
      * `.scale(domain=[...], range=['#114499', ...])`
    * `facet=alt.Facet('field_name_for_facets')`
      * `.columns(N)`
    * `row=alt.Row('field_name_for_facet_rows')`
    * `column=alt.Column('field_name_for_facet_columns')`
    * `tooltip=['field_name1', 'field_name2', 'field_name3', ...]`
* **Other properties of the chart**
  * `chart.interactive()`
  * `chart.configure_mark(opacity=0.05)`
  * `chart.properties(...)`
    * `width=400`
    * `height=300`
    * `title='Whole figure title'`
  * `chart.configure_facet(spacing=0)`
  * `chart.configure_view(stroke=None, width=20)`
* **Saving the figure**
  * `chart.save("chart.html")`
  * `chart.save("chart.png")`