# Encoding Channels

With a basic framework of _data types_, _marks_, and _encoding channels_, we can concisely create a wide variety of visualizations. In the previous notebook you were exposed to attribute data types. A visualization represents data using a collection of _graphical marks_ (bars, lines, points, etc.). The attributes of a mark &mdash; such as its position, shape, size, or color &mdash; serve as _channels_ through which we can encode underlying data values. In other words, channels are a way to change the appearance of marks based on the data's attributes. In this notebook, we will use the `mark_point` graphical mark and different channels to create visual encodings.

When discussing data items, the word __attribute__ is used to signify the data that describes the item. In the context of visualizations, we will use the words __field__ and __attribute__ interchangablely.
At the heart of Altair is the use of *encodings* that bind data fields (with a given data type) to available encoding *channels* of a chosen *mark* type. In this notebook we'll examine the following encoding channels:

- `x`: Horizontal (x-axis) position of the mark.
- `y`: Vertical (y-axis) position of the mark.
- `size`: Size of the mark. May correspond to area or length, depending on the mark type.
- `color`: Mark color, specified as a [legal CSS color](https://developer.mozilla.org/en-US/docs/Web/CSS/color_value).
- `opacity`: Mark opacity, ranging from 0 (fully transparent) to 1 (fully opaque).
- `shape`: Plotting symbol shape for `point` marks.
- `tooltip`: Tooltip text to display upon mouse hover over the mark.
- `order`: Mark ordering, determines line/area point order and drawing order.

For a complete list of available channels, see the [Altair encoding documentation](https://altair-viz.github.io/user_guide/encoding.html).




## Learning Goals
Those who actively work through this notebook will be able to:
- Identify encoding channels and describe how they are utilized in Altair
- Create scatter plots and bubble plots.
- Demonstrate how tooltips can be added to visualizations.

## Global Development Data

We will be visualizing the global health and population data that you were introduced to in the preceding notebook.

In [1]:
import pandas as pd
import altair as alt

In [2]:
from vega_datasets import data as vega_data
data = vega_data.gapminder()

For each `country` and `year` (in 5-year intervals), we have measures of fertility in terms of the number of children per woman (`fertility`), life expectancy in years (`life_expect`), and total population (`pop`).
From the preceding notebook, we know that they are 693 items and 6 attributes.
We also know the data types for the attributes
 - Quantitative:  `pop`, `life_expect`, and `fertility`
 - Nominal: `country`
 - Ordinal: `year`

Using pandas we can create a summary of each attribute. For the quantitative attributes, we will include the minimum and maximum values. For the others we will just get sense of the unique values that exist.

In [3]:
data.agg(
    {
        "year":['unique'],
        "country":['unique'],
        "cluster":['unique'],
        "pop": ['min', 'max'],
        "life_expect": ['min', 'max'],
        "fertility": ['min', 'max'],

    }
)

Unnamed: 0,year,country,cluster,pop,life_expect,fertility
unique,"[1955, 1960, 1965, 1970, 1975, 1980, 1985, 199...","[Afghanistan, Argentina, Aruba, Australia, Aus...","[0, 3, 4, 1, 5, 2]",,,
min,,,,53865.0,23.599,0.94
max,,,,1303182000.0,82.603,8.5


Mousing over the cell values for the unique row for `year` and `country` will allow you to see all the unique values for that attribute.
Note that `NaN` is shown for summaries not included.

There is also the `cluster` attribute with integer values between values 0 and 5. Should we treat this attribute as quantitative, nominal or ordinal data? What might this represent? We'll try and solve this mystery as we visualize the data!
In future notebooks we will explore patterns across time, but for today let's filter to include items for the year 2000

In [4]:
data2000 = data.loc[data['year'] == 2000]
data2000.head(5)

Unnamed: 0,year,country,cluster,pop,life_expect,fertility
9,2000,Afghanistan,0,23898198,42.129,7.4792
20,2000,Argentina,3,37497728,74.34,2.35
31,2000,Aruba,3,69539,73.451,2.124
42,2000,Australia,4,19164620,80.37,1.756
53,2000,Austria,1,8113413,78.98,1.382


## X

The `x` encoding channel sets a mark's horizontal position (x-coordinate). In addition, default choices of axis and title are made automatically. In the chart below, the choice of a quantitative data type results in a continuous linear axis scale:

In [5]:
alt.Chart(data2000).mark_point().encode(
    alt.X('fertility')
)

## Y

The `y` encoding channel sets a mark's vertical position (y-coordinate). Here we've encoded the `cluster` field to the chart on the `y` channel.

In [6]:
alt.Chart(data2000).mark_point().encode(
    alt.X('fertility'),
    alt.Y('cluster')
)

The data type of the `cluster` field is automatically inferred by Altair and is treated as quantitative. But based on our preliminary summary analysis of the attributes and just by looking at the plot, we notice that there are no values ending with .5. Let's explicity treat `cluster` as an ordinal (`O`) data type. The result is a discrete axis that includes a sized band, with a default step size, for each unique value:

In [7]:
alt.Chart(data2000).mark_point().encode(
    alt.X('fertility:Q'),
    alt.Y('cluster:O')
)

_What happens to the chart above if you swap the `O` and `Q` field types?_

Let's step away from cluster for a little while and focus on relationships between attributes that we are more familiar with.
If we instead add the `life_expect` field as a quantitative (`Q`) variable, the result is a scatter plot with linear scales for both axes:

In [8]:
alt.Chart(data2000).mark_point().encode(
    alt.X('fertility:Q'),
    alt.Y('life_expect:Q')
)

By default, axes for linear quantitative scales include zero to ensure a proper baseline for comparing ratio-valued data. In some cases, however, a zero baseline may be meaningless or you may want to focus on interval comparisons. To disable automatic inclusion of zero, configure the scale mapping using the encoding `scale` attribute:

In [9]:
alt.Chart(data2000).mark_point().encode(
    alt.X('fertility:Q', scale=alt.Scale(zero=False)),
    alt.Y('life_expect:Q', scale=alt.Scale(zero=False))
)

Now the axis scales no longer include zero by default. Some padding still remains, as the axis domain end points are automatically snapped to _nice_ numbers like multiples of 5 or 10.

_What happens if you also add `nice=False` to the scale attribute above?_

We have used the `x` and `y` channels to encode 2 attributes. What if we wanted to encode another attribute? What other channels are available to us?

## Size

The `size` encoding channel sets a mark's size or extent. The meaning of the channel can vary based on the mark type. For `point` marks, the `size` channel maps to the pixel area of the plotting symbol, such that the diameter of the point matches the square root of the size value.

Let's augment our scatter plot by encoding population (`pop`) on the `size` channel. As a result, the chart now also includes a legend for interpreting the size values.
By using the `size` channel to encode an additional quantitiative attribute, we have moved away from a standard **scatter plot**, to the lesser known, **bubble plot**.

In [10]:
alt.Chart(data2000).mark_point().encode(
    alt.X('fertility:Q'),
    alt.Y('life_expect:Q'),
    alt.Size('pop:Q')
)

In some cases we might be unsatisfied with the default size range. To provide a customized span of sizes, set the `range` parameter of the `scale` attribute to an array indicating the smallest and largest sizes. Here we update the size encoding to range from 0 pixels (for zero values) to 1,000 pixels (for the maximum value in the scale domain):

In [11]:
alt.Chart(data2000).mark_point().encode(
    alt.X('fertility:Q'),
    alt.Y('life_expect:Q'),
    alt.Size('pop:Q', scale=alt.Scale(range=[0,1000]))
)

## Color and Opacity

The `color` encoding channel sets a mark's color. The style of color encoding is highly dependent on the data type: nominal data will default to a multi-hued qualitative color scheme, whereas ordinal and quantitative data will use perceptually ordered color gradients.

Here, we encode the `cluster` field using the `color` channel and a nominal (`N`) data type, resulting in a distinct hue for each cluster value. Can you start to guess what the `cluster` field might indicate?


In [12]:
alt.Chart(data2000).mark_point().encode(
    alt.X('fertility:Q'),
    alt.Y('life_expect:Q'),
    alt.Size('pop:Q', scale=alt.Scale(range=[0,1000])),
    alt.Color('cluster:N')
)

If we prefer filled shapes, we can can pass a `filled=True` parameter to the `mark_point` method:

In [13]:
alt.Chart(data2000).mark_point(filled=True).encode(
    alt.X('fertility:Q'),
    alt.Y('life_expect:Q'),
    alt.Size('pop:Q', scale=alt.Scale(range=[0,1000])),
    alt.Color('cluster:N')
)

By default, Altair uses a bit of transparency to help combat over-plotting. We are free to further adjust the opacity, either by passing a default value to the `mark_*` method, or using a dedicated encoding channel.

Here we demonstrate how to provide a constant value to an encoding channel instead of binding a data field:

In [14]:
alt.Chart(data2000).mark_point(filled=True).encode(
    alt.X('fertility:Q'),
    alt.Y('life_expect:Q'),
    alt.Size('pop:Q', scale=alt.Scale(range=[0,1000])),
    alt.Color('cluster:N'),
    alt.OpacityValue(0.5)
)

## Shape

The `shape` encoding channel sets the geometric shape used by `point` marks. Unlike the other channels we have seen so far, the `shape` channel can not be used by other mark types. The shape encoding channel should only be used with nominal data, as perceptual rank-order and magnitude comparisons are not supported.

Let's encode the `cluster` field using `shape` as well as `color`. Using multiple channels for the same underlying data field is known as a *redundant encoding*. The resulting chart combines both color and shape information into a single symbol legend:

In [15]:
alt.Chart(data2000).mark_point(filled=True).encode(
    alt.X('fertility:Q'),
    alt.Y('life_expect:Q'),
    alt.Size('pop:Q', scale=alt.Scale(range=[0,1000])),
    alt.Color('cluster:N'),
    alt.OpacityValue(0.5),
    alt.Shape('cluster:N')
)

## Tooltips
By this point, you might feel a bit frustrated: we've built up a chart, but we still don't know what countries the visualized points correspond to! Let's add interactive tooltips to enable exploration.
Tooltips provide additional information about a given encoding.
By hovering over a mark, additional data for the given row in the data field is made accessible to the user.
The simplest form of interaction supported in Altair is tooltips. During the course of the term,  you will be introduced to ways you can author interactions in Altair.

The `tooltip` encoding channel determines tooltip text to show when a user moves the mouse cursor over a mark. Let's add a tooltip encoding for the `country` field, then investigate which countries are being represented.


In [16]:
alt.Chart(data2000).mark_point(filled=True).encode(
    alt.X('fertility:Q'),
    alt.Y('life_expect:Q'),
    alt.Size('pop:Q', scale=alt.Scale(range=[0,1000])),
    alt.Color('cluster:N'),
    alt.OpacityValue(0.5),
    alt.Tooltip('country')
)

## Ordering

As you mouse around you may notice that you can not select some of the points. For example, the largest dark blue circle corresponds to India, which is drawn on top of a country with a smaller population, preventing the mouse from hovering over that country. To fix this problem, we can use the `order` encoding channel.

The `order` encoding channel determines the order of data points, affecting both the order in which they are drawn and, for `line` and `area` marks, the order in which they are connected to one another.

Let's order the values in descending rank order by the population (`pop`), ensuring that smaller circles are drawn later than larger circles:

In [17]:
alt.Chart(data2000).mark_point(filled=True).encode(
    alt.X('fertility:Q'),
    alt.Y('life_expect:Q'),
    alt.Size('pop:Q', scale=alt.Scale(range=[0,1000])),
    alt.Color('cluster:N'),
    alt.OpacityValue(0.5),
    alt.Tooltip('country:N'),
    alt.Order('pop:Q', sort='descending')
)

Now we can identify the smaller country being obscured by India: it's Bangladesh!

We can also now figure out what the `cluster` field represents. Mouse over the various colored points to formulate your own explanation.

At this point we've added tooltips that show only a single property of the underlying data record. To show multiple values, we can provide the `tooltip` channel an array of encodings, one for each field we want to include:

In [18]:
alt.Chart(data2000).mark_point(filled=True).encode(
    alt.X('fertility:Q'),
    alt.Y('life_expect:Q'),
    alt.Size('pop:Q', scale=alt.Scale(range=[0,1000])),
    alt.Color('cluster:N'),
    alt.OpacityValue(0.5),
    alt.Order('pop:Q', sort='descending'),
    tooltip = [
        alt.Tooltip('country:N'),
        alt.Tooltip('fertility:Q'),
        alt.Tooltip('life_expect:Q')
    ]   
)

Now we can see multiple data fields upon mouse over!

## Customizing Appearance
In a later notebook we do a deep dive into plot configurations.
But here is a brief introduction.
Using the comments to the right as a guide, you can change the appearnce of the plot.


In [20]:
alt.Chart(data2000, title='My plot title').mark_point().encode(       # Change the title of the plot
    alt.X('fertility:Q', title='Fertility'),                          # Change the x-axis title
    alt.Y('life_expect:Q', scale=alt.Scale(zero=False)),              # Change the y-axis scale to not include zero
    alt.Size('pop:Q', scale=alt.Scale(range=(100, 1000))),       # Change the range (min, max) of the size scale to enlarge points
    alt.Tooltip('country')                                            # Add country name on hover
).configure_axis(
    labelFontSize=14,                                                 # Change the font size of the axis labels (the numbers)
    titleFontSize=20                                                  # Change the font size of the axis title
).configure_legend(
    titleFontSize=14                                                  # Change the font size of the legend title
).configure_title(
    fontSize=30                                                       # Change the font size of the chart title
)

## Summary
In this notebook, you saw how visual channels can be used to encode data. You were also given a glimpse into the role of interaction (via tooltips).
The only remaining building block for creating charts in Altair is graphical marks.
In the next three notebooks, we will dive deeper into creating visualizations by exploring various graphical marks.
At this point, you should now be well-equipped to further explore the space of encodings. For a comprehensive reference - including features we've skipped please see the Altair [encoding](https://altair-viz.github.io/user_guide/encoding.html) documentation.