# Tutorial 11. Color

Nearly 10% of people have color vision deficiency, including 8% of men and 0.5% of women with red-green color blindness. Designing accessible visualizations helps everyone interpret your charts more easily.

In this tutorial, we'll explore Altair's color encoding options.


## Learning Goals
Those who actively work through this notebook will be able to:
-  Choose appropriate color schemes for your data
- Use pre-made and custom color schemes
- Selectively highlight and annotate data with color and text
- Directly label data instead of using legends

In [None]:
import pandas as pd
import altair as alt

!pip install vega_datasets palmerpenguins -q

from vega_datasets import data
from palmerpenguins import load_penguins



## Color Schemes

Altair offers many color schemes. See the full list in the [Vega-Lite documentation](https://vega.github.io/vega/docs/schemes/).
The schemes fall into three main categories
- categorical
- sequential
- diverging

---
### Categorical
Altair's default categorical scheme is **"Tableau10"** – a 10-color palette starting with blue, orange, and red. Use this for discrete categories (like country names or product types).

Let's start by creating a scatter plot that we will reuse as we explore different color schemes.
<br>
**Specifications**
- Use the `circle` mark
- Encode `flipper_length_mm` on the `x` channel
- Encode `body_mass_g` on the `y` channel
- Encode `species` on the `color` channel
<br> **Styling**
- Set the width and height to 250
- both position channels set the scale so it doesn't start at zero. 


In [None]:
penguins = load_penguins()

scatter_penguins = alt.Chart(penguins).mark_circle(size=40).encode(
    x=alt.X('flipper_length_mm', scale=alt.Scale(zero=False)),
    y=alt.Y('body_mass_g', scale=alt.Scale(zero=False)),
    color=alt.Color('species')).properties(width=250,height=250)

scatter_penguins

You can change the colormap (or colorscheme) by specifying its name as a string
to `scheme` inside `alt.Scale`.
[All the available colormaps can be viewed on this page](https://vega.github.io/vega/docs/schemes/),
which also lists what type of data the colormap is useful for
(categorical, sequential, diverging, cyclic).

### Sequential
For **numerical variables**, Altair automatically uses sequential colormaps that show progression from low to high values. Best practice: assign lighter colors (like light blue) to low values so they blend with the background, making high values stand out.


Let's reuse the `scatter_penguins` chart we created and instead encode `flipper_length_mm` on the color channel. Let's see what the default is


In [None]:
scatter_penguins.encode(color='flipper_length_mm')

`Viridis` is a well-research colorscheme,
originally developed for matplotlib and now used in many different places.
Compared to the ones above,
you see changes in detail slightly better
because of the increased amount of hues/colors used,
which could also give rise to a very slight extra highlighting effect

Let's re-use the base chart and specify the `viridis` color scheme. 


In [None]:
scatter_penguins.encode(color=alt.Color('flipper_length_mm', scale=alt.Scale(scheme='viridis')))

You can reverse a color scale,
the same way we learn how to reverse axes scales.

In [None]:
scatter_penguins.encode(color=alt.Color('flipper_length_mm', scale=alt.Scale(scheme='viridis', reverse=True)))

### Diverging
For variables with a **natural midpoint** (like correlations ranging from -1 to 1), use diverging colormaps. Sequential schemes make zero appear more important than -1, which distorts the data. Diverging schemes use contrasting colors on each side of the midpoint.

#### Correlation Heatmap
Let's wrangle the penguins dataset. We'll select only numerical columns, remove the year column, and compute the correlation matrix.


In [None]:
# Load dataframe and select numeric columns, then drop year
num_df = penguins.select_dtypes(include='number').drop(columns=['year'])

# Compute correlation matrix, stack to long form
corr_df = num_df.corr()
corr_long = corr_df.stack().reset_index(name='corr')
corr_long.head(5)

Now let's create the heatmap. 
- Encode `level_0` on the `x` channel
- Encode `level_1` on the `y` channel
- Encode `corr` on the `color` channel

In [None]:

penguin_heatmap = alt.Chart(corr_long).mark_rect().encode(
    x=alt.X('level_0:N', title='var1', axis=alt.Axis(labelAngle=-45)),
    y=alt.Y('level_1:N', title='var2'),
    color=alt.Color('corr:Q'),
    tooltip=[alt.Tooltip('level_0:N', title='var1'),
             alt.Tooltip('level_1:N', title='var2'),
             alt.Tooltip('corr:Q', format='.2f', title='corr')]
).properties(width=220, height=220)

penguin_heatmap

#### Make scale diverging. 

Sequential schemes make zero appear more important than -1, which distorts the data. 
Diverging schemes use contrasting colors on each side of the midpoint.
Instead of using the default, we can choose a color scheme
that is more suitable for showing diverging values,
and define the color domain manually 



In [None]:
penguin_heatmap.encode(
    color=alt.Color('corr', 
                    scale=alt.Scale(domain=(-1, 1), 
                                    scheme='purpleorange')))

#### Setting the Midpoint
to match the range of our variable.
An alternative to setting the color scheme explicitly
would have been to set `domainMid=0`,
in which case Altair understand this is a diverging variable
with a natural midpoint and uses the default diverging color scheme.


An alternative to setting the color scheme explicitly
would have been to set `domainMid=0`,
in which case Altair understand this is a diverging variable
with a natural midpoint and uses the default diverging color scheme.

- Encode `corr` on the `color` channel and specify the scale to be diverging with `scale=alt.Scale(domain=[-1, 0, 1])`

In [None]:


penguin_heatmap.encode(color=alt.Color('corr:Q',scale=alt.Scale(domain=[-1, 0, 1])))

Altair provides access to many color schemes from different sources. For example, [ColorBrewer schemes](https://colorbrewer2.org/) are built-in and widely used for both categorical and sequential data. Specify them by name in your color encoding (e.g., `'blues'`, `'rdylgn'`, `'set1'`).



## Highlighting

Use color to highlight specific elements, like the year with the highest wheat price below. Adding a text annotation with the exact value helps reinforce the highlight.

**Setting color in encoding vs. mark:**
To override color in a layered chart, set it in the `encoding`, not the `mark`. Since the base chart already has color encoding, it takes precedence over mark-level color settings. Use `alt.value()` to pass a literal color value instead of a column name.

**Removing visual clutter:**
Once you've annotated the exact value, gridlines become redundant—they help estimate values and compare distant elements. Remove them for a cleaner look. While chart outlines work well with gridlines (they blend together), they look awkward alone, so remove the outline too.

---

In [None]:
wheat = data.wheat().query('year > 1700')  # Reduce the number of bars for clarity

# Set the year to be highlighted to a separate value in a new column
wheat['highlight'] = False
wheat.loc[wheat['year'] == 1810, 'highlight'] = True

bars = alt.Chart(wheat).mark_bar().encode(
    x='year:O',
    y=alt.Y('wheat', axis=alt.Axis(grid=False)),
    color=alt.Color('highlight', legend=None))

(bars + bars.mark_text(dy=-5).encode(text='wheat', color=alt.value('black'))).configure_view(strokeWidth=0)

## Effect of Data Type on Color Scales

Altair automatically selects color schemes based on your data type. Customize these defaults using the `scale()` method in your color encoding.

Below, we'll visualize the same data three ways—encoding color as **quantitative**, **ordinal**, and **nominal**—to see how Altair adjusts the color scheme for each type.


In [None]:
cars = data.cars()
base = alt.Chart(cars).mark_point().encode(
    x='Horsepower:Q',
    y='Miles_per_Gallon:Q',
).properties(
    width=140,
    height=140
)

alt.hconcat(
   base.encode(color='Cylinders:Q').properties(title='quantitative'),
   base.encode(color='Cylinders:O').properties(title='ordinal'),
   base.encode(color='Cylinders:N').properties(title='nominal'),
)

## Color Domain and Range

Create custom color scales using the `domain` and `range` parameters in the `scale()` method. **Domain** specifies the data values, and **range** specifies the corresponding colors.

This works for **continuous scales** to highlight specific value ranges:


In [None]:
domain = [5, 8, 10, 12, 25]
range_ = ['#9cc8e2', '#9cc8e2', 'red', '#5ba3cf', '#125ca4']

alt.Chart(cars).mark_point().encode(
    x='Horsepower',
    y='Miles_per_Gallon',
    color=alt.Color('Acceleration').scale(domain=domain, range=range_)
)

In [None]:
domain = ['Europe', "Japan", "USA"]
range_ = ['seagreen', 'firebrick', 'rebeccapurple']

alt.Chart(cars).mark_point().encode(
    x='Horsepower',
    y='Miles_per_Gallon',
    color=alt.Color('Origin').scale(domain=domain, range=range_)
)

## Exercises

### **Exercise 1: Color Scheme Exploration (Easy)**

Using the `cars` dataset, create a scatter plot showing `Horsepower` vs `Miles_per_Gallon` with points colored by `Origin`.

Create **three versions side by side** using `alt.hconcat()`:
1. Default colors (no scheme specified)
2. Using the `'set2'` color scheme
3. Using the `'dark2'` color scheme

**Requirements:**
- Each chart should be 200x200 pixels
- Add a title to each chart indicating which color scheme is used
- All three charts should share the same x and y scales

**Hint:** Use a base chart and modify only the color encoding for each version.

---

### **Exercise 2: Temperature Diverging Heatmap (Easy-Medium)**

Create a heatmap showing temperature anomalies (deviations from a baseline).

First, create mock temperature anomaly data:
```python
import numpy as np
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 
          'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
years = list(range(2015, 2025))
data = []
for year in years:
    for month in months:
        anomaly = np.random.uniform(-2.5, 2.5)  # Temperature anomaly in °C
        data.append({'year': year, 'month': month, 'anomaly': anomaly})
temp_df = pd.DataFrame(data)
```

**Requirements:**
- Use `mark_rect()` to create a heatmap
- Encode `month` on x-axis (in correct calendar order)
- Encode `year` on y-axis
- Use a diverging color scheme with 0 as the midpoint
- Color scheme should show negative anomalies in blue and positive in red
- Include tooltips showing year, month, and anomaly value
- Add an appropriate title

---

### **Exercise 3: Selective Species Highlighting (Medium)**

Using the `penguins` dataset, create a scatter plot that highlights **only the Gentoo species** in orange while showing the other two species in gray.

**Requirements:**
- Create a new column called `'is_gentoo'` that is `True` for Gentoo penguins and `False` for others
- Plot `bill_length_mm` vs `bill_depth_mm`
- Use conditional color encoding to color Gentoo in `'orange'` and others in `'lightgray'`
- Remove the color legend
- Add a text annotation near the Gentoo cluster identifying it
- Set appropriate chart dimensions

---

### **Exercise 4: Interactive Country Highlighter (Hard)**

Create an **interactive visualization** that allows users to select a country and see it highlighted across multiple views.

Owid dataset is here
https://raw.githubusercontent.com/kemiolamudzengi/dsci-320-datasets/main/owid_energy_clean.csv,

Using the `owid_energy_clean.csv` dataset for the year with the complete data

**Requirements:**
- Create a selection parameter that allows clicking on countries
- Build **three linked charts**:
  1. **Scatter plot**: `gdp` vs `energy_per_capita` with countries colored by selection
  2. **Bar chart**: Top 15 countries by `primary_energy_consumption` with selected country highlighted
  3. **Bar chart**: Renewable vs fossil energy share for the selected country
  
- Selected countries should be shown in a bright color (e.g., `'red'` or `'orange'`)
- Non-selected countries should be muted (e.g., `'lightgray'`)
- Use `alt.hconcat()` or `alt.vconcat()` to arrange the charts
- Add tooltips to all charts showing relevant metrics
- Include clear titles and axis labels


#### **Bonus Challenge:**
Add a **custom color scale** where:
- Countries with high renewable energy share (>50%) use shades of green
- Countries with low renewable energy share (<20%) use shades of brown
- Countries in between use yellow/orange
- The selected country always appears in red regardless of its renewable share


## Summary

There is a subfield of visualization research that is at the intersection of perceptual psychology, cognitive science, and computer science that explores the impact of color on sense-making.
Interested in learning more about this topic see the lecture slides on Color. 

Read through Altair [color documentation](https://altair-viz.github.io/user_guide/generated/core/altair.Color.html) to learn more about specifications in Altair
To explore different color schemes visit
 - [https://colorbrewer2.org/](https://colorbrewer2.org/#)
 - [https://carto.com/carto-colors/](https://carto.com/carto-colors/)

