<a href="https://colab.research.google.com/github/dataprogpy/dataprogpy.github.io/blob/main/starter_files/04_introduction_to_altair.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Data Visualization with Altair


## 1: Setup and Imports

Ensure necessary libraries are installed. In Google Colab, some common libraries are pre-installed. If running locally or in a different environment, you might need to run:

`!pip install altair vega_datasets polars`


In [2]:

import altair as alt
import polars as pl
from vega_datasets import data as vega_data # For loading sample datasets

# Enable the Altair renderer for Google Colab
# If you're using a different environment (like JupyterLab or a classic Notebook),
# you might need to change this. Examples:
# alt.renderers.enable('jupyterlab')
# alt.renderers.enable('notebook')
# alt.renderers.enable('default')
alt.renderers.enable('colab')

print("Libraries imported and Altair renderer enabled for Colab.")


Libraries imported and Altair renderer enabled for Colab.



### 1.1 Why Visualize Data? Anscombe's Quartet

Summary statistics can be informative, but they don't tell the whole story.

Anscombe's Quartet demonstrates this perfectly.


In [3]:
# Load Anscombe's Quartet from vega_datasets
anscombe_pd_df = vega_data.anscombe() # This loads as a Pandas DataFrame

# Convert to a Polars DataFrame (as we're using Polars in this course)
anscombe_pl_df = pl.from_pandas(anscombe_pd_df)

# Let's inspect the data structure
print("First few rows of Anscombe's Quartet (Polars DataFrame):")
print(anscombe_pl_df.head())

First few rows of Anscombe's Quartet (Polars DataFrame):
shape: (5, 3)
┌────────┬─────┬──────┐
│ Series ┆ X   ┆ Y    │
│ ---    ┆ --- ┆ ---  │
│ str    ┆ i64 ┆ f64  │
╞════════╪═════╪══════╡
│ I      ┆ 10  ┆ 8.04 │
│ I      ┆ 8   ┆ 6.95 │
│ I      ┆ 13  ┆ 7.58 │
│ I      ┆ 9   ┆ 8.81 │
│ I      ┆ 11  ┆ 8.33 │
└────────┴─────┴──────┘


In [4]:

# Now, let's calculate key summary statistics for each dataset within the quartet.
# We'll group by the 'Dataset' column.

summary_stats = anscombe_pl_df.group_by("Series").agg(
    pl.mean("X").alias("Mean_X"),
    pl.std("X").alias("StdDev_X"),
    pl.mean("Y").alias("Mean_Y"),
    pl.std("Y").alias("StdDev_Y"),
    pl.corr("X", "Y").alias("Correlation_XY")
).sort("Series") # Sort for consistent display

print("\nSummary Statistics for Anscombe's Quartet:")
print(summary_stats)



Summary Statistics for Anscombe's Quartet:
shape: (4, 6)
┌────────┬────────┬──────────┬──────────┬──────────┬────────────────┐
│ Series ┆ Mean_X ┆ StdDev_X ┆ Mean_Y   ┆ StdDev_Y ┆ Correlation_XY │
│ ---    ┆ ---    ┆ ---      ┆ ---      ┆ ---      ┆ ---            │
│ str    ┆ f64    ┆ f64      ┆ f64      ┆ f64      ┆ f64            │
╞════════╪════════╪══════════╪══════════╪══════════╪════════════════╡
│ I      ┆ 9.0    ┆ 3.316625 ┆ 7.5      ┆ 2.03289  ┆ 0.816186       │
│ II     ┆ 9.0    ┆ 3.316625 ┆ 7.500909 ┆ 2.031657 ┆ 0.816237       │
│ III    ┆ 9.0    ┆ 3.316625 ┆ 7.5      ┆ 2.030424 ┆ 0.816287       │
│ IV     ┆ 9.0    ┆ 3.316625 ┆ 7.500909 ┆ 2.030579 ┆ 0.816521       │
└────────┴────────┴──────────┴──────────┴──────────┴────────────────┘


**Note: The summary statistics (mean, std dev, correlation) are nearly identical for all four datasets!**

### 1.2 Visualizing Anscombe's Quartet with Altair

 Now, let's see what these datasets *look* like.
 We will create a scatter plot for each dataset.

 **The core Altair syntax: `alt.Chart(data).mark_type().encode(visual_channels)`**

 We specify the data type for X and Y as Quantitative ('Q') using a colon.

 This helps Altair apply appropriate scales and axes.
 Example: `x='X:Q'`


In [5]:
def dataset_mapper(val: str) -> str:
  dataset_map = {
    'I': 'Linear',
    'II': 'Non-linear',
    'III': 'Linear with outlier',
    'IV': 'Vertical line with outlier'
  }
  return dataset_map.get(val, 'Unknown')

anscombe = anscombe_pl_df.select(
    pl.col('X', 'Y'),
    pl.col('Series')
      .map_elements(dataset_mapper, return_dtype=pl.String)
)

(
    alt.Chart(anscombe)
    .mark_point(size=60)
    .encode(
        alt.X('X:Q', scale=alt.Scale(domain=[0, 20])),
        alt.Y('Y:Q', scale=alt.Scale(domain=[0, 14])),
        alt.Color('Series:N')
          .legend(title="Dataset"),
        alt.Facet('Series:N')
        .columns(2)
        .title(None)
    )
    .properties(
        title="Anscombe's Quartet",
        width=200,
        height=200,
    )
)

How does the visualization compare with the summary statistics?
- Dataset I: Appears to be a linear relationship.
- Dataset II: Shows a clear non-linear (curved) relationship.
- Dataset III: A linear relationship with a significant outlier.
- Dataset IV: Most X values are constant, with one influential outlier.

Visualization helps identify patterns, anomalies, and guides further analysis.



### 1.3 Understanding Altair's Declarative Nature & Core Idea

Recall: `object = data + behavior`

You've just used this syntax to create the Anscombe plots!
- `alt.Chart(anscombe)`: Create a `Chart` object with `anscombe` **data**.
- `.mark_point()`: Ask the chart object to use  `point` as **mark** type.
- `.encode(x='X:Q', y='Y:Q')`: Ask chart to **encode** `X` and `Y` columns
  as **Q**uantitative  data type to visual properties (channels) `x-position` and `y-position`.

Any diagram you create in Altair follow this basic structure. This consistent structure is key to Altair's power and ease of use.
By focusing on what you want to represent, you can build a wide variety of charts.

In [6]:
from vega_datasets import data as vega_data

# Load as Pandas DataFrame first
cars_pd_df = vega_data.cars()

# Convert to Polars DataFrame
cars_pl_df = pl.from_pandas(cars_pd_df)


In [7]:
# Always a good idea to inspect your data
cars_pl_df.head()
cars_pl_df.shape
cars_pl_df.null_count() # Check for missing values

Name,Miles_per_Gallon,Cylinders,Displacement,Horsepower,Weight_in_lbs,Acceleration,Year,Origin
str,f64,i64,f64,f64,i64,f64,datetime[ns],str
"""chevrolet chevelle malibu""",18.0,8,307.0,130.0,3504,12.0,1970-01-01 00:00:00,"""USA"""
"""buick skylark 320""",15.0,8,350.0,165.0,3693,11.5,1970-01-01 00:00:00,"""USA"""
"""plymouth satellite""",18.0,8,318.0,150.0,3436,11.0,1970-01-01 00:00:00,"""USA"""
"""amc rebel sst""",16.0,8,304.0,150.0,3433,12.0,1970-01-01 00:00:00,"""USA"""
"""ford torino""",17.0,8,302.0,140.0,3449,10.5,1970-01-01 00:00:00,"""USA"""


(406, 9)

Name,Miles_per_Gallon,Cylinders,Displacement,Horsepower,Weight_in_lbs,Acceleration,Year,Origin
u32,u32,u32,u32,u32,u32,u32,u32,u32
0,8,0,0,6,0,0,0,0



## 2: Your First Plots with Altair

### 2.0 Load and Prepare the Dataset for Visualization

In [4]:
# Load the 'cars' dataset from vega_datasets
cars_pd = vega_data.cars()

# Convert to a Polars DataFrame
cars_pl = pl.from_pandas(cars_pd)

print("Cars dataset loaded into a Polars DataFrame:")

cars_pl.head()

print(f"\nShape of the cars dataset: {cars_pl.shape}")

# Quick check for missing values (important before plotting some fields)
print("\nNull counts per column:")
cars_pl.null_count()


# For some plots, it's useful to drop rows with missing values in key columns.
# For example, if plotting Horsepower vs. Miles_per_Gallon:
cars_pl_cleaned = cars_pl.drop_nulls(subset=['Horsepower', 'Miles_per_Gallon', 'Origin', 'Name'])

print(f"\nShape after dropping some nulls: {cars_pl_cleaned.shape}")

Cars dataset loaded into a Polars DataFrame:

Shape of the cars dataset: (406, 9)

Null counts per column:

Shape after dropping some nulls: (392, 9)


### 2.1 Scatter Plots: Exploring Relationships

- **Scatter plots** help visualize the relationship between two quantitative variables.
- **Mark**: `mark_point()` or `mark_circle()`
- **Encodings**:
  - `x`,
  - `y`,
  - `color`,
  - `size`,
  - `tooltip`



#### **Student Task**:
1. Modify the scatter plot below to show 'Acceleration:Q' on the x-axis and 'Weight_in_lbs:Q' on the y-axis.
2. Keep the color encoding by 'Origin'.
3. Update the tooltips and title appropriately.
4. Comment out the code that is not relevant for `Acceleration` and `Weight_in_lbs`

In [149]:
# Data prep
mpg_col = pl.col('Miles_per_Gallon')
MPG_summary = cars_pl_cleaned.select(
    mpg_col.mean().alias('Mean_MPG'),
    mpg_col.median().alias('Median_MPG'),
    mpg_col.min().alias('Min_MPG'),
    mpg_col.max().alias('Max_MPG'),
    mpg_col.lt(mpg_col.median()).sum().alias('Below_Median'),
    mpg_col.lt(mpg_col.mean()).sum().alias('Below_Mean')
).to_dicts()[0]

# Incremental chart build

base = alt.Chart(cars_pl_cleaned)

points = base.mark_point().encode(
    x='Horsepower:Q',
    y='Miles_per_Gallon:Q',
    color='Origin:N',
    tooltip=['Name:N', 'Horsepower:Q', 'Miles_per_Gallon:Q']
)

median_rule = base.mark_rule(color="light-pink", strokeDash=[2,2], opacity=0.5).encode(
    y = alt.datum(MPG_summary['Median_MPG']),
)

max_rule = base.mark_rule(color="green", strokeDash=[2,2], opacity=0.5).encode(
    y = alt.datum(MPG_summary['Max_MPG']),
)

median_text = base.mark_text(
      dx=650,
      dy=5,
      fontSize=12,
      fontWeight='normal'
    ).encode(
    y = alt.datum(MPG_summary['Median_MPG']),
    x = alt.datum(0),
    text = alt.value('Median MPG'),

)

# Chart display
(points + median_rule + max_rule + median_text).properties(
    title='Car Horsepower vs. Miles per Gallon by Origin',
    width=600,
    height=400
)

In [None]:
# Your code here

### 2.2 Bar Charts: Comparing Categories or Showing Counts

- **Bar charts** compare a quantitative measure across different categories.
- **Mark**: mark_bar()
- **Encodings**:
  - `x` (categorical),
  - `y` (quantitative/aggregate),
  - `color`

In [154]:
# Example 1: Number of cars from each 'Origin' (Frequency)
(
    alt.Chart(cars_pl_cleaned)
    .mark_bar()
    .encode(
      x='Origin:N',
      y='count():Q', # Altair's way to count occurrences
      color='Origin:N', # Optional: color bars by origin
      tooltip=['Origin:N', 'count():Q']
    )
    .properties(
      title='Number of Cars by Origin',
      width=300,
      height=300
    )
)

In [156]:
# Example 2: Average 'Horsepower' for each 'Origin'.
# Altair can do simple aggregations like 'average', 'sum', 'min', 'max'.
alt.Chart(cars_pl_cleaned).mark_bar().encode(
    x='Origin:N',
    y='average(Horsepower):Q',
    color='Origin:N',
    tooltip=['Origin:N', 'average(Horsepower):Q']
).properties(
    title='Average Horsepower by Origin',
    height=300,
    width=300
)

In [158]:
# Alternatively, pre-aggregate with Polars for more complex scenarios or clarity:
avg_hp_origin_polars = cars_pl_cleaned.group_by('Origin').agg(
    pl.mean('Horsepower').alias('Mean_Horsepower')
).sort('Origin')

alt.Chart(avg_hp_origin_polars).mark_bar().encode(
    x='Origin:N',
    y='Mean_Horsepower:Q',
    color='Origin:N', # Or a fixed color: alt.value('steelblue')
    tooltip=['Origin:N', 'Mean_Horsepower:Q']
).properties(
    title='Average Horsepower by Origin (Polars Pre-aggregated)',
    height=300,
    width=300
)



#### Student Task:

Create a bar chart showing the average 'Displacement' for cars with different numbers of 'Cylinders'.
  - X-axis: 'Cylinders:O' (Ordinal, as cylinder count has an order)
  - Y-axis: Average 'Displacement'
  - Pre-aggregate the data using Polars.
  - Add appropriate tooltips and a title.


In [None]:
# YOUR CODE HERE for Student Task (Bar Chart)


### 2.3 Line Charts: Showing Trends Over Time or Sequence

- Line charts are excellent for visualizing trends.
- Mark: mark_line()
- Encodings:
  - `x` (temporal or ordered)
  - `y` (quantitative)
  - `color` (for multiple series)


In [168]:
# Load the 'seattle-weather' dataset
weather_pd = vega_data.seattle_weather()
weather_pl = pl.from_pandas(weather_pd)
weather_pl.head()

date,precipitation,temp_max,temp_min,wind,weather
datetime[ns],f64,f64,f64,f64,str
2012-01-01 00:00:00,0.0,12.8,5.0,4.7,"""drizzle"""
2012-01-02 00:00:00,10.9,10.6,2.8,4.5,"""rain"""
2012-01-03 00:00:00,0.8,11.7,7.2,2.3,"""rain"""
2012-01-04 00:00:00,20.3,12.2,5.6,4.7,"""rain"""
2012-01-05 00:00:00,1.3,8.9,2.8,6.1,"""rain"""


In [166]:

alt.Chart(weather_pl).mark_line().encode(
    x='date:T',
    y='temp_max:Q',
    color='weather:N', # Different line for each weather type
    tooltip=['date:T', 'temp_max:Q', 'weather:N']
).properties(
    title='Maximum Daily Temperature in Seattle by Weather Condition',
    width=600,
    height=350
)

#### Student Task:

Using the `seattle_weather` dataset:
1. Create a line chart showing the 'precipitation' over 'date'.
2. Only include data for days where precipitation was greater than 0. (Hint: Filter with Polars first).
3. Do NOT color by weather type for this one (i.e., a single line).
4. Add appropriate tooltips and a title.


In [169]:
# YOUR CODE HERE

### 2.4 Histograms: Visualizing Distributions

- **Histograms** show the distribution of a single quantitative variable.
- `Mark`: mark_bar()
- `Encodings`:
  - `x` (quantitative, binned)
  - `y` (count)


In [159]:
# Use `alt.X()` for more control over binning.
alt.Chart(cars_pl_cleaned).mark_bar().encode(
    alt.X(
        'Miles_per_Gallon:Q',
        bin=alt.Bin(maxbins=15),
        title='Miles per Gallon'), # Explicitly define bins
    y='count():Q',
    tooltip=[alt.Tooltip('Miles_per_Gallon:Q', bin=True), 'count():Q'] # Show binned range in tooltip
).properties(
    title='Distribution of Miles per Gallon',
    width=400
)


#### Student Task:
 1. Create a histogram for the 'Horsepower' column from the `cars_pl_cleaned` DataFrame.
 2. Experiment with the `maxbins` parameter (e.g., 10, 20, 30) to see how it changes the plot.
 3. Add appropriate tooltips and a title.

In [None]:
# Your code goes here

### 2.5 Workflow Tip: Incremental Development

  When building visualizations:
  1. Start with the most basic chart: `alt.Chart(data).mark_type().encode(x=..., y=...)`
  2. Gradually add complexity: color, size, tooltips, specific binning, etc.
  3. Refine aesthetics and properties: titles, labels, chart width/height.

This makes it easier to understand how each piece contributes and to debug if something goes wrong.

**Example: Building a scatter plot incrementally**

**Step 1**: Basic scatter
```py
chart_step1 = alt.Chart(cars_pl_cleaned).mark_point().encode(
    x='Displacement:Q',
    y='Acceleration:Q'
)
chart_step1.display()
```

**Step 2**: Add color by Origin
```py
chart_step2 = alt.Chart(cars_pl_cleaned).mark_point().encode(
    x='Displacement:Q',
    y='Acceleration:Q',
    color='Origin:N'
)
chart_step2.display()
```

**Step 3**: Add tooltips and a title
```py
 chart_step3 = alt.Chart(cars_pl_cleaned).mark_point().encode(
     x='Displacement:Q',
     y='Acceleration:Q',
     color='Origin:N',
     tooltip=['Name:N', 'Displacement:Q', 'Acceleration:Q']
 ).properties(
     title='Car Displacement vs. Acceleration by Origin',
     width=500
 )
 chart_step3.display()
 ```

In [10]:
alt.Chart(
    cars_pl_df.drop_nulls(subset=['Horsepower'])
    ).mark_bar().encode(
    alt.X(
        'Horsepower:Q',
        bin=alt.Bin(maxbins=20),
        title='Horsepower Bins',
        ),
    alt.Y('count():Q'),
    tooltip=[alt.Tooltip('Horsepower:Q', bin=True), 'count():Q']
).properties(
    title='Distribution of Car Horsepower',
    width=400
)

## 3: Enhancing Your Visualizations


### 3.1 Chart Properties: Titles and Sizing

**Let's start with a basic scatter plot from the previous section.**


In [7]:
base_scatter = alt.Chart(cars_pl).mark_circle(size=30, opacity=0.7).encode(
    x='Horsepower:Q',
    y='Miles_per_Gallon:Q',
    color='Origin:N',
    tooltip=['Name:N', 'Horsepower:Q', 'Miles_per_Gallon:Q']
)

# Add a title and set a specific width and height
scatter_with_properties = base_scatter.properties(
    title='Enhanced: Car Horsepower vs. MPG by Origin',
    width=500,
    height=300
)

print("\nScatter Plot with Title and Custom Size:")
scatter_with_properties


Scatter Plot with Title and Custom Size:



### 3.2 Customizing Axes: Labels, Formatting

- Default axis titles are just the column names. Let's make them more descriptive.
- We use alt.X(), alt.Y() and pass alt.Axis() to the 'axis' argument.


In [10]:

scatter_custom_axes = alt.Chart(cars_pl).mark_circle(size=60, opacity=0.7).encode(
    x=alt.X('Horsepower:Q',
            axis=alt.Axis(title='Vehicle Horsepower (HP)', grid=True), # Custom title, ensure grid is on
            scale=alt.Scale(zero=False) # Don't necessarily start axis at zero if data is far from it
           ),
    y=alt.Y('Miles_per_Gallon:Q',
            axis=alt.Axis(title='Fuel Efficiency (Miles per Gallon)', format='~s'), # SI unit format
            scale=alt.Scale(zero=False)
           ),
    color='Origin:N',
    tooltip=['Name:N', 'Horsepower:Q', 'Miles_per_Gallon:Q']
).properties(
    title='Horsepower vs. MPG with Custom Axes',
    width=500,
    height=300
)

print("\nScatter Plot with Custom Axis Titles and Formatting:")
scatter_custom_axes



Scatter Plot with Custom Axis Titles and Formatting:



#### Student Task:

**Axis Customization**

1. Take the `avg_hp_by_origin_bar` chart from Section 2 (Average Horsepower by Origin).
   (Recreate it here if needed based on `cars_pl`)
2. Customize the Y-axis title to "Average Horsepower (HP)".
3. Customize the X-axis title to "Country of Origin".
4. Add a main title: "Vehicle Power Comparison by Origin".


In [11]:

# Recreate the base chart for the task:
avg_hp_by_origin_base = alt.Chart(cars_pl).mark_bar().encode(
    x='Origin:N',
    y='average(Horsepower):Q',
    color='Origin:N',
    tooltip=['Origin:N', 'average(Horsepower):Q']
)


In [12]:

# YOUR CODE HERE for Student Task (Axis Customization)



### 3.3 Working with Color: Schemes and Scales

- Using a different color scheme for the 'Origin' nominal variable.
- The `scale` property within an encoding channel (like alt.Color) allows this.


In [18]:
scatter_custom_color_scheme = alt.Chart(cars_pl).mark_circle(size=60, opacity=0.7).encode(
    x=alt.X('Horsepower:Q', axis=alt.Axis(title='Vehicle Horsepower (HP)')),
    y=alt.Y('Miles_per_Gallon:Q', axis=alt.Axis(title='Fuel Efficiency (MPG)')),
    color=alt.Color('Origin:N', scale=alt.Scale(scheme='tableau10')), # Using 'tableau10' scheme
    tooltip=['Name:N', 'Horsepower:Q', 'Miles_per_Gallon:Q']
).properties(
    title='Scatter Plot with "tableau10" Color Scheme',
    width=500,
    height=300
)

print("\nScatter Plot with Custom Color Scheme:")
scatter_custom_color_scheme



Scatter Plot with Custom Color Scheme:


In [20]:

# Example with a quantitative color scale (e.g., color by 'Acceleration')
scatter_quantitative_color = alt.Chart(cars_pl).mark_circle(size=60, opacity=0.8).encode(
    x=alt.X('Horsepower:Q', axis=alt.Axis(title='Vehicle Horsepower (HP)')),
    y=alt.Y('Miles_per_Gallon:Q', axis=alt.Axis(title='Fuel Efficiency (MPG)')),
    color=alt.Color('Acceleration:Q', scale=alt.Scale(scheme='viridis')), # Sequential scheme for quantitative
    tooltip=['Name:N', 'Horsepower:Q', 'Miles_per_Gallon:Q', 'Acceleration:Q']
).properties(
    title='Color Encoded by "Acceleration" (Quantitative)',
    width=500,
    height=300
)
print("\nScatter Plot with Quantitative Color Scheme:")
scatter_quantitative_color



Scatter Plot with Quantitative Color Scheme:



### 3.4 Informative Tooltips

**Tooltips can be customized for better readability and more information.**


In [14]:
scatter_enhanced_tooltips = alt.Chart(cars_pl).mark_circle(size=60, opacity=0.7).encode(
    x=alt.X('Horsepower:Q', axis=alt.Axis(title='Vehicle Horsepower (HP)')),
    y=alt.Y('Miles_per_Gallon:Q', axis=alt.Axis(title='Fuel Efficiency (MPG)')),
    color=alt.Color('Origin:N', scale=alt.Scale(scheme='category10')),
    tooltip=[
        alt.Tooltip('Name:N', title='Car Model'), # Custom title for 'Name'
        alt.Tooltip('Origin:N', title='Origin Country'),
        alt.Tooltip('Horsepower:Q', title='HP', format='.0f'), # Format as integer
        alt.Tooltip('Miles_per_Gallon:Q', title='MPG', format='.1f'), # Format to one decimal place
        alt.Tooltip('Year:T', title='Manufacture Year', format='%Y') # Format year
    ]
).properties(
    title='Scatter Plot with Enhanced Tooltips',
    width=500,
    height=300
)

print("\nScatter Plot with Enhanced Tooltips:")
scatter_enhanced_tooltips



Scatter Plot with Enhanced Tooltips:



### 3.5 Saving Your Charts

- You can save your Altair charts to HTML, PNG, SVG, etc.
- HTML is interactive and usually works without extra setup.
- PNG/SVG might require `altair_viewer` or `vl-convert`.
- Save the chart with enhanced tooltips to an HTML file
- This file will be saved in your Colab environment's file system.
- You can then download it from the file pane on the left.

In [17]:
scatter_enhanced_tooltips.save('cars_scatter_plot.html')
print("\nSaving'cars_scatter_plot.html'. Check your Colab files.\n")

#Saving as PNG (may require additional setup like !pip install altair_viewer)
try:
  scatter_enhanced_tooltips.save('cars_scatter_plot.png')
  print("Chart saved to 'cars_scatter_plot.png'.")
except Exception as e:
  print(f"Could not save as PNG directly. Error: {e}")
  print("You might need to install a driver like 'altair_viewer' or 'vl-convert'.")
  print("Alternatively, in Colab, you can often right-click the displayed chart and 'Save image as...'")



Saving'cars_scatter_plot.html'. Check your Colab files.

Could not save as PNG directly. Error: Saving charts in 'png' format requires the vl-convert-python package: see https://altair-viz.github.io/user_guide/saving_charts.html#png-svg-and-pdf-format
You might need to install a driver like 'altair_viewer' or 'vl-convert'.
Alternatively, in Colab, you can often right-click the displayed chart and 'Save image as...'



### 3.6 Incremental Development Exercise

**Task: Create a polished histogram of 'Acceleration' from the `cars_pl` dataset.**

 Follow these incremental steps:


 1. Base Histogram:
    - Data: `cars_pl`
    - Mark: `mark_bar()`
    - Encodings:
        - X-axis: 'Acceleration:Q', binned (e.g., `maxbins=15`)
        - Y-axis: 'count():Q'
    Display this base chart.


In [None]:
# YOUR CODE for Step 1
base_accel_hist = alt.Chart(cars_pl).mark_bar().encode(
# Your encodings here
)
base_accel_hist



 2. Add Properties:
    - Add a title: "Distribution of Vehicle Acceleration".
    - Set width to 400.
    Display this chart.


In [None]:

# YOUR CODE for Step 2
accel_hist_props = base_accel_hist.properties(
# your properties here
)
accel_hist_props



 3. Customize Axes and Tooltips:
    - X-axis: Title "Acceleration (0-60 mph time)", format values with one decimal place (e.g., `format='.1f'`).
    - Y-axis: Title "Number of Car Models".
    - Tooltips: Show the binned 'Acceleration' range and the count.
    Display this chart.


In [None]:
# YOUR CODE for Step 3
accel_hist_final = alt.Chart(cars_pl).mark_bar().encode(
# encode data here
 ).properties(
# specify properties here
 )
accel_hist_final


In [None]:
#4. (Optional) Save your final polished histogram as an HTML file.
accel_hist_final.save('acceleration_histogram.html')

## 4: Interactive Visualization


### 4.0 Setup

In [27]:
# Load the 'cars' dataset again for these examples
cars_pd = vega_data.cars()
cars_pl = pl.from_pandas(cars_pd).drop_nulls(
    subset=['Horsepower', 'Miles_per_Gallon', 'Origin', 'Name', 'Cylinders']
)

# And the 'movies' dataset for a different example
movies_pd = vega_data.movies()

In [35]:
print(movies_pd.dtypes)

Title                      object
US_Gross                  float64
Worldwide_Gross           float64
US_DVD_Sales              float64
Production_Budget         float64
Release_Date               object
MPAA_Rating                object
Running_Time_min          float64
Distributor                object
Source                     object
Major_Genre                object
Creative_Type              object
Director                   object
Rotten_Tomatoes_Rating    float64
IMDB_Rating               float64
IMDB_Votes                float64
dtype: object


In [36]:
object_cols = movies_pd.select_dtypes(include='object').columns
for col in object_cols:
    movies_pd[col] = movies_pd[col].astype(str)
movies_pl = pl.from_pandas(movies_pd).drop_nulls(
    subset=['IMDB_Rating', 'Rotten_Tomatoes_Rating', 'Major_Genre', 'Release_Date']
)
movies_pl.select(pl.col('Release_Date')).head()

Release_Date
str
"""Oct 09 1998"""
"""Jul 01 1986"""
"""Dec 31 2046"""
"""Oct 07 1963"""
"""Dec 11 1968"""


In [38]:
movies_pl = movies_pl.with_columns(
    # Parse the string date "Oct 09 1998" into a Datetime type
    # Then extract the year as an integer (Int32 is sufficient for years)
    pl.col("Release_Date").str.strptime(pl.Datetime, "%b %d %Y").dt.year().alias("Release_Year")
)
movies_pl.head()

Title,US_Gross,Worldwide_Gross,US_DVD_Sales,Production_Budget,Release_Date,MPAA_Rating,Running_Time_min,Distributor,Source,Major_Genre,Creative_Type,Director,Rotten_Tomatoes_Rating,IMDB_Rating,IMDB_Votes,Release_Year
str,f64,f64,f64,f64,str,str,f64,str,str,str,str,str,f64,f64,f64,i32
"""Slam""",1009819.0,1087521.0,,1000000.0,"""Oct 09 1998""","""R""",,"""Trimark""","""Original Screenplay""","""Drama""","""Contemporary Fiction""","""None""",62.0,3.4,165.0,1998
"""Pirates""",1641825.0,6341825.0,,40000000.0,"""Jul 01 1986""","""R""",,"""None""","""None""","""None""","""None""","""Roman Polanski""",25.0,5.8,3275.0,1986
"""Duel in the Sun""",20400000.0,20400000.0,,6000000.0,"""Dec 31 2046""","""None""",,"""None""","""None""","""None""","""None""","""None""",86.0,7.0,2906.0,2046
"""Tom Jones""",37600000.0,37600000.0,,1000000.0,"""Oct 07 1963""","""None""",,"""None""","""None""","""None""","""None""","""None""",81.0,7.0,4035.0,1963
"""Oliver!""",37402877.0,37402877.0,,10000000.0,"""Dec 11 1968""","""None""",,"""Sony Pictures""","""None""","""Musical""","""None""","""None""",84.0,7.5,9111.0,1968


### 4.1 Defining Selections: `selection_single` and `selection_interval`

**Selections define how users can interact with the chart.**



**Example 1:** Single point selection (e.g., on click)
  - We'll use this to highlight a point or a group of related points.
  - `empty='none'` means if nothing is explicitly selected, the selection is empty.
  - `empty='all'` means if nothing is explicitly selected, everything is considered part of the selection.

In [39]:
cars_origin_selection = alt.selection_point( # selection_point can select multiple points with shift-click
    fields=['Origin'], # Clicking a point will select all points with the same 'Origin'
    bind='legend', # Bind selection to legend clicks as well
    empty='all' # if nothing is selected, all are considered "selected" for initial state
)


**Example 2:** Interval selection (brushing)
- This allows selecting a range of data by dragging a rectangle.
- We can specify which channels the brush applies to (e.g., 'x', 'y', or both).

In [46]:
rating_brush = alt.selection_interval(
    encodings=['x',], # Allow brushing along the x-axis (IMDB Rating)
    empty='all'
)

### 4.2 Using Selections: Conditional Encodings for Highlighting

 We use `alt.condition(selection, value_if_true, value_if_false)`
 to change visual properties based on selection status.

 **Click Interactivity**: Highlight cars by 'Origin' on click (using the legend binding)
 - Create a scatter plot of Horsepower vs. Miles_per_Gallon.
 - Points belonging to the selected 'Origin' (via legend or if fields was bound to plot) will be fully opaque.
 - Others will be semi-transparent.

In [41]:
scatter_highlight_origin = alt.Chart(cars_pl).mark_circle(size=80).encode(
    x='Horsepower:Q',
    y='Miles_per_Gallon:Q',
    color=alt.Color('Origin:N', legend=alt.Legend(title='Click Legend to Select Origin')),
    opacity=alt.condition(cars_origin_selection, alt.value(0.9), alt.value(0.1)), # 0.9 if selected, 0.1 if not
    tooltip=['Name:N', 'Origin:N', 'Horsepower:Q', 'Miles_per_Gallon:Q']
).add_params(
    cars_origin_selection # Add the selection definition to the chart
).properties(
    title='Click Legend: Highlight Cars by Origin',
    width=500,
    height=300
)

print("\nScatter Plot with Legend-based Highlighting:")
scatter_highlight_origin
# Try clicking on 'USA', 'Europe', 'Japan' in the legend.


Scatter Plot with Legend-based Highlighting:


**Brush Interactivity**: Using an interval selection to change color

- Scatter plot of IMDB Rating vs. Rotten Tomatoes Rating for movies.
- Use `rating_brush` to select a range of IMDB ratings.
- Selected points will be colored 'steelblue', others 'lightgray'.

In [47]:
movies_scatter_brush_color = alt.Chart(movies_pl.sample(n=500, seed=42)).mark_circle(size=60).encode(
    x=alt.X('IMDB_Rating:Q', scale=alt.Scale(domain=[0,10])),
    y=alt.Y('Rotten_Tomatoes_Rating:Q', scale=alt.Scale(domain=[0,100])),
    color=alt.condition(rating_brush, alt.value('steelblue'), alt.value('lightgray')),
    tooltip=['Title:N', 'IMDB_Rating:Q', 'Rotten_Tomatoes_Rating:Q', 'Major_Genre:N']
).add_params(
    rating_brush # Add the interval selection
).properties(
    title='Brush on X-axis (IMDB Rating) to Highlight Movies',
    width=500,
    height=350
)
print("\nScatter Plot with Interval Brush for Color Change:")
movies_scatter_brush_color
# Try clicking and dragging along the x-axis (IMDB Rating).


Scatter Plot with Interval Brush for Color Change:


#### Student Task: Conditional Sizing
1. Create a new single point selection called `click_select_point`. This time, don't specify `fields` or `encodings`.
   This means it will select individual data points on click. Set `empty='none'`.
2. Create a scatter plot of 'Acceleration' vs. 'Weight_in_lbs' from the `cars_pl` dataset.
3. Use `alt.condition()` with `click_select_point` to make the clicked point `size=200` and other points `size=50`.
4. Color points by 'Cylinders:O' (Ordinal, as it has an order).
5. Add tooltips for 'Name', 'Acceleration', 'Weight_in_lbs', 'Cylinders'.
6. Add the selection to the chart and give it an appropriate title.


In [None]:
# YOUR CODE HERE for Student Task (Conditional Sizing)


In [52]:
# @title
# Example Solution:
click_select_point = alt.selection_point(empty='none') # Note: selection_point instead of selection_single for multi-select with shift

scatter_conditional_size = alt.Chart(cars_pl).mark_circle(opacity=0.7).encode(
    x='Acceleration:Q',
    y='Weight_in_lbs:Q',
    color='Cylinders:O', # Ordinal for color scale if desired, or Nominal
    size=alt.condition(click_select_point, alt.value(200), alt.value(50)),
    tooltip=['Name:N', 'Acceleration:Q', 'Weight_in_lbs:Q', 'Cylinders:O']
).add_params(
    click_select_point
).properties(
    title='Click a Point to Enlarge It',
    width=500,
    height=350
)
scatter_conditional_size

### 4.3 Using Selections: Filtering Data (Simple Linked View - Optional/Advanced)
- Selections can filter data for the same chart or other charts.
- `.transform_filter(selection_name)`

Example: A bar chart showing average MPG for car origins,
linked to a scatter plot where you can select origins via the legend.
We will reuse `cars_origin_selection` which is bound to the legend.

#### Define Data

In [53]:
# Scatter plot (acts as the controller via legend clicks)
source_data = cars_pl.filter(pl.col('Miles_per_Gallon').is_not_null())

#### Define Selection

In [54]:
cars_origin_selection = alt.selection_point(fields=['Origin'], bind='legend', empty='all')

#### Define Base Chart

In [55]:
scatter_controller = alt.Chart(source_data).mark_circle(size=80).encode(
    x='Horsepower:Q',
    y='Miles_per_Gallon:Q',
    color=alt.Color('Origin:N', legend=alt.Legend(title='Click Legend to Filter Bar Chart')),
    opacity=alt.condition(cars_origin_selection, alt.value(0.9), alt.value(0.1)),
    tooltip=['Name:N', 'Origin:N']
).add_params(
    cars_origin_selection # This selection will drive the filter
).properties(
    width=350,
    height=250,
    title='Controller: Click Legend'
)

#### Define Projection

In [56]:
# Bar chart (will be filtered by the selection from the scatter plot's legend)
bar_filtered = alt.Chart(source_data).mark_bar().encode(
    x='Origin:N',
    y='average(Miles_per_Gallon):Q',
    color='Origin:N',
    tooltip=['Origin:N', 'average(Miles_per_Gallon):Q']
).transform_filter(
    cars_origin_selection # Filter this chart based on the selection
).properties(
    width=250,
    height=250,
    title='Filtered: Avg MPG by Origin'
)

#### Link Base and Projection Charts

In [57]:
linked_chart_example = scatter_controller | bar_filtered # Display side-by-side

print("\nLinked View Example (Legend Click Filters Bar Chart):")
linked_chart_example
# Click items in the legend of the scatter plot. Observe how the bar chart updates.
# Hold SHIFT key to select multiple origin country from the legend.
# If nothing selected in legend (and selection empty='all'), bar chart shows all.
# If empty='none', bar chart would be empty initially.


Linked View Example (Legend Click Filters Bar Chart):


 Key points to remember about linked view:
 - The `cars_origin_selection` is defined once and added to the scatter_controller.
 - The `bar_filtered` chart uses `.transform_filter(cars_origin_selection)` to react to changes in that selection state.
 - This is a basic example. More complex dashboards can be built by linking multiple charts and selections. See this link for more examples [https://altair-viz.github.io/gallery/index.html#gallery-category-interactive-charts](https://altair-viz.github.io/gallery/index.html#gallery-category-interactive-charts).
 - For this course, understanding conditional encoding and simple filtering is the main goal for interactivity.

## 5: Data Transformations for Visualization

### 5.1 Recap: Why Transform Data?
- To focus on relevant subsets (filtering).
- To create more meaningful features for plotting (deriving new columns).
- To summarize data for overview charts (aggregation).

We'll primarily use Polars for this, leveraging skills you're already building.

### 5.2 Polars for Pre-processing (Primary Focus)

**Example 1: Filtering data with Polars before plotting**

Let's plot Horsepower vs. MPG only for cars made in 'USA' after 1975.

In [58]:
cars_usa_post1975_pl = cars_pl.filter(
    (pl.col('Origin') == 'USA') & (pl.col('Year') > pl.datetime(1975, 1, 1)) # Assuming 'Year' is datetime
)

If 'Year' in cars_pl is just year number, adjust filter:

In [59]:
cars_pl_with_year_num = cars_pl.with_columns(pl.col('Year').dt.year().alias('Year_Number'))
cars_usa_post1975_pl = cars_pl_with_year_num.filter(
   (pl.col('Origin') == 'USA') & (pl.col('Year_Number') > 1975)
)

Let's assume cars_pl has Year as datetime for now, or re-create with year number

In [61]:
cars_pl_reloaded = pl.from_pandas(vega_data.cars()).drop_nulls().with_columns(
    pl.col("Year").dt.year().alias("Year_Number") # Ensure Year_Number exists
)
cars_usa_post1975_pl = cars_pl_reloaded.filter(
    (pl.col('Origin') == 'USA') & (pl.col('Year_Number') > 1975)
)

In [62]:
scatter_filtered_cars = alt.Chart(cars_usa_post1975_pl).mark_circle(size=60).encode(
    x='Horsepower:Q',
    y='Miles_per_Gallon:Q',
    color='Cylinders:O',
    tooltip=['Name:N', 'Year_Number:O']
).properties(
    title='USA Cars (Post-1975): Horsepower vs. MPG'
)

print("\nScatter plot of filtered USA cars (post-1975):")
scatter_filtered_cars


Scatter plot of filtered USA cars (post-1975):


**Example 2: Creating a new categorical column with Polars**

Create 'Weight_Category' (Light, Medium, Heavy) based on 'Weight_in_lbs'.

In [63]:
cars_with_weight_cat = cars_pl_reloaded.with_columns(
    pl.when(pl.col('Weight_in_lbs') < 2500).then(pl.lit('Light'))
    .when(pl.col('Weight_in_lbs') < 3500).then(pl.lit('Medium'))
    .otherwise(pl.lit('Heavy'))
    .alias('Weight_Category')
)

In [64]:
# Plot average MPG by this new 'Weight_Category'
avg_mpg_by_weight_cat = cars_with_weight_cat.group_by('Weight_Category').agg(
    pl.mean('Miles_per_Gallon').alias('Avg_MPG')
).sort('Weight_Category') # Sorting for consistent bar order if desired


In [68]:
bar_weight_cat = alt.Chart(avg_mpg_by_weight_cat).mark_bar().encode(
    x='Weight_Category:N', # Order might not be ideal, could make it Ordinal with sort order
    y='Avg_MPG:Q',
    color='Weight_Category:N',
    tooltip=['Weight_Category:N', 'Avg_MPG:Q']
).properties(
    title='Average MPG by Derived Weight Category',
    width=300,
    height=300

)

print("\nBar chart using a derived categorical column:")
bar_weight_cat


Bar chart using a derived categorical column:


**Example 3: Aggregating data with Polars for summary charts**

Average IMDB rating of movies per Major_Genre, only for genres with at least 10 movies.

In [70]:
genre_counts = movies_pl.group_by('Major_Genre').agg(
    pl.len().alias('Movie_Count'),
    pl.mean('IMDB_Rating').alias('Avg_IMDB_Rating')
).filter(
    pl.col('Movie_Count') >= 10 # Keep genres with a decent number of movies
).sort('Avg_IMDB_Rating', descending=True)

In [71]:
bar_avg_rating_genre = alt.Chart(genre_counts).mark_bar().encode(
    x=alt.X('Avg_IMDB_Rating:Q', title='Average IMDB Rating'),
    y=alt.Y('Major_Genre:N', sort='-x'), # Sort bars by the x-value (avg rating)
    tooltip=['Major_Genre:N', 'Avg_IMDB_Rating:Q', 'Movie_Count:Q']
).properties(
    title='Average IMDB Rating by Major Genre (Min. 10 Movies)',
    width=500
)

print("\nBar chart of aggregated movie ratings per genre:")
bar_avg_rating_genre


Bar chart of aggregated movie ratings per genre:


#### Student Task:

**Polars Pre-processing**
1. Using the `cars_pl_reloaded` DataFrame:

   - Filter for cars that have more than 4 Cylinders AND were made by 'Europe' or 'Japan'.
   - Create a new column 'HP_per_Cylinder' = Horsepower / Cylinders.
   - Calculate the average 'HP_per_Cylinder' for each 'Origin' within this filtered group.
2. Create a bar chart in Altair showing the average 'HP_per_Cylinder' by 'Origin' for this selection.
   - Ensure your chart has a title and informative tooltips.

In [72]:
#YOUR CODE HERE for Student Task (Polars Pre-processing & Altair Chart)

In [76]:
# @title
# Example Solution:
filtered_cars_hp_cyl = cars_pl_reloaded.filter(
    (pl.col('Cylinders') > 4) & (pl.col('Origin').is_in(['Europe', 'Japan']))
).with_columns(
    (pl.col('Horsepower') / pl.col('Cylinders')).alias('HP_per_Cylinder')
)

avg_hp_per_cyl_origin = filtered_cars_hp_cyl.group_by('Origin').agg(
    pl.mean('HP_per_Cylinder').alias('Avg_HP_per_Cylinder')
)

hp_per_cyl_bar_chart = alt.Chart(avg_hp_per_cyl_origin).mark_bar().encode(
    x='Origin:N',
    y='Avg_HP_per_Cylinder:Q',
    color='Origin:N',
    tooltip=['Origin:N', 'Avg_HP_per_Cylinder:Q']
).properties(
    title='Avg. HP per Cylinder (Europe/Japan, >4 Cylinders)',
    height=200,
    width=200
)
hp_per_cyl_bar_chart

### 5.3 Altair's Built-in Transformations (Brief Overview)

Altair can perform some transformations directly. Useful for simple cases or interactivity.

- Example: `transform_calculate` (deriving a field within Altair)
- Let's convert Horsepower to Kilowatts (1 HP = 0.7457 KW) for a plot.

In [77]:
# Note: Vega expression syntax 'datum.FieldName'
scatter_hp_to_kw = alt.Chart(cars_pl_reloaded.head(50)).mark_point().encode(
     # Using a sample for brevity
    x='Kilowatts:Q', # Use the newly calculated field
    y='Miles_per_Gallon:Q',
    tooltip=['Name:N', 'Kilowatts:Q']
).transform_calculate(
    Kilowatts = "datum.Horsepower * 0.7457"
).properties(
    title='Kilowatts (Calculated in Altair) vs. MPG'
)

print("\nScatter plot with Altair's `transform_calculate`:")
scatter_hp_to_kw


Scatter plot with Altair's `transform_calculate`:


Example: `transform_filter` (filtering within Altair)

- Show cars with Horsepower > 200. Vega expression syntax.

In [78]:
scatter_high_hp_altair_filter = alt.Chart(cars_pl_reloaded).mark_point().encode(
    x='Horsepower:Q',
    y='Miles_per_Gallon:Q',
    tooltip=['Name:N', 'Horsepower:Q']
).transform_filter(
    'datum.Horsepower > 200'
).properties(
    title='Cars with Horsepower > 200 (Altair Filter)'
)

print("\nScatter plot with Altair's `transform_filter`:")
scatter_high_hp_altair_filter


Scatter plot with Altair's `transform_filter`:


`transform_aggregate` and `transform_bin` are often used implicitly via encoding shorthands like `y='average(Miles_per_Gallon):Q'` or `alt.X('Horsepower:Q', bin=True)`.

### Key takeaway:
- Polars is generally preferred for complex or reusable data transformations.
- Altair's transforms are handy for simple, chart-specific adjustments and interactivity.

Understanding both helps you use these libraries effectively, a key goal of this course.

## 6: Best Practices & Data Storytelling Workflow

This section is more about principles and critical thinking than new code.

### 6.1 Key Principles for Effective Visualization (Summary)

1. Choose the Right Chart:
   - What question are you answering?
   - What type of data do you have (Quantitative, Nominal, Ordinal, Temporal)?
   - Match chart type (Scatter, Bar, Line, Histogram, etc.) appropriately.

2. Clarity and Simplicity:
   - "Less is More" - avoid chart junk.
   - Clear titles, axis labels (with units!), legends.
   - Logical ordering of elements.
   - Appropriate scales (e.g., y-axis starting at zero for bar charts of magnitude).
   - Purposeful use of color.

3. Audience Awareness:
   - Who are you communicating with? (e.g., executives, technical peers)
   - Tailor complexity and detail accordingly.
   - Ensure the key message is clear to *them*.

4. Iteration and Experimentation:
   - Your first chart is rarely your last.
   - Start simple, then refine. Try different approaches.
   - Seek feedback if possible.

5. Evaluating "Good Enough" for Presentation:
   - Is it Accurate? Clear? Complete (enough)?
   - Is it Honest and Ethical? (Avoid misleading visuals).
   - Does it fulfill its Purpose?

### 6.2 Visualization in the Data Analysis Workflow

Visualization is not just an end-product; it's a tool used throughout analysis:
- Data Cleaning: Identifying issues, outliers.
- Exploratory Data Analysis (EDA): Discovering patterns, relationships, forming hypotheses.
- Model Building: Understanding features, evaluating model performance.
- Communication: Sharing insights and telling stories with data.

### 6.3 Strategic Use of Resources

- **Library Documentation**: [Altair](https://altair-viz.github.io/index.html), [Polars](https://docs.pola.rs/), [scikit-learn](https://scikit-learn.org/stable/user_guide.html) websites are your friends!
- **Online Communities**: Stack Overflow, blogs, forums.
- **Generative AI Tools**: Use strategically for help, but always verify.
- **Practice, Practice, Practice!**
- **Translate Problems**: Learn to break down business questions into data analysis and visualization tasks.

### 6.4 Reflective Exercise / Discussion Prompts

Consider a recent business report or news article you've read that included a chart.

Reflect on the following:
1. What type of chart was it? Was it appropriate for the data and message?
2. Was the chart clear and easy to understand? What made it so (or not so)?
   - Think about title, labels, color, scale.
3. Who do you think the intended audience was? Was the chart well-suited for them?
4. Did the chart seem to tell an honest story, or could it have been misleading in any way?
5. What was the key takeaway message from the chart?

_(No code to write here, but these are good points for discussion or self-reflection to internalize the best practices.)_


Example Scenario for Discussion:

- "Imagine you need to present to your manager the sales performance of three
different product categories over the last four quarters.
- What chart type(s) would you consider? Why?
- What are 2-3 key things you would ensure for clarity for your manager?"
- If you are the manager on the other side of a chart, what would your train of thoughts would look like and how can you guide this train to the right informational destinations.