# Python for Data Visualization
### A Progressive Jupyter Notebook Lecture

You have already learned principles of information visualization:

- marks and channels (Bertin)
- the importance of perception (Ware)
- clarity, minimalism, integrity (Tufte)
- practical design advice (Few)
- how to use headlines, captions and annotations
- how to tell a story with data

In this notebook we will apply those ideas in Python.

Your goal is not only to make a plot but to **communicate a message**.

## 1. Setup

We load the main libraries we will use:

- `pandas` for data
- `matplotlib` for core plotting
- `seaborn` for nicer defaults and statistical plots

We also set a default figure size so plots are not tiny.

In [None]:
import pandas as pd          # work with tables of data (DataFrames)
import matplotlib.pyplot as plt  # basic plotting library
import seaborn as sns            # plotting library built on top of matplotlib

# Set a nice default visual style for seaborn
sns.set_theme()

# Make figures a bit bigger by default
plt.rcParams['figure.figsize'] = (8, 5)

## 2. Marks and Channels in Python

Let us connect your InfoVis theory to Python code.

- **Marks**: the basic objects we draw (points, lines, bars)
- **Channels**: how we encode data (position, length, color, size, shape)

In Python:

- a scatter plot uses points
- a line chart uses connected points
- a bar chart uses rectangles of different lengths

We will start with a simple dataset.

## 3. A Simple Dataset: Temperatures

We create a small `DataFrame` using `pandas.DataFrame`.

A `DataFrame` is like an Excel sheet in Python: rows and columns.

In [None]:
data = {
    'month': ['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec'],
    'temp':  [2,3,6,9,13,16,18,18,15,10,6,3]
}

# Create a DataFrame from a Python dictionary
df = pd.DataFrame(data)

df  # show the table

## 4. First Plot: A Line Chart

We use `plt.plot` from matplotlib to draw a simple line chart.

A line plot is appropriate because we have:

- a quantitative variable (temperature)
- ordered categories (months)
- a temporal pattern

This matches the **effectiveness** principle you saw in class.

In [None]:
# Basic line plot: x-values are months, y-values are temperatures
plt.plot(df['month'], df['temp'])
plt.show()

## 5. Improving the Design

The plot above is technically correct but conceptually weak.

Let us improve it using principles from Tufte, Few, Munzner and the storytelling lecture:

- clear title
- axis labels for context
- remove chart junk
- use color intentionally
- thicker line for focus

We use:
- `plt.title` to set a title
- `plt.xlabel` / `plt.ylabel` to label axes
- `sns.despine` to remove the top and right borders (spines).

In [None]:
plt.plot(df['month'], df['temp'], color='black', linewidth=2)  # thicker black line

plt.title('Average Monthly Temperature in Copenhagen (°C)')  # chart title
plt.xlabel('Month')                                          # x-axis label
plt.ylabel('Temperature (°C)')                               # y-axis label

sns.despine()  # remove top and right borders to reduce clutter
plt.show()

## 6. Annotations: Turning a Chart into a Message

Annotations guide the viewer’s attention, just like in the storytelling lecture.

We use:
- `plt.scatter` to highlight one point
- `plt.text` to add a short label near that point.

Let us emphasize the warmest month.

In [None]:
plt.plot(df['month'], df['temp'], color='black', linewidth=2)

plt.title('Summer Peak in Copenhagen Temperatures')
plt.xlabel('Month')
plt.ylabel('Temperature (°C)')

# Find the index (row number) of the maximum temperature
max_idx = df['temp'].idxmax()

# Draw a red dot on the warmest month
plt.scatter(df['month'][max_idx], df['temp'][max_idx], color='red', s=80)

# Add a short text label above the red dot
plt.text(df['month'][max_idx], df['temp'][max_idx] + 0.5,
         'Warmest month', ha='center', color='red')

sns.despine()
plt.show()

## 7. Bar Charts: Categorical Comparisons

Bars use **length**, which is a highly effective channel for comparison.

Here we use `sns.barplot` from seaborn, which works directly with a `DataFrame`.

Arguments:
- `data=df` tells seaborn which table to use
- `x='month'` chooses the column for the x-axis
- `y='temp'` chooses the column for the y-axis
- `hue='month'` uses month to map colors
- `palette='Blues'` uses a ColorBrewer palette
- `dodge=False` keeps bars on top of each other (here it means just one bar per category)
- `legend=False` hides the legend because it would repeat the same category labels.

In [None]:
sns.barplot(
    data=df,
    x='month',
    y='temp',
    hue='month',
    palette='Blues',
    dodge=False,
    legend=False
)

plt.title('Temperature by Month')
sns.despine()
plt.show()

## 8. When to Use Which Chart

Connecting theory to practice:

- **Line chart**: changes over time
- **Bar chart**: comparing categories
- **Scatter plot**: relationship between two quantitative variables
- **Histogram**: distribution of a single quantitative variable

Choosing the wrong chart can violate **expressiveness** (Munzner).

## 9. Scatter Plots: Relationship Between Two Variables

Here we simulate a simple dataset with:
- `numpy.random.randint` to create random integers
- a `DataFrame` to hold the data

We use `sns.scatterplot` to show the relationship between study time and exam score.

In [None]:
import numpy as np  # numerical library, here for random numbers

np.random.seed(0)  # make the example reproducible

df_scatter = pd.DataFrame({
    'hours_studied': np.random.randint(1, 10, 50),   # random hours between 1 and 9
    'exam_score': np.random.randint(40, 100, 50)     # random scores between 40 and 99
})

sns.scatterplot(data=df_scatter, x='hours_studied', y='exam_score')
plt.title('Relationship Between Study Time and Exam Scores')
sns.despine()
plt.show()

## 10. Histograms: Distributions

Histograms show how values are distributed.

We use `sns.histplot` and pass a single column.
- `bins=10` means we want 10 bars.


In [None]:
sns.histplot(df_scatter['exam_score'], bins=10)
plt.title('Distribution of Exam Scores')
plt.xlabel('Exam score')
plt.ylabel('Count')
sns.despine()
plt.show()

## 11. Designing with Color

Color is a powerful but risky channel.

You learned that:

- color should encode categories or magnitude
- do not use random colors
- use colorblind-safe palettes

Here we use a sequential ColorBrewer palette and follow the seaborn API by using `hue` and turning off the legend when it is redundant.

In [None]:
sns.barplot(
    data=df,
    x='month',
    y='temp',
    hue='month',
    palette='Blues',
    dodge=False,
    legend=False
)
plt.title('Using a Sequential Color Palette')
sns.despine()
plt.show()

## 12. Small Exercise: Improve a Bad Plot

Run the cell below to see a purposely terrible plot.

Then:

1. Identify what is wrong.
2. Redesign it using the principles from class.
3. Explain your design choices in a short Markdown cell.

In [None]:
plt.plot(df['month'], df['temp'], color='lime', linewidth=0.5)
plt.title('temp')
plt.show()

## 13. Cleaning Data (Brief Intro)

Most real data is messy. Before plotting, you often need to:

- rename columns
- convert types
- handle missing values
- filter rows
- create derived variables

Here is a slightly messy example to clean up.

We create a `DataFrame` with:
- extra spaces in column names and values
- inconsistent capitalization
- missing values (`None`)
- a fake numeric value `'N/A'`.

In [None]:
df2 = pd.DataFrame({
    'Year ': ['2020 ', ' 2021', '2022', None],
    ' Sales ': ['1000', ' 1200', '1300 ', ' N/A '],
    'Region': ['North', ' south ', None, 'EAST'],
    ' Notes ': ['good', ' delayed ', '  OK', ''],
})

df2  # show the dirty table

Now we clean this table step by step.

We use:
- `df2.columns.str.strip()` to remove spaces from column names
- `str.strip()` on string columns to remove spaces around values
- `replace({'N/A': None})` to treat `'N/A'` as missing
- `.astype('Int64')` to convert cleaned strings to integer numbers, allowing missing values
- `str.capitalize()` to tidy up capitalization.

Notice how the result looks much more uniform and ready for plotting.

In [None]:
# 1. Strip whitespace from column names
df2.columns = df2.columns.str.strip()

# 2. Clean 'Year' column: remove spaces and convert to integer (with missing allowed)
df2['Year'] = df2['Year'].str.strip()
df2['Year'] = df2['Year'].astype('Int64')  # nullable integer type

# 3. Clean 'Sales' column: remove spaces, handle 'N/A', convert to integer
df2['Sales'] = (
    df2['Sales']
      .str.strip()
      .replace({'N/A': None})   # turn 'N/A' into a missing value
      .astype('Int64')
)

# 4. Clean 'Region' column: remove spaces and standardize capitalization
df2['Region'] = df2['Region'].str.strip().str.capitalize()

# 5. Clean 'Notes' column: remove spaces and turn empty strings into missing values
df2['Notes'] = df2['Notes'].str.strip().replace('', None)

df2  # show the cleaned table

## 14. When to Use pandas, matplotlib, and seaborn

We now briefly discuss when each tool is most appropriate.

- **pandas**
  - Good for quick, simple plots directly from a `DataFrame`.
  - Example: `df.plot(kind='line')`, `df.plot(kind='bar')`.
  - Nice when you are already doing data cleaning and transformation with pandas.

- **matplotlib**
  - Low-level control over every visual element.
  - Good when you want to precisely tune layout, annotations, and styles.
  - The foundation that many other libraries build on.

- **seaborn**
  - Built on top of matplotlib, with smarter defaults.
  - Great for statistical plots and working directly with tidy `DataFrame`s.
  - Handles grouping, color mapping and simple statistics easily.

In this course, you will mostly use `pandas` for data, `seaborn` for high-level plots, and `matplotlib` for detailed fine-tuning and annotations.

In [None]:
# Example: pandas quick plot (very simple interface)
df.set_index('month')['temp'].plot(kind='line', title='Quick Temperature Plot with pandas')
plt.ylabel('Temperature (°C)')
plt.show()

In [None]:
# Example: seaborn statistical plot (adds a regression line)
sns.regplot(data=df_scatter, x='hours_studied', y='exam_score')
plt.title('Regression line with seaborn')
sns.despine()
plt.show()

## 15. Mini Case Study: A Small Data Story

We will now create a small data story using the classic `tips` dataset from seaborn.

We load it with `sns.load_dataset('tips')`, which returns a tidy `DataFrame`.

**Question:** Do dinner customers tip more than lunch customers?

In [None]:
# Load an example dataset that comes with seaborn
tips = sns.load_dataset('tips')

tips.head()  # show the first rows

In [None]:
# Bar plot of average tip by meal time
sns.barplot(
    data=tips,
    x='time',
    y='tip',
    hue='time',
    errorbar=None,       # do not draw error bars
    palette='Blues',
    dodge=False,
    legend=False
)
plt.title('Average Tip by Meal Time')
plt.xlabel('Meal time')
plt.ylabel('Average tip (US dollars)')
sns.despine()
plt.show()

Now let us turn this into a clearer story with a descriptive title and an annotation.

We use `plt.text` again to add a short message inside the plot area.

In [None]:
sns.barplot(
    data=tips,
    x='time',
    y='tip',
    hue='time',
    errorbar=None,
    palette='Blues',
    dodge=False,
    legend=False
)

plt.title('Dinner Customers Tip More Than Lunch Customers')
plt.xlabel('Meal time')
plt.ylabel('Average tip (US dollars)')

# Add a short annotation above the bars
plt.text(0.5, tips['tip'].mean() + 1.0,
         'On average, dinner tips are higher',
         ha='center')

sns.despine()
plt.show()

## 16. Exporting Your Figure

You can save your final figure to a file, for example to include it in a report.

We use `plt.savefig` and give it a filename.
- `dpi=300` makes it high resolution for printing.
- `bbox_inches='tight'` trims extra whitespace around the figure.

In [None]:
plt.figure()

sns.barplot(
    data=tips,
    x='time',
    y='tip',
    hue='time',
    errorbar=None,
    palette='Blues',
    dodge=False,
    legend=False
)
plt.title('Dinner Customers Tip More Than Lunch Customers')
plt.xlabel('Meal time')
plt.ylabel('Average tip (US dollars)')

sns.despine()

# Save the current figure to a PNG file
plt.savefig('my_first_data_story.png', dpi=300, bbox_inches='tight')
plt.show()

## 17. Simple Interactivity in Notebooks

Interactivity can help viewers explore data, but as you learned in the storytelling lecture, more interaction can also mean less story.

Here we create a very simple interactive plot using `ipywidgets.interact` to filter the tips data by minimum tip amount.

This is useful when you want to:

- let the viewer explore thresholds or ranges
- keep the visual form fixed while changing the data subset

> Note: This works best in Jupyter or Colab with widget support enabled.

In [None]:
from ipywidgets import interact  # small library for interactive controls in notebooks

@interact(min_tip=(0.0, 10.0, 0.5))
def plot_tip_distribution(min_tip=0.0):
    # Filter rows where tip is at least min_tip
    subset = tips[tips['tip'] >= min_tip]
    plt.figure()
    sns.histplot(subset['tip'], bins=10)
    plt.title(f'Distribution of Tips (min_tip = {min_tip})')
    plt.xlabel('Tip (US dollars)')
    plt.ylabel('Count')
    sns.despine()
    plt.show()

In this example:

- the **visual design** stays constant (a histogram)
- the **interaction** changes which data points are shown
- the **message** can shift slightly depending on the filter, but the viewer still knows what they are looking at

This is the kind of light, focused interactivity that fits well with the principles you saw in the storytelling lecture.

## 18. Final Exercise: Your First Python Data Story

Using any small dataset (for example `tips`, `penguins`, or your own CSV file), create **one visualization** that communicates **one clear message**.

Your plot must include:

- a descriptive title
- axis labels
- at least one annotation
- clean formatting (no unnecessary clutter)
- appropriate color use

In a short Markdown cell, explain:

1. What is the main message?
2. Why did you choose this chart type?
3. Which visual channels encode which attributes?
4. How did you apply at least one principle from Bertin, Tufte, Ware, Few, or Munzner?

This is a small rehearsal for your larger storytelling with data assignment.