# Python for Data Visualization
### A Progressive Jupyter Notebook Lecture

You have already learned principles of information visualization:

- marks and channels (Bertin)
- the importance of perception (Ware)
- clarity, minimalism, integrity (Tufte)
- practical design advice (Few)
- how to use headlines, captions and annotations
- how to tell a story with data

In this notebook we will apply those ideas in Python.

Your goal is not only to make a plot but to **communicate a message**.

## 1. Setup

We load the main libraries we will use:

- `pandas` for data
- `matplotlib` for core plotting
- `seaborn` for nicer defaults and statistical plots

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_theme()
plt.rcParams['figure.figsize'] = (8, 5)

## 2. Marks and Channels in Python

Let us connect your InfoVis theory to Python code.

- **Marks**: the basic objects we draw (points, lines, bars)
- **Channels**: how we encode data (position, length, color, size, shape)

In Python:

- a scatter plot uses points
- a line chart uses connected points
- a bar chart uses rectangles of different lengths

We will start with a simple dataset.

## 3. A Simple Dataset: Temperatures

In [None]:
data = {
    'month': ['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec'],
    'temp':  [2,3,6,9,13,16,18,18,15,10,6,3]
}

df = pd.DataFrame(data)
df

## 4. First Plot: A Line Chart

A line plot is appropriate because we have:

- a quantitative variable (temperature)
- ordered categories (months)
- a temporal pattern

This matches the **effectiveness** principle you saw in class.

In [None]:
plt.plot(df['month'], df['temp'])
plt.show()

## 5. Improving the Design

The plot above is technically correct but conceptually weak.

Let us improve it using principles from Tufte, Few, Munzner and the storytelling lecture:

- clear title
- axis labels for context
- remove chart junk
- use color intentionally
- thicker line for focus

In [None]:
plt.plot(df['month'], df['temp'], color='black', linewidth=2)

plt.title('Average Monthly Temperature in Copenhagen (°C)')
plt.xlabel('Month')
plt.ylabel('Temperature (°C)')

sns.despine()
plt.show()

## 6. Annotations: Turning a Chart into a Message

Annotations guide the viewer’s attention, just like in the storytelling lecture.

Let us emphasize the warmest month.

In [None]:
plt.plot(df['month'], df['temp'], color='black', linewidth=2)

plt.title('Summer Peak in Copenhagen Temperatures')
plt.xlabel('Month')
plt.ylabel('Temperature (°C)')

max_idx = df['temp'].idxmax()
plt.scatter(df['month'][max_idx], df['temp'][max_idx], color='red', s=80)

plt.text(df['month'][max_idx], df['temp'][max_idx] + 0.5,
         'Warmest month', ha='center', color='red')

sns.despine()
plt.show()

## 7. Bar Charts: Categorical Comparisons

Bars use **length**, which is a highly effective channel for comparison.

Use bar charts for comparing discrete categories.

In [None]:
sns.barplot(data=df, x='month', y='temp', color='steelblue')
plt.title('Temperature by Month')
sns.despine()
plt.show()

## 8. When to Use Which Chart

Connecting theory to practice:

- **Line chart**: changes over time
- **Bar chart**: comparing categories
- **Scatter plot**: relationship between two quantitative variables
- **Histogram**: distribution of a single quantitative variable

Choosing the wrong chart can violate **expressiveness** (Munzner).

## 9. Scatter Plots: Relationship Between Two Variables

In [None]:
import numpy as np

np.random.seed(0)

df_scatter = pd.DataFrame({
    'hours_studied': np.random.randint(1, 10, 50),
    'exam_score': np.random.randint(40, 100, 50)
})

sns.scatterplot(data=df_scatter, x='hours_studied', y='exam_score')
plt.title('Relationship Between Study Time and Exam Scores')
sns.despine()
plt.show()

## 10. Histograms: Distributions

Histograms show how values are distributed.

In [None]:
sns.histplot(df_scatter['exam_score'], bins=10)
plt.title('Distribution of Exam Scores')
sns.despine()
plt.show()

## 11. Designing with Color

Color is a powerful but risky channel.

You learned that:

- color should encode categories or magnitude
- do not use random colors
- use colorblind-safe palettes

Let us use a sequential ColorBrewer palette from seaborn.

In [None]:
sns.barplot(data=df, x='month', y='temp', palette='Blues')
plt.title('Using a Sequential Color Palette')
sns.despine()
plt.show()

## 12. Small Exercise: Improve a Bad Plot

Run the cell below to see a purposely terrible plot.

Then:

1. Identify what is wrong.
2. Redesign it using the principles from class.
3. Explain your design choices in a short Markdown cell.

In [None]:
plt.plot(df['month'], df['temp'], color='lime', linewidth=0.5)
plt.title('temp')
plt.show()

## 13. Cleaning Data (Brief Intro)

Most real data is messy. Before plotting, you often need to:

- rename columns
- convert types
- handle missing values
- filter rows
- create derived variables

Here is a tiny example.

In [None]:
df2 = pd.DataFrame({
    'Year': ['2020', '2021', '2022'],
    'Sales ': ['1000', ' 1200', '1300 ']
})

df2

In [None]:
df2.columns = df2.columns.str.strip()
df2['Sales'] = df2['Sales'].str.strip().astype(int)
df2

## 14. Mini Case Study: A Small Data Story

We will now create a small data story using the classic `tips` dataset from seaborn.

**Question:** Do dinner customers tip more than lunch customers?

In [None]:
tips = sns.load_dataset('tips')

tips.head()

In [None]:
sns.barplot(data=tips, x='time', y='tip', ci=None, palette='Blues')
plt.title('Average Tip by Meal Time')
plt.xlabel('Meal time')
plt.ylabel('Average tip (US dollars)')
sns.despine()
plt.show()

Now let us turn this into a clearer story with a descriptive title and an annotation.

In [None]:
sns.barplot(data=tips, x='time', y='tip', ci=None, palette='Blues')

plt.title('Dinner Customers Tip More Than Lunch Customers')
plt.xlabel('Meal time')
plt.ylabel('Average tip (US dollars)')

plt.text(0.5, tips['tip'].mean() + 1.0,
         'On average, dinner tips are higher',
         ha='center')

sns.despine()
plt.show()

## 15. Exporting Your Figure

You can save your final figure to a file, for example to include it in a report.

In [None]:
plt.figure()

sns.barplot(data=tips, x='time', y='tip', ci=None, palette='Blues')
plt.title('Dinner Customers Tip More Than Lunch Customers')
plt.xlabel('Meal time')
plt.ylabel('Average tip (US dollars)')

sns.despine()

plt.savefig('my_first_data_story.png', dpi=300, bbox_inches='tight')
plt.show()

## 16. Final Exercise: Your First Python Data Story

Using any small dataset (for example `tips`, `penguins`, or your own CSV file), create **one visualization** that communicates **one clear message**.

Your plot must include:

- a descriptive title
- axis labels
- at least one annotation
- clean formatting (no unnecessary clutter)
- appropriate color use

In a short Markdown cell, explain:

1. What is the main message?
2. Why did you choose this chart type?
3. Which visual channels encode which attributes?
4. How did you apply at least one principle from Bertin, Tufte, Ware, Few, or Munzner?

This is a small rehearsal for your larger storytelling with data assignment.