# Python for Social Science

<img src="../figures/PySocs_banner.png" width="50%" align="left">

# Data Visualization

>*“The preliminary examination of data by simple graphical methods is always desirable. The purpose of such diagrams is not to substitute for adequate statistical treatment, but to afford a preliminary, and often valuable, means of detecting gross errors, of studying the general character of the material, and of suggesting appropriate statistical tests.”*

This statement by Sir Ronald A. Fisher highlights the importance of using graphs and figures. Fisher emphasized that visualizations should clarify the data rather than substitute for analytical methods.

This underscores the role of visualization as a tool for exploration and communication, making it useful for identifying patterns, errors, or anomalies.

## Visualization using `matplotlib`

Python offers a variety of packages for creating data visualizations, but none is more fundamental than `matplotlib`, which will be the focus of this session. The name `matplotlib` is derived from `MATLAB`, and the traditional interface of `matplotlib` mimics that of its predecessor.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm

`pyplot` is a collection of functions that make `matplotlib` work like `MATLAB`. By convention, `pyplot` is aliased to `plt`, which we just did in the above import cell.

### Different Interfaces to `matplotlib`

One aspect of `matplotlib` that can confuse some users is that there are several ways to access its functionalities:

- **Method 1**: A high-level interface similar to MATLAB.
- **Method 2**: A low-level object-oriented method.
- **Method 3**: Methods offered by Pandas.

**Method 1**

Here’s a "one-liner" using **Method 1** to generate a simple line graph:

In [None]:
# Method 1

arr = pd.Series(range(10))

# One-liner
plt.plot(arr)

# Add options
plt.title("Title")
plt.xlabel("x-axis")
plt.ylabel("y-axis");

💡Although not explicitly shown, the simple `plot()` function call above creates two distinct objects: a **figure** and a default **axes**. The **figure** represents the entire white canvas, while the **axes** refers to the rectangular plot area within the canvas.

**Method 2**

Using **Method 2**, the following example illustrates a more modern object-oriented interface:

In [None]:
# Method 2

arr = pd.Series(range(10))

# Two steps
fig = plt.figure() # create a plot figure object
ax = fig.add_subplot() # create an axes object inside fig

ax.plot(arr)

ax.set_title("Title")
ax.set_xlabel("x-axis")
ax.set_ylabel("y-axis");

In **Method 2**, a plot figure and an axes object are explicitly created before drawing a plot. Another source of confusion is that this *two-step* approach is often performed in a single step, as follows:

In [None]:
# Method 2B

arr = pd.Series(range(10))

fig, ax = plt.subplots() # create a plot figure object and an axes object

ax.plot(arr)

ax.set_title("Title")
ax.set_xlabel("x-axis")
ax.set_ylabel("y-axis");

It creates both a `Figure` and an `Axes`
- `fig` → the container (the entire window or canvas)
- `ax` → the plotting area (the coordinate system where your data is drawn)

**Method 3**

Lastly, Pandas is tightly integrated with `matplotlib`. In **Method 3**, Pandas uses Matplotlib as its backend and accesses Matplotlib functions through the `.plot()` accessor, rather than re-implementing plotting functionality.

In [None]:
# Method 3

arr = pd.Series(range(10))

# Use pandas plot directly
arr.plot(
    title="Title",
    xlabel="x-axis",
    ylabel="y-axis",
    figsize=(6,4)
)

This method is brief and ideal for rapid exploratory tasks.

- `arr.plot()` automatically creates the figure and axes.
- `title`, `xlabel`, and `ylabel` can be passed as arguments.
- You can still access the `ax` object if you need finer customization afterward (with the assignment `ax = ...`)

✅ Using `pd.Series.plot()` / `pd.DataFrame.plot()`

Advantages:
- Concise: One-liner plotting for quick EDA (Exploratory Data Analysis).
- Built-in labeling: Column names (or index names) are automatically used for labels and legends.
- Fast iteration: Great when you’re inspecting multiple variables quickly in a notebook.

Drawbacks:
- Less flexibility: Only a subset of Matplotlib’s full plotting functions are exposed directly.
- For more complex plots (multiple axes, subplots, regression overlays, custom annotations), you’ll often need to drop down into Matplotlib anyway.

### Incremental Construction of Complex Visualizations

One of Matplotlib's greatest strengths is its ability to allow incremental construction of visualizations. This approach enables you to build complex figures step by step, rather than requiring everything to be defined at once.

In the following example, we illustrate some of the key ways Matplotlib supports incremental construction of complex figures.

In [None]:
x = np.linspace(-8, 8, 100)
p = 1 / (1 + np.exp(-x))  # cumulative logistic function
r = p + np.random.normal(0, 0.01, size=len(p))  # p + random noise

fig, ax = plt.subplots(figsize=(8, 5))

# logistic curve
ax.plot(x, p, linewidth=2, color="navy",
        label=r"$p(x)=\frac{1}{1 + e^{-x}}$")

# scatter points (transparent red)
ax.scatter(x, r, color="red", alpha=0.4, s=30, edgecolor="none",
           label="Simulated data")

# tangent line at (0, 0.5)
ax.axline((0, 0.5), slope=0.25, color="black", linestyle="--",
          label="Tangent line")

# reference lines
ax.axvline(x=0, linestyle=":", color="gray")
ax.axhline(y=0.5, linestyle=":", color="gray")

# labels and legend
ax.set_xlabel("x")
ax.set_ylabel("p(x)")
ax.legend()
ax.set_title("Logistic Function with Simulated Data", fontsize=14);

### Multiple Subplots

Matplotlib utilizes the concept of *subplots*, which enables the display of multiple plots within a single figure object. This makes it easier to compare different views of data. The distinction between **figure** and **axes** elements provides convenient management of multi-plot layouts, letting us control the overall figure as well as each plot.

In [None]:
x = np.linspace(0, 10, 100)

fig, ax = plt.subplots()

ax.plot(x , np.sin(x)) # Draw explicitly on 'ax'
ax.set_title("sine(x)")

fig.suptitle("Trig Functions");  # Control figure-wide properties

Again, you explicitly manage:
- The Figure (`fig`) → overall container.
- The Axes (`ax`) → the actual plotting area.

We can add a second axes manually afterward with `fig.add_axes()`. For example,

In [None]:
x = np.linspace(0, 10, 100)

fig, ax = plt.subplots()

ax.plot(x, np.sin(x))
ax.set_title("sin(x)")
fig.suptitle("Trig Functions")  # Control figure-wide properties

# add a second axes in the same figure
ax2 = fig.add_axes([0.365, 0.58, 0.25, 0.25])  # inset at top-center [left, bottom, width, height]
ax2.plot(x, np.cos(x), color="orange")
ax2.set_title("cos(x)");

More commonly, we might want to generate the first axes to leave some whitespace in the figure, so a second axes can be placed next to it but not overlapping.

In [None]:
fig = plt.figure(figsize=(8, 4))

# First axes (occupies the left half, not the whole figure)
ax1 = fig.add_axes([0.1, 0.1, 0.35, 0.8])  # [left, bottom, width, height]
ax1.plot(x, np.sin(x))
ax1.set_title("sin(x)")

# Second axes (occupies the right half)
ax2 = fig.add_axes([0.55, 0.1, 0.35, 0.8])
ax2.plot(x, np.cos(x), color="orange")
ax2.set_title("cos(x)")

fig.suptitle("Trig Functions");  # Control figure-wide properties

This feature might have a use case; however, the two-layer approach really shines when you have multiple plots pre-arranged within a figure object.

Suppose you want to create four plots in a single figure object in a 2x2 partition.

In [None]:
fig, ax = plt.subplots(2, 2) # 2x2 grid of axes

# top left
ax[0, 0].plot(x, np.sin(x))
ax[0, 0].set_title("sin(x)")

# top right
ax[0, 1].plot(x, np.cos(x), "-", color="orange")
ax[0, 1].set_title("cos(x)")

# bottom left
ax[1, 0].hist(np.random.standard_normal(100), 
              bins=20, 
              color="blue", 
              alpha=0.2)
ax[1, 0].set_title("Histogram")

# bottom right
ax[1, 1].scatter(np.arange(30), 
                 np.arange(30) + 3 * np.random.standard_normal(30))
ax[1, 1].set_title("Scatterplot")

# figure
fig.suptitle("Multiple Plots in 2x2") # Control figure-wide properties
fig.tight_layout(rect=[0, 0, 1, 0.95]); # adjust layout to avoid overlap

The `rect=[0, 0, 1, 0.95]` option of the last line defines the rectangle in normalized figure coordinates (`[left, bottom, right, top]`) that the subplots will fit inside. For example, `rect=[0, 0, 1, 0.95]` reserves the top 5% of the figure for a suptitle.

## Loading Datasets from R packages

Before exploring more advanced visualization techniques, we’ll show how to use real-world datasets available in some R packages. The `statsmodels` module offers access to R datasets from CRAN using `statsmodels.datasets.get_rdataset()`.

Below is a table of various social science datasets accessible through `statsmodels.datasets.get_rdataset()` and the Python code to load them. These classical datasets are commonly used in economics, political science, and sociology.

| Dataset    | R Package / Source | Description                                | Load in Python                                         |
| ---------- | ------------------ | ------------------------------------------ | ------------------------------------------------------ |
| *`Fair`     | `Ecdat`            | Extramarital affairs data   | `sm.datasets.get_rdataset("Fair", "Ecdat").data`       |
| *`Guerry`   | `HistData`         | Moral statistics of 19th-century France    | `sm.datasets.get_rdataset("Guerry", "HistData").data`  |
| `anes96`   | `carData`          | American National Election Survey, 1996    | `sm.datasets.get_rdataset("anes96", "carData").data`   |
| `nes96`    | `carData`          | Another 1996 US election survey            | `sm.datasets.get_rdataset("nes96", "carData").data`    |
| `usair`    | `datasets`         | US air pollution & health data             | `sm.datasets.get_rdataset("usair", "datasets").data`   |
| *`CPS1985`  | `AER`              | U.S. Current Population Survey 1985        | `sm.datasets.get_rdataset("CPS1985", "AER").data`      |
| `Longley`  | `datasets`         | Macroeconomic data for regression examples | `sm.datasets.get_rdataset("Longley", "datasets").data` |
| `Grunfeld` | `Ecdat`            | Investment data for firms                  | `sm.datasets.get_rdataset("Grunfeld", "Ecdat").data`   |


For example, `sm.datasets.get_rdataset("Fair", "Ecdat")` fetches the dataset, `Fair`, from the R package named `Ecdat`:

In [None]:
fair = sm.datasets.get_rdataset("Fair", "Ecdat")
df_fair = fair.data
df_fair.head()

Here are the variables contained in the dataset, along with brief descriptions.

| **Variable** | **Description**                                                                       |
| ------------ | ------------------------------------------------------------------------------------- |
| `sex`        | Gender of the respondent: *male* or *female*                                          |
| `age`        | Age of the respondent                                                                 |
| `ym`         | Number of years married                                                               |
| `child`      | Presence of children: *yes* or *no*                                                   |
| `religious`  | Self-reported religiosity, on a scale from 1 (anti) to 5 (very religious)             |
| `education`  | Years of education completed                                                          |
| `occupation` | Occupation level, coded from 1 to 7 based on Hollingshead classification              |
| `rate`       | Self-rating of marriage happiness, on a scale from 1 (very unhappy) to 5 (very happy) |
| `nbaffairs`  | Number of extramarital affairs in the past year                                       |



## Line Charts

A line chart is one of the most common types of visualizations in Matplotlib. It is typically used to show trends over time or ordered categories.

- In Matplotlib, a line chart is created using the function `plt.plot()`.
- The first argument represents the x-values (independent variable) and the second argument represents the y-values (dependent variable).
- Line charts are great for showing continuous data and spotting patterns, trends, and relationships.

In [None]:
fig, ax = plt.subplots(2, 2)

rng = np.random.default_rng(42) # create a generator with seed 42

x = rng.standard_normal(100).cumsum()

ax[0, 0].plot(x, linestyle="-", color="k")   # solid black
ax[0, 1].plot(x, linestyle="--", color="r")  # dashed red
ax[1, 0].plot(x, linestyle=":", color="b")   # dotted blue
ax[1, 1].plot(x, linestyle="-.", color="c"); # dash dot cyan

### Practice Exercise 1

Generate an array `x` with 200 values evenly spaced from 0 to 2π using `np.linspace` and `np.pi`.

Plot both functions, `y1 = sin(x)` and `y2 = cos(x)`, on the same chart.

Include a title, axis labels, and a legend:
- Title: "Sine and Cosine Functions"
- X-axis label: "x (radians)"
- Y-axis label: "y value"
- Add a legend to identify each function.

**Step 1**: Generate `x`, `y1`, and `y2`

In [None]:
# Your CODE HERE


**Step 2**: Plot both functions using **Method 1**, e.g., `plt.plot()`:

In [None]:
# YOUR CODE HERE - Method 1


**Step 3**: Plot both functions using **Method 2**, e.g., `fig, ax = plt.subplots()`:

In [None]:
# YOUR CODE HERE - Method 2
  

### The Guerry Dataset

The Guerry dataset is a classic historical dataset in social science, compiled by the French lawyer and statistician André-Michel Guerry in the 1830s. It provides detailed information on moral and social statistics across the 86 départements of France.

🔹 Purpose

Guerry aimed to understand patterns of crime, education, wealth, and social behavior in 19th-century France. By mapping and analyzing these variables, he sought to explore the relationship between social conditions and moral outcomes, predating modern quantitative social science.

The dataset contains these variables:

| Variable      | Description             | Type          | Notes / Units                                                |
| ------------- | ----------------------- | ------------- | ------------------------------------------------------------ |
| `Dept`        | Department number       | Integer       | French administrative region code (1–86 in 1830s France)     |
| `Department`  | Department name         | String        | Name of the department                                       |
| `Crime_pers`  | Crimes against persons  | Numeric       | Number of crimes against persons per 100,000 inhabitants     |
| `Crime_prop`  | Crimes against property | Numeric       | Number of crimes against property per 100,000 inhabitants    |
| `Literacy`    | Literacy                | Numeric       | Percentage of the population who can read and write          |
| `Donations`   | Donations to the poor   | Numeric       | Number of donations to the poor per 100,000 inhabitants      |
| `Infants`     | Illegitimate births     | Numeric       | Number of children born out of wedlock per 1,000 births      |
| `Suicides`    | Suicides                | Numeric       | Number of suicides per 100,000 inhabitants                   |
| `Wealth`      | Wealth                  | Numeric       | Taxable property per inhabitant (arbitrary units)            |
| `Commerce`    | Commerce                | Numeric       | Number of commercial establishments per 100,000 inhabitants  |
| `Clergy`      | Clergy                  | Numeric       | Number of clergymen per 100,000 inhabitants                  |
| `Crime_total` | Total crime             | Numeric       | Sum of crimes against persons and property                   |
| `Region`      | Region name             | Factor/String | Geographical region (e.g., North, South, East, West, Center) |


In [None]:
# Load Guerry dataset from HistData
data_guerry = sm.datasets.get_rdataset("Guerry", "HistData").data

# Select variables to plot: Crime_pers, Crime_prop, Literacy, Region
df_guerry = data_guerry[['Department', 'Crime_pers', 'Crime_prop', 'Literacy', 'Region']]

# Sort by Department name alphabetically
df_guerry = df_guerry.sort_values('Department')
df_guerry.head(10)

In [None]:
# Set the plot size
plt.figure(figsize=(14, 6))

# Plot each variable as a line
plt.plot(df_guerry['Department'], df_guerry['Crime_pers'], marker='o', label='Crime Against Persons')
plt.plot(df_guerry['Department'], df_guerry['Crime_prop'], marker='s', label='Crime Against Property')
plt.plot(df_guerry['Department'], df_guerry['Literacy'], marker='^', label='Literacy (%)')

# Add labels and title
plt.xlabel('Department')
plt.ylabel('Value')
plt.title('Crime and Literacy by French Department (Guerry Dataset, 1830s)')
plt.xticks(rotation=90)  # rotate department names
plt.legend()

plt.grid(True)
plt.tight_layout()

`Literacy` is measured as a percentage (0–100), while crime data is reported as counts per 100,000 people. Plotting `Literacy` in its raw form makes the line appear very flat. To visually compare, we can rescale Crime per 100,000 and Crime proportion to match the range of Literacy.

In [None]:
# Rescale crime variables
df_guerry['Crime_pers_scaled'] = df_guerry['Crime_pers'] / 100_000 * 100
df_guerry['Crime_prop_scaled'] = df_guerry['Crime_prop'] / 100_000 * 100

# Plot
plt.figure(figsize=(14, 6))

plt.plot(df_guerry['Department'], df_guerry['Crime_pers_scaled'], marker='o', label='Crime Against Persons (per capita)')
plt.plot(df_guerry['Department'], df_guerry['Crime_prop_scaled'], marker='s', label='Crime Against Property (per capita)')
plt.plot(df_guerry['Department'], df_guerry['Literacy'], marker='^', label='Literacy (%)')

plt.xlabel('Department')
plt.ylabel('Value')
plt.title('Crime and Literacy by French Department (Guerry Dataset, 1830s)')
plt.xticks(rotation=90)
plt.legend()

plt.grid(True)
plt.tight_layout()

Currently, the x-axis is sorted by `Department`. To sort the series by another variable, e.g., `Literacy`, you can reorder the DataFrame prior to plotting. This will ensure all three series remain consistently aligned.

In [None]:
# Rescale crime variables
df_guerry['Crime_pers_scaled'] = df_guerry['Crime_pers'] / 100_000 * 100
df_guerry['Crime_prop_scaled'] = df_guerry['Crime_prop'] / 100_000 * 100

# Sort by Literacy
df_sorted = df_guerry.sort_values(by='Literacy') # Beware inplace=True

# Plot
plt.figure(figsize=(14, 6))

plt.plot(df_sorted['Department'], df_sorted['Crime_pers_scaled'], marker='o',
         label='Crime Against Persons (per capita)')
plt.plot(df_sorted['Department'], df_sorted['Crime_prop_scaled'], marker='s',
         label='Crime Against Property (per capita)')
plt.plot(df_sorted['Department'], df_sorted['Literacy'], marker='^',
         label='Literacy (%)')

plt.xlabel('Department')
plt.ylabel('Value')
plt.title('Crime and Literacy by French Department (Guerry Dataset, 1830s)')
plt.xticks(rotation=90)
plt.legend()

plt.grid(True)
plt.tight_layout()

## Bar Charts

Bar charts are one of the most common tools for visualizing categorical data. They provide a clear and effective way to compare the sizes of different categories or the magnitudes of summary statistics.

A bar chart visually represents categorical data using rectangular bars. The height (or length, in the case of horizontal bars) of each bar corresponds to the value of its category. This value can represent a count or a statistic, such as the mean. Bar charts are ideal for comparing quantities across various categories.

In Matplotlib, you can create bar charts by using the `plt.bar()` function for vertical bars or the `plt.barh()` function for horizontal bars.

### Vertical Bar Chart

A small bookstore wants to visualize its monthly sales for four genres: Fiction, Non-Fiction, Science, and History. The sales (in units) are:

| Genre       | Sales |
| ----------- | ----- |
| Fiction     | 120   |
| Non-Fiction | 90    |
| Science     | 60    |
| History     | 30    |

- Create a vertical bar chart showing the sales of each genre.
- Customize the chart with a title "Monthly Book Sales" and label the axes appropriately.
- Change the bar color to "lightgreen".

In [None]:
# Data
genres = ['Fiction', 'Non-Fiction', 'Science', 'History']
sales = [120, 90, 60, 30]

# Create vertical bar chart
plt.bar(genres, sales, color='lightgreen')

# Add title and axis labels
plt.title('Monthly Book Sales')
plt.xlabel('Book Genre')
plt.ylabel('Units Sold');

Horizontal Orientation with `barh()`.

In [None]:
# Create horizontal bar chart
plt.barh(genres, sales, color='magenta')

# Add title and axis labels
plt.title('Monthly Book Sales')
plt.ylabel('Book Genre')
plt.xlabel('Units Sold');

Individual bar colors

In [None]:
# Create horizontal bar chart with individual bar colors
genre_colors = ['pink', 'black', 'skyblue', 'grey']

plt.barh(genres, sales, label=genres, color=genre_colors)

# Add title and axis labels
plt.title('Monthly Book Sales')
plt.ylabel('Book Genre')
plt.xlabel('Units Sold')
plt.legend();

Using a color palette

In [None]:
# Choose a colormap
cmap = plt.cm.viridis # equivalently, cmap = plt.get_cmap("viridis")

# Sample 5 evenly spaced colors from the colormap
genre_colors = cmap(np.linspace(0, 1, 5))

plt.barh(genres, sales, label=genres, color=genre_colors)

# Add title and axis labels
plt.title('Monthly Book Sales')
plt.ylabel('Book Genre')
plt.xlabel('Units Sold')
plt.legend();

### Perceptually uniform colormaps

You may want to try these other color maps. They are more suitable for continuous variables than for categorical ones, but they do offer some aesthetic appeal.

| Name      | Description                | Example use                                                   |
| --------- | -------------------------- | ------------------------------------------------------------- |
| `viridis` | Dark purple → yellow       | Default in many plots, great for general-purpose numeric data |
| `plasma`  | Dark purple → yellow-red   | Slightly warmer than viridis                                  |
| `inferno` | Black → yellow             | High-contrast, good for dark backgrounds                      |
| `magma`   | Dark purple → light orange | Smooth, subtle gradient                                       |
| `cividis` | Blue → yellow              | Colorblind-friendly                                           |

## Pandas shortcut to `matplotlib`

Both `pd.Series` and `pd.DataFrame` objects in Pandas have a `.plot()` method. Pandas uses its built-in plotting accessor, which serves as a wrapper around `matplotlib`.

For example, we can create a `pd.Series` by combining the `genres` and `sales` lists from the example above, and then call the `.plot()` method on the resulting Series.

In [None]:
# Data
genres = ['Fiction', 'Non-Fiction', 'Science', 'History']
sales = [120, 90, 60, 30]

# Create a pd.Series
sales_by_genre = pd.Series(sales, index=genres, name="Sales")

ax = sales_by_genre.plot(kind="bar", color=genre_colors, rot=0)

plt.title("Monthly Book Sales by Genre")
plt.ylabel("Sales");


### Practice Exercise 2

In studies of human behavior, researchers are often interested in how personal characteristics influence social outcomes. One area of interest is the relationship between religiosity and marital behavior. The **Fair** dataset from the **Ecdat** package contains survey data on individuals’ marital histories, including the number of extramarital affairs and measures of religiosity.

For this exercise, we will use the `df_fair` DataFrame to explore this relationship by creating a bar chart that visualizes the average number of affairs grouped by levels of religiosity. This helps illustrate potential patterns between personal beliefs and reported marital behavior.

**Step 1**

Compute the average number of affairs (`nbaffairs`) by the religiosity ratings (`religious`) and save it as `mean_affairs`:

In [None]:
# YOUR CODE HERE


**Step 2**

Using Pandas shortcut to `matplotlib`, create a bar chart for `mean_affairs` with the following customizations.

- title: "Mean Number of Affairs by Religiousness (Fair Dataset)"
- x-axis label: "Religiousness (1=least, 5=most)"
- y-axis label: "Mean Number of Affairs"

In [None]:
# YOUR CODE HERE


## Stacked and Juxtaposed Barcharts

**Juxtaposed** (grouped) bar chart: places bars for different groups side by side, making it easy to compare categories across groups.

**Stacked** bar chart: stacks group bars on top of each other within each category, making it easy to see the total distribution but harder to compare groups directly.

When to use which?
- Use juxtaposed bars when you want to emphasize differences between groups.
- Use stacked bars when you want to emphasize the composition within a category.

**Example**

Suppose a survey asked respondents about their level of agreement with the statement:

"Government should provide more support for higher education."

Responses were: Agree, Neutral, Disagree.
We also recorded respondents’ political affiliation (Group A vs. Group B).

We want to visualize how opinions differ between groups.

Using the following dataset, let's create both a stacked bar chart and a juxtaposed bar chart to compare opinions between groups:

| Political Affiliation | Agree | Neutral | Disagree |
| --------------------- | ----- | ------- | -------- |
| Group A               | 40    | 15      | 20       |
| Group B               | 25    | 30      | 35       |


In [None]:
# Raw data
data = {
    "Group": ["Group A", "Group B"],
    "Agree": [40, 25],
    "Neutral": [15, 30],
    "Disagree": [20, 35]
}

df = pd.DataFrame(data)
# Use Group as index
df = df.set_index("Group")
print(df)

Let's create a stacked bar chart showing response distributions by political affiliation.

In [None]:
# Plot stacked
df.plot(kind="bar", stacked=True, figsize=(6,4))

plt.ylabel("Number of respondents")
plt.title("Opinions by Political Affiliation (Stacked)")
plt.xticks(rotation=0)
plt.legend(title="Response");

Let's create a juxtaposed bar chart for the same data.

In [None]:
df.plot(
    kind="bar",
    stacked=False,
    figsize=(6,4))

plt.ylabel("Number of respondents")
plt.title("Opinions by Political Affiliation (Juxtaposed)")
plt.xticks(rotation=0)
plt.legend(loc="upper center", title="Response");

### Practice Exercise 3

Using `df_guerry`, create a grouped bar chart for `Crime_pers_scaled` and `Crime_prop_scaled` by `Region`. 

**Step 1**

Compute the crime rates per region and save as `regions_means`

In [None]:
# YOUR CODE HERE


**Step 2**

Create a **stacked** bar chart for `Crime_pers_scaled` and `Crime_prop_scaled` by `Region` with the following details:

- title: Average Crime by Region (Guerry Dataset, 1830s)
- y-axis label: Average Crime Rates
- x-axis label: Region
- color of bars: "red", "black"
- legend title: Crime Type
- legend labels: "Personal Crime", "Property Crime"

In [None]:
# YOUR CODE HERE 
# Plot a grouped bar chart with pandas


### Customizing Plots using `plt` vs. `ax`

You've seen how to customize plots using two approaches: the `plt` functions and `ax` methods.

- `plt` functions modify the current figure globally.
- `ax` methods affect only the specific axes object, making them safer when working with multiple subplots or writing reusable code.

While you can mix both styles, the Object-Oriented approach is generally preferred for its clarity and control.

Here’s atable summarizing common Matplotlib plot customizations and how to set them using `plt` (pyplot) vs. `ax` (object-oriented Axes):

| Option        | `plt` (pyplot)                                  | `ax` (Axes / OO style)                                  | Notes                                   |
| ------------- | ----------------------------------------------- | ------------------------------------------------------- | --------------------------------------- |
| Figure size   | `plt.figure(figsize=(6,4))`                     | `fig, ax = plt.subplots(figsize=(6,4))`                 | Creates the figure and axes             |
| Plot type     | `plt.bar(x, y)` / `plt.plot(x, y)`              | `ax.bar(x, y)` / `ax.plot(x, y)`                        | Any plot method works on Axes           |
| Title         | `plt.title("My Title")`                         | `ax.set_title("My Title")`                              | OO style is per-Axes                    |
| X-axis label  | `plt.xlabel("X Label")`                         | `ax.set_xlabel("X Label")`                              |                                         |
| Y-axis label  | `plt.ylabel("Y Label")`                         | `ax.set_ylabel("Y Label")`                              |                                         |
| X-tick labels | `plt.xticks(rotation=45)`                       | `ax.set_xticklabels(ax.get_xticklabels(), rotation=45)` | Must pass labels in OO style            |
| Y-tick labels | `plt.yticks(rotation=45)`                       | `ax.set_yticklabels(ax.get_yticklabels(), rotation=45)` | Similar to X-ticks                      |
| Grid          | `plt.grid(axis="y", linestyle="--", alpha=0.7)` | `ax.grid(axis="y", linestyle="--", alpha=0.7)`          | Works for both single and multiple axes |
| Legend        | `plt.legend(["A","B"], title="Legend")`         | `ax.legend(["A","B"], title="Legend")`                  | OO style allows per-Axes control        |
| Tight layout  | `plt.tight_layout()`                            | `ax.figure.tight_layout()`                              | Adjusts spacing automatically           |
| Color         | `plt.bar(x, y, color="skyblue")`                | `ax.bar(x, y, color="skyblue")`                         | Works for bar, plot, scatter, etc.      |


## Scatter Plots

A scatter plot is a simple visualization tool commonly used as an initial step in exploratory data analysis to illustrate the relationship between two numerical variables. Each observation in the dataset is represented as a point, with its position determined by the values of the two variables.

In social science research, scatter plots are often employed to investigate questions such as:
- Is there a relationship between education level and income?
- Is there a connection between unemployment rates and crime rates across different regions?
- How does age influence political participation?

Let's create a scatterplot of `Crime_pers` and `Crime_prop` in pure `matplotlib`.

In [None]:
# Pure `matplotlib`
plt.figure(figsize=(6, 6))
plt.scatter(df_guerry['Crime_pers'], df_guerry['Crime_prop'], color='black')

plt.xlabel("Crime Against Persons (per capita)")
plt.ylabel("Crime Against Property (per capita)")
plt.title("Scatterplot of Crime Rates (Guerry Dataset)")
plt.grid(True, linestyle="--", alpha=0.6)

Let's reproduce this in Pandas using `.plot(kind="scatter")`.

In [None]:
# Pure `matplotlib`
ax = df_guerry.plot(
    x='Crime_pers',
    y='Crime_prop',
    kind='scatter',
    color='red',
    figsize=(6, 6),
    title="Scatterplot of Crime Rates (Guerry Dataset)"
)

ax.set_xlabel("Crime Against Persons (per capita)")
ax.set_ylabel("Crime Against Property (per capita)")
ax.grid(True, linestyle="--", alpha=0.6)

### Encoding Data with Visual Aesthetics 

In many plots, we can use visual features like color, shape, or size to represent another variable. For instance, in a scatter plot, the size of each point could show literacy rate, and the color could show the region. This makes it easier to see patterns, compare values, and notice unusual cases, all at a glance.

In [None]:
# Scale Literacy so marker sizes are reasonable
size_scale = 5

ax = df_guerry.plot(
    x='Crime_pers',
    y='Crime_prop',
    kind='scatter',
    color="red",
    s=df_guerry['Literacy'] * size_scale,  # bubble size from Literacy
    alpha=0.4,  # transparency for readability
    figsize=(8, 6),
    title="Crime vs. Crime by Region and Literacy (Guerry Dataset)"
)

ax.set_xlabel("Crime Against Persons (per capita)")
ax.set_ylabel("Crime Against Property (per capita)")
ax.grid(True, linestyle="--", alpha=0.6)

### Practice Exercise 4

Load the CPS1985 dataset from the AER package in R, accessible via the `statsmodels` library as `df_cps1985`. This dataset contains 534 observations on 11 variables related to wages and demographic characteristics.

Visualize the relationship between the number of years of education (education) and hourly wage (wage) using a scatterplot.

**Step 1** Load Data

Using `sm.datasets.get_rdataset()` load "CPS1985" in the "AER" package.

In [None]:
# YOUR CODE HERE


**Step 2** Inspect Data

Display the first few rows to understand its structure.

In [None]:
# YOUR CODE HERE


**Step 3** Data Wrangling

Collapse all education values ≤ 8 into a single value of 8.

In [None]:
# YOUR CODE HERE


**Step 4** Create a scatterplot

Create a scatterplot displaying the relationship between years of education (`education`) and hourly wage (`wage`) across the 534 individuals in the dataset.

Use the `alpha=0.7` parameter to add transparency to the points, and enable the grid with a dashed line style for improved readability.

In [None]:
# YOUR CODE HERE
# Pandas method


### More Encoding Data with Visual Aesthetics

In our previous discussion, we demonstrated how to encode variables using visual aesthetics to convey multiple dimensions of information in a single plot. In this more detailed example, we use a bubble map to represent cities in California, incorporating the following elements:

- **Position (x and y coordinates)**: This indicates the geographic location of cities based on their longitude and latitude. 
- **Color (hue)**: We use a log transformation of the total population, which enables viewers to easily distinguish between small and large cities.
- **Size (area of markers)**: The size of each marker represents the total land area of each city, adding another dimension of scale. 
- **Transparency (alpha)**: Adjusting the transparency helps reduce overplotting in densely populated areas, improving the overall readability of the map.

In [None]:
pd_cities = pd.read_csv('../data/california_cities.csv', index_col=0)
pd_cities.head()

The following example is adapted from VanderPlas (2023, p. 273), where the color and size of the points are mapped to two features (Population and Area) of the data.

In [None]:
ax = pd_cities.plot.scatter(
    x="longd",
    y="latd",
    c=np.log10(pd_cities["population_total"]),  # log-color scale
    cmap="plasma",
    s=pd_cities["area_total_km2"],              # bubble size
    linewidth=0,
    alpha=0.5,
    label=None,  # suppresses lengend
    clim=(3, 7), # sets the colormap limits
    grid=True,
    figsize=(8, 6)
)

# Axis labels and title
ax.set_xlabel("Longitude")
ax.set_ylabel("Latitude")
ax.set_title("California Cities: Population (color) and Area (size)");


## Histogram and Density Plots

Histograms and density plots are fundamental tools for visualizing the distribution of a single numeric variable.

- **Histogram**: Divides the range of the variable into bins and counts how many observations fall into each bin. Useful for seeing frequency distributions and detecting skewness or outliers.
- **Density Plot**: A smoothed version of a histogram that estimates the probability density function of the variable. Provides a continuous view of the distribution.

Using the `population_total` variable in the `pd_cities` dataset:
- A histogram can show how many cities fall into different population ranges.
- A density plot can reveal the overall shape of the distribution (e.g., many small cities, few mega-cities).

Plot a histogram of population_total for all cities.
- Overlay a density plot on the same figure.
- Use a logarithmic scale on the x-axis to handle skewed populations.
- Add appropriate axis labels and a title.

In [None]:
from scipy.stats import gaussian_kde

# Log-transform population to handle skew
log_pop = np.log10(pd_cities["population_total"])

plt.figure(figsize=(8, 6))

# Histogram
plt.hist(log_pop, bins=30, color="skyblue", alpha=0.7, density=True, label="Histogram")

# Density plot
density = gaussian_kde(log_pop)
x_vals = np.linspace(log_pop.min(), log_pop.max(), 200)
plt.plot(x_vals, density(x_vals), color="darkblue", linewidth=2, label="Density")

plt.xlabel("Log10 of Total Population")
plt.ylabel("Density")
plt.title("Distribution of City Population (Log-Scaled)")
plt.legend()
plt.grid(axis="y", linestyle="--", alpha=0.6)

### Practice Exercise 5

In this exercise, we will create a histogram overlaid with a density plot for the Total Area (in square miles), represented by `area_total_sq_mi`. 

We will use Pandas’ built-in plotting methods, specifically `.plot.hist()` for the histogram and `.plot.kde()` for the density plot.

**Step 1**

The `area_total_sq_mi` variable is highly skewed and contains zero values. Before applying the log transformation, filter out any values less than or equal to zero. Use `np.log10()` to transform the variable and save the result as a Pandas Series, `log_area_total`. 

- Cities with `area_total_sq_mi` <= 0 are dropped.
- We use `np.log10()` (common for area/population variables).
- Both histogram and density are drawn on the transformed scale.

In [None]:
# YOUR CODE HERE


**Step 2**

Create a histogram with an overlaid density plot using Pandas’ built-in plotting methods, `.plot.hist()` and `.plot.kde()`, instead of manually calling Matplotlib:

- Histogram:
    - 30 bins
    - color: "lightgreen"
    - density: True
    - xlim: (-3, 3)
    - label: "Histogram"

- Density Plot:
    - color: "darkgreen"
    - linewidth: 2
    - label: "Density"

In [None]:
# YOUR CODE HERE


## Visualizing Uncertainties

Measurements in nearly all scientific disciplines are subject to errors, such as sampling and measurement errors. It is important to clearly account for uncertainties when reporting numerical data. This is particularly relevant in social science research, where comparing average outcomes across groups and displaying the uncertainty in those estimates is crucial.

The `errbar()` function in Matplotlib can be used to represent variability, such as standard errors or confidence intervals, in estimates.

Using the `df_cps1985` again, we will show mean wages by gender with ±1 standard error ($\sigma_{\bar{X}}$) error bars. 

$$\sigma_{\bar{X}} = \frac{\hat{\sigma}}{\sqrt{N}}$$

In [None]:
# Load CPS1985 dataset
df_cps1985 = sm.datasets.get_rdataset("CPS1985", "AER").data

# Compute mean and standard error by gender
group_stats = df_cps1985.groupby("gender")["wage"].agg(["mean", "std", "count"])
group_stats["se"] = group_stats["std"] / group_stats["count"]**0.5

Matplotlib’s `plt.errorbar()` is used to draw plots with error bars:

```python
plt.errorbar(x, y, yerr=errors, fmt='o')
```

- `x`: x-values (groups or predictors)
- `y`: mean values (or central estimate)
- `yerr`: error bar length (can be ±SE, CI, etc.)
- `fmt='o'`: marker style for the central point

Other options: `capsize=5` to add caps, `color` for bar color, `elinewidth` for line thickness, and `ecolor` for error bar color.

In [None]:
# Create figure and axes
fig, ax = plt.subplots(figsize=(6, 5))

# Plot error bars on the axes
ax.errorbar(
    x=group_stats.index,
    y=group_stats["mean"],
    yerr=group_stats["se"],
    fmt="o",
    capsize=5,
    color="blue",
    ecolor="black"
)

# Customize axes
ax.set_title("Hourly Wages by Gender (CPS1985, ±1 SE)")
ax.set_xlabel("Gender")
ax.set_ylabel("Average Hourly Wage (USD)")
ax.set_ylim(6, 12)
ax.grid(axis="y", linestyle="--", alpha=0.6)

# Optional: adjust layout
fig.tight_layout()

### Practice Exercise 6

Let’s redo the example, but instead of reporting `wage` conditioning on `gender` (2 levels: male/female), we’ll condition on `occupation` (several levels). 

- Using `df_cps1985`, plot the mean `wage` by `occupation` with 95% error bars showing ±1.96 standard error. 
- Use Pandas’ built-in plotting (`DataFrame.plot`) instead of calling `plt.errorbar()` directly.

**Step 1** 

Compute mean wange by occupation and save it as a DataFrame `group_stats`:

In [None]:
# YOUR CODE


**Step 2**

Note that Pandas doesn’t have a `.errbar()` method. As a workaround, we can call `group_stats.plot()` and generate a bar chart with error bars. Use `.plot()` with `kind="bar"`, along with the following options:

- title: "Average Hourly Wages by Occupation (CPS1985, ±1 SE)"
- xlabel: "Occupation"
- ylabel: "Hourly Wage (USD)"

In [None]:
# YOUR CODE HERE


We’ve only scratched the surface with Matplotlib. Be sure to check out the Matplotlib [gallery](https://matplotlib.org/stable/gallery/) for some advanced techniques and inspiration!

### Wrap Up

That's all for now.
- Please complete the DC course "Introduction to Data Visualization with Seaborn" by noon on 10/13.
- Submit the in-class exercise notebook by 6:00 PM today.

BY PRINTING YOUR NAME BELOW, YOU CONFIRM THAT THE EXERCISES YOU SUBMITTED IN THIS NOTEBOOK ARE YOUR OWN AND THAT YOU DID NOT USE AI TO ASSIST WITH YOUR WORK.

In [None]:
# PRINT YOUR NAME
print("Enter Your Name Here")