# Data visualization

Python data visualization tool landscape:

  - matplotlib is powerful but unwieldy; good for basic plotting (scatter, line, bar), and pandas can use it [directly](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html)
  - [seaborn](http://seaborn.pydata.org/) (built on top of matplotlib) is best for statistical visualization: summarizing data, understanding distributions, searching for patterns and trends
  - [bokeh](https://docs.bokeh.org/) is for interactive visualization to let your audience explore the data themselves

We will focus on **seaborn** in this class. It is the easiest to work with to produce meaningful and aesthetically-pleasing visuals.

In [None]:
!curl -s -o pyproject.toml https://raw.githubusercontent.com/gboeing/ppd534/refs/heads/main/pyproject.toml && uv pip install -q -r pyproject.toml

In [None]:
import pandas as pd
import seaborn as sns

## 1. Load and prep the data

In [None]:
# load the california tracts census data
df = pd.read_csv(
    "https://raw.githubusercontent.com/gboeing/ppd534/main/data/census_tracts_data_ca.csv",
    dtype={"GEOID10": str},
)
df.shape

In [None]:
df.columns

In [None]:
df = df.set_index("GEOID10")
df.head()

## 2. Review: subsetting, grouping, and descriptive stats

In [None]:
# let's look only at counties in southern california
socal_counties = [
    "Imperial",
    "Kern",
    "Los Angeles",
    "Orange",
    "Riverside",
    "San Bernardino",
    "San Diego",
    "San Luis Obispo",
    "Santa Barbara",
    "Ventura",
]
mask = df["county_name"].isin(socal_counties)
df_sc = df[mask]
df_sc.shape

In [None]:
# quick descriptive stats across these counties
df_sc["med_household_income"].describe()

In [None]:
# looking across the whole thing obscures between-group heterogeneity
# let's group by county and look at descriptive stats again
df_sc.groupby("county_name")["med_household_income"].describe().astype(int)

That's better... but it's still hard to pick out patterns and trends by just staring at a table full of numbers. Let's visualize it.

## 3. Visualizing distributions

### 3a. Box plots

Box plots illustrate the data's distribution via the "5 number summary": min, max, median, and the two quartiles (plus outliers).

We will use seaborn for our visualization. In seaborn, you can control what's considered an outlier by changing min/max of whiskers with `whis` parameter... the convention is outliers > 1.5 IQR. For a vertical boxplot, x = your variable's column and y = categorical column to group by.

In [None]:
# use seaborn to make a boxplot of median household income per county
ax = sns.boxplot(x=df_sc["med_household_income"], y=df_sc["county_name"])

**What stories does this visualization tell you?**

Next, let's configure and tweak the plot to improve its aesthetics.

In [None]:
# what is this "ax" variable we created?
type(ax)

In [None]:
# every matplotlib axes is associated with a "figure" which is like a container
fig = ax.get_figure()
type(fig)

In [None]:
# manually change the plot's size/dimension by adjusting its figure's size
fig = ax.get_figure()
fig.set_size_inches(6, 6)  # inches
fig

It's usually better to let seaborn intelligently handle the figure size for you. But you can easily configure its style, plotting context, and many attributes of the plot:

In [None]:
# you can configure seaborn's style
sns.set_style("whitegrid")  # visual styles
sns.set_context("paper")  # presets for scaling figure element sizes

# fliersize changes the size of the outlier dots
# boxprops lets you set more configs with a dict, such as alpha (which means opacity)
ax = sns.boxplot(
    x=df_sc["med_household_income"], y=df_sc["county_name"], fliersize=1, boxprops={"alpha": 0.7}
)

# set the x-axis limit, the figure title, and x/y axis labels
ax.set_xlim(left=0)
ax.set_title("Box plot of tract-level median household income")
ax.set_xlabel("2017 inflation-adjusted USD")
ax.set_ylabel("")

# save figure to disk with 600 dpi and a tight bounding box
ax.get_figure().savefig("figure-income-boxplot.png", dpi=600, bbox_inches="tight")

In [None]:
# now it's your turn
# choose a different variable and visualize it as a box plot in each of 3 counties of your choice

### 3b. Histograms and KDE plots

Histograms visualize the distribution of some variable by binning it then counting observations per bin. KDE plots are similar, but continuous and smooth.

In [None]:
# distplot visualizes the variable's distribution as both histogram and kde
ax = sns.histplot(df["median_age"].dropna(), stat="density", kde=True)

In [None]:
# if you prefer, you can plot just the histogram alone
ax = sns.histplot(df["median_age"].dropna(), stat="density", kde=False)

You can compare multiple histograms to see how different groups overlap or differ by some measure.

In [None]:
# subset the dataframe into majority white and majority hispanic subsets
df_wht = df[df["pct_white"] > 50]
df_hsp = df[df["pct_hispanic"] > 50]

In [None]:
# compare their distributions to each other
ax = sns.histplot(df_wht["median_age"].dropna(), stat="density")
ax = sns.histplot(df_hsp["median_age"].dropna(), stat="density", color="orange")

In [None]:
# improve the aesthetics: label each distribution and create a legend
ax = sns.histplot(df_wht["median_age"].dropna(), stat="density", label="Majority White Tracts")

ax = sns.histplot(
    df_hsp["median_age"].dropna(), stat="density", label="Majority Hispanic Tracts", color="orange"
)
ax.legend()

# set x-limit, add x-label, then save to disk
ax.set_xlim(10, 85)
ax.set_xlabel("Median Age of Population (Years)")
ax.get_figure().savefig("figure-age-distributions.png", dpi=600, bbox_inches="tight")

**So, what does this plot tell us?**

It looks like the two groups differ... but it is a big enough difference to make meaningful claims about it? We will revisit this question when we discuss statistical significance in a few weeks.

In [None]:
# now it's your turn
# subset the dataframe in a different way (your choice), choose a new variable, and compare its distribution across the subsets
# how do the distributions differ? what does this mean in the real world?

## 4. Pairwise relationships

### 4a. Scatter plots

Histograms and box plots visualize univariate distributions: how a single variable's values are distributed. Scatter plots essentially visualize *bivariate* distributions so that we can see patterns and trends jointly between two variables.

In [None]:
# use seaborn to scatter-plot two variables
ax = sns.scatterplot(x=df["pct_bachelors_degree"], y=df["med_household_income"])

In [None]:
# scatter-plot two variables, broken out across three counties by color
counties = ["Riverside", "San Mateo", "San Francisco"]
df_counties = df[df["county_name"].isin(counties)]
ax = sns.scatterplot(
    x=df_counties["pct_bachelors_degree"],
    y=df_counties["med_household_income"],
    hue=df_counties["county_name"],
)

In [None]:
# same thing again, but styled more nicely
counties = ["Riverside", "San Mateo", "San Francisco"]
df_counties = df[df["county_name"].isin(counties)]
ax = sns.scatterplot(
    x=df_counties["pct_bachelors_degree"],
    y=df_counties["med_household_income"],
    hue=df_counties["county_name"],
    alpha=0.8,
)

# remove the column name from the legend
handles, labels = ax.get_legend_handles_labels()
ax.legend(handles=handles, labels=labels)

# set x/y limits, labels, and save figure
ax.set_xlim(0, 100)
ax.set_ylim(bottom=0)
ax.set_xlabel("Tract population % with bachelor's degree or higher")
ax.set_ylabel("Tract median household income (2017 USD)")
ax.get_figure().savefig("figure-income-degree.png", dpi=600, bbox_inches="tight")

In [None]:
# now it's your turn
# pick 2 new variables from the full dataset and scatter plot them against each other
# how do you interpret the pattern? what if you look at only 1 county?

### 4b. Pair plots, correlation heatmaps, and linear trends

In [None]:
# create a subset of SF county tracts, and just 4 variables
df_sf = df[df["county_name"] == "San Francisco"]
df_sf = df_sf[
    ["pct_bachelors_degree", "med_household_income", "med_home_value", "mean_commute_time"]
]
df_sf.head()

In [None]:
# show a pair plot of these SF tracts across these 4 variables

ax = sns.pairplot(df_sf.dropna())

**Do you see patterns in these scatter plots?**

*Correlation* tells us to what extent two variables are linearly related to one another. Pearson correlation coefficients range from -1 to 1, with 0 indicating no linear relationship, -1 indicating a perfect negative linear relationship, and 1 indicating a perfect positive linear relationship.

In [None]:
# a correlation matrix
correlations = df_sf.corr()
correlations.round(2)

In [None]:
# visual correlation matrix via seaborn heatmap
# use vmin, vmax, center to set colorbar scale properly
ax = sns.heatmap(
    correlations, vmin=-1, vmax=1, center=0, cmap="coolwarm", square=True, linewidths=1
)

In [None]:
# a linear (regression) trend line + confidence interval
ax = sns.regplot(x=df_sf["pct_bachelors_degree"], y=df_sf["med_household_income"])
ax.get_figure().set_size_inches(5, 5)  # make it square

## 5. Bar plots and count plots

Count plots let you count things across categories. Bar plots let you estimate a measure of central tendency across categories.

In [None]:
# pandas value_counts() counts how many times each unique value appears in a column
counts = df_sc["county_name"].value_counts().sort_index()
counts

In [None]:
# simple count plot
# essentially a histogram counting observations across categorical data (instead of continuous data)
ax = sns.countplot(x=df_sc["county_name"])

In [None]:
# same thing again, but ordered and styled more nicely
order = df_sc["county_name"].value_counts().index
ax = sns.countplot(x=df_sc["county_name"], order=order, alpha=0.7)

# rotate the tick labels, set x and y axis labels, then save
ax.set_xticks(ax.get_xticks())
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha="right")
ax.set_xlabel("Counties in Southern California")
ax.set_ylabel("Number of census tracts")
ax.get_figure().savefig("county-tracts-countplot.png", dpi=600, bbox_inches="tight")

In [None]:
# simple bar plot: estimate means of tract median household income + 95% confidence interval
ax = sns.barplot(x=df_sc["county_name"], y=df_sc["med_household_income"])

In [None]:
# nicer bar plot sorted by median value
order = df_sc.groupby("county_name")["med_household_income"].median().sort_values().index
ax = sns.barplot(
    x=df_sc["med_household_income"],
    y=df_sc["county_name"],
    hue=df_sc["county_name"],
    estimator="median",
    errorbar=None,
    order=order,
    alpha=0.7,
    palette="plasma",
)

How does this compare to a box plot of the same variable?

## 6. Line plots

Line plots are most commonly used to visualize time series: how one or more variables change over time.

In [None]:
# load dataset of country gdp by year
df_gdp = pd.read_csv(
    "https://raw.githubusercontent.com/gboeing/ppd534/main/data/gdp.csv"
).set_index("year")
df_gdp.shape

In [None]:
df_gdp.tail()

In [None]:
# simple line plot
# seaborn uses the index as x-axis and individual lines for each column
ax = sns.lineplot(data=df_gdp)

In [None]:
# same thing, but subset to only show 50 years of data (1900-1950)
ax = sns.lineplot(data=df_gdp.loc[1900:1950])

In [None]:
# same thing, but also subset to only show 2 countries
ax = sns.lineplot(data=df_gdp.loc[1900:1950, ["GBR", "USA"]])

In [None]:
# same thing again, but styled more nicely
ax = sns.lineplot(
    data=df_gdp.loc[1900:1950, ["GBR", "USA"]], dashes=False, palette=["steelblue", "chocolate"]
)

ax.set_xlim(1900, 1950)
ax.set_ylim(5000, 17000)
ax.set_xlabel("")
ax.set_ylabel("Real GDP per capita (2011 USD)")
ax.set_title("Per capita GDP, 1900-1950")
ax.get_figure().savefig("country-gdp-lineplot.png", dpi=600, bbox_inches="tight")

In [None]:
# now it's your turn
# choose any 3 countries from the GDP dataset and visualize them over any 100 year interval in the dataset

## 7. Working with color

Seaborn makes generally smart decisions about color for you. But you can tweak the colors in your plot usually by passing in a `palette` argument (the name of a colormap or a list of colors to use).

How seaborn handles color: https://seaborn.pydata.org/tutorial/color_palettes.html

Available color maps: https://matplotlib.org/tutorials/colors/colormaps.html

Available named colors: https://matplotlib.org/gallery/color/named_colors.html

In [None]:
# show the default color palette
sns.palplot(sns.color_palette())

In [None]:
# show the "Blues" color map as a palette
sns.palplot(sns.color_palette("Blues", n_colors=5))

In [None]:
# show the "plasma" color map as a palette
# notice that color map names are case sensitive
sns.palplot(sns.color_palette("plasma", n_colors=5))

In [None]:
# now it's your turn
# go back through a couple of the plots earlier in this notebook and adjust their colors
# try both colormaps and lists of color names: look up both using the links above