# Urban Informatics
# Module 05: Data visualization with matplotlib

- documentation: https://matplotlib.org/api/api_overview.html
- examples: https://matplotlib.org/gallery/index.html
- anatomy of mpl: https://matplotlib.org/_images/anatomy.png

Today we'll dissect matplotlib. In many ways, matplotlib is the "hard way" to do viz in Python... but it's powerful, flexible, ubiquitous, and helps to reinforce the manual decisions and techniques that other hand-holdy libraries obfuscate. Once you've learned how to do all of this, other Python visualization libraries are easy to pick up. There are several other visualization libraries out there, such as:

- seaborn: http://seaborn.pydata.org/
- bokeh: http://bokeh.pydata.org/

In [None]:
import matplotlib.cm as cm
import matplotlib.font_manager as fm
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

%matplotlib notebook

## 1. matplotlib basics: figures and axes

A figure is the top-level container for everything related to your plot. An axis is attached to the figure, and contains most of the plotting elements (like x-axis, y-axis, ticks, lines, text, polygons, and the definition of the coordinate system). It's kind of confusing, but with repitition you'll get familiar with how it works.

In [None]:
# create a figure with a single axis
fig, ax = plt.subplots()

In [None]:
type(fig)

In [None]:
type(ax)

In [None]:
# create a figure with 4 axes and choose its size
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(8, 8))

In [None]:
type(axes)

The anatomy of mpl: https://matplotlib.org/_images/anatomy.png

## 2. Bar charts

In [None]:
# load some data
df = pd.read_csv('data/tracts.csv')
df.shape

In [None]:
# what variables are present?
df.columns

In [None]:
# top 10 cities by number of tracts
cities = df['place_name'].value_counts().head(10)
cities

In [None]:
# default, simple matplotlib bar chart via pandas's dataframe.plot() method
ax = cities.plot(kind='bar')

In [None]:
# change the default font size
plt.rcParams['font.size'] = 12

In [None]:
# style the plot to make it look nicer
ax = cities.plot(kind='bar', figsize=(8, 6), width=0.6, alpha=0.6, 
                 color='g', edgecolor='k', zorder=2)

# added a dotted-line grid for the y-axis only
ax.yaxis.grid(True, ls=':')

# rotate the x-labels 45-degrees
ax.set_xticklabels(cities.index, rotation=45, rotation_mode='anchor', ha='right')

ax.set_title('Cities with the most tracts')
ax.set_ylabel('Number of tracts')

plt.show()

In [None]:
# same thing, only instead of pandas directly, use the mpl object-oriented API directly
fig, ax = plt.subplots(figsize=(8, 6))
ax.bar(x=cities.index, height=cities, width=0.6, alpha=0.6,
       color='g', edgecolor='k', zorder=2)

ax.yaxis.grid(True, ls=':')
ax.set_xticklabels(cities.index, rotation=45, rotation_mode='anchor', ha='right')

ax.set_title('Cities with the most tracts')
ax.set_ylabel('Number of tracts')

plt.show()

In [None]:
# now it's your turn
# recreate the plot above, but give it an x-axis label and make the bars orange with maroon edges


In [None]:
# plot log data
fig, ax = plt.subplots(figsize=(8, 6))
ax.bar(x=cities.index, height=np.log(cities), width=0.6, alpha=0.6,
       color='g', edgecolor='k', zorder=2)

ax.yaxis.grid(True, ls=':')
ax.set_ylim((0, 8))

ax.set_xticklabels(cities.index, rotation=45, rotation_mode='anchor', ha='right')
ax.set_title('Cities with the most tracts')
ax.set_ylabel('Number of tracts (log)')

plt.show()

In [None]:
# now it's your turn
# plot a bar chart of the top 10 cities by average tract median income (hint: use pandas's dataframe.groupby() method)


## 3. Histograms and KDE

In [None]:
# default histogram, via pandas
ax = df['median_age'].hist()

In [None]:
# you can style your plot from pandas more nicely
ax = df['median_age'].hist(bins=50, edgecolor='w', alpha=0.8, zorder=2)
ax.grid(ls=':')

# rather than setting an axis range, you can set a single limit
ax.set_xlim(left=0)
ax.set_ylim(top=1300)

ax.set_title('Tract median age histogram')
plt.show()

In [None]:
# plot a simple kde function
ax = df['median_age'].plot.kde()

In [None]:
# make the KDE look nicer
ax = df['median_age'].plot.kde(lw=4, alpha=0.6, bw_method=0.2)
ax.grid(ls=':')

ax.set_xlim((0, 80))
ax.set_ylim(bottom=0)

ax.set_title('Tract median age probability density')
plt.show()

In [None]:
# plot the histogram and KDE together
fig, ax = plt.subplots(figsize=(8, 6))
ax = df['median_age'].hist(ax=ax, bins=50, edgecolor='w', alpha=0.8, zorder=2)
ax = df['median_age'].plot.kde(ax=ax, lw=2, secondary_y=True, alpha=0.8)

ax.grid(ls=':')
ax.set_xlim((0, 75))
ax.set_ylim(bottom=0)

ax.set_title('Tract median age')
plt.show()

In [None]:
# plot histograms of 4 separate variables as subplots of a single mpl figure
cols = ['median_age', 'med_income_k', 'median_gross_rent_k', 'med_home_value_k']
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(10, 8))

# zip together the axes and the columns to plot them
for ax, col in zip(axes.flatten(), cols):
    df[col].hist(ax=ax, bins=50, alpha=0.8, zorder=2)
    ax.grid(ls=':')
    ax.set_xlim(left=0)
    ax.set_title(col)

# add a super title to the figure
fig.suptitle('Histograms of tract-level variables', y=0.95, fontsize=16, weight='bold')
plt.show()

In [None]:
# now it's your turn
# identify 2 additional variables in the dataframe to plot 6 histograms along with their KDEs, in a single figure


In [None]:
# compare white vs black majority tracts
# first plot kde of majority white tracts' median income
white_income = df[df['prop_white'] > 0.5]['med_income_k']
ax = white_income.plot.kde(c='k', lw=2, ls='--', alpha=0.8, bw_method=1, label='majority white')

# next plot kde of majority black tracts' median income
black_income = df[df['prop_black'] > 0.5]['med_income_k']
ax = black_income.plot.kde(c='k', lw=2, alpha=0.8, bw_method=1, label='majority black')

ax.grid(ls=':')
ax.set_xlim((-30, 200))
ax.set_ylim(bottom=0)

ax.set_title('White vs Black Census Tracts')
ax.set_xlabel('Median income (USD, thousands)')
ax.set_ylabel('Probability density')

ax.legend()
plt.show()

In [None]:
# now it's your turn
# plot contrasting KDEs comparing median home values in tracts with majority college degree or higher, vs not


## 4. Time series and line plots

In [None]:
# GPS coordinates
dt = pd.read_csv('data/gps-coords.csv', index_col='date', parse_dates=True)

In [None]:
# processing same as last week
weekend_mask = (dt.index.weekday==5) | (dt.index.weekday==6)
weekends = dt[weekend_mask]
weekdays = dt[~weekend_mask]
weekday_hourly_share = weekdays.groupby(weekdays.index.hour).size() / weekdays.groupby(weekdays.index.hour).size().sum()
weekend_hourly_share = weekends.groupby(weekends.index.hour).size() / weekends.groupby(weekends.index.hour).size().sum()
hourly_share = pd.DataFrame([weekday_hourly_share, weekend_hourly_share], index=['weekday', 'weekend']).T
hourly_share.index = [s + ':00' for s in hourly_share.index.astype(str)]
hourly_share.head()

In [None]:
# weekday vs weekend hourly observations as a bar chart
ax = hourly_share.plot(figsize=(10, 6), kind='bar', alpha=0.7, 
                       title='Share of observations, by hour')

In [None]:
# stacked bar chart
ax = hourly_share.plot(figsize=(10, 6), kind='bar', stacked=True, 
                       alpha=0.7, title='Share of observations, by hour')

In [None]:
ax = hourly_share.plot(figsize=(10, 6), kind='bar', stacked=False, width=0.5,
                       alpha=0.7, color=['#336699', '#ff3366'], edgecolor='k')

ax.yaxis.grid(True, ls=':')
ax.set_xticklabels(hourly_share.index, rotation=60, rotation_mode='anchor', ha='right')
ax.set_title('Share of observations, by hour')

plt.show()

In [None]:
# get the count of records by date
countdata = dt.groupby(dt.index.date).size()
countdata.head()

In [None]:
# simple line plot via pandas
ax = countdata.plot(kind='line', figsize=(10, 6))

In [None]:
# better-styled line plot
ax = countdata.plot(kind='line', figsize=(10, 6), lw=2, c='m', alpha=0.7,
                    marker='o', markerfacecolor='w', markeredgewidth=1.5)

ax.set_ylim(bottom=0, top=70)

# only show ticks for the 1st and 15th of the month
mask = (dt.index.day == 1) | (dt.index.day == 15)
matching_dates = dt.index.date[mask]
xticks = np.unique(matching_dates)
plt.xticks(ticks=xticks, rotation=45, rotation_mode='anchor', ha='right')

ax.grid(ls=':')
plt.show()

In [None]:
# now it's your turn
# recreate the plot above, but make it a dashed red line with xticks at every day evenly divisible by 5


## 5. Scatterplots

In [None]:
fig, ax = plt.subplots(figsize=(6, 6))
ax.scatter(x=df['med_income_k'], y=df['med_home_value_k'], s=0.5, alpha=0.5)
ax.set_xlabel('Median Income (USD, thousands)')
ax.set_ylabel('Meidan Home Value (USD, thousands)')
plt.show()

In [None]:
def scatter_plot(df, xcol, ycol):
    fig, ax = plt.subplots(figsize=(6, 6))
    ax.scatter(x=df[xcol], y=df[ycol], s=0.5, alpha=0.5)
    ax.set_xlabel(xcol)
    ax.set_ylabel(ycol)
    plt.show()

In [None]:
# does distance to center co-vary with commute time?
scatter_plot(df, 'distance_to_center_km', 'mean_travel_time_work')

In [None]:
# does the student population proportion co-vary with proportion renting?
scatter_plot(df, 'prop_college_grad_student', 'prop_renting')

In [None]:
# compare home values vs income in majority white and minority white tracts
fig, ax = plt.subplots(figsize=(10, 10))

# first scatter minority white tracts, then majority white tracts
mask = df['prop_white'] > 0.5
ax.scatter(x=df[~mask]['med_income_k'], y=df[~mask]['med_home_value_k'],
           s=10, alpha=0.5, marker='o', c='none', edgecolor='r', label='minority white')
ax.scatter(x=df[mask]['med_income_k'], y=df[mask]['med_home_value_k'],
           s=10, alpha=0.5, marker='o', c='none', edgecolor='k', label='majority white')

# set axis limits
ax.set_ylim((0, 1000))
ax.set_xlim((0, 175))

# add labels
ax.set_xlabel('Median Income (USD, thousands)', fontsize=16)
ax.set_ylabel('Meidan Home Value (USD, thousands)', fontsize=16)

# add legend, show plot
ax.legend()
plt.show()

In [None]:
# now it's your turn
# scatterplot majority-hispanic tracts median income vs median rent in blue and majority-white tracts median income vs median rent in orange


## In class exercise

Many other Python visualization libraries build on the matplotlib functionality we have learned today. For example, seaborn abstracts much of the nitty-gritty matplotlib work to create simple plots of data sets. Once you've learned the underlying matplotlib code, it's easy to play around with other visualization libraries.

1. Required: choose two topics from the Seaborn tutorial (https://seaborn.pydata.org/tutorial.html), and work through them adding your code to this notebook, below.
1. Select a data set for the assignment due next week
1. Begin working on the assignment (instructions on GitHub)

In [None]:
# work through the seaborn tutorial in this cell and below
