# Introduction to Data Visualization with Matplotlib

These are my notes for DataCamp's course [_Introduction to Data Visualization with Matplotlib_](https://www.datacamp.com/courses/introduction-to-data-visualization-with-matplotlib).

This course is presented by Ariel Rokem, Senior Data Scientist, University of Washington. Collaborators are Chester Ismay and Amy Peterson.

Prerequisite:

- [_Introduction to Python_](../Introduction%20to%20Python/Introduction%20to%20Python.ipynb)

This course is part of these tracks:

- Data Scientist with Python
- Data Scientist Professional with Python
- Data Visualization with Python

## Data Sets

The data sets were downloaded from datacamp.com into the directory containing this Jupyter notebook.

| Name | File |
| :--- | :--- |
| Seattle weather | seattle_weather.csv |
| Austin weather | austin_weather.csv |
| Climate data | climate_change.csv |
| Medals by country | medals_by_country_2016.csv |
| Medalist weights | summer2016.csv |

## Resources

Matplotlib cheat sheets:
- https://matplotlib.org/cheatsheets/
- DataCamp's "matplotlib Cheat Sheet.pdf" file in the project folder.

Matplotlib styles:
- https://matplotlib.org/stable/gallery/style_sheets/style_sheets_reference.html

Color picker:
- https://www.colorhexa.com/

SQL to pandas converter:
- https://sql2pandas.pythonanywhere.com/

## Imports

For convenience and clarity, all imports are gathered here.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

plt.style.use("dark_background")

## Introduction to Matplotlib

### Introduction to Data Visualization with Matplotlib

This course uses the object-oriented interface, which is provided by the pyplot submodule.

#### Load Seattle Weather Data (Extra)|

In [None]:
# Load the Seattle weather data and look at it.
seattle_weather_all = pd.read_csv("seattle_weather.csv")
print(seattle_weather_all.info())
print()
seattle_weather_all.head()

# The seattle_weather.csv file contains rows from many weather stations, and
# it lacks the "MONTH" column. We are interested in the 12 rows where
# NAME == "SEATTLE SAND PT WSFO, WA US".
# Modify the DataFrame obtained from the file to add a MONTH column and to
# keep only the 12 rows of interest.
# See "Intermediate Python" and "Data Manipulation with pandas" for simple
# examples of subsetting a DataFrame.

# Create a "MONTH" column from the "DATE" column, using this lookup
# dictionary.
months = {1: "Jan", 2: "Feb", 3: "Mar", 4: "Apr", 5: "May", 6: "Jun",
          7: "Jul", 8: "Aug", 9: "Sep", 10: "Oct", 11: "Nov", 12: "Dec"}
seattle_weather_all["MONTH"] = [months[x] for x in seattle_weather_all["DATE"]]

# Get the rows representing the station of interest.
seattle_weather = seattle_weather_all[
    seattle_weather_all["NAME"] == "SEATTLE SAND PT WSFO, WA US"].copy()

# Display the data of interest in an IPython-generated table.
seattle_weather[["MONTH", "MLY-TAVG-NORMAL", "MLY-PRCP-NORMAL"]]

#### Create Empty Subplots (Demonstration)

In [None]:
# Create empty subplots; this returns a figure and axes.
# The figure object is a container that holds everything you see on the page.
# The axes object is the part of the page that holds the data. It is the canvas
# on which we will draw our data to visualize it.
# Set the size of the figure in inches.
fig, ax = plt.subplots()
fig.set_size_inches((12, 9))
print("type(fig) =", type(fig))
print("type(ax) =", type(ax))
plt.show()

#### Load Austin Weather Data (Extra)

In [None]:
# Read the Austin weather data. The data file has only 12 rows, so we don't 
# need to subset by "NAME". Add a MONTH column and display the columns of
# interest.
austin_weather = pd.read_csv("austin_weather.csv")
austin_weather["MONTH"] = [months[x] for x in austin_weather["DATE"]]
austin_weather[["MONTH", "MLY-TAVG-NORMAL", "MLY-PRCP-NORMAL"]]

#### Plot Temperatures for Seattle and Austin (Demonstration)

In [None]:
# Plot the data for Seattle and Austin.
# I enhanced this plot to add the title, labels, and legend.
fig, ax = plt.subplots()
fig.set_size_inches((12, 9))
_ = ax.plot(seattle_weather["MONTH"], seattle_weather["MLY-TAVG-NORMAL"], label="Seattle")
_ = ax.plot(austin_weather["MONTH"], austin_weather["MLY-TAVG-NORMAL"], label="Austin")
_ = ax.set_xlabel("Month")
_ = ax.set_ylabel("Degrees F")
_ = ax.set_title("Average Monthly Temperatures")
_ = ax.legend()
plt.show()

#### Plot Seattle and Austin Monthly Precipitation (Exercise)

In [None]:
# Plot Seattle and Austin average monthly rainfall.
# I have already customized the plot with labels, a legend, and title.
fig, ax = plt.subplots()
_ = fig.set_size_inches((12, 9))
_ = ax.plot(seattle_weather["MONTH"], seattle_weather["MLY-PRCP-NORMAL"], label="Seattle")
_ = ax.plot(austin_weather["MONTH"], austin_weather["MLY-PRCP-NORMAL"], label="Austin")
_ = ax.set_xlabel("Month")
_ = ax.set_ylabel("Precipitation (inches)")
_ = ax.set_yticks(range(7))
_ = ax.set_title("Average Monthly Precipitation")
_ = ax.legend()
plt.show()

### Customizing Your Plots

Markers are documented at https://matplotlib.org/stable/api/markers_api.html.

Lines are documented at https://matplotlib.org/stable/api/_as_gen/matplotlib.lines.Line2D.html. They are:
- ":" (dotted)
- "-" (solid)
- "--" (dashed)
- "-." (dashdot)
- "", " ", "none", or "None" (no line)

Colors are documented at https://matplotlib.org/stable/tutorials/colors/colors.html#sphx-glr-tutorials-colors-colors-py.

#### Plot Seattle and Austin Precipitation with Enhancements (Exercise)

In [None]:
# Add markers, set line styles and colors, and add axis labels and a title.
# I also added data labels and a legend.
fig, ax = plt.subplots()
fig.set_size_inches((12, 9))
ax.plot(seattle_weather["MONTH"], seattle_weather["MLY-PRCP-NORMAL"],
        marker="v", linestyle="--", color="b", label="Seattle")
ax.plot(austin_weather["MONTH"], austin_weather["MLY-PRCP-NORMAL"], 
        marker="x", linestyle="--", color="r", label="Austin")
ax.set_xlabel("Time (months)")
ax.set_yticks(range(7))
ax.set_ylabel("Precipitation (inches)")
ax.legend()
ax.set_title("Average Monthly Precipitation")
plt.show()

### Small Multiples

It is better to use "small multiples", multiple small plots that show similar
data for multiple conditions.

#### Make Multiple Line Plots (Demonstration)

In [None]:
# Show average precipitation in Seattle, with 25% and 75% percentiles below
# and above, using dashed lines.
fig, ax = plt.subplots()
fig.set_size_inches((12, 9))
ax.plot(seattle_weather["MONTH"], seattle_weather["MLY-PRCP-NORMAL"], color="b", linestyle="-")
ax.plot(seattle_weather["MONTH"], seattle_weather["MLY-PRCP-25PCTL"], color="b", linestyle="--")
ax.plot(seattle_weather["MONTH"], seattle_weather["MLY-PRCP-75PCTL"], color="b", linestyle="--")
ax.set_xlabel("Time (months)")
ax.set_ylabel("Precipitation (inches)")
plt.show()

# Add the data for Austin to the plot.
# The plot is too busy.
fig, ax = plt.subplots()
fig.set_size_inches((12, 9))
ax.plot(seattle_weather["MONTH"], seattle_weather["MLY-PRCP-NORMAL"], color="b", linestyle="-")
ax.plot(seattle_weather["MONTH"], seattle_weather["MLY-PRCP-25PCTL"], color="b", linestyle="--")
ax.plot(seattle_weather["MONTH"], seattle_weather["MLY-PRCP-75PCTL"], color="b", linestyle="--")
ax.plot(austin_weather["MONTH"], austin_weather["MLY-PRCP-NORMAL"], color="r", linestyle="-")
ax.plot(austin_weather["MONTH"], austin_weather["MLY-PRCP-25PCTL"], color="r", linestyle="--")
ax.plot(austin_weather["MONTH"], austin_weather["MLY-PRCP-75PCTL"], color="r", linestyle="--")
ax.set_xlabel("Time (months)")
ax.set_ylabel("Precipitation (inches)")
plt.show()

#### Create Multiple Subplots (Demonstration)

In [None]:
# Create "small multiples" using plt.subplots(rows, cols).
fig, ax = plt.subplots(3, 2) # 3 rows, 2 columns
fig.set_size_inches((12, 9))
print(ax.shape)
# How to plot in one of the subplots.
ax[0, 0].plot(np.arange(10), np.arange(10))
plt.show()

#### Create Multiple Subplots in a Single Column (Extra)

In [None]:
# Example subplots in a single column. A 1-D subplot array *must* be indexed
# with a single value.
fig, ax = plt.subplots(2, 1)
fig.set_size_inches((12, 9))
print(ax.shape)
ax[0].plot(np.arange(11), np.arange(10, -1, -1))
ax[1].plot(np.arange(11), np.arange(11))
plt.show()

#### Plot Rainfall Data Using Subplots (Demonstration)

In [None]:
# Plot the rainfall data using subplots.
# Note the following problems:
#   y axis differences
#   title of second subplot overlaps tick labels of x axis of first plot
#   label for x axis of the top subplot is not visibl when the figure
#     is small.
fig, ax = plt.subplots(2, 1)
fig.set_size_inches((12, 9))
# Plot Seattle data in the upper subplot.
ax[0].plot(seattle_weather["MONTH"], seattle_weather["MLY-PRCP-NORMAL"], color="b", linestyle="-")
ax[0].plot(seattle_weather["MONTH"], seattle_weather["MLY-PRCP-25PCTL"], color="b", linestyle="--")
ax[0].plot(seattle_weather["MONTH"], seattle_weather["MLY-PRCP-75PCTL"], color="b", linestyle="--")
ax[0].set_xlabel("Time (months)")
ax[0].set_ylabel("Precipitation (inches)")
ax[0].set_title("Average Monthly Precipitation in Seattle")
# Plot Austin data in the lower subplot.
ax[1].plot(austin_weather["MONTH"], austin_weather["MLY-PRCP-NORMAL"], color="r", linestyle="-")
ax[1].plot(austin_weather["MONTH"], austin_weather["MLY-PRCP-25PCTL"], color="r", linestyle="--")
ax[1].plot(austin_weather["MONTH"], austin_weather["MLY-PRCP-75PCTL"], color="r", linestyle="--")
ax[1].set_xlabel("Time (months)")
ax[1].set_ylabel("Precipitation (inches)")
ax[1].set_title("Average Monthly Precipitation in Austin")
plt.show()

#### Plot Rainfall Data Using Subplots with Enhancements (Demonstration)

In [None]:
# Plot the rainfall data using subplots.
# Make the y axis ticks the same using sharey=True.
# Use one label for the x axis since they are identical.
# Do not add titles for now.
fig, ax = plt.subplots(2, 1, sharey=True)
fig.set_size_inches((12, 9))
ax[0].plot(seattle_weather["MONTH"], seattle_weather["MLY-PRCP-NORMAL"], 
           color="b", linestyle="-", label="Seattle")
ax[0].plot(seattle_weather["MONTH"], seattle_weather["MLY-PRCP-25PCTL"],
           color="b", linestyle="--")
ax[0].plot(seattle_weather["MONTH"], seattle_weather["MLY-PRCP-75PCTL"],
           color="b", linestyle="--")
# ax[0].set_xlabel("Time (months)")
ax[0].set_ylabel("Precipitation (inches)")
ax[0].set_title("Average Monthly Precipitation in Seattle")
ax[0].legend()

ax[1].plot(austin_weather["MONTH"], austin_weather["MLY-PRCP-NORMAL"],
           color="r", linestyle="-", label="Austin")
ax[1].plot(austin_weather["MONTH"], austin_weather["MLY-PRCP-25PCTL"],
           color="r", linestyle="--")
ax[1].plot(austin_weather["MONTH"], austin_weather["MLY-PRCP-75PCTL"],
           color="r", linestyle="--")
ax[1].set_xlabel("Time (months)")
ax[1].set_ylabel("Precipitation (inches)")
ax[1].set_title("Average Monthly Precipitation in Austin")
ax[1].legend()
plt.show()

#### Plot Precipitation and Temperature Data for Seattle and Austin in Subplots (Exercise)

In [None]:
# Plot precipitation and temperature for Seattle and Austin.
fig, ax = plt.subplots(2, 2)
fig.set_size_inches((12, 9))
ax[0, 0].plot(seattle_weather["MONTH"], seattle_weather["MLY-PRCP-NORMAL"])
ax[0, 1].plot(seattle_weather["MONTH"], seattle_weather["MLY-TAVG-NORMAL"])
ax[1, 0].plot(austin_weather["MONTH"], austin_weather["MLY-PRCP-NORMAL"])
ax[1, 1].plot(austin_weather["MONTH"], austin_weather["MLY-TAVG-NORMAL"])
plt.show()

#### Plot Precipitation Data, Sharing the y Axis (Exercise)

In [None]:
# Plot precipitation, sharing the y axis.
# I have already done this exercise above, so here's a repeat.
fig, ax = plt.subplots(2, 1, sharey=True)
fig.set_size_inches((12, 9))
ax[0].plot(seattle_weather["MONTH"], seattle_weather["MLY-PRCP-NORMAL"], 
           color="b", linestyle="-", label="Seattle")
ax[0].plot(seattle_weather["MONTH"], seattle_weather["MLY-PRCP-25PCTL"],
           color="b", linestyle="--")
ax[0].plot(seattle_weather["MONTH"], seattle_weather["MLY-PRCP-75PCTL"],
           color="b", linestyle="--")
# ax[0].set_xlabel("Time (months)")
ax[0].set_ylabel("Precipitation (inches)")
# ax[0].set_title("Average Monthly Precipitation in Seattle")
ax[0].legend()

ax[1].plot(austin_weather["MONTH"], austin_weather["MLY-PRCP-NORMAL"],
           color="r", linestyle="-", label="Austin")
ax[1].plot(austin_weather["MONTH"], austin_weather["MLY-PRCP-25PCTL"],
           color="r", linestyle="--")
ax[1].plot(austin_weather["MONTH"], austin_weather["MLY-PRCP-75PCTL"],
           color="r", linestyle="--")
ax[1].set_xlabel("Time (months)")
ax[1].set_ylabel("Precipitation (inches)")
# ax[1].set_title("Average Monthly Precipitation in Austin")
ax[1].legend()
plt.show()

## Plotting Time-Series

### Plotting Time-Series Data

#### Load the Climate Change Data (Extra)

In [None]:
# Load the climate change data.
# Convert the index to datetime values.
# Note that pandas.DataFrame.info lists "DateTimeIndex: ..." for index
# information.
# The DataFrame is named climate_change_ex (climate change example) to
# keep it separate from the climate_change variable used in the exercises
# below.
climate_change_ex = pd.read_csv("climate_change.csv", index_col=0)
climate_change_ex.index = climate_change_ex.index.astype("datetime64[ns]")
print(climate_change_ex.info())

#### Load the Climate Change Data While Parsing Dates (Extra)

In [None]:
# It is possible to parse datetime columns immediately while reading the
# file using the parse_dates argument, which needs a list of column numbers.
# Show that the DataFrame obtained this way is equivalent to the DataFrame
# obtained above.
climate_change_ex2 = pd.read_csv("climate_change.csv", index_col=0, parse_dates=[0])
print(climate_change_ex2.info())
print()
print(climate_change_ex.equals(climate_change_ex2))

#### Look at the Climate Data (Demonstration)

In [None]:
# Look at some of the relative_temp values.
climate_change_ex["relative_temp"]

In [None]:
# Look at some of the CO2 values. There are some NaN values where the
# measurement is missing.
climate_change_ex["co2"]

In [None]:
climate_change_ex.head()

#### Plot the CO2 Data (Demonstration)

In [None]:
# Plot the CO2 data.
fig, ax = plt.subplots()
fig.set_size_inches((12, 9))
ax.plot(climate_change_ex.index, climate_change_ex["co2"])
ax.set_label("Time")
ax.set_ylabel("CO2 (ppm)")
plt.show()

#### Zoom in on One Decade of CO2 Data (Demonstration)

In [None]:
# Zoom in on a particular decade by slicing the data.
sixties = climate_change_ex["1960-01-01":"1969-12-31"]
fig, ax = plt.subplots()
fig.set_size_inches((12, 9))
ax.plot(sixties.index, sixties["co2"])
ax.set_label("Time")
ax.set_ylabel("CO2 (ppm)")
plt.show()

#### Zoom in on One Year of CO2 Data (Demonstration)

In [None]:
# Zoom in one year.
# Note that as we have changed scales, matplotlib has adjusted the
# ticks on the axes.
sixty_nine = climate_change_ex["1969-01-01":"1969-12-31"]
fig, ax = plt.subplots()
fig.set_size_inches((12, 9))
ax.plot(sixty_nine.index, sixty_nine["co2"])
ax.set_label("Time")
ax.set_ylabel("CO2 (ppm)")
plt.show()

#### Load the Climate Data (Exercise)

In [None]:
# Read the climate_change.csv file, parsing the dates in column 0.
# Show that the DataFrame is equivalent to what we obtained above.
climate_change = pd.read_csv("climate_change.csv", index_col=0, parse_dates=[0])
print(climate_change.info())
print()
print(climate_change.equals(climate_change_ex))

#### Plot the Time Series Data for Relative Temperature (Exercise)

In [None]:
# Plot the time-series data for relative_temp.
fig, ax = plt.subplots()
fig.set_size_inches((12, 9))
ax.plot(climate_change.index, climate_change["relative_temp"])
ax.set_xlabel("Time")
ax.set_ylabel("Relative temperature (Celsius)")
plt.show()

#### Zoom in on One Decate of CO2 Data (Exercise)

In [None]:
# Zoom in to view the period from 1970-01-01 to 1979-12-31.
fig, ax = plt.subplots()
fig.set_size_inches((12, 9))
seventies = climate_change[(climate_change.index >= "1970-01-01") & (climate_change.index <= "1979-12-31")]
ax.plot(seventies.index, seventies["co2"])
ax.set_xlabel("Time")
ax.set_ylabel("CO2 concentration (ppm)")
plt.show()

### Plotting Time-Series with Different Variables

#### Load the Climate Change Data, Parsing Dates (Extra)

In [None]:
# It is possible to specify the index column by its name and to pass a list of
# column names to the parse_dates argument. This was shown in the video.
# Show that the result is equivalent to using the other approaches for reading
# the data.
climate_change_ex3 = pd.read_csv("climate_change.csv", index_col="date", parse_dates=["date"])
print(climate_change_ex3.info())
print()
print(climate_change_ex.equals(climate_change_ex3))

#### Plot CO2 and Relative Temperature Data in the Same Plot (Demonstration)

In [None]:
# Plot both "co2" and "relative_temp" in the same plot.
# This does not look good.
fig, ax = plt.subplots()
fig.set_size_inches((12, 9))
ax.plot(climate_change_ex3.index, climate_change_ex3["co2"])
ax.plot(climate_change_ex3.index, climate_change_ex3["relative_temp"])
ax.set_xlabel("Time")
ax.set_ylabel("CO2 (ppm) / Relative temperature (Celsius)")
plt.show()

#### Plot the CO2 and Relative Temperature Data Using Different y Axis Scales (Demonstration)

In [None]:
# The solution is to use two different y axis scales.
# I played around with setting alpha without much success.
# Color the y axis ticks and tick labels.
# Wow, that might be too much color.
fig, ax = plt.subplots()
fig.set_size_inches((12, 9))
ax.plot(climate_change_ex3.index, climate_change_ex3["co2"], color="blue")
ax.set_xlabel("Time")
ax.set_ylabel("CO2 (ppm)", color="blue")
ax.tick_params("y", colors="blue")
# Create a second Axes object, ax2, that shares the same x axis as ax.
ax2 = ax.twinx()
ax2.plot(climate_change_ex3.index, climate_change_ex3["relative_temp"], color="red")
ax2.set_ylabel("Relative temperature (Celsius)", color="red")
ax2.tick_params("y", colors="red")
plt.show()

#### Create a Function for Plotting Time Series Data (Demonstration)

In [None]:
# Create a function for this code.
def plot_timeseries(axes, x, y, color, xlabel, ylabel):
    axes.plot(x, y, color=color)
    axes.set_xlabel(xlabel)
    axes.set_ylabel(ylabel)
    axes.tick_params("y", colors=color)

# Call the function.
fig, ax = plt.subplots()
fig.set_size_inches((12, 9))
ax2 = ax.twinx()
plot_timeseries(ax, climate_change.index, climate_change["co2"], 
                color="blue", xlabel="Time (years)", ylabel="CO2 (ppm)")
plot_timeseries(ax2, climate_change.index, climate_change["relative_temp"], 
                color="red", xlabel="Time (years", ylabel="Relative temperature (Celsius)")
plt.show()

#### Repeat the Demonstration Code (Exercise)

The exercise repeated building the demonstration code above.

### Annotating Time-Series Data

#### Annotate a Time-Series Data Point in the Plot (Demonstration)

In [None]:
# The first day climate_change["relative_temp"] >= 1.0 was 2015-10-06.
# Annotate this on the plot.
fig, ax = plt.subplots()
fig.set_size_inches((12, 9))
plot_timeseries(ax, climate_change.index, climate_change["co2"], 
                color="blue", xlabel="Time (years)", ylabel="CO2 (ppm)")
ax2 = ax.twinx()
plot_timeseries(ax2, climate_change.index, climate_change["relative_temp"], 
                color="red", xlabel="Time (years", ylabel="Relative temperature (Celsius)")
# Add an annotation to the plot, where the xy parameter specifies coordinates
# in the plot. We have to set the x coordinate to the right object.
# The xytext argument moves the annotation to a better location.
# The annotation text doesn't indicate the data point it's associated with.
# Add an arrow to indicate the point.
# I moved the position of the annotation.
ax2.annotate(
    "> 1 degree",
    xy=(pd.Timestamp("2015-10-06"), 1.0), 
    xytext=(pd.Timestamp("2000-10-06"), 1.2),
    # xytext=(pd.Timestamp("2008-10-06"), -0.2),
    arrowprops = {"arrowstyle": "->", "color": "gray"})
plt.show()

#### Replicate the Demonstration Code (Exercise)

The exercise replicated the demonstration code above.

In [None]:
# The exercise replicated the code above.

## Quantitative Comparisons and Statistical Visualizations

See also the notebook from the "Introduction to Data Science in Python" course, which presents many plotting examples.

### Bar Charts

#### Load the Olympic Medal Data (Extra)

In [None]:
# Load the Olympic medal data.
# There is no name for the column containing country names because that is
# intended to be the index column. This code loaded the column as a data
# column and "fixed" the index.
medals = pd.read_csv("medals_by_country_2016.csv")
column_names = list(medals.columns)
column_names[0] = "Country"
medals.set_axis(column_names, axis=1, inplace=True)
medals

#### Load the Data Using the Country Names as the Row Index (Extra)

In [None]:
# The course uses the country names as the index.
medals = pd.read_csv("medals_by_country_2016.csv", index_col=0)
medals

#### Create a Bar Chart of Gold Medals by Country (Demonstration)

In [None]:
# Create a bar chart of gold medals.
# Rotate the country names so they don't overlap.
fig, ax = plt.subplots()
fig.set_size_inches((12, 9))
ax.bar(medals.index, medals["Gold"], color="gold")
# Must call ax.set_xticks(medals.index) before calling ax.set_xticklabels()
# to avoid this warning:
#    UserWarning: FixedFormatter should only be used together with FixedLocator
ax.set_xticks(medals.index)
ax.set_xticklabels(medals.index, rotation=90)
ax.set_xlabel("Country")
ax.set_ylabel("Number of medals")
plt.show()

#### Create a Stacked Bar Chart of Medals Won by Country (Demonstration)

In [None]:
# Create a stacked bar chart of medals.
# There is not a web color named "bronze".
# See this link for a color code for "bronze":
#   https://www.colorhexa.com/cd7f32
#   bronze: "#cd7f32"
# "xkcd:bronze": "#a87900" # see https://xkcd.com/color/rgb/
# Matplotlib can also use "xkcd:gold" and "xkcd:bronze".
fig, ax = plt.subplots()
fig.set_size_inches((12, 9))
ax.bar(medals.index, medals["Gold"], label="Gold", color="gold")
ax.bar(medals.index, medals["Silver"], bottom=medals["Gold"],
       label="Silver", color="silver")
ax.bar(medals.index, medals["Bronze"],
       bottom=(medals["Gold"] + medals["Silver"]),
       label="Bronze", color="xkcd:bronze")
ax.set_xticks(medals.index)
ax.set_xticklabels(medals.index, rotation=90)
ax.set_xlabel("Country")
ax.set_ylabel("Number of medals")
ax.legend()
plt.show()

#### Replicate the Code Above (Exercise)

The exercise repeated creating the code above.

### Histograms

#### Read the Summer 2016 Olympics Data (Extra)

In [None]:
# Read the data from summer2016.csv.
athletes = pd.read_csv("summer2016.csv", index_col=0)
athletes.info()

#### Subset the Data for Rowing and Count the Medals by Sex (Extra)

In [None]:
# Subset the data for rowing and count the medals by sex.
rowing = athletes[athletes["Sport"] == "Rowing"].copy()
# Perform the equivalent of:
# SELECT Sex, COUNT(Sex) FROM rowing GROUP BY Sex;
print(rowing.groupby("Sex").size())

#### Create a DataFrame for Men's Rowing (Extra)

In [None]:
# Create a DataFrame for men's rowing (84 rows).
mens_rowing = athletes[(athletes["Sport"] == "Rowing") & (athletes["Sex"] == "M")]
print(mens_rowing.info())

#### Create a DataFrame for Men's Gymnastics (Extra)

In [None]:
# Create a DataFrame for men's gymnastics (36 rows).
mens_gymnastics = athletes[(athletes["Sport"] == "Gymnastics") & (athletes["Sex"] == "M")]
print(mens_gymnastics.info())

#### Create a Bar Chart of the Mean Heights for Men's Rowing and Gymnastics (Demonstration)

In [None]:
# Create a bar chart of the mean heights.
# This is deliberately the wrong approach.
# A box plot might be nice here.
# We could add error bars for the standard error of the mean.
fig, ax = plt.subplots()
fig.set_size_inches((12, 9))
ax.bar("Rowing", mens_rowing["Height"].mean())
ax.bar("Gymnastics", mens_gymnastics["Height"].mean())
plt.show()

#### Create Histograms of the Heights for Men's Rowing and Gymnastics (Demonstration)

In [None]:
# Create histograms.
# The default is 10 bins.
# Labels are required when plotting multiple histograms.
# Set alpha < 1.0 to allow seeing overlaps.
fig, ax = plt.subplots()
fig.set_size_inches((12, 9))
bins = 10
ax.hist(mens_rowing["Height"], bins=bins, label="Rowing", alpha=0.7)
ax.hist(mens_gymnastics["Height"], bins=bins, label="Gymnastics", alpha=0.7)
ax.set_yticks(np.arange(0, 22, 5))
ax.set_xlabel("Height (cm)")
ax.set_ylabel("Number of athletes")
ax.legend()
plt.show()

#### Create Histograms of the Same Data using Bin Boundaries (Demonstration)

In [None]:
# Create histograms using bin boundaries..
# When a list is used as the bins argument, the list values set the
# boundaries of the bins.
fig, ax = plt.subplots()
fig.set_size_inches((12, 9))
bins = np.arange(148, 210, 4)
ax.hist(mens_rowing["Height"], bins=bins, label="Rowing", alpha=0.7)
ax.hist(mens_gymnastics["Height"], bins=bins, label="Gymnastics", alpha=0.7)
ax.set_yticks(np.arange(0, 22, 5))
ax.set_xlabel("Height (cm)")
ax.set_ylabel("Number of athletes")
ax.legend()
plt.show()

#### Create a Stepped Histogram of the Same Data (Demonstration)

In [None]:
# Create a stepped histogram.
# This makes it easier to see overlaps of the histograms.
fig, ax = plt.subplots()
fig.set_size_inches((12, 9))
bins = np.arange(148, 210, 4)
ax.hist(mens_rowing["Height"], bins=bins, label="Rowing", histtype="step")
ax.hist(mens_gymnastics["Height"], bins=bins, label="Gymnastics", histtype="step")
ax.set_yticks(np.arange(0, 22, 5))
ax.set_xlabel("Height (cm)")
ax.set_ylabel("Number of athletes")
ax.legend()
plt.show()

#### Plot Histograms of the Weights for Men's Rowing and Gymnastics (Exercise)

In [None]:
# Plot histograms of the weights of participants in men's rowing
# and men's gymnastics.
fig, ax = plt.subplots()
fig.set_size_inches((12, 9))
bins = np.arange(50, 115, 5)
ax.hist(mens_rowing["Weight"], bins=bins, label="Rowing", alpha=0.7)
ax.hist(mens_gymnastics["Weight"], bins=bins, label="Gymnastics", alpha=0.7)
ax.set_xlabel("Weight (kg)")
ax.set_ylabel("Number of athletes")
ax.legend()
plt.show()

#### Plot Step Histograms of the Same Data (Exercise)

In [None]:
# Plot step histograms.
fig, ax = plt.subplots()
fig.set_size_inches((12, 9))
bins = np.arange(50, 115, 5)
ax.hist(mens_rowing["Weight"], histtype="step", bins=bins, label="Rowing")
ax.hist(mens_gymnastics["Weight"], histtype="step", bins=bins, label="Gymnastics")
ax.set_xlabel("Weight (kg)")
ax.set_ylabel("# of observations")
ax.legend()
plt.show()

### Statistical Plotting

#### Create a Bar Chart with Error Bars (Demonstration)

In [None]:
# Create a bar chart with error bars.
# I dislike these.
fig, ax = plt.subplots()
fig.set_size_inches((12, 9))
ax.bar("Rowing", mens_rowing["Height"].mean(), yerr=mens_rowing["Height"].std())
ax.bar("Gymnastics", mens_gymnastics["Height"].mean(), yerr=mens_gymnastics["Height"].std())
ax.set_ylabel("Height (cm)")
plt.show()

#### Create a Line Plot with Error Bars (Demonstration)

In [None]:
# Create a line plot with error bars using the errorbar method.
# Plot the weather data, which contains means and standard deviations.
# This uses the errorbar method, not the plot method.
fig, ax = plt.subplots()
fig.set_size_inches((12, 9))
ax.errorbar(
    seattle_weather["MONTH"],
    seattle_weather["MLY-TAVG-NORMAL"],
    yerr=seattle_weather["MLY-TAVG-STDDEV"],
    label="Seattle")
ax.errorbar(
    austin_weather["MONTH"],
    austin_weather["MLY-TAVG-NORMAL"],
    yerr=austin_weather["MLY-TAVG-STDDEV"],
    label="Austin")
ax.set_xlabel("Month")
ax.set_ylabel("Temperature (Fahrenheit)")
plt.legend()
plt.show()

#### Create Box Plots of Weights for Men's Rowing and Gymnastics (Demonstration)

From the documentation for the boxplot method: 
> The box extends from the first quartile (Q1) to the third quartile (Q3) of the data, with a line at the median. The whiskers extend from the box by 1.5x the inter-quartile range (IQR). Flier points are those past the end of the whiskers. See https://en.wikipedia.org/wiki/Box_plot for reference.

A legend is not necessary since the x axis ticks are labeled. In fact, calling `ax.legend()` or `plt.legend()` causes this warning:
> No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.

In [None]:
# Create box plots.
fig, ax = plt.subplots()
fig.set_size_inches((12, 9))
ax.boxplot(
    [mens_rowing["Height"], mens_gymnastics["Height"]],
    labels=["Rowing", "Gymnastics"])
ax.set_ylabel("Height (cm)")
ax.set_xlabel("Sport")
plt.show()

#### Replicate Creating a Bar Chart with Error Bars (Exercise)

This exercise replicates the video's demonstration using the
bar method. The code is above.

#### Replicate Creating an Error Bar Plot (Exercise)

This exercise replicates the video's demonstration using the
errorbar method. The code is above.

#### Replicate Creating Box Plots (Exercise)

This exercise replicates the video's demonstration using the
boxplot method. The code is above.

### Scatter Plots

These are also known as bi-variate comparisons.

#### Create a Scatter Plot of Temperature as a Function of CO2 Concentration (Demonstration)

In [None]:
# Create a scatter plot of CO2 vs. relative temperature.
fig, ax = plt.subplots()
fig.set_size_inches((12, 9))
ax.scatter(x=climate_change["co2"], y=climate_change["relative_temp"])
ax.set_xlabel("CO2 (ppm)")
ax.set_ylabel("Relative temperature (Celsius)")
plt.show()

#### Create a Scatter Plot with Enhancement for Subsets of Temperature and CO2 Data (Demonstration)

In [None]:
# Customize scatter plots by zooming in.
eighties = climate_change["1980-01-01":"1989-12-31"]
nineties = climate_change["1990-01-01":"1999-12-31"]
fig, ax = plt.subplots()
fig.set_size_inches((12, 9))
ax.scatter(x=eighties["co2"], y=eighties["relative_temp"], color="red", label="1980s")
ax.scatter(x=nineties["co2"], y=nineties["relative_temp"], color="blue", label="1990s")
ax.set_xlabel("CO2 (ppm)")
ax.set_ylabel("Relative temperature (Celsius)")
plt.legend()
plt.show()

#### Plot Temperature and CO2 Data, Adding Color for the Time Dimension (Demonstration)

In [None]:
# Encoding a third variable by color.
# We can add a third variable, the values in the DataFrame index, which represents time.
# We add these using the c parameter.
# Time is indicated by darkness to brightness of the marker colors.
fig, ax = plt.subplots()
fig.set_size_inches((12, 9))
ax.scatter(x=climate_change["co2"], y=climate_change["relative_temp"], c=climate_change.index)
ax.set_xlabel("CO2 (ppm)")
ax.set_ylabel("Relative temperature (Celsius)")
plt.show()

#### Replicate the Scatter Plots (Exercises)

The first exercise replicates the first demonstration from the video. See the code above.

The second exercise replicates the third demonstration from the video. See the code above.

## Sharing Visualizations with Others

### Preparing Your Figures to Share with Others

#### Change the Plot Style (Demonstration)

Styles allow setting the appearance of multiple figure elements simultaneously. The style applies to all figures created during the session.

See this page for the various styles available from Matplotlib: https://matplotlib.org/stable/gallery/style_sheets/style_sheets_reference.html

Return to the default style with this code:
```
plt.style.use("default")
```

In [None]:
fig, ax = plt.subplots()
ax.plot(seattle_weather["MONTH"], seattle_weather["MLY-TAVG-NORMAL"])
ax.plot(austin_weather["MONTH"], austin_weather["MLY-TAVG-NORMAL"])
ax.set_xlabel("Time (months)")
ax.set_ylabel("Average temperature (F)")
plt.show()

#### Use the `ggplot` Plot Style (Demonstration)

In [None]:
# Create the same figure using the ggplot plot style.
plt.style.use("ggplot")
fig, ax = plt.subplots()
ax.plot(seattle_weather["MONTH"], seattle_weather["MLY-TAVG-NORMAL"])
ax.plot(austin_weather["MONTH"], austin_weather["MLY-TAVG-NORMAL"])
ax.set_xlabel("Time (months)")
ax.set_ylabel("Average temperature (F)")
plt.show()

#### Use the `bmh` Plot Style (Demonstration)

In [None]:
# Create the same figure using the bmh plot style.
plt.style.use("bmh")
fig, ax = plt.subplots()
ax.plot(seattle_weather["MONTH"], seattle_weather["MLY-TAVG-NORMAL"])
ax.plot(austin_weather["MONTH"], austin_weather["MLY-TAVG-NORMAL"])
ax.set_xlabel("Time (months)")
ax.set_ylabel("Average temperature (F)")
plt.show()

#### Use the `seaborn-colorblind` Plot Style (Demonstration)

In [None]:
# Create the same figure using the seaborn-colorblind plot style.
plt.style.use("seaborn-colorblind")
fig, ax = plt.subplots()
ax.plot(seattle_weather["MONTH"], seattle_weather["MLY-TAVG-NORMAL"])
ax.plot(austin_weather["MONTH"], austin_weather["MLY-TAVG-NORMAL"])
ax.set_xlabel("Time (months)")
ax.set_ylabel("Average temperature (F)")
plt.show()

#### Guidelines for Choosing Plotting Style

Consider the following guidelines when choosing a plotting style:
- Dark backgrounds are usually less visible.
- If color is important, consider choosing colorblind-friendly options
    - `seaborn-colorblind` or `tableau-colorblind10`
- If someone will print your figure, use less ink by avoiding colored backgrounds
- If the figure will be printed in black and white, consider using the `grayscale` style

#### Set a Grayscale Style (Exercise)

In [None]:
# Create the same figure using the grayscale plot style.
plt.style.use("grayscale")
fig, ax = plt.subplots()
ax.plot(seattle_weather["MONTH"], seattle_weather["MLY-TAVG-NORMAL"])
ax.plot(austin_weather["MONTH"], austin_weather["MLY-TAVG-NORMAL"])
ax.set_xlabel("Time (months)")
ax.set_ylabel("Average temperature (F)")
plt.show()

#### Switching Between Styles (Exercise)

In [None]:
# Use the "ggplot" style.
plt.style.use("ggplot")
fig, ax = plt.subplots()
ax.plot(seattle_weather["MONTH"], seattle_weather["MLY-TAVG-NORMAL"])
plt.show()

# Use the "Solarize_light2" style.
plt.style.use("Solarize_Light2")
fig, ax = plt.subplots()
ax.plot(seattle_weather["MONTH"], seattle_weather["MLY-TAVG-NORMAL"])
plt.show()

### Saving Your Visualizations

#### Save a Figure in Various Formats

For the Matplotlib documentation about saving a figure, see https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.savefig.html.

See https://matplotlib.org/stable/gallery/ticks/ticklabels_rotation.html#sphx-glr-gallery-ticks-ticklabels-rotation-py for setting and rotating the x axis tick labels and for adjusting the bottom to avoid cutting off the names of the countries.

Consider using .jpg files when file size is a consideration (e.g., when sharing the images on a website).

For rasterized figures, use the `dpi` argument to control the figure's resolution.

In [None]:
# Create a bar chart of gold medal data.
# In Jupyter Lab, the figure still appears in the notebook.
# Without adjusting the margins, the country names are cut off at the bottom.
# A single call to ax.set_xticks() will add the country names and rotate them.
plt.style.use("default")
fig, ax = plt.subplots()
fig.set_size_inches((10, 6))
ax.bar(medals.index, medals["Gold"])
ax.set_xticks(medals.index, medals.index, rotation="vertical")
ax.set_ylabel("Number of medals")
plt.subplots_adjust(bottom=0.25)
fig.savefig("gold_medals.png")
# The quality keyword is no longer supported as of Matplotlib 3.3.
# fig.savefig("gold_medals.jpg", quality=50)
fig.savefig("gold_medals.jpg")
# Save the PNG figure at high resolution.
fig.savefig("gold_medals_300_dpi.png", dpi=300)
# A vector graphics format.
fig.savefig("gold_medals.svg")
# Save a smaller figure.
fig.set_size_inches(5, 3)
fig.savefig("gold_medals_5_x_3.png")

#### Saving a Figure Several Times (Exercise)

This exercise replicated some of the code above.

#### Save a Figure with Different Sizes (Exercise)

This exercise replicated some of the code above, setting the size of the figure to 3 inches wide by 5 inches tall, then to 5 inches wide by 3 inches tall.

### Automating Figures from Data

Reasons to automate figure generation:
- Ease and speed
- Flexibility
- Robustness
- Reproducibility

#### Get the Unique Values from a Column of a DataFrame

In [None]:
# Consider the data in the summer2016.csv file, which we have loaded into
# the athletes DataFrame. Get the unique values for the "Sport" column.
# Get a NumPy ndarray containing the unique names from the "Sports" column.
# There are 34 sports in the DataFrame.
sports = athletes["Sport"].unique().astype(str)
print(len(sports))

#### Create a Bar Chart of the Mean Height for All Sports (Demonstration)

In [None]:
# Create a bar chart of the mean height for all sports.
# At no point do we need to know how many different sports there are in
# the DataFrame.
fig, ax = plt.subplots()
for sport in sports:
    sport_df = athletes[athletes["Sport"] == sport]
    ax.bar(sport, sport_df["Height"].mean(), yerr=sport_df["Height"].std(), label=sport)
ax.set_ylabel("Height (cm)")
ax.set_xticks(sports, sports, rotation=90)
plt.show()

#### Create Box Plots of the Heights for All Sports (Extra)

In [None]:
# From a lesson in _Statistical Thinking in Python (Part 1)_, use
# seaborn to create boxplots of the heights grouped by sport.
# I don't know how to sort the names on the x axis.
# Perhaps I need to sort the rows by the value of the "Sport" column.
# Sorting them ahead of time puts the names in a different order from
# that found by the sns.boxplot method.
fig, ax = plt.subplots()
sns.boxplot(x="Sport", y="Height", data=athletes)
ax.set_ylabel("Height (cm)")
ax.tick_params(axis='x', labelrotation=90)
plt.show()

#### Create a Bar Chart of Mean Weights by Sport (Exercise)

In [None]:
# Create a bar chart of the mean weight for all sports.
fig, ax = plt.subplots()
for sport in sports:
    sport_df = athletes[athletes["Sport"] == sport]
    ax.bar(sport, sport_df["Weight"].mean(), yerr=sport_df["Weight"].std(), label=sport)
ax.set_ylabel("Weight (kg)")
ax.set_xticks(np.arange(len(sports)), sports, rotation=90)
plt.show()

#### Create Box Plots of Weights by Sports (Extra)

In [None]:
# From a lesson in _Statistical Thinking in Python (Part 1)_, use
# seaborn to create boxplots of the heights grouped by sport.
# I don't know how to sort the names on the x axis.
# Perhaps I need to sort the rows by the value of the "Sport" column.
# Sorting them ahead of time puts the names in a different order from
# that found by the sns.boxplot method.
fig, ax = plt.subplots()
sns.boxplot(x="Sport", y="Weight", data=athletes)
ax.set_ylabel("Weight (kg)")
ax.tick_params(axis='x', labelrotation=90)
plt.show()

#### Identify the Exceptionally Heavy Athletes (Extra)

In [None]:
# Two of the athletes were exceptionally heavy! Who were they?
print(athletes[athletes["Weight"] > 150])