# Publishing with Python - Creating figures with data visualization libraries

## Setting up the Notebook

The first thing we'll do is import all the libraries and packages we need to function!

Be sure that all of these libraries and packages are installed in the python environment in which you're working. If you are accessing this notebook via binder, those environments will be pre-constructed for you. :)

In [None]:
import os   # This library is important if you pull the .ipynb file onto your local machine and need to change directories

import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns # This is a West Wing reference apparently, but some import it as sb

import numpy as np
import pandas as pd
import geopandas as gpd

In [None]:
# This notebook has a few cells that point to not built into the libraries (e.g. the airbnb .csv's)
# If you are using the link directly from the workshop to launch a binder, leave this cell commented out
# If you have downloaded the .ipynb file use this cell to change to your local directory
# IMPORTANT: in your local directory, download the data into a folder called 'data'

# os.chdir('C:\\Users\\fritzdi\\DataScience\\data\\pypub')  # add your own directory path to your data here!

## Diving into Data Visualization

### Working with Matplotlib

In [None]:
fig = plt.figure()  # an empty figure with no Axes
fig, ax = plt.subplots()  # a figure with a single Axes
fig, axs = plt.subplots(2, 2)  # a figure with a 2x2 grid of Axes

In [None]:
# You can easily plot equations

x = np.linspace(0, 2 * np.pi, 10) # arguments are start, stop and number of samples
y = np.sin(x)

fig, ax = plt.subplots() # Create a figure containing a single axes (area to display the data)
ax.plot(x, y)
plt.show()

In [None]:
# Quick versions of various plots are possible

cost = [2.50, 1.23, 4.02, 3.25, 5.00, 4.40]
sales_per_day = [34, 62, 49, 22, 13, 19]

plt.scatter(cost, sales_per_day)
# plt.plot(cost, sales_per_day, "o")   # This line creates the same plot more efficiently, but loses the flexibility of the plt.scatter function
plt.show()

In [None]:
# You can add labels and annotations
# Control details of the "Artists"
# Basically everything visible on a figure is an Artist that can be controlled with "setters" when a method is called

fig, ax = plt.subplots()
ax.set_title('Title')
ax.set_xlabel('numbers')
ax.set_ylabel('other numbers', fontsize=14, color='red')  # you can pass keyword arguments to text functions
ax.annotate('local min', xy=(3, 2), xytext=(2, 1.5),
            arrowprops=dict(facecolor='black', shrink=0.05))
ax.plot([1, 2, 3, 4], [1, 4, 2, 3], color='orange'); # Plot some data on the axes and pick a color for the line.

In [None]:
# Create some data pulling randomly from a uniform distribution:

np.random.seed(6)
x = 0.5 + np.arange(8)
y = np.random.uniform(2, 7, len(x))

# plot
fig, ax = plt.subplots()

ax.bar(x, y, width=1, edgecolor="white", linewidth=0.7)

# Play with the two different sets of axix controls below:

# ax.set(xlim=(0, 8), xticks=np.arange(1, 8),
#        ylim=(0, 8), yticks=np.arange(1, 8))

# ax.set_xticks(np.arange(0, 8, 2)) # high value not included
# ax.set_yticks([0, 4, 8])  # note that we don't need to specify labels
# ax.set_title('Fewer ticks');

plt.show()

We've been looking at creating figures with simple numbers, equations and randomly generated numbers, but what if we want to load in our own data?

We want to turn to other python packages to be able to load in data. This is where NumPy and Pandas come in.

### NumPy

NumPy arrays are powerful for working with data.

In [None]:
# Here we'll import a simple spreadsheet containing two columns of integers

data = np.genfromtxt('data/SomeNumbers.csv',
                 dtype = int, delimiter = ',',
                 skip_header=1)
print(data)

In [None]:
fig, ax = plt.subplots()  # Create a figure containing a single axes (area to display the data)
ax.set_title('number series')
ax.plot(data);  # Plot some data on the axes.

In [None]:
airb1, airb2 = np.genfromtxt('data/AirbnbDenver_Sample.csv',   # This data is a trimmed version of the full summary listings from Inside Airbnb
                 dtype = int,
                 delimiter = ',',
                 unpack = True,   # This puts values separated into airb1 and airb2 (2 1D arrays)
                 skip_header=1,   # lets numpy know the first row is field labels
                 usecols = (9,11))  # These columns are price and number_of_reviews - always look at your source data!

print(airb1, airb2)

In [None]:
fig, ax = plt.subplots()
# ax.plot(airb1, airb2, 'o')   # You can change the marker style to be a circle
# ax.plot(airb1, airb2, 'd')   # ... a diamond
# ax.plot(airb1, airb2, 'v')   # ... a downward triangle
ax.plot(airb1, airb2, 's')   # ... or a square
ax.set_title('Airbnb Prices vs Number of Reviews')
ax.set_xlabel('Price')
ax.set_ylabel('Reviews')
plt.show()

In [None]:
# Without unpacking, the data is read in as one 2D array, "airb"

airb = np.genfromtxt('data/AirbnbDenver_Sample.csv',
                 dtype = int,
                 delimiter = ',',
                 skip_header=1,   
                 usecols = (9,11))  # These columns are price and number_of_reviews
print(airb)

In [None]:
# The default plot is of the values in the array
# The following two sets of code create the same graph

# fig, ax = plt.subplots()
# ax.plot(airb);

plt.plot(airb)
plt.show()

### Pandas

Instead of NumPy and working with arrays, you can use Pandas to create visualizations as it is built on top of matplotlib.

Pandas is a package that gives us the capability of putting all of our data into a **dataframe**. It's like pulling a spreadsheet into python! 

We'll look at Pandas more closely in a moment, but here's a quick look at what it can do.

In [None]:
# Here we're using Pandas (pd) to read in our data and read the column headers

airbnbData = pd.read_csv("data/AirBnB-Denver_20211229_listingssummary.csv")   # This is the full Denver dataset from Inside Airbnb

airbnbData.columns

In [None]:
# We'll create a bar chart with Pandas making use of matplotlib.
# The dataset is quite large, so we can limit it to the first 15 entries in our graphic.

airbnbData.head(15).plot.bar(x="neighbourhood", y ="price")

### Seaborn

Seaborn is a data visualization library that is also built on matplotlib and uses pandas.

This library comes with datasets built in. We can easily access one of these called *penguins* to see the graphics Seaborn can quickly create...

In [None]:
# The followng simple load command is applicable to the built-in datasets

df = sns.load_dataset("penguins")
sns.pairplot(df, hue="species")

In [None]:
# Here is another style of plot Seaborn can make with one line of code after the data is loaded

df = sns.load_dataset("penguins")
sns.histplot(data=df, x="flipper_length_mm", hue="species", multiple="stack")

Seaborn is dependent on matplotlib, numpy, scipy and pandas. These libraries will be installed when you install seaborn if you don't have them already.

Seaborn's functioning is tightly integrated with the pandas dataframe.

In [None]:
# If we want to use our own dataset, we'll initially work with our .csv in pandas
# We'll create a dataframe called "airb" for the truncated dataset
# and one called "airbnbData" for the full dataset

airb = pd.read_csv("data/AirbnbDenver_Sample.csv")
airb.head(3)

In [None]:
# Let's get a summary of our column headers

list(airb.columns)

In [None]:
airbnbData = pd.read_csv("data/AirBnB-Denver_20211229_listingssummary.csv")

In [None]:
# The columns for the full dataset are the same if you want to inspect them:

# list(airbnbData.columns)

Seaborn has a number of built in themes that let you independently control the style and scaling of the plot to quickly translate your work between presentation contexts.

In [None]:
# Leave the following default theme off for your first go through the next commands,
# then come back and re-run them with this theme on to see how the plot changes.

# sns.set_theme()

In [None]:
f, ax = plt.subplots(figsize=(6.5, 6.5))

Remember, we are interested in creating figures to insert into reports, papers, etc. We can We can save the figure with the `matplotlib.plot.savefig()`function (remember, we pulled this in as plt!). You can choose to save a .jpg or an .svg!

In [None]:
sns.scatterplot(x="price", y="number_of_reviews",
                data=airb)
plt.savefig('AirbnbSubset-pricetoreviews.svg')

In [None]:
sns.scatterplot(x="price", y="number_of_reviews",
                data=airbnbData)

### Maps! - Introducing GeoPandas

Let's look at the geopandas library for creating geographic visualizations.

The Pandas package has been extended to GeoPandas. This package, like it sounds, makes it possible to have geospatial data in our dataframes. It can:

- Create a geo-enabled point data column off of lat/lon information
- Directly load geospatial vector files such as shapefiles and GeoJSONs

In [None]:
# Now, we'll use geopandas to geo-enable our .csv since it has lat & lon data!

gdf = gpd.GeoDataFrame(
    airbnbData, geometry=gpd.points_from_xy(airbnbData.longitude, airbnbData.latitude))

In [None]:
# We've added a geometry column at the end:

list(gdf.columns)

airbnbData.head(3)

In [None]:
# The .plot() method in geopandas is based on matplotlib
# there is an argument "kind" where the default is "geo" so it auto-magically makes a map

gdf.plot()

In [None]:
# Let's created a map plot with a base reference layer
# And yes, geopandas can directly read shapefiles!

denver = gpd.read_file("data/county_boundary_lines.shp")

base = denver.plot()
gdf.plot(ax=base, marker='*',color='green',markersize=3)
plt.show()

#### Adding a basemap

We can also add a basemap in geopandas, but we'll need a new package for this called contextily.

Contextily retrieves web map tiles from several sources like OpenStreetMap and Stamen.

In [None]:
import geopandas as gpd
import contextily as cx

For this section, we'll use a dataset that is built into the GeoPandas library called *naturalearth_cities* which really aren't ALL the cities.

In [None]:
gdf = gpd.read_file(gpd.datasets.get_path('naturalearth_cities'))
ax = gdf.plot(figsize=(20, 20), alpha=0.5, edgecolor='k')   # alpha controls symbol transparency

In [None]:
# Checking our coordinate reference system (they are important!):
# Before adding web map tiles to this plot, we first need to ensure the coordinate reference systems (CRS) of the tiles and the data match. Web map tiles are typically provided in Web Mercator (EPSG 3857), so let us first check what CRS our NYC boroughs are in:

gdf.crs

In [None]:
gdf_wm = gdf.to_crs(epsg=3857)
gdf_wm.crs

In [None]:
ax = gdf_wm.plot(figsize=(20, 20), alpha=0.5, edgecolor='k')

In [None]:
ax = gdf_wm.plot(figsize=(20, 20), alpha=0.5, edgecolor='k')
cx.add_basemap(ax)

By default, contextily uses the Stamen Terrain style. We can specify a different style using cx.providers:

You'll notice that the axis labels are no longer lat and lon, but numbers associated with the web tiles, but we can turn these off.

And again, our goal is to create a figure to insert into a publication! We can save the figure with the `matplotlib.plot.savefig()` function (remember, we pulled this in as plt!).

In [None]:
ax = gdf_wm.plot(figsize=(20, 20), alpha=0.5, edgecolor='k')
cx.add_basemap(ax, source=cx.providers.Stamen.TonerLite)
ax.set_axis_off()
plt.savefig('naturalearthcities.jpg')

So now you have some tools to explore and create a number of static data visualizations you can export to figures.

Go back to the documentation guides for the libraries and packages above, and start building some of your own Jupyter Notebooks out of Anaconda Navigator (be sure to have all the right libraries added to your python environment!) and HAVE FUN creating useful figures for your research!