# Understanding Matplotlib

Materials adapted from the RealPython tutorial [Python Plotting with Matplotlib](https://realpython.com/python-matplotlib-guide/)

**Table of Content**

1. [Setting up matplotlib](#sec1)
2. [Creating one plot and changing its style](#sec2)
3. [Creating subplots](#sec3)
4. [Real-data example: California Housing](#sec4)

<a id="sec1"></a>
## 1. Setting up matplotlib

**Importing what we need**

In [None]:
import matplotlib.pyplot as plt
import numpy as np
np.random.seed(444) # initializes our random number generator

**The simplest plot**

In [None]:
plt.plot([4, 5, 6])

**The top object: `figure`**

In [None]:
fig, ax = plt.subplots()
print(type(fig))

**Use object hierarchy to get a nested element**

In [None]:
one_tick = fig.axes[0].yaxis.get_major_ticks()[0]
type(one_tick)

<a id="sec2"></a>
## 2. Creating one plot

Creating and decorating a single plot.

In [None]:
# Generate data
import numpy as np
rng = np.arange(50)
rnd = np.random.randint(0, 10, size=(3, rng.size))
yrs = 1950 + rng

In [None]:
# Generate plot
fig, ax = plt.subplots(figsize=(5, 3))
ax.stackplot(yrs, rng + rnd, labels=['Eastasia', 'Eurasia', 'Oceania'])
ax.set_title('Combined debt growth over time')
ax.legend(loc='upper left')
ax.set_ylabel('Total debt')
ax.set_xlim(xmin=yrs[0], xmax=yrs[-1]) # range of values
fig.tight_layout() # clean whitespace

**Changing the plot style**

The default matplotlib style is not that appealing. It is possible to change the appearance of the entire plot, by setting the style of the image. Each style has its own color scheme, font family and font size, background color, grids, etc. Here are some examples of famous styles:

In [None]:
# The famous data & statisics website 
plt.style.use('fivethirtyeight')

In [None]:
# Generate plot
def generateOnePlot():
    fig, ax = plt.subplots(figsize=(5, 3))
    ax.stackplot(yrs, rng + rnd, labels=['Eastasia', 'Eurasia', 'Oceania'])
    ax.set_title('Combined debt growth over time')
    ax.legend(loc='upper left')
    ax.set_ylabel('Total debt')
    ax.set_xlim(xmin=yrs[0], xmax=yrs[-1]) # range of values
    fig.tight_layout() # clean whitespace

generateOnePlot()

In [None]:
# The famouos R library for plotting, ggplot
plt.style.use('ggplot')

In [None]:
generateOnePlot()

Matplotlib has many more styles, you can find them as follows:

In [None]:
print(plt.style.available)

**Try out some of the styles below**

Replace the value of "bmh" to any of the values shown above, and rerun the cell.

In [None]:
plt.style.use('bmh') 
generateOnePlot()

<a id="sec3"></a>
## 3. Creating subplots

In [None]:
plt.style.use('ggplot') 


# Generate the data
x = np.random.randint(low=1, high=11, size=50)
y = x + np.random.randint(1, 5, size=x.size)
data = np.column_stack((x, y))

# Create figure and axes
fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2,
                               figsize=(8, 4))

# Create the first subplot as a scatterplot
ax1.scatter(x=x, y=y, marker='o', c='r', edgecolor='b')
ax1.set_title('Scatter: $x$ versus $y$')
ax1.set_xlabel('$x$') # Use TeX style, by enclosing value in the dollar signs to italize it
ax1.set_ylabel('$y$')

# Create the second subplot as a scatterplot
ax2.hist(data, bins=np.arange(data.min(), data.max()),
         label=('x', 'y'))
ax2.legend(loc=(0.65, 0.8))
ax2.set_title('Frequencies of $x$ and $y$')
ax2.yaxis.tick_right()

Let's create a 2x2 grid of subplots:

In [None]:
fig, ax = plt.subplots(nrows=2, ncols=2, figsize=(7, 7))

What is the value of ax at the moment?

In [None]:
ax

In [None]:
type(ax)

It's a numpy array. To access each axes we would have to do double indexing, such as:

In [None]:
ax[1][1]

But, we can also **flatten** the array, so that it becomes a list, which is easier to index:

In [None]:
fig, ax = plt.subplots(nrows=2, ncols=2, figsize=(7, 7))
ax1, ax2, ax3, ax4 = ax.flatten()  # flatten a 2d NumPy array to 1d

<a id="sec4"></a>
## 4. Read data example: California Housing

In [None]:
from io import BytesIO
import tarfile
from urllib.request import urlopen

url = 'http://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.tgz'
b = BytesIO(urlopen(url).read())
fpath = 'CaliforniaHousing/cal_housing.data'

with tarfile.open(mode='r', fileobj=b) as archive:
    housing = np.loadtxt(archive.extractfile(fpath), delimiter=',')

How big is the array housing?

In [None]:
housing.shape

In [None]:
housing[0]

Let's create a dataframe, just to see the data:

In [None]:
import pandas as pd

In [None]:
df = pd.DataFrame(housing)
df.head()

Get the response variable (or the dependent variable), which corresponds to the "area's average home value".
This is the last column:

In [None]:
y = housing[:, -1]

Let's name two other variables:

`pop` - area's population  
`age` - the average house age

In [None]:
pop, age = housing[:, [4, 7]].T # transpose from vertical column to horizontal row

**Helper function for the inplot title**

In [None]:
def add_titlebox(ax, text):
    ax.text(.55, .8, text,
        horizontalalignment='center',
        transform=ax.transAxes,
        bbox=dict(facecolor='white', alpha=0.6),
        fontsize=12.5)
    return ax

**Create a subgrid and plot the data**

In [None]:
# Create the grid
gridsize = (3, 2)
fig = plt.figure(figsize=(12, 8))
ax1 = plt.subplot2grid(gridsize, (0, 0), colspan=2, rowspan=2)
ax2 = plt.subplot2grid(gridsize, (2, 0))
ax3 = plt.subplot2grid(gridsize, (2, 1))


# Plot data in each axes of the grid
ax1.set_title('Home value as a function of home age & area population',
              fontsize=14)
sctr = ax1.scatter(x=age, y=pop, c=y, cmap='RdYlGn')
plt.colorbar(sctr, ax=ax1, format='$%d')
ax1.set_yscale('log')
ax2.hist(age, bins='auto')
ax3.hist(pop, bins='auto', log=True)

add_titlebox(ax2, 'Histogram: home age')
add_titlebox(ax3, 'Histogram: area population (log scl.)')

plt.savefig("california-housing.png") # make sure to save it
plt.show()