# Plotting with Pandas and Matplotlib

This is a brief supplement to demonstrate the replication of the plots from Lesson 6 using matplotlib together with Pandas.  

**GOALS**: 

- Use matplotlib to customize plots

- Plot data from Pandas DataFrame

- Use Seaborn to include additional Features in Visualizations

** Learning Objectives **:

- Create a ggplot object
- Set universal plot settings
- Modify an existing ggplot object
- Change the aesthetics of a plot such as colour
- Edit the axis labels
- Build complex plots using a step-by-step approach
- Create scatter plots, box plots, and time series plots
- Use the facet_wrap and facet_grid commands to create a collection of plots splitting the data by a factor variable
- Create customized plot styles to meet their needs

## Matplotlib in the Notebook

We can use the Jupyter notebooks to investigate the plots interactively by using the `%matplotlib notebook` command.  By doing so, we create a figure object embedded in the notebook that also allows saving, paning, and zooming.  Further, we can continue to interact with a single plot across many cells unlike the more traditional `%matplotlib inline` magic command.

First, we import the plotting library `pyplot` and abbreviate it as `plt`.  We also set the style to be `seaborn-white`, a clean simple background.  We also import the Pandas and Numpy libraries per usual.  

In [1]:
%matplotlib notebook
import matplotlib.pyplot as plt
plt.style.use('seaborn-white')

import numpy as np
import pandas as pd

### A Simple ScatterPlot

To create a scatterplot, we can generate some random numbers using the `np.random.randn()` function to represent the *x, y, color,* and *size* of the points for the plot.  Next, we create a figure and assign an optional `figsize` argument.  Finally, we apply the `BuPu` colormap available from matplotlib's built-in colormaps.

Somewhat like `ggplot2`, we can continue to add elements with additional lines.  For example, we add a main title and $x$ and $y$ axis labels.

In [2]:
x = np.random.randn(50)
y = np.random.randn(50)
colors = np.random.rand(50)
sizes = 1000 * np.random.rand(10)

plt.figure(figsize = (8, 8))
plt.scatter(x, y, c=colors, s=sizes, alpha=0.4,cmap='BuPu')
plt.title("A Simple ScatterPlot")
plt.xlabel("A Label for X")
plt.ylabel("A Label for Y")

<IPython.core.display.Javascript object>

<matplotlib.text.Text at 0x7f3aacb00290>

## Pandas and Matplotlib

We can load the `surveys.csv` dataset into the notebook as a DataFrame named surveys_df. Now, we can call plot functions on the DataFrame itself or with a conventional matplotlib approach.

Similar to the first example, we can directly name the $x$ and $y$ values by reference to the DataFrame's column names.  By taking a quick peek at the head of the dataframe we can see the names of the columns that we are interested in plotting and call them accordingly.

In [3]:
surveys_df = pd.read_csv( '../data/surveys.csv', index_col=0)

In [4]:
surveys_df.head()

Unnamed: 0_level_0,month,day,year,plot_id,species_id,sex,hindfoot_length,weight
record_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,7,16,1977,2,NL,M,32.0,
2,7,16,1977,3,NL,M,33.0,
3,7,16,1977,2,DM,F,37.0,
4,7,16,1977,7,DM,M,36.0,
5,7,16,1977,3,DM,M,35.0,


First, we'll look at a scatter plot of weight vs. height.

In [5]:
plt.figure(figsize = (8, 8))
plt.scatter(surveys_df['weight'], surveys_df['hindfoot_length'])
plt.show()

<IPython.core.display.Javascript object>

Next, we'll add a title and labels and customize the display of the data.  The `s` parameter sets the size of the marks in a scatter plot, `alpha` controls the opacity of the points, and `c` controls the color.  

We'll plot the same data but vary the size by the weight and color by plot_id (location). 

In [6]:
plt.figure(figsize = (8,8))
plt.scatter(surveys_df['weight'], surveys_df['hindfoot_length'], 
            s = surveys_df['weight']*4, alpha = 0.15, 
            c = surveys_df['plot_id'], cmap = 'magma')
plt.title("Information About Animals Physique and Foot")
plt.xlabel("Weight")
plt.ylabel("Hindfoot Length")

<IPython.core.display.Javascript object>

<matplotlib.text.Text at 0x7f3aac8be9d0>

## Plotting From the DataFrame

We can also plot directly from the DataFrame.  Now, the `figsize` argument is a keyword in the scatter function itself.  This demonstrates the range of direct plotting methods available.  For more help with the DataFrame plotting, use the built-in help with `df.plot?`.  

Below, you see a scatterplot and a boxplot produced from the DataFrame with a single call.

In [7]:
surveys_df.plot.scatter('weight', 'hindfoot_length', c = 'plot_id', cmap = 'viridis', figsize = (8, 8),
                title= "Plotting from the DataFrame")

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x7f3aac82df50>

Coloring by plot ID doesn't appear to be very informative.  We might instead want to color by species, unfortunately matplotlib doesn't support categorical variables like our `species_id` variable, so we need to convert it.  We'll actually add a column with the numerical version so we can still use the string version. 

Pandas has a function that converts categorical variables (the `pandas` type) to numerical (`int8` formally).  Our `species_id` is, statistically speaking, a categorical variable, but as we saw in lesson 1, `pandas` interprets it as `object`.  We can verify this

In [8]:
surveys_df.dtypes

month                int64
day                  int64
year                 int64
plot_id              int64
species_id          object
sex                 object
hindfoot_length    float64
weight             float64
dtype: object

We need to first convert to type `category`, then convert to numbers.  The first one is a straightforward type conversion, so we use the `astype` method.  The second uses a special `pandas` function for categorical variabes, found in the `cat` submodule.  

In [9]:
# we can change the species id to a pandas category variable, that gives
# it more functionality like a statisical categorical variable
surveys_df['species_id']= surveys_df['species_id'].astype('category')

# then we can add a column with it coded numerically
surveys_df['species_num'] = surveys_df.species_id.cat.codes

# finally check what we just changed
surveys_df.dtypes

month                 int64
day                   int64
year                  int64
plot_id               int64
species_id         category
sex                  object
hindfoot_length     float64
weight              float64
species_num            int8
dtype: object

In [10]:
# we can make the plot a bit neater by having a color map with the same number of 
# levels as the number rof species
cmap = plt.get_cmap('viridis', surveys_df['species_id'].nunique())


surveys_df.plot.scatter('weight', 'hindfoot_length', c = 'species_num', cmap = cmap, figsize = (8, 8),
                title= "Plotting from the DataFrame")

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x7f3aac138c90>

We can also generate a box plot. 

In [11]:
surveys_df.boxplot('hindfoot_length', by = 'species_id', figsize = (9, 8))

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x7f3aac138e50>

## Seaborn

The Seaborn Library provides additional plotting functionality for data visualization in Python.  Here, we show how Seaborn produces a typical regression plot.  Seaborn does a lot more formatting to make visually appealing plots with less work.  


In [12]:
import seaborn as sns
sns.set_style("whitegrid")

In [13]:
plt.figure()
sns.regplot(x, y)

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x7f3aafe32990>

A boxplot factored by sex, 

A jointplot is a scatter plot with histograms on the axes.  

In [14]:
plt.show()
sns.jointplot('weight', 'hindfoot_length', data = surveys_df, size = 9, color = 'darkorchid', alpha = 0.4)

<IPython.core.display.Javascript object>

<seaborn.axisgrid.JointGrid at 0x7f3aac840290>

Finally, we add some additional layers to the boxplot by using a violinplot that is split by the `sex` variable.  First, let's learn about the violinplot so that we can choose the right settings for our data and what we are interested in.  We can either veiw the documentation (docs) for this online or by using the `help()` function.  To change windows less, we'll use `help()` but the online versionns are compiled with better formatting and therefore often easier to read. 

In [15]:
help(sns.violinplot)

Help on function violinplot in module seaborn.categorical:

violinplot(x=None, y=None, hue=None, data=None, order=None, hue_order=None, bw='scott', cut=2, scale='area', scale_hue=True, gridsize=100, width=0.8, inner='box', split=False, orient=None, linewidth=None, color=None, palette=None, saturation=0.75, ax=None, **kwargs)
    Draw a combination of boxplot and kernel density estimate.
    
    A violin plot plays a similar role as a box and whisker plot. It shows the
    distribution of quantitative data across several levels of one (or more)
    categorical variables such that those distributions can be compared. Unlike
    a box plot, in which all of the plot components correspond to actual
    datapoints, the violin plot features a kernel density estimation of the
    underlying distribution.
    
    This can be an effective and attractive way to show multiple distributions
    of data at once, but keep in mind that the estimation procedure is
    influenced by the sample size, a

In [16]:
plt.figure(figsize = (9,8))
sns.violinplot('species_id', 'hindfoot_length', hue = 'sex',
              data = surveys_df,   split = True, inner = 'quartile',
              palette = ['navy', 'fuchsia'])
plt.title("ID and Hindfoot Length by Gender")

<IPython.core.display.Javascript object>

<matplotlib.text.Text at 0x7f3a92aa6190>

This plot is difficult to read, the x axis is crowded.  Also, many of the species do not have any weight data for them, as we saw in lesson 2.

In [17]:
surveys_df['species_id'].nunique()

48

In [18]:
complete_hl = surveys_df.dropna(axis=0, how= 'any', subset=['hindfoot_length'])
complete_hl['species_id'].nunique()

25