Back to [Section 1](GCP_Data_Analysis_01.ipynb)

# Plotting with Pandas and Matplotlib

This tutorial is based on [Data Carpentry Python Ecology Lesson](https://datacarpentry.org/python-ecology-lesson)

**GOALS**: 

- Use matplotlib to customize plots

- Plot data from Pandas DataFrame

- Use Seaborn to include additional Features in Visualizations


## Matplotlib in the Notebook

We can use the Jupyter notebooks to investigate the plots interactively by using the `%matplotlib notebook` command.  By doing so, we create a figure object embedded in the notebook that also allows saving, paning, and zooming.  Further, we can continue to interact with a single plot across many cells unlike the more traditional `%matplotlib inline` magic command.

First, we import the plotting library `pyplot` and abbreviate it as `plt`.  We also set the style to be `seaborn-white`, a clean simple background.  We also import the Pandas and Numpy libraries per usual.  

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('seaborn-white')

import numpy as np
import pandas as pd

### A Simple ScatterPlot

To create a scatterplot, we can generate some random numbers using the `np.random.randn()` function to represent the *x, y, color,* and *size* of the points for the plot.  Next, we create a figure and assign an optional `figsize` argument.  Finally, we apply the `BuPu` colormap available from matplotlib's built-in colormaps.

Somewhat like `ggplot2`, we can continue to add elements with additional lines.  For example, we add a main title and $x$ and $y$ axis labels.

In [None]:
x = np.random.randn(50)
y = np.random.randn(50)
colors = np.random.rand(50)
sizes = 1000 * np.random.rand(10)

plt.figure(figsize = (10, 8))
plt.scatter(x, y, c=colors, s=sizes, alpha=0.4, cmap='BuPu')
plt.title("A Simple ScatterPlot")
plt.xlabel("A Label for X")
plt.ylabel("A Label for Y")

## Pandas and Matplotlib

We can load the `surveys_complete.csv` dataset into the notebook as a DataFrame named df. Now, we can call plot functions on the DataFrame itself or with a conventional matplotlib approach.

Similar to the first example, we can directly name the $x$ and $y$ values by reference to the DataFrame's column names.  By taking a quick peek at the head of the dataframe we can see the names of the columns that we are interested in plotting and call them accordingly.

In [None]:
df = pd.read_csv( 'data/surveys_complete.csv', index_col=0)

In [None]:
df.head()
#print(df.count())

In [None]:
plt.figure(figsize = (10, 8))
plt.scatter(df['weight'], df['hindfoot_length'])

In [None]:
df = pd.read_csv( 'data/surveys.csv', index_col=0)
df.head()
#pd.unique(df['species_id'])

In [None]:
plt.figure(figsize = (12,10))
plt.scatter(df['weight'], df['hindfoot_length'], s = df['weight']*4, alpha = 0.15, c = df['plot_id'], cmap = 'magma')
plt.title("Information About Animals Physique and Foot")
plt.xlabel("Weight")
plt.ylabel("Hindfoot Length")

## Plotting From the DataFrame

We can also plot directly from the DataFrame.  Now, the `figsize` argument is a keyword in the scatter function itself.  This demonstrates the range of direct plotting methods available.  For more help with the DataFrame plotting use the built-in help with `df.plot?`.  

Below, you see a scatterplot and a boxplot produced from the DataFrame with a single call.

In [None]:
df.plot.scatter('weight', 'hindfoot_length', c = 'plot_id', cmap = 'viridis', figsize = (10, 8),
                title= "Plotting from the DataFrame")

In [None]:
df.boxplot('hindfoot_length', by = 'species_id', figsize = (10, 8))

## Seaborn

The Seaborn Library provides additional plotting functionality for data visualization in Python.  Here, we show how Seaborn produces a typical regression plot, a boxplot factored by sex, and a jointplot with histograms on the axes.  Finally, we add some additional layers to the boxplot by using a violinplot that is split by the `sex` variable.

In [None]:
import seaborn as sns
sns.set_style("whitegrid")

In [None]:
plt.figure()
sns.regplot(x, y)

In [None]:
sns.jointplot('weight', 'hindfoot_length', data = df, size = 9, color = 'darkorchid', alpha = 0.4)

In [None]:
plt.figure(figsize = (12,10))
sns.violinplot('species_id', 'hindfoot_length', hue = 'sex',
              data = df,   split = True, inner = 'quartile',
              palette = ['fuchsia', 'gainsboro'])
plt.title("ID and Hindfoot Length by Gender")

In [None]:
# Basic descriptive statistics
df['weight'].describe()

In [None]:
print(df['weight'].min(),
df['weight'].max(),
df['weight'].mean(),
df['weight'].std(),
df['weight'].count())

In [None]:
grouped_data = df.groupby('sex')

In [None]:
# Summary statistics for all numeric columns by sex
grouped_data.describe()
# Provide the mean for each numeric column by sex
grouped_data.mean()

In [None]:
df = pd.read_csv( 'data/surveys_complete.csv', index_col=0)
# Count the number of samples by species
species_counts = df.groupby('species_id')['record_id'].count()
print(species_counts)

In [None]:
# Count single species
df.groupby('species_id')['record_id'].count()['DO']

In [None]:
# Calculate double weight
print(df['weight'].head())
dw = df['weight']*2
dw.head()

Go to [Section 3](GCP_Data_Analysis_03.ipynb)