# 6. Seaborn

### Objectives
* Seaborn makes beautiful plots with concise syntax
* Seaborn (mostly) requires tidy data
* Seaborn integrates directly with DataFrames
* Must use matplotlib for customized plots
* Sometimes Pandas is the correct choice for plots
* Seaborn has two broad types of plots: axes (simple) and grid (complex and composed of multiple axes plot)
* Know the difference between categorical and regression variables and plots
* Know how to add dimensionality with **hue, col, row** seaborn plotting parameters

### Resource
* Visit the [seaborn home page](http://seaborn.pydata.org/index.htm) and read the introduction
* Keep the [Seaborn API page](http://seaborn.pydata.org/api.html) open throughout the notebook
* Read the [pandas visualization docs](http://pandas.pydata.org/pandas-docs/stable/visualization.html)

### Objectives

This notebook will dive into the Seaborn visualization library in Python. This notebook assumes no previous visualization knowledge.

Seaborn has a high-level, easy-to-use interface for creating powerful and beautiful visualizations. Seaborn does not actually perform any of the visualizations but instead, calls the primary Python visualization library, matplotlib, to do all the heavy lifting. Sometimes, Seaborn is referred to as a **wrapper** for matplotlib, a library which is more difficult to use but provides more fine-grained control.

The Seaborn documentation is excellent and you will be well-served to read all of it. The library is fairly minimal and only exposes a relatively few amount of functions.

## Seaborn and Tidy data
Seaborn plotting functions work best with tidy data. Tidy data makes plotting with seaborn easy.

## Seaborn integration with Pandas
Nearly all Seaborn plotting functions contain a **`data`** parameter that accepts a pandas DataFrame. This allows you to use the **strings** of the column names for the function arguments.

## The four common Seaborn plotting function parameters - `x`, `y`, `hue`, and `data`
The Seaborn API is easy to use and most of the plotting functions look very similar. They syntax will look like this:

```sns.plotting_func(x='col1', y='col2', hue='col3', data=df)```

You will always pass your DataFrame to the `data` parameter. For univariate plots, you can use exactly one of `x` or `y`. The `hue` parameter adds an extra level of dimensionality by splitting and coloring the data by a third variable. 

## Getting started with Axes plots for Univariate visualization
Let's begin by making plots with a single dimension of data.

[1]: http://seaborn.pydata.org/api.html

In [None]:
import seaborn as sns
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
emp = pd.read_csv('../data/employee.csv', parse_dates=['hire_date'])
emp['experience'] = 2016 - emp['hire_date'].dt.year
emp['experience_level'] = pd.cut(emp['experience'], 
                                 bins=[0, 5, 15, 100], 
                                 labels=['Novice', 'Experienced', 'Veteran'])
emp.head()

## Univariate numeric plots
A few of the plots work with just a single dimension of data. Pass the DataFrame to the **`data`** parameter. Let's create a boxplot. Most Seaborn methods have both an **`x`** and **`y`** parameter. For some plots you only need to specify one of them.

In [None]:
sns.boxplot(x='salary', data=emp)

Make the boxplot vertical by passing using **`y`**.

In [None]:
sns.boxplot(y='salary', data=emp)

### Violin plot
Violin plots help you visualize a distribution and work with just single dimensions of data.

In [None]:
sns.violinplot('experience', data=emp)

### Counting values with `countplot`
Columns with strings are usually visualized by their frequency counts of their unique values. Seaborn's `countplot` will do this for us.

In [None]:
sns.countplot('race', data=emp)

#### Comparison to Pandas

In [None]:
emp['race'].value_counts().plot(kind='bar')

Seaborn plots return the underlying Axes object. We can assign this returned object to a variable and use it to modify our plots. This is rather unfortunate that we have to dip into matplotlib to do this. It would be much nicer if Seaborn had this functionality built in.

In [None]:
ax = sns.countplot('race', data=emp)
ax.tick_params(rotation=90)

### KDE and Histogram
The **`distplot`** function plots both the KDE and the histogram together. There is no **`data`** parameter here, so you must pass in the Series directly.

In [None]:
sns.distplot(emp['experience'])

# Multivariate plotting
The above plots involved only a single variable. We will add another dimension to our data by using both `x` and `y` parameters.

### Mixing a categorical feature
The following plots use one categorical column and one numeric column. Seaborn **aggregates** the data for us. Let's find the average salary by gender using many different plots.

In [None]:
sns.boxplot(x='salary', y='gender', data=emp)

In [None]:
sns.boxplot(x='gender', y='salary', data=emp)

In [None]:
sns.violinplot(x='gender', y='salary', data=emp)

By default, the bar plot takes the average

In [None]:
sns.barplot(x='gender', y='salary', data=emp)

Can change aggregation but Seaborn does not accept strings. Must use NumPy.

In [None]:
sns.__version__

In [None]:
sns.barplot(x='gender', y='salary', data=emp, estimator=np.max)

## Multivariate numeric
We now turn to plotting two numeric variables. Scatter and line plots are the most common.

The **`regplot`** function creates a scatterplot but and draws the regression line through the points.

In [None]:
sns.regplot(x='experience', y='salary', data=emp)

# Add another dimension with `hue`
The `hue` parameter is found in most seaborn methods and allows you to slice the data by one more dimension. The **`hue`** parameter does not create a new Axes, rather, it slices the data in the current Axes. Notice here, how it divides each gender into each of the races.

In [None]:
sns.boxplot(x='salary', y='gender', hue='race', data=emp)

In [None]:
sns.barplot(x='gender', y='salary', hue='race', data=emp)

In [None]:
ax = sns.barplot(x='race', y='salary', hue='gender', data=emp)

# Create a Grid to add even more dimensionality
There are only a few Grid plots in Seaborn. The main Grid plots are `catplot`, and `lmplot`. Both these functions simply make one of the plots we've already created into a grid by using the parameters `row` and `col`.

## Recreating the above plots with `catplot`
Besides `row` and `col`, the `catplot` function has the `kind` parameter which controls the kind of plot. Let's re-create the box plot from above with `catplot`. The syntax is the exactly the same except for the added `kind` parameter.

In [None]:
sns.catplot(x='salary', data=emp, kind='box')

### Recreate the bar plot from above

In [None]:
sns.catplot(x='gender', y='salary', hue='race', data=emp, kind='bar')

### By default, the `catplot` does a `stripplot`
Strip plots are another way of showing the distribution of a variable. I don't use them much, but they happen to be the default kind for `catplot`.

In [None]:
sns.stripplot(x='race', y='salary', data=emp)

Notice how the below plot is the same as the above.

In [None]:
sns.catplot(x='race', y='salary', data=emp)

## Use `col` or `row` to create the Grid
The real power of `catplot` is the ability to create grids with the `row` and `col` parameters. This allows us to split the data into separate plots. For instance, let's calculate the average salary by race and gender but split into separate plots by experience level.

In [None]:
sns.catplot(x='race', y='salary', hue='gender', data=emp, kind='bar', col='experience_level')

## Use both `row` and `col` for maximum level of slicing
You can use the **`row`** variable as well to further slice the data. The following calculates bar plots for each combination of department, experience level, race, and gender.

In [None]:
sns.catplot(x='race', y='salary', hue='gender', 
            data=emp, kind='bar', col='experience_level', row='dept', sharey=False)

## Matrix plots
Let's open up the Mini Web App Finding Similar Members with the Meetup API notebook.

# Exercises

### Problem 1
<span  style="color:green; font-size:16px">Using Seaborn, plot the number of schools per state.</span>

### Problem 2
<span  style="color:green; font-size:16px">Use pandas to create the same plot from problem 1, but make it an ordered horizontal bar chart.</span>

### Problem 3
<span  style="color:green; font-size:16px">Make a boxplot per state of SAT Math.</span>

### Problem 4
<span  style="color:green; font-size:16px">Draw the relationship between SAT Math and Verbal scores with a regression line.</span>

# Ask questions that you can answer with Seaborn
Use the insurance dataset: