# DATA VISUALIZATION WITH SEABORN

Seaborn is a Python data visualization library with an emphasis on statistical plots. The library is an excellent resource for common regression and distribution plots, but where Seaborn really shines is in its ability to visualize many different features at once. It’s built to provide great visualizations and at the same time it makes developers’ life easier. Seaborn is built on top of Matplotlib and provides a high level API that makes “a well-defined set of hard things easy”, amongst other things by making that its methods work greatly by passing a minimal set of arguments.

The `great visualization` comes from the built in themes, the possibility to build custom attractive color palettes, and the witty way they’re utilized to display statistical plots (e.g. the kernel density estimation in a violin plot). Seaborn is part of the PyData stack, and accepts Pandas’ data structures as inputs in its API.

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import timeit

%matplotlib inline

sns.set_style('darkgrid')

## TYPE OF PLOTS

1. **Distribution Plot (distplot)** - The first thing you want to see when exploring your data is the distribution of your variables. 
2. **FactorPlot / FacetGrid** - If we wanted to break down a plot by some categories, we needn’t perform boolean queries, nor groupbys, we can use FacetGrid.
3. **[PairPlot](http://seaborn.pydata.org/generated/seaborn.pairplot.html) / [PairGrid](http://seaborn.pydata.org/generated/seaborn.PairGrid.html)** - Plot pairwise relationships in a dataset. This is a high-level interface for PairGrid that is intended to make it easy to draw a few common styles. You should use :class`PairGrid` directly if you need more flexibility.
4. **JointPlot / JointGrid** - This method is used to display data points according to two variables, along with both their distributions, kernel density estimators, and an optional regression that fits the data. With “reg” we indicate that we want a regression fit to the data.
5. **heatmap**: Heatmaps are ideal to plot “rectangular data” such as matrixes. They’re great to visualize when some values, or calculated values, such as averages, counts, etc. are more extreme.

## PART 1: CHARACTERISTICS OF TITANIC DATA

### DISTRIBUTION PLOT (distplot)

For example, let’s see the Titanic’s passengers’ ages distribution

In [None]:
# Load dataset
titanic = sns.load_dataset('titanic')

In [None]:
sns.distplot(titanic.age.dropna())
sns.plt.show()

If we want to see the raw number of rows in each bin, we can pass kde=False (kernel density estimation = False)
We need to drop NaN values for distplot not to raise a ValueError exception.

### FACETGRID (FacetGrid)

In [None]:
g = sns.FacetGrid(titanic, row='survived', col='class')
g.map(sns.distplot, "age")
sns.plt.show()

### JOINTPLOT (jointplot)

 - In the case of titanic dataset, although there appears to be a small tendency upwards shown by the regression, there appears to be almost no correlation between the variables “age” and “fare”, as shown by the Pearson correlation coefficient.

In [None]:
sns.jointplot(data=titanic, x='age', y='fare', kind='reg', color='g')
sns.plt.show()

- **heatmap**: Heatmaps are ideal to plot “rectangular data” such as matrixes. They’re great to visualize when some values, or calculated values, such as averages, counts, etc. are more extreme.

 - We can take the pt_titanic DataFrame from the pivot_table, which held data of the median fares paid by passengers per embark_town per age_group, and build a heatmap very easily. Most times, we like out heatmaps annotated to catch some subtelties that may pass by me with the colors. The “fmt” value is pretty straightforward.

In [None]:
bins = [0, 12, 17, 60, np.inf]
labels = ['child', 'teenager', 'adult', 'elder']
age_groups = pd.cut(titanic.age, bins, labels=labels)
titanic['age_group'] = age_groups

In [None]:
df = titanic.pivot_table(index='embark_town', columns='age_group', values='fare', aggfunc=np.median)

In [None]:
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 5))
sns.heatmap(df, annot=True, fmt=".1f")

Finally, something really cool that you can put into a heatmap is a correlation matrix. Pandas DataFrame has a corr method that calculates Pearson’s (can be another) correlation coefficient between all couples of numeric columns of the DataFrame.

In [None]:
plt.figure(figsize=(10, 5))
sns.heatmap(titanic.corr(), annot=True, fmt=".2f")

## PART 2: CHARACTERISTICS OF AUTO MPG (Mile Per Gallon) DATA

In this section, we'll cover three of Seaborn's most useful functions: factorplot, pairplot, and jointgrid. Going a step further, we'll show how we can get even more mileage out of these functions by stepping up to their even-more-powerful forms: **FacetGrid**, **PairGrid**, and **JointGrid**.

We use the [UCI "Auto MPG"](https://archive.ics.uci.edu/ml/datasets/Auto+MPG) data set.

In [None]:
names = [
        'mpg',
        'cylinders',
        'displacement',
        'horsepower',
        'weight',
        'acceleration',
        'model_year',
        'origin',
        'car_name'
]
df = pd.read_csv("../data/auto-mpg.data", sep='\s+', names=names)
df['maker'] = df.car_name.map(lambda x: x.split()[0])
df.origin = df.origin.map({1: 'America', 2: 'Europe', 3: 'Asia'})
df=df.applymap(lambda x: np.nan if x == '?' else x).dropna()
df['horsepower'] = df.horsepower.astype(float)
df.head()

## factorplot and FacetGrid

One of the most powerful features of Seaborn is the ability to easily build conditional plots; this let's us see what the data look like when segmented by one or more variables. The easiest way to do this is thorugh factorplot. Let's say that we're interested in how cars' MPG has varied over time. Not only can we easily see this in aggregate:

In [None]:
sns.factorplot(data=df, x="model_year", y="mpg")

In [None]:
# other kind={point, bar, count, box, violin, strip}
sns.factorplot(data=df, x="model_year", y="mpg", kind='box')

But we can also segment by, say, region of origin:

In [None]:
sns.factorplot(data=df, x="model_year", y="mpg", col="origin")

What's so great factorplot is that rather than having to segment the data ourselves and make the conditional plots individually, Seaborn provides a convenient Application Programming Interface (API) for doing it all at once.

The **FacetGrid** object is a slightly more complex, but also more powerful, take on the same idea. Let's say that we wanted to see KDE plots of the MPG distributions, separated by country of origin:

In [None]:
sns.factorplot("cylinders", data=df, col="origin", kind='bar')

In [None]:
g = sns.FacetGrid(df, col="origin")
g.map(sns.distplot, "mpg")

Or let's say that we wanted to see scatter plots of MPG against horsepower with the same origin segmentation:

In [None]:
g = sns.FacetGrid(df, col="origin")
g.map(plt.scatter, "horsepower", "mpg")

Using **FacetGrid**, we can map any plotting function onto each segment of our data. For example, above we gave `plt.scatter` to g.map, which tells Seaborn to apply the matplotlib `plt.scatter` function to each of segments in our data. We don't need to use `plt.scatter`, though; we can use any function that understands the input data. For example, we could draw regression plots instead:

In [None]:
g = sns.FacetGrid(df, col="origin")
g.map(sns.regplot, "horsepower", "mpg")
plt.xlim(0, 250)
plt.ylim(0, 60)

We can even segment by multiple variables at once, spreading some along the rows and some along the columns. This is very useful for producing comparing conditional distributions across interacting segmentations:

In [None]:
df['tons'] = (df.weight/2000).astype(int)
g = sns.FacetGrid(df, col="origin", row="tons")
g.map(sns.kdeplot, "horsepower", "mpg")
plt.xlim(0, 250)
plt.ylim(0, 60)

In [None]:
g = sns.FacetGrid(df, col="origin", row="tons")
g.map(plt.hist, "mpg", bins=np.linspace(0, 50, 11))

## pairplot and PairGrid

While **factorplot** and **FacetGrid** are for drawing conditional plots of segmented data, pairplot and PairGrid are for showing the interactions between variables. For our car data set, we know that MPG, horsepower, and weight are probably going to be related; we also know that both these variable values and their relationships with one another, might vary by country of origin. Let's visualize all of that at once:

In [None]:
g = sns.pairplot(df[["mpg", "horsepower", "weight", "origin"]], hue="origin", diag_kind="hist")
for ax in g.axes.flat:
    plt.setp(ax.get_xticklabels(), rotation=45)

In [None]:
g = sns.PairGrid(df[["mpg", "horsepower", "weight", "origin"]], hue="origin")
g.map_upper(sns.regplot)
g.map_lower(sns.residplot)
g.map_diag(plt.hist)
for ax in g.axes.flat:
    plt.setp(ax.get_xticklabels(), rotation=45)
g.add_legend()
g.set(alpha=0.5)

We were able to control three regions (the diagonal, the lower-left triangle, and the upper-right triangle) separately. Again, you can pipe in any plotting function that understands the data it's given.

## jointplot and JointGrid

The final Seaborn objects are jointplot and JointGrid; these features let you easily view both a joint distribution and its marginals at once. Let's say, for example, that aside from being interested in how MPG and horsepower are distributed individually, we're also interested in their joint distribution:

In [None]:
sns.jointplot("mpg", "horsepower", data=df, kind='kde')

In [None]:
sns.jointplot("horsepower", "mpg", data=df, kind="reg")

As before, JointGrid gives you a bit more control by letting you map the marginal and joint data separately. For example:

In [None]:
g = sns.JointGrid(x="horsepower", y="mpg", data=df)
g.plot_joint(sns.regplot, order=2)
g.plot_marginals(sns.distplot)

## Summary

Seaborn is a great Python visualization library, and some of its most powerful features are:

- [factorplot](http://stanford.edu/~mwaskom/software/seaborn/generated/seaborn.factorplot.html) and [FacetGrid](http://stanford.edu/~mwaskom/software/seaborn/generated/seaborn.FacetGrid.html#seaborn.FacetGrid),
- [pairplot](http://stanford.edu/~mwaskom/software/seaborn/generated/seaborn.pairplot.html) and [PairGrid](http://stanford.edu/~mwaskom/software/seaborn/generated/seaborn.PairGrid.html#seaborn.PairGrid),
- [jointplot](http://stanford.edu/~mwaskom/software/seaborn/generated/seaborn.jointplot.html) and [JointGrid](http://stanford.edu/~mwaskom/software/seaborn/generated/seaborn.JointGrid.html#seaborn.JointGrid)

The official Seaborn [tutorial](http://stanford.edu/~mwaskom/software/seaborn/tutorial.html) is a great place to start learning about simpler, but also extremely useful, functions such as distplot, regplot, and the other component functions we used above. 