# Data Visualization in Python
For this workshop we are going to be exploring the basics of **seaborn**, a statistical graphing library for Python. Seaborn was created by [Michael Waskom](https://stanford.edu/~mwaskom/), a PhD student at Stanford. Seaborn is built on top of **matplotlib**, but provides a high level API and aesthetically pleasing graphs. Seaborn can use data from **pandas**. Seaborn also handles things such as automatically cleaning data. As a result, seaborn fits in perfectly with the typical scientific software stack: SciPy, NumPy, pandas, matplotlib. 

The documentation for seaborn is very thorough and can be found [here](https://stanford.edu/~mwaskom/software/seaborn/api.html). 

### Setup

Install seaborn with `conda install seaborn`

If you are on **Windows**, run that command from the Anaconda Command Prompt

If you are on **Mac OS X**, run that command from the terminal

# Comparing matplotlib and seaborn
### With matplotlib

### With seaborn

### Color Palettes
Seaborn includes a few color palettes such as muted, pastel, and dark. [Documentation on color palettes](https://stanford.edu/~mwaskom/software/seaborn/generated/seaborn.color_palette.html#seaborn.color_palette)

# Basic Plotting
These are just simple plots with random data. 

**kde** is kernel density estimation

kernel density estimation is a non-parametric method to estimate the *probability density function* (PDF)

seaborn calculates the KDE by placing a normal curve at each value and then summed together. The area under the curve is equal to 1.

## Other Plots
[Factor plots](https://stanford.edu/~mwaskom/software/seaborn/generated/seaborn.factorplot.html)
![Factor plot](https://i.imgur.com/ybqZFF3.png)

[Time Series Plots](https://stanford.edu/~mwaskom/software/seaborn/generated/seaborn.tsplot.html)
![Time Series Plot](https://i.imgur.com/XFk2TBk.png)

# Linear Regression
seaborn includes support for loading datasets from [https://github.com/mwaskom/seaborn-data](https://github.com/mwaskom/seaborn-data) using load_dataset. The dataset is loaded as a pandas dataframe.

# Heatmaps

# Titanic Dataset

Contains almost 900 rows of passengers from the Titanic. Contains the following features:
- survived - 1 = Survived, 0 = Died
- pclass - First class (1), Second class (2), or Third class (3).
- sex - The gender of the passenger
- age - The age of the passenger
- sibsp - Number of siblings and spouses the passenger had on board.
- parch - Number of parents and children the passenger had on board.
- fare - How much the passenger paid for the ticker.
- embarked - First letter of the city where the passenger boarded the Titanic
- who - Contains either man, woman, or child
- adult_male - Boolean value for whether or not the passenger is an adult male
- embark_town - Entire name of the city where the passenger boarded the Titanic
- alive - Same as survived, but with a yes or no rather than 0 or 1
- alone - Boolean value if the passenger was alone on the Titanic

### Linear Regression Equation

Seaborn is purely a visualization library. Due to how it creates the graphs, there is no way to output the math behind the graph. 

![Seaborn model](https://i.imgur.com/iiXN561.png)

### Robust Regression

SciPy does not do Robust Regression, but the same process would apply. You clean the data and would then use a library that supports the model you need.

[StatsModels](http://statsmodels.sourceforge.net) supports a huge range of statistical functions and can handle things like robust regression.

### Who paid the most?

### Who survived?

Check out the dataset on [Kaggle](https://www.kaggle.com/c/titanic) for a machine learning tutorial with the titanic dataset.

# Examples of Visualizations

[Craigslist Missed Connections](http://www.vox.com/a/craigslist-missed-connections/i-analyzed-10-000-craigslist-missed-connections-here-s-what-i-learned)

[Who Marries Who](http://www.bloomberg.com/graphics/2016-who-marries-whom/)

[18th and 19th Century Ship Logs](https://i.imgur.com/dmaEsgO.png)