<font color="white">.</font> | <font color="white">.</font> | <font color="white">.</font>
-- | -- | --
![NASA](http://www.nasa.gov/sites/all/themes/custom/nasatwo/images/nasa-logo.svg) | <h1><font size="+3">ASTG Python Courses</font></h1> | ![NASA](https://www.nccs.nasa.gov/sites/default/files/NCCS_Logo_0.png)

---

<CENTER>
<H1 style="color:red">
Introduction to Seaborn
</H1>
</CENTER>

# <font color="red">Reference Documents</font>

- <a href="seaborn.pydata.org">seaborn: statistical data visualization</a>
- <a href="https://www.datacamp.com/community/tutorials/seaborn-python-tutorial">Python Seaborn Tutorial For Beginners</a>
- <a href="https://www.c-sharpcorner.com/article/a-complete-python-seaborn-tutorial/">A Complete Python Seaborn Tutorial</a>
- <a href="https://www.journaldev.com/18583/python-seaborn-tutorial">Python Seaborn Tutorial </a>

# <font color="red"> What is Seaborn?</font>
- Python package for producing statistical graphics.
- It comes equipped with preset styles and color palettes so you can create complex, aesthetically pleasing charts with a few lines of code.
- It makes visualization a central part of exploring and understanding data.
- Its dataset-oriented plotting functions operate on Pandas DataFrames and Numpy arrays containing whole datasets and internally perform the necessary semantic mapping and statistical aggregation to produce informative plots.
- It is built on top of Matplotlib, but it's meant to serve as a complement, not a replacement.. Behind the scenes, Seaborn uses matplotlib to draw plots.
- **Seaborn is an important tool used in Exploratory Data Analysis.**

In [None]:
%matplotlib inline

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

## Seaborn Settings

- Matplotlib is highly customizable, but it can be complicated at the same time as it is hard to know what settings to tweak to achieve a good looking plot. 
- Seaborn comes with a number of themes and a high-level interface for controlling the look of matplotlib figures.

The default default `seaborn` theme, scaling, and color palette is done through the setting:

```python
   sns.set()
```

It relies on the Matplotlib rcParams customization.

**Consider the following Matplotlib example**

In [None]:
def plot_cosine_sine():

    X = np.linspace(-np.pi, np.pi, 256,endpoint=True)
    C = np.cos(X)
    S = np.sin(X)

    # Plot cosine using blue color with a continuous line of width 1 (pixels)
    plt.plot(X, C, color="blue", linewidth=1.0, linestyle="-", label="cos");

    # Plot sine using green color with a continuous line of width 1 (pixels)
    plt.plot(X, S, color="green", linewidth=1.0, linestyle="-", label="sin");

    plt.legend(loc='best')

    # Set x limits
    plt.xlim(-4.0,4.0);

    # Set x ticks
    plt.xticks(np.linspace(-4,4,5,endpoint=True));

    # Set y limits
    plt.ylim(-1.05,1.05);

    # Set y ticks
    plt.yticks(np.linspace(-1,1,5,endpoint=True));

In [None]:
plot_cosine_sine()

**Let us run the same example with Seaborn settings**

In [None]:
sns.set()
plot_cosine_sine()

We can also set the background through the function `set_style`. Some possible settings are:

```python
sns.set_style("dark")
sns.set_style("whitegrid")
sns.set_style("white")
```

In [None]:
sns.set_style("dark")
plot_cosine_sine()

You can use the function `despine` to remove the spines.

In [None]:
sns.set_style("white")
plot_cosine_sine()
sns.despine()

Using subplot and temporarily setting figure style:

In [None]:
with sns.axes_style("darkgrid"):
   plt.subplot(211)
   plot_cosine_sine()
plt.subplot(212)
plot_cosine_sine()

# <font color="red">Identifying Statistical Relationships</font>

- Statistical analysis is a process of understanding how variables in a dataset relate to each other and how those relationships depend on other variables. 
- When data are visualized properly, the human visual system can see trends and patterns that indicate a relationship.

In [None]:
mpg = sns.load_dataset("mpg")
mpg

## `scatterplot`

- The relationship between `x` and `y` can be shown for different subsets of the data using the `hue`, `size`, and `style` parameters. 
- These parameters control what visual semantics are used to identify the different subsets. 

In [None]:
sns.set()  

sns.scatterplot(x="horsepower", y="mpg", 
                hue="origin", size="weight",  
                sizes=(400, 40), palette="muted",  
                data=mpg);  

## `lineplot`

In [None]:
sns.lineplot(x="horsepower", y="mpg", 
            hue="origin", palette="muted",  
            data=mpg)  

### `relplot`

- Designed to visualize many different statistical relationships.

In [None]:
sns.set(style="white")  

# Plot miles per gallon against horsepower with other semantics  
sns.relplot(x="horsepower", y="mpg", 
            hue="origin", size="weight",  
            sizes=(400, 40), alpha=.5, palette="muted",  
            height=6, data=mpg)  

# <font color="red">Manipulating Categorical Data</font>



## Categorical scatterplots

In [None]:
iris = sns.load_dataset("iris")
iris

### `stripplot`

- Draw a scatterplot where one variable is categorical.

In [None]:
# "Melt" the dataset to "long-form" or "tidy" representation  
iris_m = pd.melt(iris, "species", var_name="measurement")  
  
# Initialize the figure  
f, ax = plt.subplots()  
sns.despine(bottom=True, left=True)  
  
# Show each observation with a scatterplot  
sns.stripplot(x="measurement", y="value", hue="species",  
              data=iris_m, dodge=True, jitter=True,  
              alpha=.25, zorder=1)  
  
# Show the conditional means  
sns.pointplot(x="measurement", y="value", hue="species",  
              data=iris_m, dodge=.532, join=False, palette="dark",  
              markers="d", scale=.75, ci=None)  
  
# Improve the legend   
handles, labels = ax.get_legend_handles_labels()  
ax.legend(handles[3:], labels[3:], title="species",  
          handletextpad=0, columnspacing=1,  
          loc="lower right", ncol=3, frameon=True)

### `swarmplot`

- Draw a categorical scatterplot with non-overlapping points.

In [None]:
sns.set(style="whitegrid", palette="muted") 

# "Melt" the dataset to "long-form" or "tidy" representation  
iris_m = pd.melt(iris, "species", var_name="measurement")  
  
# Draw a categorical scatterplot to show each observation  
sns.swarmplot(x="value", y="measurement", hue="species",  
              palette=["r", "c", "y"], data=iris_m);  

## Categorical Distribution Plots

In [None]:
planets = sns.load_dataset("planets")  
planets

### `boxplot`
- Draw a box plot to show distributions with respect to categories.
- A box plot shows the distribution of quantitative data in a way that facilitates comparisons between variables or across levels of a categorical variable. 
- The box shows the quartiles of the dataset.

In [None]:
sns.set(style="ticks")  
  
# Initialize the figure with a logarithmic x axis  
f, ax = plt.subplots(figsize=(7, 6))  
ax.set_xscale("log")   
  
# Plot the orbital period with horizontal boxes  
sns.boxplot(x="distance", y="method", data=planets,  
            whis="range", palette="vlag")  
  
# Tweak the visual presentation  
ax.xaxis.grid(True)  
ax.set(ylabel="")  
sns.despine(trim=True, left=True)  

### `violinplot`
- Draw a combination of boxplot and kernel density estimate.
- It shows the distribution of quantitative data across several levels of one (or more) categorical variables such that those distributions can be compared. 
- This can be an effective and attractive way to show multiple distributions of data at once.

In [None]:
ax = sns.violinplot(x=planets["distance"])

In [None]:
f, ax = plt.subplots(figsize=(13, 7))

sns.violinplot(x="method", y="distance", data=planets) 
# Finalize the figure  
ax.set(ylim=(-10, 9000))  
sns.despine(left=True, bottom=True)  

In [None]:
df = sns.load_dataset("brain_networks", header=[0, 1, 2], index_col=0)
df

In [None]:
sns.set(style="whitegrid")  
  
# Load the example dataset of brain network correlations  
df = sns.load_dataset("brain_networks", header=[0, 1, 2], index_col=0)  
  
# Pull out a specific subset of networks  
used_networks = [1, 3, 4, 5, 6, 7, 8, 11, 12, 13, 16, 17]  
used_columns = (df.columns.get_level_values("network")  
                          .astype(float)  
                          .isin(used_networks))  
df = df.loc[:, used_columns]  
  
# Compute the correlation matrix and average over networks  
corr_df = df.corr().groupby(level="network").mean()  
corr_df.index = corr_df.index.astype(int)  
corr_df = corr_df.sort_index().T  
  
# Set up the matplotlib figure  
f, ax = plt.subplots(figsize=(11, 6))  
  
# Draw a violinplot with a narrower bandwidth than the default  
sns.violinplot(data=corr_df, palette="Set3", bw=1, cut=.2, linewidth=1)  
  
# Finalize the figure  
ax.set(ylim=(-.7, 1.05))  
sns.despine(left=True, bottom=True)  

### `boxenplot`
- Draw an enhanced box plot for larger datasets.
- This style of plot was originally named a “letter value” plot because it shows a large number of quantiles that are defined as “letter values”.
- It is similar to a box plot in plotting a nonparametric representation of a distribution in which all features correspond to actual observations. 
- By plotting more quantiles, it provides more information about the shape of the distribution, particularly in the tails. 

In [None]:
ax = sns.boxenplot(x=planets["distance"])

In [None]:
f, ax = plt.subplots(figsize=(13, 7))

sns.boxenplot(x="method", y="distance", data=planets) 
# Finalize the figure  
ax.set(ylim=(-10, 9000))  
sns.despine(left=True, bottom=True)  

## Categorical Estimate Plots

In [None]:
titanic = sns.load_dataset("titanic")
titanic

### `pointplot`
- Show point estimates and confidence intervals using scatter plot glyphs.
- A point plot represents an estimate of central tendency for a numeric variable by the position of scatter plot points and provides some indication of the uncertainty around that estimate using error bars.
- Point plots can be more useful than bar plots for focusing comparisons between different levels of one or more categorical variables.

In [None]:
ax = sns.pointplot(x="class", y="survived", hue="who", 
                   data=titanic)

### `countplot`
- Show the counts of observations in each categorical bin using bars.
- A count plot can be thought of as a histogram across a categorical, instead of quantitative, variable.

In [None]:
ax = sns.countplot(x="class", hue="who", data=titanic)

In [None]:
g = sns.catplot(x="class", hue="who", col="survived",
                data=titanic, kind="count",
                height=4, aspect=.7);

### `barplot`
- Show point estimates and confidence intervals as rectangular bars.
- A bar plot represents an estimate of central tendency for a numeric variable with the height of each rectangle and provides some indication of the uncertainty around that estimate using error bars. 

In [None]:
ax = sns.barplot(x="class", y="survived", data=titanic)

In [None]:
ax = sns.barplot(x="class", y="survived", hue="who", data=titanic)

# <font color="red">Visualizing the Distribution of a Dataset</font>



### `distplot`
- Flexibly plot a univariate distribution of observations.

In [None]:
sns.set(style="white", palette="muted", color_codes=True)  
rs = np.random.RandomState(10)  
  
# Set up the matplotlib figure  
f, axes = plt.subplots(2, 2, figsize=(7, 7), sharex=True)  
sns.despine(left=True)  
  
# Generate a random univariate dataset  
d = rs.normal(size=100)  
  
# Plot a simple histogram with binsize determined automatically  
sns.distplot(d, kde=False, color="b", ax=axes[0, 0])  
  
# Plot a kernel density estimate and rug plot  
sns.distplot(d, hist=False, rug=True, color="r", ax=axes[0, 1])  
  
# Plot a filled kernel density estimate  
sns.distplot(d, hist=False, color="g", kde_kws={"shade": True}, ax=axes[1, 0])  
  
# Plot a historgram and kernel density estimate  
sns.distplot(d, color="m", ax=axes[1, 1])  
  
plt.setp(axes, yticks=[])  
plt.tight_layout() 

### `kdeplot`

- Fit and plot a univariate or bivariate kernel density estimate.

In [None]:
sns.set(style="dark")  
rs = np.random.RandomState(500)  
  
# Set up the matplotlib figure  
f, axes = plt.subplots(3, 3, figsize=(9, 9), sharex=True, sharey=True)  
  
# Rotate the starting point around the cubehelix hue circle  
for ax, s in zip(axes.flat, np.linspace(0, 3, 10)):  
  
    # Create a cubehelix colormap to use with kdeplot  
    cmap = sns.cubehelix_palette(start=s, light=1, as_cmap=True)  
  
    # Generate and plot a random bivariate dataset  
    x, y = rs.randn(2, 50)  
    sns.kdeplot(x, y, cmap=cmap, shade=True, cut=5, ax=ax)  
    ax.set(xlim=(-3, 3), ylim=(-3, 3))  
  
f.tight_layout()

###  `pairplot`

- Plot pairwise relationships in a dataset.

In [None]:
sns.set(style="ticks")  
#iris = sns.load_dataset("iris") 
sns.pairplot(iris, hue="species");

# <font color="red">Visualizing Linear Relationships</font>


### `regplot`

- Plot data and a linear regression model fit.

In [None]:
g = sns.regplot(x="sepal_length", y="sepal_width",  
                 data=iris) 

### `lmplot`

- Enhance a scatterplot to include a linear regression model (and its uncertainty). 

In [None]:
iris = sns.load_dataset("iris")

# Plot sepal with as a function of sepal_length across days  
g = sns.lmplot(x="sepal_length", y="sepal_width", hue="species",  
               truncate=True, height=5, data=iris)  
  
# Use more informative axis labels than are provided by default  
g.set_axis_labels("Sepal length (mm)", "Sepal width (mm)");

###  `jointplot`
- Draw a plot of two variables with bivariate and univariate graphs.

In [None]:
g = sns.jointplot(x="sepal_length", y="sepal_width", data=iris)

In [None]:
g = sns.jointplot(x="sepal_length", y="sepal_width", data=iris, 
                  kind="reg", space=0, color="g")

# Useful Functions

### `heatmap`

- Plot rectangular data as a color-encoded matrix.

In [None]:
uniform_data = np.random.rand(10, 12)
ax = sns.heatmap(uniform_data)

In [None]:
sns.set()  
  
# Load the example flights dataset and conver to long-form  
flights_long = sns.load_dataset("flights")  
flights = flights_long.pivot("month", "year", "passengers")  
  
# Draw a heatmap with the numeric values in each cell  
f, ax = plt.subplots(figsize=(11, 8))  
sns.heatmap(flights, annot=True, fmt="d", linewidths=.5, ax=ax); 

### `clustermap`
- Plot a matrix dataset as a hierarchically-clustered heatmap.

In [None]:
species = iris.pop("species")
g = sns.clustermap(iris)