# Seaborn Visualization: Intermediate

Instrutor: Chris Moffitt: Practical Business Python

* `matplotlib` provides the raw building blocks for Seaborn's visualiztions
* Seaborn supports complex visualizations of data
* It is built on matplotlib and works best with pandas dataframes
* Seaborn make's reasonable assumptions about colors and other visual elements to make visualizations that look more pleasing than the standard matplotlib plots
* Additionally, Seaborn performs statistical analysis on the data

#### Seaborn's `distplot`
* By default, generates a Gaussian Kernel Density Estimates (KDE)
* looks somewhat like a histogram
    * `sns.distplot(df['fmr_2'])`
* **Customizing distribution plots:**
* in order to plot a simple histgram: disable KDE and specify number of bins
    * `sns.distplot(df['alcohol'], kde= False, bins=10)`
* **rug plot**: [doc here](https://en.wikipedia.org/wiki/Rug_plot)
    * kde curve and rug plot can be combined
    * `sns.distplot(df['alcohol'], hist= False, rug=True)`
* the `distplot` function uses several functions, including `kdeplot` and `rugplot`
* It is possible to further customize a plot by passing arguments to the underlying function
* `sns.distplot(df['alcohol'], hist=False, rug=True, kde_kws={'shade':True})`
* `kws`: keywords

#### Regression plots
* Univariate analysis: looks at one variable
* **Regression Analysis** is bivariate (looks for relationships between two variables)
* **`regplot()`**
    * `regplot()` function generates a scatterplot with a regression line
    * Usage is similar to .distplot()
    * Must define: `data`, `x`, `y`
        * Since we're using a pandas DataFrame, the `x` and `y` variables refer to columns in the DataFrame
    * **`lmplot()`**: builds on top of regplot()
        * while regplot() is "low level," lmplot() is high "level"
        * lmplot() is much more flexible
        * lmplot() faceting:
            * organize data by colors (`hue`)
            * organize data by columns (`col`) or rows (`row`)
            * **Faceting:** the use of plotting multiple graphs while changing a single variable

#### Using Seaborn Styles
* Visualization's "aesthetics": layouts, labels, colors
* **`sns.set`** sets plot (pd or plt or sns) to default Seaborn style
* Seaborn has several default configurations that can be set with **`sns.set_style`**
    * These styles can override matplotlib and pandas styles as well
    * built-in styles (5): `white`, `whitegrid`, `dark`, `darkgrid`, `ticks`
* **In general, visualizations are more impactful if the amount of "excess chart junk" is removed
    * Common use case: remove the lines alomg axes called `spines` with **`despine`**
    * the default is to remove the top and right lines, but you can pass arguments specifying others
    * `sns.despine(left=True)`
* **`plt.clf()`** to clear a figure

#### Colors in Seaborn 
* Color is an extremely important component of creating effective visualizations
* Around 8% of the population is affected by color-blindedness (around 1 in 12 men but only around 1 in 200 women).
* Using color palettes that are colorblind-friendly can be very important
* Seaborn has several functions for creating, viewing, and configuring color palettes
* Because Seaborn is built on top of Matplotlib, it is able to interpret and apply Matplotlib color codes
    * To use matplotlib color codes, use:
    * `sns.set_style(color_codes=True)`
    * `sns.distplot(df['Tuition'], color=g)`
* To assign specific palette: `sns.set_palette()`
    * cycle through colors of a palette with:
    
```
for p in sns.palettes.SEABORN_PALETTES:
    sns.set_palette(p)
    sns.distplot(df['Tuition'])
```
* Seaborn has 6 default palettes, including:
    * deep
    * muted
    * pastel
    * bright
    * dark
    * colorblind
    #### Displaying palettes:
        * `sns.palplot()` function displays a palette
        * `sns.color_palette()` returns the current palette

```
for p in sns.palettes.SEABORN_PALETTES:
    sns.set_palette(p)
    sns.palplot(sns.color_palette())
    plt.show()
```
* There are three main types of color palettes:
    * **Circular color palettes:** used for categorial data that is not ordered
        * Example: `Paired`
    * **Sequential color palettes:** useful for when the data has a consistent range from high to low
        * Example: `Blues`
    * **Diverging color palettes:** for when both the low and high values are interesting
        * Example: `BrBG`
    * To print any of the palettes of 12 colors: `sns.palplot(sns.color_palette("Paired", 12))`
        * Choose any value for 12 (how many "swatches" in the color palette)

#### Customizing with Matplotlib
* Since Seaborn is based on Matplotlib, there is a wide variety of options for further modifying Seaborn plots
    * **Matplotlib Axes:**
        * Most customizations available through `matplotlib` `Axes` objects
        * `Axes` can be passed to seaborn functions
        * One of the most important concepts is to add additional code to create the subplots using matplotlib's subplots functions and pass the resulting axes object to the Seaborn function:

```
fig, ax = plt.subplots()
sns.distplot(df['Tuition'], ax=ax)
ax.set(xlabel='Tuition 2013-14')
```
* The `Axes` object supports many common customizations:

```
fig, ax = plt.subplots()
sns.distplot(df['Tuition'], ax=ax)
ax.set(xlabel='Tuition 2013-14', ylabel='Distribution', xlim=(0, 50000), title='2013-14 Tuition and Fees Distribution')
```

   #### Combining plots:
   
```
fig, (ax1, ax2)= plt.subplots(nrows=1, ncols=2, sharey=True, figsize =(7,4)

sns.distplot(df['Tuition'], ax=ax0)
sns.distplot(df.query('State' == 'MN')['Tuition'], ax=ax1)

ax1.set(xlabel='Tuition (MN)', xlim=(0,70000))
ax1.axvline(x=20000, label='My Budget', linestyle='--')
ax1.legend()
```

```
# Create a plot with 1 row and 2 columns that share the y axis label
fig, (ax0, ax1) = plt.subplots(nrows=1, ncols=2, sharey=True)

# Plot the distribution of 1 bedroom apartments on ax0
sns.distplot(df['fmr_1'], ax=ax0)
ax0.set(xlabel="1 Bedroom Fair Market Rent", xlim=(100,1500))

# Plot the distribution of 2 bedroom apartments on ax1
sns.distplot(df['fmr_2'], ax=ax1)
ax1.set(xlabel="2 Bedroom Fair Market Rent", xlim=(100,1500))

# Display the plot
plt.show()
```

#### Categorical plot types
* Categorical data contains a limited or fixed number of values and is most useful when combined with numeric data.

* Seaborn breaks categorical data plots into three groups:
    * 1) **First group: shows all observations:** `stripplot()` and `swarmplot()` 
    * 2) **Second group: abstract representations:** `violinplot()`, `boxplot()`, and `lvplot()`
    * 3) **Third group: statistical estimates:** `barplot()`, `pointplot()`, and `countplot()`
    
**Plots of each observation:**

* **`stripplot()`:**
    * `sns.stripplot(data=df, y='DRG Definition', x='Average Covered Charges', jitter=True)`
    * Shows every observation in the dataset 
    * In some cases it can be difficult to see individual data points
    
* **`swarmplot()`:**
    * `sns.swarmplot(data=df, y = 'DRG Defintion', x='Average Covered Charges')`
    * More sophisticated visualization that stripplot
    * Uses a complex algorithm to place observations in a manner where they will not overlap
    * Note: does not scale well to large datasets

**Abstract representations:**

* **`boxplot()`:**
    * `sns.boxplot(data=df, y='DRG Definition', x='Average Covered Charges')`
    * Used to show several measures related to the distribution of data, including the median, upper and lower quartiles, as well as outliers
    
* **`violinplot()`:**
    * `sns.violinplot(data=df, y='DRG Definition', x='Average Covered Charges', palette='husl')`
    * Combination of a kernel density plot and a boxplot
    * Because the plot uses a kernel density calculation, it does not show all points
    * This can be useful for displaying large datasets, but it can also be computationally intensive
    * Palette parameter optional

* **`lvplot()`:**
    * `sns.lvplot(data=df, y='DRG Definition', x='Average Covered Charges')`
    * Stands for "Letter Value Plot"
    * API is the same as boxplot and violinplot, but can scale more effectively to large datasets
    * lvplot is a hybrid between a boxplot and violinplot
    * Relatively quick to render, easy to interpret
    
**Statistical estimates:**

* **`barplot()`:**
    * `sns.boxplot(data=df, y='DRG Definition', x='Average Covered Charges', hue='Region')`
    * Shows an estimate of the value as well as a confidence interval
    * hue parameter optional, but often very helpful
    
* **`pointplot()`:**
    * `sns.pointplot(data=df, y='DRG Definition', x='Average Covered Charges', hue='Region')`
    * similar to bar plot in that it shows summary measure and confidence interval
    * can be very useful for observing how values change across categorical values
    
* **`countplot()`:**
    * `sns.countplot(data=df, y='DRG_Code', hue='Region')`
    * displays the number of instances of each variable

#### Regression plots
* Seaborn has a robust API that supports sophisticated analysis of data sets

* **Plotting with `regplot()`:**
    * `sns.regplot(data=df, x='temp', y='total_rentals', markers='+')`
    
* **Evaluating regression with `residplot()`:**
    * A residual plot is useful for evaluating the fit of a model/ understanding its appropriateness
    * Seaborn supports through `residplot` function
    * `sns.residplot(data=df, x='temp', y='total_rentals')`
    * Ideally, the residual values in the plot should be plotted randomly across the horizontal line
 
* **Polynomial regression with `regplot()`:**
    * Seaborn supports polynomial regression with the `order` parameter
    * `sns.regplot(data=df, x='temp', y='total_rentals', order=2)`
    
* the residual plot can interpret the second order polynomial and plot the residual values. Ideally, the values are randomly distributed.

* **Regression plots with categorical values:**
    * `sns.regplot(data=df, x='mnth', y='total_rentals', x_jitter=0.1, order=2)`

* **Estimators:**
    * In some cases, even with the jitter, it may be difficult to see if there are any trends based on the values of the variable 
    * `sns.regplot(data=df, x='mnth', y='total_rentals', x_estimator=np.mean, order=2)`
    * Using an estimator for the x value, can provide another helpful view of the data
    * This simplified view can show a trend
    
* **Binning the data:**
    * When there are continuous variables, it can be helpful to break them up into different bins
    * `x_bins` can be used to divide the data into discrete bins
    * The regression line is still fit against all the data
    * `sns.regplot(data=df, x='temp', y='total_rentals', x_bins=4)`
    * this shortcut function can help with getting a quick read on continuous data such as temperature
    
#### Matrix plots
* The heatmap is the most common type of matrix plot and can easily be created by Seaborn
* These can be useful to quickly see trends in a dataset
* `sns.heatmap` requirs that the data be in grid format (ie a matrix)
* pandas `crosstab()` is frequently used to manipulate the data

`pd.crostab(df['mnth'], df['weekday'], values=df['total_rentals'], aggfunc=mean).round(0)`

* the `crosstab()` function builds a table to summarize the data by the day and month 
* the `heatmap` translates the numerical values in the matrix into a color-coded grid

`sns.heatmap(pd.crosstab(df['mnth'], df['weekday'], values=df['total_rentals'], aggfunc='mean'))`

* **Customize a heatmap:**
    * heatmaps can be customized in multiple ways to present the most information as clearly as possible
    
`sns.heatmap(df_crosstab, annot= True, fmt='d', cmap='YlGnBu', cbar=False, linewidths=0.5)`

* `annot = True` : to add annotations to the individual cells
* `fmt` : ensures that the results are displayed as integers
* `cmap` : color map
* `cbar = False` : color bar (legend) is not displayed
* `linewidths` : passing a value to linewidths puts a small space between each of the squares/cells so that the values are simpler to view (arguable)

* **Centering a heatmap:*
* Seaborn supports centering a heatmap colors on a specific value 

`sns.heatmap(df_crosstab, annot=True, fmt='d', cmap='YlGnBu', cbar=True, center=df_crosstab.loc[9,6])`
* In this example, we center the color map at the value stored for September and Saturday (9=September, 6=Saturday, 6th day of the week). 

* One common usage for a heatmap is to illustrate correlation
* pandas `corr` functions calculates correlations between columns in a dataframe
* The output can be converted to a heatmap with seaborn

`sns.heatmap(df.corr())`