# Seaborn Visualization

* Seaborn (sns) is a Python data visualization library
    * Advantages:
        * Easy to use
        * works well with pandas
            * ...but only if the data is tidy
            * 'tidy data': each observation has its own row, and each variable has its own column
        * built on top of matplotlib
        * visually attractive
    * Why 'sns'?
        * The Seaborn library was named after "Samuel Norton Seaborn", a West Wing character
    * Access seaborn's built-in datasets with: `sns.load_dataset()`

#### Scatterplots
   * `sns.scatterplot()` [doc here](https://seaborn.pydata.org/generated/seaborn.scatterplot.html): parameters = ( x=None, y=None, hue=None, style=None, size=None, data=None, palette=None, hue_order=None, hue_norm=None, sizes=None, size_order=None, size_norm=None, markers=True, style_order=None, x_bins=None, y_bins=None, units=None, estimator=None, ci=95, n_boot=1000, alpha=None, x_jitter=None, y_jitter=None, legend='auto', ax=None, **kwargs)
       * `sns.scatterplot(x=heights, y=weights)`

#### Countplots
   * Countplots take in a categorical list and return bars that represent the number of list entries per category
   * `sns.countplot(x=dog_breeds)`
   * countplots with pandas:
       * `sns.countplot(x='how_masculine', data=df)`
       * when you use a named column of the dataframe as either x or y, seaborn automatically names that axis with the column name.

#### Hue
   * seaborn allows you to quickly add a third variable to your plots by adding color with **`hue`**
   * set the `hue` parameter to the dataframe column you would like to add as your third variable
   * `sns.scatterplot(x='total_bill', y='tip', data=tips, hue='smoker')
   * For hue, sns adds a legend to the plot automatically
   * if you don't want to use pandas, you can set `hue` to a list of values instead of a column name
   * Hue allows you to assert more control over the ordering and coloring of each value 
   * the `hue_order` parameter takes in a list of values and will set the order of the values in the plot accordingly
   * also control the colors assigned to each value with the parameter `palette`
       * `palette` takes in a dictionary mapping the variable values to the colors you want to represent the value
       * `hue_colors = {'Yes': 'black', 'No': 'red}`
       * `sns.scatterplot(x='total_bill', y='tip', data=tips, hue='smoker', palette=hue_colors)`
   * hue is available in most of seaborn's plot types

#### Relational plots and subplots
* Seaborn calls plots that visualize the relationship between two quantitative variables: **relational plots**
* While looking at a relationship between two variables at a high level is often informative, sometimes we suspect that the relationship may be different within certain subgroups

    * **relplot()**
        * "relational plot": enables you to visualize the relationship between two quantitative variables using either scatterplots or lineplots
        * scatterplots, lineplots
        * **Why use `relplot()` instead of `scatterplot()`?
            * The ability to create subplots in a single figure 
        * Subplots in columns:

``` 
sns.replot(x = 'total_bill', y='tip', data= tips, kind = 'scatter', col= 'smoker')
plt.show() #subplots arranged horizontally in columns
```
   * to arrange vertically, in rows, instead:
   
```
sns.replot(x = 'total_bill', y='tip', data= tips, kind = 'scatter', row= 'smoker')
plt.show() #subplots arranged vertically in rows
```
   * it **is** also possible to use both `col` and `row` at the same time:

```
sns.replot(x = 'total_bill', y='tip', data= tips, kind = 'scatter', col= 'smoker', row='time')
plt.show() #subplots arranged in a grid of rows and columns
```
   * to specify how many plots you want per row (in a grid set up of subplots)

```
sns.replot(x = 'total_bill', y='tip', data= tips, kind = 'scatter', col= 'day', col_wrap=2)
plt.show() # specified: 2 plots per row
```
   * change the order of the subplots by using the __`col_order`__ and/or __`row order`__ parameters:

```
sns.replot(x = 'total_bill', y='tip', data= tips, kind = 'scatter', col= 'day', col_wrap=2,
                col_order = ["Thur", "Fri", "Sat", "Sun"])
plt.show() 
```

#### Customizing scatter plots
* Customize size, style, and transparency of points (can be used in both `scatterplot()` and `relplot()`)
   #### Subgroups with point size
    * Varying point size is best used if the variable is either a quantitative variable or a categorical variable that represents levels of something (like small, medium, large, etc)
    * Say, we want each point on the scatterplot to be sized based on the number of people in the group, with larger groups having bigger points on the plot
    * set `size` parameter equal to the variable name `size` from the dataset
    
```
sns.relplot(x='total_bill', y='tip', data=tips, kind='scatter', size = 'size')
plt.show()
```
   * Using size parameter in combination with the hue parameter can make plots easier to read (than just using size parameter with single color
   * Because size is a quantitative variable, Seaborn will automatically color the points different shades of the same color, instead of different colors per category value (as seen in other plots)

```
sns.relplot(x='total_bill', y='tip', data=tips, kind='scatter', size = 'size', hue='size')
plt.show()
```
   #### Subgroups with point style
   * Setting the `style` parameter to a variable name will use different point styles for each value of the variable.

```
sns.relplot(x='total_bill', y='tip', data=tips, kind='scatter', hue='smoker', style='smoker')
plt.show()
```
   #### Changing point transparency
   * Setting the `alpha` parameter to a value between 0 and 1 will vary the transparency of the variables in the plot
   * 0 = completely transparent
   * 1 = completely opaque
   * can be very useful when you have many overlapping points on your scatterplot

```
sns.relplot(x='total_bill', y='tip', data=tips, kind='scatter', alpha= 0.4)
plt.show()
```

#### Line plots
* Whereas each point in a scatterplot is assumed to be an independent observation, __lineplots__ are the visualization of choice when we need to track the same thing over time.
* specify `kind='line'`
* track subgroups over time with lineplots
* if you **don't** want the linestyles to vary by subgroup, set the `dashes` parameter to `False`
* line plots can also be used when you have more than one observation per x value 
    * *if a line plot is given multiple observations per x value, it will aggregate them into a single summary measure*
    * by default, it will display the mean
    * **Note** Seaborn will automatically calculate a confidence interval for the mean, displayed by a shaded region
    * **Assumes dataset is a random sample**
    * If, instead of confidence level, you'd prefer to see std: set `ci="sd"` 
    * ci = confidence interval; sd = standard deviation
    * std- shows the spread of the distribution of observations at each x value 
    * Also: "turn off" confidence interval with `ci=None`

#### Count plots
* Visualizations involving categorical variables
* Categorical plots involve a categorical variable (commonly used to make comparisons between/across groups)
* __`sns.catplot()`__: used to create categorical plots
    * `catplot()` offers similar flexibility and advantages to `relplot()`
        * `kind=count`
        * easily create subplots with `col= ` and `row= `
        * sometimes there is a specific ordering of categories that makes sense for these plots (example: survey responses on a scale of 0 to 10, or chronological events)
        * to change the order of the categories, create a list of category values in the the order you would like them to appear, and then use the order parameter
        * `x` = name of categorical variable to plot

```
category_order = ['No answer', 'Not at all', 'Not very', 'Somewhat', 'Very']
sns.catplot9(x='how_masculine', data=masculinity_data, kind= 'count', order= 'category_order')
plt.show()
```
   * This works for all types of categorical plots, not just countplots 
   
#### Bar plots
* Display mean of quantitative variable per category 
* `kind = bar`
* Here, too, Seaborn automtically shows 95% confidence intervals, represented with solid black bars/"whiskers"
* change the orientation of the bars on the bar plot (horizontal vs vertical and vice versa) by switching the `x` and `y` parameters

#### Box plots
* Type of categorical plot
* Shows distribution of quantitative data
* See Median, Spread, Skewness, and Outliers
* Facilitates comparisns between groups (of a categorical variable)
* Seaborn does have a `boxplot` function, but here we'll work on `sns.catplot(kind='box')`
    * this makes it easy to create subplots using the `col` and `row` parameters
    * Put categorial variable on x-axis and the quantitative variable on the y-axis
    * to omit outliers from boxplot: `sym=""`
        * `sym` can also be used to change the appearance of outliers, instead of removing them
    * By default, the whiskers extend to 1.5 * the interquartile range
        * change whiskers using `whis`
        * several options for changing the whiskers:
            * extend whiskers with, for example: `whis = 2.0` (2.0 * IQR)
            * Or, have whiskers define specific lower and upper percentiles by passing in a list of the lower and upper values 
                * `whis=[5,95]` (5th to 95th percentile)
                * `whis=[0,100]` to draw whiskers at min and max values

#### Point plots
* Categorical plot 
* Point plots show the mean of a quantitative variable for the observations in the observations in each category, plotted as a single point.
* The vertical bar extending above and below the mean (the "point") represent the 95% confidence intervals for that mean.
* `kind = 'point'`
* to remove the lines connecting the points (perhaps because we only wish to compare within a categorical group and not between groups): `join=False`
* to have points (and confidence intervals) represent the median, instead of the mean: 
    * first: `from numpy import median`
    * `estimator=median`
        * median is more robust to outliers
* Customize the way the confidence intervals are displayed:
    * to add caps to the end of the confidence intervals: `capsize=0.2`

#### Changing plot style and color
* Reasons to change style:
    * personal preference
    * improve readability 
    * help orient audience more quickly to the key takeaway
    * Guide interpretation
* Changing the figure style:
    * Seaborn has 5 preset figure styles, which change the background and axes of the plot 
    * Preset options:
        * "white"
        * "dark"
        * "whitegrid"
        * "darkgrid"
        * "ticks"
    * to set one of these as the global style for all of your plots, use: `sns.set_style()`
    * Change the color of the main elements of the plot with `sns.set_palette()`
        * Use preset color palettes
        * Or, create your own custom palette
            * *Note that if you append the palette name with `_r`, you can reverse the palette*
        * **divergent palettes:** great for comparing differences
        * **sequential palettes:** a single color or two colors blended, moving from light to dark values. Sequential palettes are great for emphasizing a variable on a continuous scale
        
        * Create your own, custom palette, with `sns.set_palette` and passing in a list of the colors you would like to use (or, a list of hex color codes):

```
custom_palette = ['red', 'green', 'orange', 'blue', 'yellow', 'purple']
sns.set_palette(custom_palette)
```
* Changing the scale:
    * Figure "context" changes the scale of the plot elements and labels: `sns.set_context`
    * Scale options (from smallest to largest): "paper" , "notebook", "talk", "poster"
        * default context is "paper"

Matplotlib has eight colors: abbreviations
* blue: 'b'
* green: 'g'
* red: 'r'
* cyan: 'c'
* magenta: 'm'
* yellow: 'y'
* black: 'k'
* white: 'w'
*You can also use HTML hexcodes, just make sure you put the hex codes in quotes, with a pound sign at the beginning*