# Overview

1. Getting started
2. `matplotlib`
3. `seaborn`
5. Conclusion

# Getting started

---

## Imports for today

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
import numpy as np
import pandas as pd
import numpy.random as nr

---

## Simulating data

Today we will be creating some random data based on randomly generated gaussian parameters: a gaussian mixture, a 'control' group, and a 'treatment' group.

In [None]:
# Create a random set of normal distribution parameters: loc (mean), scale (stdev), and size
def random_parameters(n = 3, size = 10000):
    """ Creates n number of random normal parameters
    
    Args:
        n (int): number of random normal parameter sets to make (default: 3)
        size (int): predefined size parameter. Useful in for downstream downsampling (default: 10000)
    
    Returns:
        rand_params (list): contains 3 values per item: loc (mean), scale (stdev), and size
    """
    rand_params = []
    for i in range(n):
        rand_params.append((nr.uniform(-100, 100), nr.rand(), size))
    return rand_params

In [None]:
# Create normal distributions from the above parameters and downsample all three
def gen_sample(dist_params, downsamples = None):
    """ Create random downsampled variables from normal distributions of a given set of parameters
    
    Args:
        dist_params (list of tuples): list of parameters tuples: loc (mean), scale (stdev), and size of normal distribution
        downsamples (list): numbers of downsample sizes. Must be same length as dist_params or None (default: None)
    
    Returns:
        norms (list): length of list determined by number of parameters sets in dist_params. Each item contains a random downsampled set of random variables
    
    Raises:
        AssertionError: if the length of downsample is not the same length as dist_params or is not None 
    """
    norms = []
    if downsamples is None:
        downsamples = [1000] * len(dist_params)
    assert len(downsamples) == len(dist_params), 'dist_params and downsamples must be the same length'
    for i, params in enumerate(dist_params):
        norms.append(nr.choice(nr.normal(*params), downsamples[i]))
    return norms

In [None]:
# Join all three normal random samples into one array
def join_shuf_format(samples, downsample = 1000):
    """ Concatenates sample sets, shuffles them, and return a pd.Series
    
    Args:
        samples (list): each item is a set of normal distribution random samples
        downsample (int): downsample size of concatenated arrays (default: 1000)
    
    Returns:
        (pd.Series): the shuffled version of joined samples
    """
    rs = np.array([])
    for arr in samples:
        rs = np.concatenate((rs, arr))
    rs = nr.choice(rs, downsample)
    nr.shuffle(rs)
    
    return pd.Series(rs)

In [None]:
downsample_max = 1000
rvs = gen_sample(random_parameters(5))

In [None]:
gauss_mix = join_shuf_format(rvs[:3])
control, treatment = [pd.Series(arr) for arr in rvs[3:]]

In [None]:
df = pd.concat([gauss_mix, control, treatment], axis=1, ignore_index=True)
df.columns=['gauss_mix', 'control', 'treatment']

In [None]:
df.head()

---

# Matplotlib

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

Because we used a notebook magic (`%`) for `matplotlib`, anything we plot with it will render in our environment.

## `matplotlib` format strings

`matplotlib` can use *format strings* to quickly declare the type of plots you want. Here are *some* of those formats:

|**Character**|**Description**|
|:-----------:|:--------------|
|'--'|Dashed line|
|':'|Dotted line|
|'o'|Circle marker|
|'^'|Upwards triangle marker|
|'b'|Blue|
|'c'|Cyan|
|'g'|Green|

## From scratch

## Multiple Plots

`matplotlib` allows users to define the regions of their plotting canvas. If the user intends to create a canvas with multiple plots, they would use the `subplot()` function. The `subplot` function sets the number of rows and columns the canvas will have **AND** sets the current index of where the next subplot will be rendered.

In [None]:
plt.figure(1)

# Plot all three columns from df in different subplots
# Rows first index (top-left)
plt.subplot(311)
plt.plot()

# Some plot configuration
plt.subplots_adjust(top=.92, bottom=.08, left=.1, right=.95, hspace=.25, wspace=.35)
plt.show()

In [None]:
# Temporary styles
with plt.style.context(('ggplot')):
    plt.figure(1)

    # Plot all three columns from df in different subplots
    # Rows first index (top-left)
    plt.subplot(311)
    plt.plot()

    # Some plot configuration
    plt.subplots_adjust(top=.92, bottom=.08, left=.1, right=.95, hspace=.25, wspace=.35)
    plt.show()

## Histogram

In [None]:
n, bins, patches = plt.hist(df.gauss_mix), facecolor='#5A0BB0', alpha=0.8, rwidth=.8, align='mid')

# Add a title


# Add y axis label


The biggest issue with `matplotlib` isn't its lack of power...it is that it is too much power. With great power, comes great responsibility. When you are quickly exploring data, you don't want to have to fiddle around with axis limits, colors, figure sizes, etc. Yes, you *can* make good figures with `matplotlib`, but you probably won't.

## Using pandas `.plot()`

Pandas abstracts some of those initial issues with data visualization. However, it is still `matplotlib`-esque.</br></br>
Pandas is built off of `numpy` for its caclulations, but its plotting is built off of `matplotlib`. Therefore, just like any data you get from `pandas` can be used within `numpy`, every plot that is returned from `pandas` is a `matplotlib` plot...and subject to `matplotlib` modification.

In [None]:
# Scatter plot
ax = 

# title and axis_labels


plt.show()

In [None]:
pd.plotting.scatter_matrix(df, alpha = 0.05, figsize=(10,10), 
                                diagonal='kde')

---

# Seaborn

In [None]:
import seaborn as sns

`seaborn` lets users *style* their plotting environment.

In [None]:
sns.set(style='whitegrid')

However, you can always use `matplotlib`'s `plt.style`

### New data to play with

In [None]:
weather = pd.read_table('./datasets/weather.tsv')

## Violin plot

Fancier box plot that gets rid of the need for 'jitter' to show the inherent distribution of the data points

In [None]:
sns.set(style='whitegrid', palette='muted')

# 1 row, 3 columns
f, axes = plt.subplots(1,3, figsize=(10,10), sharex=True)
sns.despine(left=True)

# Regular displot
sns.distplot(df.iloc, ax=axes[0,0])

# Change the color
sns.distplot(df.iloc, kde=False, ax=axes[0,1], color='orange')

# Show the Kernel density estimate
sns.distplot(df.iloc, hist=False, kde_kws={'shade':True}, ax=axes[1,0], color='purple')

# Show the rug
sns.distplot(df.iloc, hist=False, rug=True, ax=axes[1,1], color='green')

## Distplot

In [None]:
sns.set(style='whitegrid', palette='muted')

# 1 row, 3 columns
f, axes = plt.subplots(1,3, figsize=(10,10), sharex=True)
sns.despine(left=True)

# Regular displot
sns.distplot(df.iloc, ax=axes[0,0])

# Change the color
sns.distplot(df.iloc, kde=False, ax=axes[0,1], color='orange')

# Show the Kernel density estimate
sns.distplot(df.iloc, hist=False, kde_kws={'shade':True}, ax=axes[1,0], color='purple')

# Show the rug
sns.distplot(df.iloc, hist=False, rug=True, ax=axes[1,1], color='green')

## Hexbin with marginal distributions

In [None]:
sns.set(style='ticks')

In [None]:
plots_to_join = []
sns.jointplot(*plots_to_join, kind='hex', color= '#246068')

## FacetGrid

In [None]:
real_estate = pd.read_csv('./datasets/real_estate.csv')

In [None]:
real_estate.head()

In [None]:
sns.set()
columns_wanted = []
important_column = None
g = sns.FacetGrid(df.loc[:,columns_wanted], col=important_column, hue=important_column, col_wrap=5)
g.map(plt.scatter, 'hj', 'tv')