# Plots & Graphs

Time to unleash the right side of your brain!  Truthfuly, we should have been working with plots for the last few sections way more extensively than we have.  Instead of introducing too many topics at once, we intentionally paused delving into the the visual exploration side of things so we could focus on one thing at a time.  Now that the analytical side of EDA is behind us, it's time to see how adding some visual creativity to your arsenal will pay huge dividends!

```{figure} ../images/right_brain.png
---
width: 600px
name: right-brain-fig
---
```

Plots and graphs will come into play at multiple stages during the course of your project.  First, they'll aid in your EDA journey with the basic understanding of what you're working with, investigating raw features, and helping to identify if there are any potential issues in the data.  Then once you start moving into feature engineering and transformations you'll want to begin visualing the changes and relational effects between your variables.  After that, when you start getting into building models and iterating through versions, you'll need some way to visualize the progress, tuning, and performance of your candidate models.

Up to this point all of the visuals and graphs you've been building are purely for your benefit.  They can be as simple and utilitarian, or as pretty as you want them to be.  The main focus during the proof of concept stages is to find a viable solution that works, so they only need to help you in that pursuit.  The next layer of visualizations you'll need to concern yourself with are external facing.

During the course of a project you'll have multiple and distinct groups of people you'll need to communicate with, all at different levels of detail.  You'll have data science peers that will need to understand the low-level and technical aspects of your program.  You'll have business sponsors who just want to understand that you solved their problem in some way, and don't need all of the gory math and technical details.  And finally, you'll have executive leadership teams who want _very_ simple and high-level details of your success story so they can understand the value you're bringing to the organization.   As you'll come to find out, visual graphics are often worth a thousand words, and can go a long way in selling your story to all of these different stakeholders.

<h3>Plotting In Python</h3>

Before going too far, we need to lay out some of the most popular library options for plotting in Python.  There are many other popular ones you could choose from, but what follows is a pretty typical starting point for your foundational learning.  Expand from here and find out what works for you in practice.

Take a look at the documentation and example galleries from the following to get a taste of what's possible.

__<a href="https://matplotlib.org/">Matplotlib</a>__ is the oldest and still the most popular plotting extension.  No question it works, and you'll find lots of online support if you need it.  In my opinion, it's a bit verbose to use as your top level administration though.  There are several packages offering better quality graphics with less code (interestingly enough, probably built on top of matplotlib...).  So the only time I'll use matplotlib these days is if it's something _extremely_ simple and I just need one or two lines of code to make it happen.  Other than that, it'll be one of these others that leverage matplotlib underneath the hood.  Your mileage may vary of course.

__<a href="https://seaborn.pydata.org/tutorial.html">Seaborn</a>__ is a library package built on top of matplotlib that addresses most of my gripes with matplotlib.  It's much more concise coding with better quality results.  I will absolutely use seaborn in my workflow depending on the graph I need, but it's not my primary library.  I favor what's known as the "Grammar of Graphics" {cite}`Wilkinson_2005`, which is an entirely novel approach to plotting that we'll get to when we discuss plotnine below.  Having said of all that, seaborn is currently being updated as we speak from it's core library to a grammar of graphics interface known as <a href="https://seaborn.pydata.org/tutorial/objects_interface.html">seaborn.objects</a>.  I guess the word is out that it's a superior methodology.  I won't cover the new seaborn.objects interface here, because frankly, I haven't taken the time to learn it yet because my needs are met elsewhere.  I have a sneaking suspicion that this library is more actively being developed so I will likely check it out again after it's fully built out.

__<a href="https://pandas.pydata.org/">Pandas</a>__ should sound familiar to you if you've been beefing up your coding skills.  The extremely popular pandas library has plotting functionality that chains and works extremely well in any pandas workflow.  I will often use pandas plotting when I'm visualizing a single series or something very simple.  If it's more than that, I'll tend to default to my package of choice up next.

__<a href="https://plotnine.readthedocs.io/en/v0.12.4/#">Plotnine</a>__ is the current standard for implementing the grammar of graphics for Python.  It's modeled after the go-to R programming plotting library by Hadley Wickham, ggplot2 {cite}`ggplot2_2016`.  Since I came to Python from R, this was a no-brainer for me.  The value Hadley and team introduced and gave to the analytics community by incorporating this standardized interface cannot be overstated.  It's a super logical way to build up a plot layer by layer with exactly what you need.  It won't always be the most concise code, especially for simple plots, but once you learn the syntax you won't really mind.  Fair enough to say I'm a ggplot fan, and by extension, plotnine.  The only downsides I've run into with the plotnine implementation of the logic, is that it's missing a few of the more advanced things that ggplot in R has, and if you're looking for interactivity you may want to consider another option such as <a href="https://plotly.com/python/">Plotly</a>.

What follows is not intended to be a tutorial on Python plotting code or techinques, but more generally about the application of plotting methodologies in various stages of your projects.  We will mostly use the plotnine design in this book, but feel free to use whatever works for you.  That's what's important.

Ok, now that we have all of that out of the way, let's talk about how we need to start thinking about plots and graphs.

<h3>Types of Plots</h3>

First off, you'll often see the words "plots" and "graphs" used interchangeably, but usually intended to mean the same thing.  These are simply visual graphics of some kind.  Plots more commonly refer to the techinical side of what we're doing in our application environments, and serve as a catch-all to encompass all kinds of visuals, from proper graphs to images.  Graphs will mostly be referencing the visualization of numeric or categorical data in some way.

Borrowing, and gently updating {numref}`plot-types-fig` below from the <a href="https://seaborn.pydata.org/tutorial.html">Seaborn User Guide and Tutorial</a>, is about as straight forward as I've ever seen an illustration laying out the most common graphs and their uses.  There are of course others and more exotic plot types for specific purposes, such as heatmaps, area, autocorrelation plots, 3D graphs, graph networks, images, word clouds, animated/interactive, and so on, but what we'll reveiw will cover the overwhelming majority of our needs for now.

```{figure} ../images/plot_types.png
---
width: 500px
name: plot-types-fig
---
Source: https://seaborn.pydata.org/tutorial/function_overview.html
```

For each pillar of plot types, we'll discuss in more detail when we get to each section, but for now hopefully you can see there are really only a few different reasons you might want to display data graphically, i.e. for Relational, Distributions, and Categorical data.  That's pretty much it.

You either want to compare some values in some way __(relational)__, investigate the dispersion of a data sample under study __(distributions)__, or you want to visualize __categorical__ (non numeric) data some how.  And all of the plot types under each pillar simply give you options and flexibility to respond to the exact pecularities of your actual data.  Honestly, you could probably learn 3 or 4 different types of graphs and be all set for 95% of the analytics we need to perform.  Of course we love having the flexibility and options to really dial in our graphics, but you could get by nonetheless.

We'll use {numref}`plot-types-fig` as the reference to motivate our examples below.

A couple of final thoughts before jumping in.

1. Begin by thinking about what data you have before worrying about graphs and types of plots
   - What are your numeric variables?  Are there categorical ones (if so, as the main feature, or secondary data that you want to use to hightlight or separate your numeric values in some way)?  Etc.
3. Simple is _always_ better
   - It's generally excepted that going overboard and adding too many dimensions in your graphics is not a good idea(data points, grouping, faceting, color, size, and shape aesthetics all at once)
   - Reporting from personal experience, I can tell you it will overwhelm your non technical business people and does more harm than good
4. There's a difference between working and exploratory, and presentation quality graphics
   - For exploratory, if you tried something and it reveals interesting information great, if not that's ok too, just move on
   - For presentation, do not leave anything in your graph that does not add value

Enough words.  It'll be easier if we get going with some examples.

<h3>Relational Plots</h3>

These are probably the most common types of graphs we work with.  Relational means you have some variable of interest you'd like to compare with some other phenomenon, be it another variable(s) or time.  Sounds complicated, but it's really pretty straightforward.

At the risk of unoriginality, we'll use the "tips" dataset to work through our examples below.  This popular dataset holds the record of tips received by a waiter over the course of several months.  Features present are the tip amount, bill total, sex/gender of the payer, if any smokers were present in the party, the day of the week, the time of day, and the size of the party.

Let's take a look.

In [None]:
# Import PydataSet Library
from pydataset import data

# Read in the "tips" dataset
dat = data("tips")
dat.head(10)

__<h5>Scatter Plots</h5>__

Scatter plots are excellent places to get started because they're pretty simple, and usually only a couple of variables.  While these types of plots could be used with categorical data such as Week Day on the x axis, it makes much more sense to view the relationships between numeric features.  Here we have three continuous variables in "total_bill", "tip", and "size".

You'll come to start thinking about your data in terms of primary and secondary, and/or aesthetic variables, meaning information you could use to segment or further subclassify your data to give further contextual information.

In [None]:
# Load library
import warnings
import plotnine as gg
warnings.filterwarnings('ignore')

# Plot scatterplot
gg.ggplot(dat, gg.aes(x = 'total_bill', y = 'tip')) + gg.geom_point()

Cool.  Our first exploratory plot!  We're relating the size of each individual _tip_ amount to the overall size of their _total bill_.  So what can we see?  Maybe you noticed a nice linear trend as the total bill gets larger, generally so does the size of the tip?  Maybe you noticed a few points that do _not_ follow those same general patterns?  Like the anomaly of someone tipping over \$5 on a \$7 bill?

Good start, but what else could we do here?  Maybe you thought about adding an additional segmentaiton _aesthetic_ using color in some way?  How about if we highlight each point by whether or not it was a male or female?

In [None]:
# Plot simple scatterplot with gender color aesthetic
gg.ggplot(dat, gg.aes(x = 'total_bill', y = 'tip', color = 'sex')) + gg.geom_point()

Hmm... unfortunately looks like pretty uniform dispersion so it doesn't appear to be too informational.  What if we add linear fit trendlines for each gender?

In [None]:
# Plot with lm fit
(gg.ggplot(dat, gg.aes(x = 'total_bill', y = 'tip', color = 'sex')) +
     gg.geom_point() +
     gg.geom_smooth(method = "lm", se = False))

As we first suspected.  Not much difference based on the sexes; however, the up sloping trendlines do confirm what we first observed in a general up trend of tipping more as the total bill increases.

Let's move on to segment, or facet, by whether or not each group was a "smoking" group or not, and also we'll use a coloring aesthetic to highlight the size of each group.

In [None]:
# Plot with lm fit and standard errors
(gg.ggplot(dat, gg.aes(x = 'total_bill', y = 'tip', color = 'size')) +
     gg.geom_point() +
     gg.geom_smooth(method = "lm") +
     gg.facet_wrap('smoker'))

Anything jump out at you?  There's a ton of information in this one, but I'll leave it to you to pick it apart.

Ok, we can't spend all day on scatterplots, but one more just for fun.  Maybe you had the thought to create a new variable to investigate the tip as a percentage of the total bill instead of just the raw dollar amount?  If so, nicely done.  Let's try it and see what happens.

In [None]:
# Create tip % variable
dat['tip_pct'] = dat.tip / dat.total_bill

# Plot with "smoker" facet
(gg.ggplot(dat, gg.aes(x = 'total_bill', y = 'tip_pct', color = 'tip')) +
     gg.geom_point() +
     gg.geom_smooth(method = "lm") +
     gg.facet_wrap('smoker'))

Whoa... the sign of the slopes flipped to negative?!?!  I thought we said as the total bill goes up, so does the tip amount???  Well that's true, it does, but as we can see the tip as a percentage of the total bill actually decreases as the bill gets larger.  So while yes it's more money, it's not keeping up on a linear scale.  Sad to see as I'm sure our waiter deserved better.

Alright we need to move to some other types.  Next up, line plots.

__<h5>Line Plots</h5>__

Line plots are pretty straightforward.  These are the stuff of _time series_ data that we discussed in our data types section.  It doesn't really make sense to connect data point by point if they're independent observations that have no dependency across time.  Let's see what we're talking about by reading in a dataset from the St. Louis Federal Reserve with the following features:

- DATE - Month
- CPI - Consumer Price Index
- PCE - Personal Consumption Expendature
- AHW - Average Hourly Wage
- CSI - Consumer Sentiment Index

Recall from our discussion about time series data types, what you should notice straight away is that this data carries a _time_ component.  This is what allows you to connect the points in time over the x-axis.

In [None]:
import pandas as pd

# Read in FRED data from Github
url = 'https://github.com/bradybr/practical-data-science-and-ml/blob/main/datasets/FRED.csv?raw=true'
dat = pd.read_csv(url, sep = "\,")
dat.head(5)

In [None]:
# Update DATE to datetime
dat['DATE'] = pd.to_datetime(dat['DATE'])

# Print statistical summaries
dat.describe().T

We can see that some of these variables are indexes, and others are one very different range scales, so let's start with just one series before getting too complicated.

In [None]:
# CPI Time Series line plot
(gg.ggplot(dat, gg.aes(x = 'DATE', y = 'CPI')) +
    gg.geom_line(color = '#4E68EC') + 
    gg.ggtitle("CPI - Consumer Price Index") +  
    gg.theme(axis_text_x = gg.element_text(rotation = 45, hjust = 1)))

Now what if we wanted to plot all 4 of the series together?  Could we do that?  Unfortunately no, not without normalizing or standardizing each individual series to be on the same scale.  Otherwise, the large values of the PCE series would overwhelm the smaller ones and you would not be able to see any variance.  There is another way we can get them all on one plot _without_ adjusting the values though, and it's called faceting.  Faceting essentially breaks out each series into it's own plot and then we set the "scales" parameter to "free_y" which allows each plot to set its own y-axis range instead of sharing.

See below.

In [None]:
# Melt data from wide to long
melted_dat = pd.melt(dat, id_vars = 'DATE')

# Facet Plot
(gg.ggplot(melted_dat, gg.aes(y = 'value', x = 'DATE', color = 'variable')) +
    gg.geom_line() +
    gg.theme(axis_text_x = gg.element_text(rotation = 45, hjust = 1)) +
    gg.facet_wrap('variable', scales = 'free_y'))

One more for good measure.  We're jumping ahead a bit here to feature engineering and transformations, but let's see how we'd normalize each series so we could get them all on one graph.

There are several ways we could do this, but we're going to keep it simple and create a normalized index out of each variable.

In [None]:
# Set the variables to normalize
vars = ['CPI','PCE','AHW','CSI']

# Normalize by dividing by the first value of each series
dat[vars] = dat[vars].div(dat[vars].iloc[0])

# Melt data from wide to long
melted_dat = pd.melt(dat, id_vars = 'DATE')

In [None]:
# Normalized Series Plot
(gg.ggplot(melted_dat, gg.aes(y = 'value', x = 'DATE', color = 'variable')) +
    gg.geom_line() +
    gg.ggtitle("Normalized Series") +  
    gg.theme(axis_text_x = gg.element_text(rotation = 45, hjust = 1)))

<h3>Distribution Plots</h3>

Plots that show the statistical moments of the data in a distribution, or spread, of the data can be extremely helpful in understanding the behavior of our sample under study.  We'll start with the most common and cover a few lesser used as well.

Continuing with the tips dataset.


In [None]:
dat.head(10)

__<h5>Histograms</h5>__

Histograms are probably the most familiar, and the most common to visualize the frequency counts of data series.  These graphs show the frequency of numerical data in the form of vertical rectangles illustrating how many times that value range was observed in the data (higher = more frequent).

Let's see what the range of values looks like in our "total_bill" variable.

In [None]:
# Re-read in the tips data & create tip % variable againi
dat = data("tips")
dat['tip_pct'] = dat.tip / dat.total_bill

# Print summary statistics
dat[['total_bill']].describe().T

We can see that the average value is around \$20 dollars, with the inner quartile range of \$13-24.  Now let's see what this looks like plotted in a historgram.

In [None]:
# Plot histogram
(gg.ggplot(dat, gg.aes('total_bill')) +
     gg.geom_histogram(binwidth = 3) +
     gg.geom_vline(xintercept = 20, color = "blue", size = 1.5) + 
     gg.annotate("rect", xmin = 13, xmax = 24, ymin = 0, ymax = 45, alpha = 0.4, fill = "#4E68EC"))

Nice!  That's pretty much exactly what we see.  Our data is right (positive) skewed, shown in that sweeping tail out to the right of our histogram.  It's a little hard to see the nuances of the full distribution though because of the "coarseness".  This is due to the number of chosen bins in the plotting function.  Histograms have a parameter called "bins" that you can adjust to increase or decrease the width of the discrete ranges on the x-axis.  Let's try making it a little more spread out by adjusting the "binwidth" parameter.

In [None]:
# Plot historgram with more granular "binwidth"
(gg.ggplot(dat, gg.aes('total_bill')) +
     gg.geom_histogram(binwidth = 1) +
     gg.geom_vline(xintercept = 20, color = "blue", size = 1.5) + 
     gg.annotate("rect", xmin = 13, xmax = 24, ymin = 0, ymax = 25, alpha = 0.4, fill = "#4E68EC"))

Interesting.  See how we lost all of the aggregated groupings?  Historgrams are particular in the number of bins required to really see what's going on.  When you start wanting to see this kind of granularity, maybe you want the next kind of plot we'll look at called density plots.

__<h5>Density Plots</h5>__

If the aggregated coarseness of the historgram was too much for you, maybe you're looking for a smoothed version that will nicely show the overall distribution of data.  Enter the density plot.  Here we lose the frequency counts in favor of the probability density funciton of a variable.  Sounds complicated, however the same idea applies as with the histogram where the larger more peaked areas occur more frequently.  Simple as that.

In [None]:
# Density plot
(gg.ggplot(dat) +
     gg.geom_density(gg.aes('total_bill'), fill = "#4E68EC", alpha = .4))

Let's overlay our density curve on top of our histogram and see how they compliment each other.

In [None]:
# Plot density curve over the histogram
(gg.ggplot(dat, gg.aes(x = "total_bill", y = gg.after_stat("density"))) +
     gg.geom_histogram() +
     gg.geom_density(fill = "#4E68EC", alpha = .4) +
     gg.scale_fill_discrete(guide = False))

Very nice.  We can see the obvious differences between the two representations, but also where they share the same distributional shape.  Sometimes you don't need or want to see the frequency steps and jagged jumps, and would prefer to see the smooth curve.

Now, let's take one last look by visualizing tips and total bill amounts on the same graph overlaying their individual density curves.  

In [None]:
import pandas as pd

# Melt dataset from wide to long
melted_dat = pd.melt(dat, id_vars = ['sex','smoker','day','time'])

# Plot multiple density plots
(gg.ggplot(melted_dat[~melted_dat.variable.isin(['tip_pct','size'])], gg.aes(x = 'value', color = 'variable')) +
     gg.geom_density())

__<h5>Empirical Cumulative Density Function (ECDF)</h5>__

On to another distribution function that calculates the fraction of observations that are less than or equal to a specific point in the distribution of the sample.  Probability density curves are extremely beneficial when you need to model or simulate from an observed sample (empirical) because a theoretical distribution will not fit or work.

This one is easier to show than to explain so we'll illustrate by simulating playing a fair coin flipping game that either wins 1 token or losses 1 token with each flip, governed by a 50/50 bernouli distribution.

Each game we'll start with 10 tokens and will consist of flipping a coin 1,000 times.  We'll simulate playing the game 100 times.  Each game will terminate either by running out of tokens if we hit zero, or by hitting the maximum number of flips set to 1,000.  For each game, we'll run a cumulative sum over all of the outcomes, then find the maximum total tokens we held at any one point in each game, and finally we'll calculate the ECDF on the maximums to understand the probabilty curve of what you could have expected to walk away with on average at your highest point.

Sounds complicated, but it's actually pretty straight forward in practice.  Check it out below.

In [411]:
import numpy as np

# Game function that wins +1 or costs us -1 token with each flip of a coin
def play_fun(start_amt = 10, n_flips = 1000, win_prob = .5):
    flips = np.random.choice([1,-1], size = n_flips, p = [win_prob, (1 - win_prob)])
    flips = np.insert(np.cumsum(flips) + start_amt, 0, start_amt)

    # Truncate the series if we hit 0 and ran out of tokens
    lost = [x for x,y in enumerate(flips) if y == 0]
    
    if len(lost) > 0:
        flips = flips[0:lost[0] + 1, ]

    return(flips)

# Play the game n times and find the maximum cumulative total from each time we played the game
n = 100
tokens_seq = [play_fun() for i in range(0, n)]
max_tokens = [max(x) for x in tokens_seq]

In [None]:
from statsmodels.distributions.empirical_distribution import ECDF

# Calculate the empirical cummulative density function & create a dataframe
ecdf = ECDF(max_tokens)
ecdf_dat = pd.DataFrame({'x': ecdf.x,
                         'y': ecdf.y})

In [None]:
# Plot CDF
(gg.ggplot(ecdf_dat, gg.aes(x = 'x', y = 'y')) +
    gg.stat_ecdf(color = "#4E68EC") +
    gg.labs(x = "Tokens", y = "Cummulative Probability") +
    gg.ggtitle("ECDF for the Maximum Tokens Held from " + str(n) + " Simulations"))

Beautiful shape.  What's so cool about this distribution is that now we can begin to ask some questions about expected outcomes.  Want to know where you should have walked away if you wanted to make a profit?  Given some loss tolerance percentage what you could have won?  Etc.  

For example, what if I am pretty risk adverse and wanted to know the probability of actually staying above where we started with 10 tokens?  Easy enough.  Just call our `ecdf()` object with the value you want to know the cummulative probability for.

In [None]:
# Retrieve the cumulative probability of the ecdf at the value of 10 tokens
ecdf(10) 

In [None]:
from IPython.display import Markdown as md
md("Ouch.  Looks like {}% of the time we went straight down and never sat on a profit.  I don't know about you, but that's not a game I want to play".format(round(ecdf(10) * 100), 2))

How do we interpret the graph then?  It's actually pretty simple.  The cumulative probability shows where some % of the observations were less than that value on the x-axis, and the inverse percentage is how much of the time the observations were greater than that number.

Say you're actually pretty risk tolerant and you would like to know where you might want to start thinking about cashing out given a 50% chance of reaching that amount? 

In [None]:
# Set container for the cumulative probability we want, and find the token value that corresponds to it
y_prob = .5
x_val = ecdf_dat.loc[ecdf_dat.y == y_prob, 'x']

# Plot CDF with 50% line indicators
(gg.ggplot(ecdf_dat, gg.aes(x = 'x', y = 'y')) +
    gg.stat_ecdf(color = "#4E68EC") +
    gg.geom_segment(gg.aes(x = 0,
                           xend = x_val,
                           y = y_prob,
                           yend = y_prob),
                    color = 'maroon') +
    gg.geom_segment(gg.aes(x = x_val,
                           xend = x_val,
                           y = 0,
                           yend = y_prob),
                    color = 'maroon') +
    gg.labs(x = "Max Tokens", y = "Cummulative Probability") +
    gg.ggtitle("ECDF for the Maximum Tokens Held from " + str(n) + " Simulations"))

In [None]:
from IPython.display import Markdown as md
x = round(list(ecdf_dat.loc[ecdf_dat.y == .5, 'x'].values)[0])
md("Above we can see above that 50% of the time you would fail to reach {} tokens, and 50% of the time you could expect to exceed it.  Definitely a game of chance, and not one of skill.".format(x))

Which makes sense if you think about it, because we started with 10 tokens and we're flipping a fair coin, so we should have approximately the same chance of losing 10 tokens as we do gaining 10!

<h3>Categorical Plots</h3>

And finally, our last grouping of plot types:  Categorical Plots.  

These are exactly as they sound, plots for visualizing categorical data in some way.  Some of these will look very familiar to what we've seen already, while others may be a little new.  What they'll all have in common though is categorical aggregation or faceting in some way that allows us to segment by these non numeric groups.

I see no reason to break from using the tips data at this point, so we'll carry on as we have been.

In [None]:
dat.head(10)

__<h5>Bar Plots</h5>__

Bar plots look very similar to histograms, the difference being here we're not spreading the frequencies of a numeric variable out by its range of values across the x-axis, but rather showing the categorical levels of a feature on the x-axis and the counts on the y-axis.

In [None]:
# Counts of categorical levels contained in the "day" feature
(gg.ggplot(dat, gg.aes(x = 'day')) +
    gg.geom_bar() +
    gg.ggtitle("Frequency Counts by Day"))

Of course we can also use the aggregation funtionality within the barplot to show numeric values _by_ these same levels contained within a categorical feature.

In [None]:
# Total summed values for "total_bill" by "day"
(gg.ggplot(dat, gg.aes(x = 'day', y = 'total_bill')) +
    gg.geom_bar(stat = 'identity') +
    gg.ggtitle("Total Bill Amounts by Day"))

__<h5>Boxplots</h5>__

Boxplots are again similar to some others we've seen, namely bar plots and density plots.  We're essentially using a bar plot layout to view the statistical distribution of the values contained within a feature.  Recall the numeric quantile summaries we've been working with?  Well, these are graphical representations of these summaries laid out in a barchart.  

How might this work?

Curious if there's a statistical difference in tipping by day of the week?  We can use boxplots to nicely visualize the differences!  However, before we do, let's print out the numeric summaries for reference.

In [None]:
# Print summary statistics for tips by day 
dat.groupby('day')['tip'].describe()

In [None]:
# Boxplot for tips by day
(gg.ggplot(dat, gg.aes(x = 'day', y = 'tip')) +
    gg.geom_boxplot())

Nice.  The minimums and maximums are easily identifiable.  Now, remember the discussion about inner quartile range (25th-75th percentiles)?  Well those are observable in the plot as the outer horizontal lines of the boxes.  The thickest line in the middle of the box is the median value.  And see those vertical lines sticking out above and below each box?  Those are the 1.5x the IQR range we've also been discussing as an indication of the upper and lower limits for outliers.  Which leaves the dots beyond those outlier limits.  Can you guess what those are?  Outliers!  Good guess.

We could also do some coloring and additional grouping and faceting if we'd like to get a little more information in one graphic, just don't forget that _simple is always better_, as things can get overly complicated super quickly.

In [None]:
# Plot boxplot with size facet and color aesthetic by gender
(gg.ggplot(dat, gg.aes(x = 'day', y = 'tip', color = 'sex')) +
    gg.geom_boxplot() +
    gg.ggtitle("Tips by Day, Party Size, and Gender") +
    gg.facet_wrap('size'))

__<h5>Violin Plots</h5>__

Next up is one of my favorite types of plots, but also unfortunately one of the least used by others.  Sad, because it's really an informative one.

Let's start with throwing boxplots under the bus first.  Did you happen to notice a potential issue with boxplots?  It's ok if you did't catch it.  Yes, these plots show a graphical representation of the numeric distribution of a variable, but they can easily obsfucate the real underlying spread.  Let's show an example to make this clear.  

First, we'll simulate a bimodal distribution and see what it looks like in a histogram and density curve.

In [None]:
# Create a bimodal distribution
x1 = np.random.beta(5, 1.5, n)
x2 = np.random.beta(1.5, 5, n)
X = pd.DataFrame({ 'x' : np.concatenate([x1, x2]) }) 

# Plot density curve over histogram
(gg.ggplot(X, gg.aes(x = 'x', y = gg.after_stat("density"))) +
     gg.geom_histogram(binwidth = .1) +
     gg.geom_density(fill = "#4E68EC", alpha = .4) +
     gg.scale_fill_discrete(guide = False))

Now, let's use the same data and print the numeric summary statistics, and then plot a boxplot and violin plot side by side to visualize.

In [None]:
# Print summary statistics
X[['x']].describe().T

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Plot a boxplot and violin plot sharing the same y-axis
fig, axs = plt.subplots(ncols = 2, sharey = True)
sns.boxplot(y = 'x', data = X, ax = axs[0])
sns.violinplot(y = 'x', split = True, inner = "quart", cut = 0, data = X, ax = axs[1])

Very very interesting.  Hopefully you can see the clear differences.  Would you really have been able to understand the distribution of data from the boxplot by itself?  The new violin plot clearly shows the underlying distribution much better than the boxplot does.  It's essentially a density plot turned on it's side with the quartiles overlaid.  Super informative.  

This example is just a simple caution.  While boxplots are awesome and super helpful, they can also easily hide distributional information if the data is non-normal.  And this isn't to say violin plots are infallable either.  Simply understand every plot type has its strengths and weaknesses, and try not to fall in love with just one type and you'll be fine.

Back to violin plots.  We can do some very cool things with these by splitting, coloring, and faceting.  Here's just a taste below if you wanted to see the difference in tipping percentage by gender, and whether or not they were in a party with a smoker.  It would be nearly impossible to see this same amount of information in such a succient way in a boxplot.

In [None]:
# Plot split violin plot
sns.violinplot(data = dat, x = "smoker", y = "tip_pct", hue = "sex", inner = "quart", split = True)

__<h5>Mosaic Plots</h5>__

And now our final plot type we're going to show: Mosiac Plots.

Mosaic plots are awesome and simple ways to plot multiple categorical variables against each other.  Under the hood, mosiac plots are leveraging what's known as _crosstabs_, which caluclate the overlapping percentages amongst the categorical groups.

Let's see how it works using the sex and day variables in the tips data.  First, we show a crosstab table to illustrate the actual numbers behind the graph.

In [None]:
# Print the crosstab table for sex and day
pd.crosstab(dat['sex'], dat['day'])

And now the mosaic plot, which is just a graphical representation of the crosstable counts above.

In [None]:
from statsmodels.graphics.mosaicplot import mosaic

# Plot 2 feature mosaic plot
mosaic(dat, ['sex', 'day'])
plt.show()

<h3>What Did We Learn?</h3>

Another long one, I know.  But there's so many different ways to visualize your data and we wanted to try to expose you to as many useful methodologies as possible.  You'll start to get comfortable with a plotting library and syntax you prefer over time, and with practice you'll be able to quickly associate different plot types with the specifics of your data at hand.  And as you being trying these out for yourself, you'll run across problems and new details you'd like to dial in and tweak.  Google, Stack Overflow, and the library documentations will again come to your rescue.

The last thought to leave you with is that you'll quickly come to realize there are many dimensions we can force into a graph, but just because we can doesn't mean we should.  Graphs should not be complicated or confusing.  The point you're trying to make should jump off the image at the viewer, and rarely should require any outside explanation.

Remember, _simple is always better_.

Happy plotting.