# Data Visualization
*Author: [Douglas Strodtman](http://linkedin.com/in/dstrodtman/)*



Today's lesson will focus on visualizations using Pandas, Matplotlib, and Seaborn. By the end of the lesson you will feel empowered to make quick plots to visually explore trends in your data as well as feel comfortable customizing plots to make them ready for professional reports.

## Lesson Overview

1. Basic Pandas plotting
    - histogram
    - bar plot
    - scatter plot
    - line plot
1. Seaborn
    - heat map
    - pair plot
    - dist plot
    - box plot
1. Matplotlib
    - basic plotting functionality
    - plotting multiple plots to the same axes
    - customization

This lesson continues to build on the data that we've previously explored. We'll largely be rehasing previous investigations from lessons and labs, but now focusing on representing insights in visual format.

## Module Import
In addition to Pandas, we'll be importing Matplotlib and Seaborn.

Note that `pyplot` is a submodule of Matplotlib, and that both of these plotting modules have standard aliases.

## Data Import
Load the data we cleaned in the last lesson.

And, as always, check that your data loaded as expected.

## A soapbox on plotting

First, you should read [this blog post](https://towardsdatascience.com/storytelling-with-data-a-data-visualization-guide-for-business-professionals-97d50512b407), which gives a brief synopsis of some of the key points presented in [_Storytelling with Data: A Data Visualization Guide for Business Professionals_](https://www.amazon.com/gp/product/1119002257/).

Creating good plots isn't easy, but it's worth the time and effort. Here's another fun [post that highlights some of the best and worst visualizations from 2018](https://www.kdnuggets.com/2019/02/best-worst-data-visualization-2018.html).

![](https://imgs.xkcd.com/comics/convincing.png)

Sometimes we're seeking to just create a quick and dirty visualization to get a snapshot into our data. I think of these as _images_ rather than _plots_. My basic criteria for a visualization to be considered a plot are:

1. It has a title.
1. Axes are labeled.
1. A legend is provided (if more than one color is used).
1. The correct _type_ of plot was selected...
1. ... to demonstrate a valid relationship.
1. Interpretation and context are provided.

These are the absolute bare minimum criteria. We should always try to:

1. Give consideration to scale.
1. Make all text legible for the intended presentation format.
1. Eliminate redundant information.
1. Where appropriate, order data to aid interpretability.
1. Choose colors that are color-blind friendly AND clearly distinguishable.

Without going too deep, here are some things to always keep in mind:

- Visual literacy, data literacy, technical literacy, and business acumen will vary drastically by audience. Make sure that you choose plotting methods that appropriate to the consumers of your visualizations.
- Plots for written reports can be dense, and oftentimes may be provided to allow the audience to visually explore relationships of interest to them. Providing your own commentary on ALL plots will ensure that even if their interests in the data diverge from your own, they can quickly align themselves to why YOU feel the plot is important.
- For presentations, your audience should be able to extract the core message from you plot in **3 seconds**. This allows your audience to focus on your stated message, rather than expending mental energy trying to figure out what you're trying to convey. Reducing clutter, using clear colors with high contrast, and formatting text to be visible from every seat in the meeting room are essential.
- Colors vary _drastically_ on different monitors, TVs, and projectors. Additionally, colorblindness is fairly common, especially in men of European descent. You should do your best to choose colors that will be easily differentiable regardless of tehnology-based rendering problems or genetic differences.

## Basic plots with Pandas

All that being said, we're going to start off doing some quick and dirty plots with Pandas.

#### What's good about Pandas plotting?

- It's quick
- It's easy
- It ~~generally~~ **often** makes smart choices
- You can always customize it after (more on this later)

#### What's bad about Pandas plotting?

- It makes assumptions
- Customization can be cumbersome
- Generating incorrect plots is easy
- Some default settings are **_bad_**

### Pandas histogram

Histograms are great to show the distribution of a numeric variable in your dataset.

Let's look at the distribution of our `total_budget` column.

This is a bad histogram. We can see that almost all of our values are between 0 and (it looks like) 175k dollars. Let's try using a mask to have a more reasonable histogram.

We still see a majority of our line items are in our first bin, so less than roughly \$20k. We'll leave this as is for now.

### Pandas bar plot

Bar plots are a great way to represent numeric data, often when you're most interested in representing the count or total of some value, often grouped by a condition.

Let's make a bar plot of our `account_group_name`.

Even for a quick glimpse into our data, we might prefer to be able to read our group names. It's easy to make a horizontal bar plot instead.

We can see that most of our entries don't have an `account_group_name` specified. This plot isn't great (how many equipment entries are there?), but we'll move on for now.

### Pandas scatter plot

A scatter plot is useful for demontstrating a relationship *between* two numeric features. Generally, you'll want these to be continuous (think `float`s), although sometimes discrete values (think `int`s) will work.

Here, we'll explore the relationship between `total_budget` and `total_expenditures`.

Note that scatter plots imply that there is a relationship between the two variables being explored, and that many people will want to read them as the `y` variable being dependent on the `x` variable.

### Pandas line plot

A line plot is **only** appropriate when there is a sequential/ordinal relationship between points on your X-axis, such as:
- time
- qualitative ratings
- discrete measurements

Things that would be inappropriate on the X axis (these are real things I've seen):
- state/country names
- zip codes
- monotonically increasing ID

Unfortunately, we only have two years in our currently loaded dataset. For demonstration purposes, we'll do a simple line graph on the grand total of expenditures each year.

Definitely not a good plot.

We could quickly download and load some [stock data from Yahoo](https://finance.yahoo.com/quote/AAPL/history/) though:

## Advanced plots with Seaborn

Before proceeding, I want you to note that in the 0.9.0 release, Seaborn added a number of new plotting functions. Many of these are actually the most basic plotting functions, as Seaborn was really designed as a package to provide highly-specialized visualizations difficult to configure in Matplotlib.

#### What's good about Seaborn?
- Generally great default settings
- Provides easy access to difficult plots
- A pretty friendly API build to work with Pandas

#### What's bad about Seaborn?
- Easy to generate overly complicated plots that will crash your kernel
- Some plots are inappropriate for certain audiences and/or presentation slides
- Technical implementation of some plots prevents further customization (or at least creates a high barrier to access)

Here I'll be focusing on a number of plots that I feel Seaborn does exceptionally.

### Seaborn heat map

A heat map applies color to tabular data corresponding to the numbers in each cell. I really like this plot paired with Pandas `.corr` method, which calculates the correlation between variables.

[Pearson's correlation coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) relates to the covariance demonstrated by two variables. Positive values mean that, at the aggregate level, as one variable increases, so does the other. Negative would indicate that values trend in opposite directions.

### Seaborn pair plot

The `pairplot` method can be useful when you want to explore all of the linear relationships between numeric columns.

The diagonal defaults to be a histogram for the variable. Note that paired with the heatmap above, we can now see which relationships resulted in positive and negative correlations.

As you can see, with only 11 numeric fields, this is already almost impossible to interpret. Also, by default our year and the numeric representation of `department` are included.

### Seaborn dist plot

`distplot` can be thought of as basically just a fancy histogram. 

Note that the line being drawn is the kernal density estimate, analogous to the probability mass/distribution function. This line isn't always appropriate to draw, especially as it can tend to visually represent likelihoods that fall outside of the bounded expectations.

From the output of our pair plot above, we can see that we don't really have any great histograms to make. I'm going to use `department` for demonstration purposes.

Of course, this isn't a good plot, as we have no reason to think that the `department` column represents a meaningful ordinal relationship.

I would argue that the kde line would be inappropriate if this were plotting a valid relationship, as it suggests discrepancies between discrete bins may actually be smoother than they are.

### Seaborn box plots

Box plots are a great way to visually communicate central tendency and spread for discretely-binned numeric data.

Here, we'll look at the spread of total expenditures on salaries year-to-year.

### Seaborn dog plot

Make sure you shift-tab to read the doc string.

## Matplotlib Plotting

Now we're going to build out a single plot to visualize a question from our lab:

**Which department had the most line item entries each year?**

We'll start by creating a pivot table, getting rid of any nulls values (due to now have entries for a given year), and sorting our counts by their values in 2017.

In [None]:
dept_by_year = df.pivot_table(values='fund_name', index='department_name', columns='budget_fiscal_year', aggfunc='count')
dept_by_year.fillna(0, inplace=True)
dept_by_year.sort_values(2017, inplace=True)

We will now iteratively build up a plot in matplotlib, doing the following:

- Creating a single bar plot from a Series
- Adding a second bar plot
- Changing the figure size
- Adjusting the width of bars
- Adjusting the alignment of bars
- Setting the x and y labels
- Setting the title
- Creating a legend
- Getting the current axis
- Rotating x ticks
- Aligning x ticks
- Changing all font sizes
- Saving figure

Let's get started.