> **Jupyter slideshow:** This notebook can be displayed as slides. To view it as a slideshow in your browser, type the following in the console:


> `> ipython nbconvert [this_notebook.ipynb] --to slides --post serve`


> To toggle off the slideshow cell formatting, click the `CellToolbar` button, then `View --> Cell Toolbar --> None`.

<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

## Principles of Data Visualization With Python

_Author: Dave Yerrington (San Francisco)

---


### Learning Objectives
*After this lesson, you will be able to:*
- Describe why data visualization is important.
- Identify the characteristics of a great data visualization.
- Describe when you would use a bar chart, pie chart, scatter plot, and histogram.


### Lesson Guide

- [Why Use Data Visualization?](#why_data_viz)
- [Anscombe's Quartet](#anscombe)
- [Attributes of Good Visualization](#viz_attr)
- [Choosing the Right Chart](#chart_choice)
- [Visualization Programming Libraries](#visualization_libraries)
- [Independent Research](#independent_research)
- [Conclusion](#conclusion)


<a id='why_data_viz'></a>

### Discussion: Why Use Data Visualization?

---

In a small group, discuss some of the ways you have used or enjoyed data visualization. Why do you think data visualization is useful? Why is it important?


### Why Use Data Visualization?

---

Because of the way the human brain processes information, charts or graphs that visualize large amounts of complex data are easier to understand than spreadsheets or reports. 

Data visualization is a quick, easy way to convey concepts in a universal  manner — and you can experiment with different scenarios by making slight adjustments.

Here's a helpful overview of the importance of data visualization:

[SAS: Data Visualization](http://www.sas.com/en_us/insights/big-data/data-visualization.html)

In [1]:
# Check out my awesome text-based data viz below.
import pandas as pd

df = pd.read_csv("./datasets/sales_info.csv")
df.head()

Unnamed: 0,volume_sold,2015_margin,2015_q1_sales,2016_q1_sales
0,18.42076,93.802281,337166.53,337804.05
1,4.77651,21.082425,22351.86,21736.63
2,16.602401,93.612494,277764.46,306942.27
3,4.296111,16.824704,16805.11,9307.75
4,8.156023,35.011457,54411.42,58939.9


How useful are our basic metrics of median, mean, and mode? What happens if they're the same?

<a id='anscombe'></a>

### Anscombe's Quartet

---

Below are the summary statistics for four plots. What do you think the visualization for each plot would look like? 

![summary statistics for four different plots](./assets/images/anscombs%20quartet.png)

You can probably already guess what the answer is: Although the four plots have the same summary statistics, 
they are actually completely different. This can be seen when we visualize them together. 

![anscomb's quartet](./assets/images/anscombs%20quartert%20visualization.png)

These descriptive statistics come from a data set constructed in 1973 by the statistician Francis Anscombe. It is a classic demonstration of the importance of data visualization.

- It highlights the failures of summary statistics.
- It shows the effect of outliers on statistical properties.
- Anscombe's intention was to attack the impression among statisticians that "numerical calculations are exact, but graphs are rough."

<a id='viz_attr'></a>

### Attributes of Good Visualization

---

What are some attributes you think are important for data visualizations to have? 

Let's take a look at what Jeffrey Shaffer, who teaches data visualization at the University of Cincinnati, thinks:

![](./assets/images/data%20attributes.png)

Interestingly, some attributes have more of an effect on our brains than others. The ones we tend to focus on most are position, then color, then size.

Let's take a look at three visualizations. Which one catches your attention most? Why?

![](./assets/images/mixed%20shapes.png)

![](./assets/images/squares%20and%20circles.png)

![](./assets/images/color.png)


Let's focus on color for a moment. Generally, in data visualizations, you’re going to use color in one of three ways: sequential, divergent, or categorical. 

Sequential colors are used to show values ordered from low to high.

![sequential](./assets/images/sequential.png)

Divergent colors are used to show ordered values that have a critical midpoint, like an average or zero.

![divergent](./assets/images/divergent.png)

Categorical colors are used to distinguish data that falls into distinct groups.

![categorical](./assets/images/categorical.png)

[Images via MediaShift](http://mediashift.org/2016/02/checklist-does-your-data-visualization-say-what-you-think-it-says/)

<a id='chart_choice'></a>

### Choosing the Right Chart

---


In addition to considering data visualization attributes, you should also carefully choose the type of chart or graph you'll use. Let's look at a few commonly used charts and graphs.

![](http://www.comicsenglish.com/wp-content/uploads/2013/06/xkcd-stove_ownership.png)


### Bar Charts

Bar charts are one of the most common ways of visualizing data. Why? Because they make it easy to compare information, revealing highs and lows quickly. Bar charts are most effective when you have numerical data that splits neatly into different categories.

![](./assets/images/bar%20chart.png)

### Pie Charts

Pie charts are the most commonly misused chart type. They should be only used to show relative proportions or percentages of information. 

If you want to compare data, leave it to bars or stacked bars. If your viewer has to work to translate pie wedges into relevant data or compare pie charts to one another, the key points you're trying to convey might go unnoticed. 

![](./assets/images/pie%20chart.jpg)
[Pie chart via TV.com](http://www.tv.com/news/learning-about-the-2013-pilot-season-through-pie-charts-136243394841/)

### The Best Use of a Pie Chart

![](http://i.imgur.com/uhTf6Ek.jpg)

### Scatter Plots

Scatter plots are a great way to give you a sense of trends, concentrations, and outliers. This will provide a clear idea of what you may want to investigate further. 

![](./assets/images/scatter%20plot.png)
[Scatter plot via Wikibooks](https://en.wikibooks.org/wiki/Statistics/Displaying_Data/Scatter_Graphs)

### Histograms 

Histograms are useful when you want to see how your data are distributed across groups.

![](./assets/images/histogram%20chart.png)

This is not an all-inclusive list of chart and graph types, but the point is to remember that you have options. You should consider which one is most appropriate for representing a particular data set. 

[Charts and graphs via Tableau](https://drive.google.com/file/d/0Bx2SHQGVqWasT1l4NWtLclJJcWM/view)

<a id='visualization_libraries'></a>

### Visualization Programming Libraries

In this lesson, we will use the Python libraries [Matplotlib](https://matplotlib.org/) (Python plotting) and [Seaborn](https://seaborn.pydata.org/) (statistical data visualization).

Many other Python libraries exist for making visualizations. Some of the most popular include:

- **[Bokeh](http://bokeh.pydata.org/en/latest/):** Python visualization library that targets the web browser (e.g., in Jupyter). Makes interactive plots, dashboards, data applications, etc.

- **[Graphviz](http://graphviz.readthedocs.io/en/stable/manual.html):** Popular visualization library for graph data structures (e.g., edges, vertices, etc). Has Python extensions.

- **[Basemap](http://matplotlib.org/basemap/):** Python Matplotlib extension for drawing static maps. There are many other Python libraries for plotting geographic data, including ones that might be easier to use, but many are not actively developed.

One of the most popular libraries for interactive visualizations in the web browser is D3. Because web browsers only natively run JavaScript, D3 requires knowledge of JavaScript:

- **[D3.js](https://d3js.org/):** JavaScript library for interactive web visualizations [D3.js](https://d3js.org/) | [Examples](https://github.com/mbostock/d3/wiki/Gallery)


### Other Visualization Tools

Although this course emphasizes a Python approach to data science, a variety of non-programming tools are also used in industry. Often, these tools can be applied much more quickly than creating a custom Python solution. For example:

- **Excel:** For quick data cleaning and simple graphs
- **Power BI:** A suite of business analytics tools
- **Tableau:** Business intelligence and analytics software
- **Periscope Data:** Data analysis platform
- **Plotly:** Create charts and dashboards


<a id='independent_research'></a>

### Independent Research: Python Plotting With Pandas and Seaborn

---

Open up the [independent research notebook](./python-data-viz-lab.ipynb) to explore plotting the sales data with Python. 


<a id='conclusion'></a>

### Things to consider

---

- Why is data visualization so important? 
- What are some considerations to keep in mind when creating a visualization? 
- Describe when you would use the following types of charts or graphs:
    - Bar chart
    - Pie chart
    - Scatter plot
    - Histogram 