<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Principles of Data Visualization

---

## Learning Objectives
- Identify what type of visualization may be appropriate for a given scenario.
- Generate bar graphs, histograms, scatter plots, and line plots in `matplotlib`.
- Generate heat maps in `seaborn`.
- Critique data visualizations and implement changes.

<details><summary>Let's get started with a question: what is the point of creating a visualization for data?</summary>
    
- "Use a picture. It's worth a thousand words." - Tess Flanders, 1911
- We generally use visualizations in order to efficiently and/or effectively communicate information.
</details>

## First, what *not* to do...

<img src="./images/bad2.jpg" style="height: 400px">

*Above images from [this source](https://teachdatascience.com/ethicaldataviz/).*

<details><summary>What is wrong with the above visualization?</summary>
    
- The x-axis is not arranged according to dates, but according to cases. This shows a misleading decline in cases that does not actually exist.
</details>

<img src="./images/bad3.png" style="height: 350px">

*Above image from [this source](https://www.datarevelations.com/resources/hey-your-tableau-public-viz-sucks-revisited/).*

<details><summary>What is wrong with the above visualization?</summary>
    
- In general, this graph is not super clear. The reader is unsure of what is being conveyed here.
- The main problem with this visual is the color scale! As a general rule, never use red and green together in a visualization. Stick with accessible, color blind friendly color palettes. Read more about this [here](https://venngage.com/blog/color-blind-friendly-palette/).
</details>

<img src="./images/bad4.png" style="height: 400px">

<details><summary>What is wrong with the above visualization?</summary>
    
- Multiple data series are compared on the same graph, with different scales!  
- The claim of this visual is that this shows there is a temporal correlation between thyroid cancer and exposure to the chemical (herbicide) glyphosate also known as **Roundup**. Does this visual support that claim?  
</details>

You can see more bad visualizations [here](https://viz.wtf/) and [here](https://www.callingbullshit.org/tools/tools_misleading_axes.html).



## What to do...

0. Tell a story. The point of presenting visual information in data science is to make a technical point in a digestible way.
> Ask yourself what story you are trying to tell before creating the visualization.
> - When you share a deliverable with a client (report, slide deck, etc.) the first or second thing they will do is look over the visualizations. You can lose their attention or confuse them if the visualizations are hard to understand
1. Less is more. Get rid of everything you don’t need and only focus on what you are trying to communicate.
> “Above all else show the data.” “Graphical excellence consists of complex ideas communicated with clarity, precision, and efficiency.”  
> - Tufte in [Visual Display of Quantitative Information](https://www.amazon.com/Visual-Display-Quantitative-Information/dp/1930824130)
2. Use color to help communicate
> Color should be used to emphasize, not distract the audience. We can use color to draw the audience's eyes where we want them to look. If you are concerned that your image is too "cluttered," then consider toning down your use of color.
3. Design for Accessibility
> Color Blindness/ Color Vision Deficiency (CVD) can take many forms. Red-green is 99% of cases.  
> - "Red–green color blindness is the most common form, followed by blue–yellow color blindness and total color blindness. Red–green color blindness affects up to 8% of males and 0.5% of females of Northern European descent." - [Wikipedia](https://en.wikipedia.org/wiki/Color_blindness)
> - This is an educational article that explains color theory in more depth: [color theory](https://edtechbooks.org/ux/color_theory)

### Let's do it!

First, let's download our data. This data is from [Kaggle](https://www.kaggle.com/crawford/80-cereals?select=cereal.csv) and is part of a [Makeover Monday challenge](https://www.makeovermonday.co.uk/data/) to improve a visualization.

<img src="./images/cereal.jpeg" style="height: 35==200px">

In [22]:
# Imports


In [None]:
# Import and view cereal data


In [None]:
# Preliminary checks


In [None]:
def digital_root(num):
    total = sum(int(digit) for digit in str(num))
    if x < 10:
        return x
    else:
        return digital_root(x)
    print(total)

## Bar Charts

> "A bar chart or bar graph is a chart or graph that presents categorical data with rectangular bars with heights or lengths proportional to the values that they represent." -[Wikipedia](https://en.wikipedia.org/wiki/Bar_chart)

In [None]:
# Get top 10 most sugary cereals, save as a variable


In [None]:
plt.figure(figsize = (15, 5))
# bar chart
plt.bar(x = sugary["name"], height = sugary["sugars"])
###########################
# Styling your plot
###########################
# Title
plt.title("Top 10 most sugary cereals")
# Axis labels
plt.xlabel("Cereal name") # categories of cereal
plt.ylabel("Grams of Sugar (per serving)")
plt.xticks(rotation = 10);


In [None]:
plt.figure(figsize = (12,5))
plt.barh(y = sugary["name"][::-1], width = sugary["sugars"][::-1])
plt.title("Top 10 Cererals with highest sugar content (horiztonal)")
plt.xlabel("Grams of sugar (per serving)");

In [None]:
# plt.barh() will create a horizontal bar chart. (Note the h at the end of bar!)


## Histograms
> Histograms are used to display the distribution of numerical data.

In [None]:
# Let's plot a histogram of cereal rating

# Size

# Plot it
# colors: https://matplotlib.org/3.1.0/gallery/color/named_colors.html

# Create a descriptive title
# Do we need axis lables here?

## Scatter Plots
> Scatter plots are used to display the relationship between two variables.

> Can encode additional information into a scatter plot by using color, size and shape

In [None]:
# Let's create a scatter plot of calories vs. sugar

# Size

# Scatter plot

# Create a descriptive title

# Add axis labels


## Line Graphs
> Line graphs (also known as line plots or line charts) use lines to connect data points to show the changes in numerical values over time.

In [None]:
# Generate data over time
dates = pd.date_range('3-1-20', '7-30-20')
bowls_eaten = np.random.poisson(1, size = 152)

In [None]:
# Let's create a line plot of the number of bowls of cereal I've consumed

# Size

# Line plot

# Create a descriptive title

# Add axis labels


## Heat Maps
> Heat maps use color to show the strength of a relationship between two or more variables.

A commonly seen implementation of a heatmap uses the [Seaborn library](https://seaborn.pydata.org).

In [None]:
# This code is taken with minor modifications from https://seaborn.pydata.org/generated/seaborn.heatmap.html

# Establish size of figure.

# Get correlation of variables.

# Set up mask to be "True" in the upper triangle.

# Plot our correlation heatmap, while masking the upper triangle to be white.


# Establish size of figure.
plt.figure(figsize= (16, 9))
# Get correlation of variables.
corr = cereal.corr()
# Set up mask to be "True" in the upper triangle.
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True
# Plot our correlation heatmap, while masking the upper triangle to be white.
sns.heatmap(data = corr,     # data to plot
            mask = mask,
            square = True,   # boolean for a square plot or not
            cmap = "mako",   # color patelette
            annot = True);   # annotate = true, display correlations on plot as text

## Advanced: Subplots
> We can use subplots to easily compare multiple visualizations by sharing an axis.

In [None]:
# Establish figure size.
plt.figure(figsize = (16,9))

# We can create subplots, which allows us to have multiple subplots in the same plot.
# plt.subplot(3, 1, 1) means we have 3 rows, 1 column, and are referencing plot 1.
ax1 = plt.subplot(3, 1, 1)


# plt.subplot(3, 1, 2) means we have 3 rows, 1 column, and are referencing plot 2.
ax2 = plt.subplot(3, 1, 2, sharex=ax1)


# plt.subplot(3, 1, 3) means we have 3 rows, 1 column, and are referencing plot 3.
ax3 = plt.subplot(3, 1, 3, sharex=ax2)


plt.tight_layout(); # adds more space





# Establish figure size.
plt.figure(figsize = (16,9))
# We can create subplots, which allows us to have multiple subplots in the same plot.
# plt.subplot(3, 1, 1) means we have 3 rows, 1 column, and are referencing plot 1.
ax1 = plt.subplot(3, 1, 1)
ax1.set_title("Grams of protein")
ax1.hist(cereal["protein"])
# plt.subplot(3, 1, 2) means we have 3 rows, 1 column, and are referencing plot 2.
ax2 = plt.subplot(3, 1, 2, sharex=ax1)
ax2.set_title("Grams of fat")
ax2.hist(cereal["fat"])
# plt.subplot(3, 1, 3) means we have 3 rows, 1 column, and are referencing plot 3.
ax3 = plt.subplot(3, 1, 3, sharex=ax2)
ax3.set_title("Grams of carbs")
ax3.hist(cereal["carbo"])
# plt method to control space
plt.tight_layout(); 

---

## More Visualization Libraries in Python

In this lesson, you saw the vanilla matplotlib API and [Seaborn](https://seaborn.pydata.org/) (which uses Matplotlib under the hood). Other Popular Python visualization libraries include the following, which can be used for more advanced plots (like maps) or for interactive plotting:
- [Bokeh](http://bokeh.pydata.org/en/latest/)
- [Altair](https://altair-viz.github.io/)
- [Plotly](https://plot.ly/python/getting-started/)

## Other Visualization Tools

A variety of non-programming tools are also used in industry. However, not all of these are great for repeated analysis, customizable, or free! For example:
- Excel
- Power BI
- Tableau

---

## Matplotlib gallery

See the possibilities of matplotlib [here](https://matplotlib.org/3.2.1/tutorials/introductory/sample_plots.html#sphx-glr-tutorials-introductory-sample-plots-py).
- These examples are to show what's possible. Don't feel you need to memorize any of them. 

---

## Choosing the Right Chart Type

**If-This-Then-That Guidelines**

### Just a number or two
If showing just one or two numbers, then:
   - simply report the numbers.
   - compare numbers using a bar chart.
   
### One Variable
If visualizing the distribution of one variable, then:
   - If the variable is qualitative, (categorical: nominal or ordinal): use a bar chart.
   - If the variable is quantitative (numeric: ratio or interval): use a histogram.
    
### Two Variables
If visualizing the relationship between two variables, then:
   - If both variables are quantitative, then:
       - If one variable is time, then use a line plot.
       - Otherwise, use a scatter plot - or a box plot if want to show stats.
   - If one variable is quantitative and another is qualitative, then use multiple histograms.
   - If both variables are qualitative, then:
       - Use a table or a heat map.

### Three+ Variables
- Use a heat map, box plot, or multiple scatter plots

That should cover 95%+ of cases you'll see. 😀


### Avoid...
- Pie Charts
- 3-D Charts

### Resources
- This is a resource to experiment with visualizing datasets [gapminder](https://www.gapminder.org/tools/)
- There is also a `gapminder` python package: [pypi gapminder](https://pypi.org/project/gapminder/)

### Talks on Data Visualization

In [23]:
# Function to embed youtube videos into a Jupyter notebook
from IPython.display import YouTubeVideo

def display_yotube_video(url, **kwargs):
    """
    Displays a Youtube video in a Jupyter notebook.
    
    Args:
        url (string): a link to a Youtube video.
        **kwargs: further arguments for IPython.display.YouTubeVideo
    
    Returns:
        YouTubeVideo: a video that is displayed in your notebook.
    """
    id_ = url.split("=")[-1]
    return YouTubeVideo(id_, **kwargs)

In [24]:
# Video about data journalism
display_yotube_video("https://www.youtube.com/watch?v=5Zg-C8AAIGg", width=800, height=600)

In [25]:
# Great video for understanding the story-telling aspect of data visualization
display_yotube_video("https://www.youtube.com/watch?v=jbkSRLYSojo", width=800, height=600)