# Principles of Data Visualization

_Authors:_ Noelle Brown (mostly), Tim Book

---

## Learning Objectives
- Identify what type of visualization may be appropriate for a given scenario.
- Generate bar graphs, histograms, scatter plots, and line plots in `matplotlib` and `seaborn`.
- Generate heat maps in `seaborn`.
- Critique data visualizations and implement changes.

<details><summary>Let's get started with a question: what is the point of creating a visualization for data?</summary>
    
- "Use a picture. It's worth a thousand words." - Tess Flanders, 1911
- We generally use visualizations in order to efficiently and/or effectively communicate information.
</details>

## First, what *not* to do...

<img src="./images/bad1.jpg" style="height: 500px">

<details><summary>What is wrong with the above visualization?</summary>
    
- The y-axis is flipped - most people expect 0 to be at the bottom of the graph, providing a misleading graph for people who just glance at the visualization.
</details>

<img src="./images/bad2.jpg" style="height: 400px">

*Above images from [this source](https://teachdatascience.com/ethicaldataviz/).*

<details><summary>What is wrong with the above visualization?</summary>
    
- The x-axis is not arranged according to dates, but according to cases. This shows a misleading decline in cases that does not actually exist.
</details>

<img src="./images/bad3.png" style="height: 350px">

*Above image from [this source](https://www.datarevelations.com/resources/hey-your-tableau-public-viz-sucks-revisited/).*

<details><summary>What is wrong with the above visualization?</summary>
    
- In general, this graph is not super clear. The reader is unsure of what is being conveyed here.
- The main problem with this visual is the color scale! As a general rule, never use red and green together in a visualization. Stick with accessible, color blind friendly color palettes. Read more about this [here](https://venngage.com/blog/color-blind-friendly-palette/).
</details>

You can see more bad visualizations [here](https://viz.wtf/)!

## What to do...

1. Less is more. Get rid of everything you don’t need and only focus on what you are trying to communicate.
> “Above all else show the data.” “Graphical excellence consists of complex ideas communicated with clarity, precision, and efficiency.”  
> - Tufte in [Visual Display of Quantitative Information](https://www.amazon.com/Visual-Display-Quantitative-Information/dp/1930824130)
2. Use color to help communicate
> Color should be used to emphasize, not distract the audience. We can use color to draw the audience's eyes where we want them to look. If you are concerned that your image is too "cluttered," then consider toning down your use of color.
3. Design for Accessibility
> Color Blindness/ Color Vision Deficiency (CVD) can take many forms. Red-green is 99% of cases.  
> - "Red–green color blindness is the most common form, followed by blue–yellow color blindness and total color blindness. Red–green color blindness affects up to 8% of males and 0.5% of females of Northern European descent." - [Wikipedia](https://en.wikipedia.org/wiki/Color_blindness)

### Let's do it!

First, let's download our data. This data is from [Kaggle](https://www.kaggle.com/crawford/80-cereals?select=cereal.csv) and is part of a [Makeover Monday challenge](https://www.makeovermonday.co.uk/data/) to improve a visualization.

<img src="./images/cereal.jpeg" style="height: 35==200px">

In [None]:
# Imports


In [None]:
# Import and view cereal data


## Bar Charts

> "A bar chart or bar graph is a chart or graph that presents categorical data with rectangular bars with heights or lengths proportional to the values that they represent." -[Wikipedia](https://en.wikipedia.org/wiki/Bar_chart)

In [None]:
# Get top 10 most sugary cereals, save as a variable


In [None]:
# MATPLOTLIB - Plot Bar Chart

In [None]:
# SEABORN - Plot Bar Chart

In [None]:
# MATPLOTLIB - Plot Horizontal Bar Chart
#plt.barh() will create a horizontal bar chart. (Note the h at the end of bar!)

# Size

# plt.barh()

# Create a descriptive title

# Add axis labels


In [None]:
# SEABORN - Plot Horizontal Bar Chart


## Histograms
> Histograms are used to display the distribution of numerical data.

In [None]:
# MATPLOTLIB - Plot a histogram of cereal rating

# Plot it

# Create a descriptive title


In [None]:
# SEABORN - Plot a histogram of cereal rating


## Boxplots
> Boxplot are used to display the distribution of numerical data.

In [None]:
# MATPLOTLIB - Plot a boxplot of cereal rating

# Plot it

# Create a descriptive title


In [None]:
# SEABORN - Plot a boxplot of cereal rating


In [None]:
#BONUS - What cereal has that super high rating?!


## Scatter Plots
> Scatter plots are used to display the relationship between two variables.

In [None]:
# MATPLOTLIB - Create a scatter plot of calories vs. sugar
    
# Size

# Plot it

# Create a descriptive title

# Add axis labels


In [None]:
# SEABORN - Create a scatter plot of calories vs. sugar


## Line Graphs
> Line graphs (also known as line plots or line charts) use lines to connect data points to show the changes in numerical values over time.

In [None]:
# Generate data over time

# Generate dates
dates = pd.date_range('3-1-22', '7-30-22')

In [None]:
# Generate bowls eaten
bowls_eaten = np.random.poisson(1, size = 152)

In [None]:
# MATPLOTLIB - Create a line plot of the number of bowls of cereal I've consumed

# Size

# Plot it 

# Create a descriptive title

# Add axis labels


In [None]:
# SEABORN - Create a line plot of the number of bowls of cereal I've consumed


## Heat Maps
> Heat maps use color to show the strength of a relationship between two or more variables.

A commonly seen implementation of a heatmap uses the [Seaborn library](https://seaborn.pydata.org).

In [None]:
# This code is taken with minor modifications from https://seaborn.pydata.org/generated/seaborn.heatmap.html

# Establish size of figure.

# Get correlation of variables.

# Set up mask to be "True" in the upper triangle.
# mask = np.zeros_like(corr)
# mask[np.triu_indices_from(mask)] = True

# Plot our correlation heatmap


In [None]:
# Let's look at the correlations to rating


Correlation refers to a statistical relationship between two variables. When two variables are correlated, changes in one variable are associated with changes in the other variable. A correlation can be positive (meaning that the two variables move in the same direction) or negative (meaning that the two variables move in opposite directions). However, correlation does not imply causation.

Correlation can be a useful tool to identify potential relationships between variables, but it is not sufficient to establish causation.

[Spurious Correlations](https://www.tylervigen.com/spurious-correlations)

## Advanced: Subplots
> We can use subplots to easily compare multiple visualizations by sharing an axis.

In [None]:
stocks = pd.read_csv('../data/food-stocks.csv')
stocks['Date'] = pd.to_datetime(stocks['Date'])
stocks.head()

In [None]:
# Create the figure and list of axes.
# We can also set the figsize here.
# Additionally, set sharex=True to keep x-axes aligned
fig, axs = plt.subplots(2, 2, figsize=(16, 10), sharex=True)

# First determine an axis to operatoe on and then create the plot!
axs[0][0].plot(stocks['Date'], stocks['DPZ'], color='dodgerblue', linewidth=3)
axs[0][0].set_title('Dominoes', size=16, loc='left')

# We can do this for all the other stocks!
axs[0][1].plot(stocks['Date'], stocks['MCD'], color='red', linewidth=3)
axs[0][1].set_title('McDonald\'s', size=16, loc='left')

axs[1][0].plot(stocks['Date'], stocks['WEN'], color='darkred', linewidth=3)
axs[1][0].set_title('Wendy\'s', size=16, loc='left')

axs[1][1].plot(stocks['Date'], stocks['YUM'], color='purple', linewidth=3)
axs[1][1].set_title('Yum Foods', size=16, loc='left');

---

## More Visualization Libraries in Python

In this lesson, you saw the vanilla matplotlib API and [Seaborn](https://seaborn.pydata.org/) (which uses Matplotlib under the hood). Other Popular Python visualization libraries include the following, which can be used for more advanced plots (like maps) or for interactive plotting:
- [Bokeh](http://bokeh.pydata.org/en/latest/)
- [Altair](https://altair-viz.github.io/)
- [Plotly](https://plot.ly/python/getting-started/)

## Other Visualization Tools

A variety of non-programming tools are also used in industry. However, not all of these are great for repeated analysis, customizable, or free! For example:
- Excel
- Power BI
- Tableau

---

## Matplotlib gallery

See the possibilities of matplotlib [here](https://matplotlib.org/3.2.1/tutorials/introductory/sample_plots.html#sphx-glr-tutorials-introductory-sample-plots-py).
- These examples are to show what's possible. Don't feel you need to memorize any of them. 

---

## Choosing the Right Chart Type

**If-This-Then-That Guidelines**

### Just a number or two
If showing just one or two numbers, then:
   - simply report the numbers.
   - compare numbers using a bar chart.
   
### One Variable
If visualizing the distribution of one variable, then:
   - If the variable is qualitative, (categorical: nominal or ordinal): use a bar chart.
   - If the variable is quantitative (numeric: ratio or interval): use a histogram.
    
### Two Variables
If visualizing the relationship between two variables, then:
   - If both variables are quantitative, then:
       - If one variable is time, then use a line plot.
       - Otherwise, use a scatter plot - or a box plot if want to show stats.
   - If one variable is quantitative and another is qualitative, then use multiple histograms.
   - If both variables are qualitative, then:
       - Use a table or a heat map.

### Three+ Variables
- Use a heat map, box plot, or multiple scatter plots

That should cover 95%+ of cases you'll see. 😀


### Avoid...
- Pie Charts
- 3-D Charts