# Synopsis

Visualizing your data is key to both understanding its basic properties and effectively communicating its results to an outside audience. In particular, we will cover some basic principles for creating clean and informative visualizations.






# Read libraries

In [None]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

from colorama import Back, Fore, Style
from pathlib import Path
from sys import path

path.append('../My_libraries')
path

In [None]:
my_fontsize = 15

In [None]:
import random

import matplotlib.pyplot as plt
import matplotlib.cm as cm
import scipy.stats as stats

from My_libraries.my_stats import half_frame

# Data visualization

Visualizing your data is a **key** component of data analysis, no matter how big or small your data is.  Our minds are hardwired to process visual information, in fact **one-third** of our brain is dedicated to image processing and **40% of all** nerve fibers connected to the brain come from the retina. Visualization is essentially a high-speed link to your cognitive systems.

Let me show you a quick example.

## Traffic accidents

Let's look at this image together. **It plots the number of accidents per month against the time of day**. 

Can you tell me when accidents are most likely to occur?

<img src='Images/visualization_raw_chart.png' width = '600px'>

Now let's try something different. This is the same chart, but now I've added a color scale. The individual numbers have a more saturated shade of green-blue (cyan?) the more accidents that occur during that time period.

Now tell me, when are accidents most likely to occur? How long did it take you to figure it out?

<img src = 'Images/visualization_heatmap.png' width = '600px'>

## Effective visualizations make a difference

Effective visualizations allow us to make accurate decisions quicker. More importantly they help us make **correct** decisions. When you make a visualization, you're actually trying to make a point. The entire point is to persuade your audience of some fact that you believe to be true. 

Whether you accept this statement or not, it is what happens. If you don't construct your visualization in a way that informs your audience and allows them to make a correct decision, then they can easily reach a different conclusion. 

### Creating a poor visualization can be disastrous.

You might remember or have read about the 1986 Challenger accident. The Challenger was a NASA Space Shuttle that exploded upon take-off killing the entire crew because the O-rings in the rockets cracked and allowed jet fuel to leak into the combustion chamber.

**Why did the O-rings crack?**

There were rumblings at NASA prior to the Challenger launch that defects in the O-rings occurred at cold temperatures. This data was looked at by a large number of people, but this is how it was presented to senior management (the people actually tasked with making the decision of **whether it was safe or not** to launch the rocket).

<img src = 'Images/challenger_original.png' width = '600px'>

The predicted temperature for the launch time of the Challenger was 26-29$^o$F.  Do you think that was safe based on the data above?

**Would you have cancelled the launch based on these data?**




Really hard to say, no?


Let's use a graph to present the data as a function of launch temperature instead of launch number.

<img src='Images/challenger_remade.png' width = '600px'>

Would you have cancelled the launch based on these data?

## Visualizations are about making patterns visible!

However, by using deceiving methods, they can make non-existent patterns appear real. The book "How to Lie with Statistics" provides many examples of deceiving techniques. You should learn about them so that you can recognize them and avoid being duped by them.

**Dishonest axes**

Courtesy of the Biden White House.

<img src = 'Images/wh_economic_growth.png' width = '600px'>

**Fake scale**

Courtesy of National Law Enforcement Memorial and Museum's *2021 End-of-Year Preliminary Law Enforcement Officers Fatalities Report*.

<img src = 'Images/police-fatalities.png' width = '400px'>

**Cropped bar plots**

Courtesy of Fox News

<center>
<table>
    <tr>
        <td><img src = 'Images/aca_enrollment_fox.webp' width = '300px'></td>
        <td><img src = 'Images/aca_enrollment_corrected.jpg' width = '300px'></td>
    </tr>
</table>
</center>


**Making a good graph is not trivial. It requires thinking and the ability to summarize information.** 

In fact, **it is no different from writing a good essay or writing good code.**  

You need to know enough to have a point of view.  You need to choose the right argument that is solid and valid. You need to know how to present that point of view clearly. 


Consider this example, courtesy of *Americans United for Life* and *The Washington Post*.

<img src = 'Images/abortion.png' width = '600px'>

*Americans United for Life* wants to argue that the number of abortions performed by *Planned Parenthood* is going up enormously, while other medical procedures (such as screening for cancer) have been going down dramatically.

*The Washington Post* shows that, when placed on the same scale, *Planned Parenthood* performs nearly three times as many cancer screens as abortions. That is, while the trends presented by *Americans United for Life* are correct, that *Planned Parenthood* provides many other medical services besides abortions.

Unfortunately, both representations lack context.  As the following plot shows, the number of abortions in the US has decreased by more than half from the late 1970s peak. 

<img src = 'Images/us_abortions.png' width = '600px'>

But even these data do not tell the all story.  First, the number of people that could conceivably get an abortion has not remained constant.  So, more important than the total number is the rate.  The following graphs shows that the abortion rate has since 2012 been below even the 1973 levels.

<img src = 'Images/us_abortion_rate.png' width = '600px'>

Second, because of intimidation and state and local legislation, the number of hospitals performing abortions has decreased significantly.  This change has pushed procedures to clinics such as those run by *Planned Parenthood*.

<img src = 'Images/us_abortion_providers.webp' width = '600px'>

From here on we will focus on the actual nuts and bolts of crafting a graph in Python, but you should keep these principles in mind. 


# The free `Matlab`


**`numpy` - Numerical Python**
[Package documentation](http://docs.scipy.org/doc/numpy/)


**`scipy` - Scientific Python**
[Package documentation](http://docs.scipy.org/doc/scipy/reference/)

These two packages enable us to reproduce much of the capabilities of software such as *Matlab*.  they contain functions enabling one to do linear algebra, solve differential equations, generate pseudorandom numbers, and conduct statistical analysis.



# Plotting with `matplotlib`

[Package documentation]()

A number of plotting packages for Python have been released in the last few years. Currently, we like and recommend `matplotlib`. `Matplotlib` was created in 2003 and is the oldest Python plotting library that has remained under active development.

However, that doesn't mean that it's always the best for all purposes or that it will remain our recommendation for ever. Among biologists, the `seaborn` library has become popular. If you are able to make your data publicly available, there is a service called `plot.ly` with a Python library.

A problem with `matplotlib` is that its documentation is not particularly good. Typically, programmers copy the source code of visualizations they like and modify them. There are also lots and lots of **stackoverflow** answers concerning `matplotlib`. 

**`matplotlib` is completely costumizable.** 


**To learn more, browse the docummentation.**


## Create a figure object

We start by createing a `figure` object.

We can pass arguments to the figure object when we create it. For example, we can change its size.

In [None]:
fig = plt.figure(figsize = (4, 4.5))
print(type(fig))

Those dimensions are actually in inches, the first one is the width and the other is the height.

## Creating `axis` within a figure

Now we have to create something to plot, that something is a set of `axes` inside a `subplot`. Subplots let us have multiple graphs inside a single plot.

In [None]:
fig = plt.figure(figsize = (4, 6.5))

ax = fig.add_subplot(3, 1, 1)

ax.plot([1, 2, 3, 4], [1, 2, 3, 4], color = 'steelblue', 
        marker = 'o', lw = 2)
ax.plot([7, 8 , 9], [7, 8 , 9], color= 'orange', marker='^')

ax = fig.add_subplot(3, 1, 2)
ax.plot([7, 8 , 9], [7, 8 , 9], color= 'orange', marker='^')

ax = fig.add_subplot(3, 1, 3)
ax.plot([11, 12, 13], [11, 12, 13], color ='red')

plt.tight_layout()
plt.show()

When we specify subplots the first number is the number of rows of plots. The second number is the number of columns of plots. The third number is the specific plot that you wish to populate. This number goes from 1 to the maximum plot number (num_columns * num_rows). Hopefully this image will make it more clear

<img src='Images/matplotlib_subplots.png' width = '400px'></img>


## Create some data

Let's create some fake data that we can use to understand how to costumize plots.

In [None]:
x = range(5,100,5)

In [None]:
# This is a LIST COMPREHENSION
#
y = [i**2 for i in x]

# which is equivalent to the code below
#
y1 = []
for i in x:
    y1.append(i**2)
    
print(y)
print(y1)

In [None]:
z = [100*i for i in x]

rv1 = [random.random() for i in range(1000)]
rv2 = [random.random() for i in range(10000)]

Let's create a simple plot using the variables `x` and `y` from above.

In [None]:
fig = plt.figure( figsize = (5, 5/1.6) )
ax = fig.add_subplot(1, 1, 1)

ax.plot(x, y)

plt.show()

## Fixing the x and y axes

Of course, the graph above is shit.  

If you showed it to me, I would be rather disappointed.

First of all, you have no axis labels! 

**How am I to know what you are plotting?**


In [None]:
fig = plt.figure( figsize = (5, 5/1.6) )
ax = fig.add_subplot(1,1,1)

# We should add a label to our dataset that will go into a legend
ax.plot(x, y, label = "Parabola")

# Now we can label the axes. 
# Always label your axes! Who knows what is in the graph otherwise
ax.set_xlabel("x")
ax.set_ylabel("f(x)")

# Display legend
ax.legend()

plt.tight_layout()
plt.show()

# plt.savefig()

The fonts used in a graph should be easy to read. We can change readability by playing with font style and font size.  

**Sans-serif fonts (such as Helvetica and Arial) are better for screen and poster reading.**

**Font size can help us see what is important.**  The font for the axis label should be larger than the size of the one used for the tick labels.

I like to define a default size and then adjust other sizes in relation to that one.

## Changing font sizes and adding text to an `axis`

In [None]:
# Create the figure
fig = plt.figure( figsize = (5, 5/1.6) )
ax = fig.add_subplot(1,1,1)

# We should add a label to our dataset that will go into a legend
ax.plot(x, y, label = "Parabola", color = 'steelblue', linewidth = 3, 
        marker = 'o', markersize = 5)

# Now we can label the axes. 
# Always label your axes! Who knows what is in the graph otherwise
#
# ax.set_xlabel("$\mu$", loc = 'right', fontsize = 1.6*my_fontsize)
# ax.set_ylabel("$f(\mu)$", loc = 'top', fontsize = 1.6*my_fontsize)

ax.set_xlabel("$\mu$", fontsize = 1.6*my_fontsize)
ax.set_ylabel("$f(\mu)$", fontsize = 1.6*my_fontsize)

# Display legend
ax.legend(loc='best', fontsize = my_fontsize, 
          markerscale = 1.2)

#Adding a panel label
ax.text(90, 500, "(a)", fontsize = 1.2 * my_fontsize)

plt.tight_layout()
plt.show()

## Changing background color and customizing the frame of the figure

You can also change the background of your image.  This is useful when preparing presentation or images for the web. However, notice that for printing a non-white background is not great because it will lead to a waste of ink. 


In [None]:
# Create the figure
fig = plt.figure( figsize = (5, 5/1.6) )
ax = fig.add_subplot(1,1,1, facecolor = '0.9')

# We should add a label to our dataset that will go into a legend
ax.plot(x, y, label = "Parabola", color = 'steelblue', linewidth = 3, 
        marker = 'o', markersize = 5)

# Now we can label the axes. 
ax.set_xlabel("$\mu$", fontsize = 1.6*my_fontsize)
ax.set_ylabel("$f(\mu)$", fontsize = 1.6*my_fontsize)

# Display legend (I don't like box around legend)
ax.legend(loc='best', frameon=False, fontsize = my_fontsize, 
          markerscale = 1.2)

#Adding a panel label 
ax.text(85, 500, "(a)", fontsize = 1.2 * my_fontsize)

# Turn off the spines for two of the 4 axes
for axis in ['bottom','left']:
    ax.spines[axis].set_linewidth(1.5)
    ax.spines[axis].set_position(("axes", -0.02))
for axis in ['top','right']:
    ax.spines[axis].set_visible(False)

# We'll also need to turn off the ticks on the axes that we turned off
# and adjust the length and tickness
ax.tick_params(width = 1.5, length = 6)
ax.yaxis.set_ticks_position('left')
ax.xaxis.set_ticks_position('bottom')

# Set axes limits
ax.set_xlim(0, 100)
ax.set_ylim(0, 10000)

plt.tight_layout()
plt.show()
# plt.savefig('quadratic_half_frame.png')

You see now that the number of options that you can possibly configure to make a graph look like **how you want** is **enormous!** 

There's way too much to go over in this course, so we're going to stop going over more and more options now. 

If you want to learn more about the intricacies of `matplotlib` I think that this is a good [tutorial](http://www.labri.fr/perso/nrougier/teaching/matplotlib/).



# Taking advantage of pre-defined styles

A way to get around the default style of `matplotlib` is by using a relatively new addition to `matplotlib`: `styles`. There are a few (but should be more soon) pre-canned styles that look pretty okay. These by no means generate *publishable* figures, but they look decent enough to show a colleague.

You can actually change the default style in the entire notebook if you execute

    plt.style.use('style_name')
    
But I don't want to change the style of every plot in the notebook. When you don't want to change the style globally, you can just write a graph like this:

    with plt.style.context('style_name'):
        #Your graph code here
        
That `with` statement basically says that all of the code inside that block should use that setting. Once we leave the `with` block the `with` statement is no longer applied.


## The `ggplot` style

In [None]:
with plt.style.context('ggplot'):
    fig = plt.figure( figsize = (5, 5/1.6) )
    ax = fig.add_subplot(1,1,1)
    
    # We should add a label to our dataset that will go into a legend
    ax.plot(x, y, label = "Parabola")
    
    # Label the axes.
    ax.set_xlabel("$x$", fontsize = 1.6*my_fontsize)
    ax.set_ylabel("f(x)", fontsize = 1.6*my_fontsize)
    
    # Set axes limits
    ax.set_xlim(0, 100)
    ax.set_ylim(0, 10000)
  
    # Display legend
    ax.legend(loc='best', frameon=False, fontsize = my_fontsize)

## The `bmh` style

In [None]:
with plt.style.context('bmh'):
    fig = plt.figure( figsize = (5, 5/1.6) )
    ax = fig.add_subplot(1,1,1)
    
    # We should add a label to our dataset that will go into a legend
    ax.plot(x, y, label = "Parabola")
    
    # Label the axes. 
    ax.set_xlabel("$x$", fontsize = 1.6*my_fontsize)
    ax.set_ylabel("$f(x)$", fontsize = 1.6*my_fontsize)
    
    # Set axes limits
    ax.set_xlim(0, 100)
    ax.set_ylim(0, 10000)

    # Display legend
    ax.legend(loc='best', frameon=False, fontsize = my_fontsize)

## The `xkcd` style

In [None]:
with plt.xkcd():
    fig = plt.figure( figsize = (5, 5/1.6) )
    ax = fig.add_subplot(1,1,1)

    # We should add a label to our dataset that will go into a legend
    ax.plot(x, y, label = "Parabola", linewidth = 3)
    
    # Now we can label the axes. 
    ax.set_xlabel("x", fontsize = 1.6*my_fontsize)
    ax.set_ylabel("f(x)", fontsize = 1.6*my_fontsize)
    
    # Set axes limits
    ax.set_xlim(0, 100)
    ax.set_ylim(0, 10000)

    # Display legend
    ax.legend(loc = 'best', frameon=False, fontsize = my_fontsize)

# Seaborn

[`Seaborn`](https://seaborn.pydata.org/index.html) has become extremely popular among plotting packages built on `matplotlib`. The reason is that it has templates for many times of plots that look quite good without much tweaking.

<img src = "Images/seaborn.png" >

Additionally, `seaborn` also includes datasets that can be used to explore and test the templates.

In [None]:
import seaborn as sns

In [None]:
# Load the example planets dataset
planets = sns.load_dataset("planets")
planets


In [None]:
sns.set_theme(style="ticks")

# Initialize the figure with a logarithmic x axis
f, ax = plt.subplots(figsize=(7, 6))
ax.set_xscale("log")

# Plot the orbital period with horizontal boxes
sns.boxplot(x="distance", y="method", data=planets,
            whis=[5, 95], width=.6, palette="vlag")

# Add in points to show each observation
sns.stripplot(x = "distance", y = "method", data = planets,
              size = 4, color = ".3", alpha = 0.5, linewidth = 0)

# Tweak the visual presentation
ax.xaxis.grid(True)
ax.set(ylabel="")
sns.despine(trim=True, left=True)

In [None]:
tips = sns.load_dataset("tips")
tips


In [None]:
sns.set_theme(style="darkgrid")

g = sns.jointplot(x="total_bill", y="tip", data=tips,
                  kind="reg", truncate=False,
                  xlim=(0, 60), ylim=(0, 12),
                  color="m", height=7)

You can read more about the different seaborn styles [here](https://seaborn.pydata.org/tutorial/aesthetics.html).

# This is just a taste of what `matplotlib` offers

You can see some simple examples of all of these types at the Matplotlib Gallery [page](http://matplotlib.org/gallery.html).

**Find time to explore!!**