In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib notebook

<h1 align="center"> Building Salience in Scientific Figures </h1>

<br>

<div align="center">
<font size="+10"> Adam A Miller </font>
<br>
(CIERA/Northwestern/Adler)
<br>
<br>
LSSTC DSFP Session 12
<br>
<br> 
8 Feb 2020</div>

## Preamble

During Session 11, which focused on image processing, implicit in all our discussions was the importance of data processing pipelines. And with good reason, the Vera C. Rubin Observatory will perform $\sim$1000 separate observations of $\sim$37 billion sources over the course of a decade.

Ammasing the largest throng of astronomers and volunteers ever assembled would not be enough to inspect every single observation made by the Rubin Observatory. Fortunately, we have pipelines.

As good data scientists, we should all strive to develop software that produces fully reproducible analysis results.

This is why we spend time learning about version control, continuous integration, software containers, etc.

And yet, it is important to remember that pipelines can only execute the tasks they are asked to perform.

In late 2007, a relatively a stripped-envelop star exploded as a Type Ib supernova SN 2007uy in the nearby galaxy NGC 2770. These SNe sometimes emit in the X-rays, and so an observation was obtained with [Swift](https://www.nasa.gov/mission_pages/swift/main):

<img style="display: block; margin-left: auto; margin-right: auto" src="images/sn2007uy.jpg" width="1000" align="middle">

<div align="right"> <font size="-3">(X-ray on the left, UV/optical on the right. credit: NASA/Swift) </font></div>

NASA has developed an easy-to-use pipeline for the analysis of Swift X-ray images as part of its high energy research archive:

<img style="display: block; margin-left: auto; margin-right: auto" src="images/Swift_pipeline.png" width="600" align="middle">

<div align="right"> <font size="-3">(X-ray on the left, UV/optical on the right. credit: NASA/HEASARC) </font></div>

It would have been easy to continually extract the X-ray flux at the position of SN 2007uy. If all you did was hit "go" on such a pipeline, you'd get a nice nice X-ray light curve, but you'd miss this: 

<img style="display: block; margin-left: auto; margin-right: auto" src="images/sn2008D.jpg" width="1000" align="middle">

<div align="right"> <font size="-3">(X-ray on the left, UV/optical on the right. credit: NASA/Swift) </font></div>

SN 2008D, was discovered literally seconds after the star exploded, and this never would have happened had Edo Berger and Alicia Soderberg not looked at the Swift X-ray images themselves. 

In the field of SN science, this discovery is one of the most important from the past two decades.

Pipelines are wonderful, necessary, and often life savers, but...

**there is no replacement for looking at the data**.

or, in other words,

those that worry about the data, look at the data.

The most vexing problem of my career was solved by looking at the data:

<img style="display: block; margin-left: auto; margin-right: auto" src="images/SDSS_LRGs.png" width="800" align="middle">

<div align="right"> <font size="-3">(credit: SDSS) </font></div>

## Introduction

<br>
Session 12 is focused on data visualization, and, informally attempts to answer the question: what is the process to best communicate (or discover) the most important features within a data set?
<br>

(The remainder of this talk will primarily focus on visualization as a tool for communication, but as the preamble shows visualization is also a very powerful tool for discovery which will be discussed in more detail later this week.)

As scientists we primarily communicate via three mediums: 
  -  text (e.g., papers)
  -  speaking (e.g., talks)
  -  visualization (e.g., figures and slides)

As human beings we communicate via stories. 

When writing a paper or giving a talk you are certainly telling a story (with a beginning, middle, and end). It stands to reason that if you are creating a visualization you should do the same.

Telling stories with data is an idea we will repeatedly visit this week, so I won't dwell on this topic here. 

**Break Out Problem 1**

What is the story of this figure?

<img style="display: block; margin-left: auto; margin-right: auto" src="images/badColCol.jpg" width="600" align="middle">

<div align="right"> <font size="-3">(credit: Miller et al. 2015, ApJ, 798, 122) </font></div>

(*the* story is that anyone who made that is not qualified to give this talk...)

My eyes! They burn!

There are many problems with that visualization. Aside from the terrible construction (`jet` colormap, overlapping symbols, a lack of visual boundaries...), it completely fails to tell any story whatsoever.

seriously, what was I thinking...

Another lesson – captions matter. 

Captions, though they are composed of text, are a critical part of telling the story.

Nevertheless, I issue a **challenge** to each of you today: build visualizations that do not require captions. It is difficult but not impossible. 

Captions also tell stories! We will discuss more later this week, but think about whether the "default" method of writing captions – red line = this, grey dots = that – is truly in service of your story.

But what if my figure is "just" a histogram?

If it does not tell a story, then it should not be a figure. A mean/median/variance can be summarized in a table, distributions can easily be described in text (multimodal, long tails, etc.)

## Effective Communication

To build effective visualizations, ask yourself the following questions:

$~~~~$ What is the fundamental purpose of this figure?

$~~~~$ What newspaper headline would accompany this figure?

$~~~~$ Will other people "steal" this figure and put it in their talks?

Consider the following:

<img style="display: block; margin-left: auto; margin-right: auto" src="images/badFeatures.png" width="600" align="middle">

<div align="right"> <font size="-3">(credit: Miller et al. 2015, ApJ, 798, 122) </font></div>

*Purpose* – Rorschach test?

*Headline* – "Local Astronomer Fails to Effectively Communicate Anything?"

*Steal it?* – yes... if it was printed on paper and the person needed to start a fire

Now consider the following: 

<img style="display: block; margin-left: auto; margin-right: auto" src="images/MadauPlot.png" width="600" align="middle">

<div align="right"> <font size="-3">(credit: Madau 1997 AIPC, 393, 481 ) </font></div>

*Purpose* – Show the rate of cosmic star formation as a function of time.

*Headline* – "Universe Forms Fewer and Fewer Stars With Every Passing Year"

*Steal it?* – Yes!!! Literally, 10,000 times yes.

(I could not go more than 2 weeks in grad school without seeing some version of this figure.)

This figure is now known as the infamous "Madau plot." 

I find myself marveling at its simplicity. (I cannot think of any way to make this more "effective.")

<img style="display: block; margin-left: auto; margin-right: auto" src="images/MadauPlot.png" width="600" align="middle">

<div align="right"> <font size="-3">(credit: Madau 1997 AIPC, 393, 481 ) </font></div>

**Challenge #2** that I issue to each of you today: build visualizations that are so effective other astronomers must steal them from you for their talks. 

(If we're being honest, none of us will win a Nobel prize, but one of us might, one day, have a figure named after us #DSFPsquadgoals)

## Why Visualization?

Data and analysis are worthless if the results of the analysis cannot be communicated. 

To borrow a (tired?) trope - "a picture is worth a thousand words."

Relative to other topics in the DSFP, visualization proves exceptionally difficult to teach because there is no "right answer."

Every person brings their own personal history and perspective to every visualization. 

**Break Out Problem 2** 

<img style="display: block; margin-left: auto; margin-right: auto" src="images/upside_down.png" width="600" align="middle">

<div align="right"> <font size="-3">(data credit: Latest Supernovae + Transient Name Server, courtesy of G. Hosseinzadeh) </font></div>

**Break Out Problem 2**

Is the number of transients increasing or decreasing with time?

You probably said decreasing (but maybe not), either way, this question should not require a lot of thought. 

Here's a slightly different version:

<img style="display: block; margin-left: auto; margin-right: auto" src="images/better_labels.png" width="600" align="middle">

<div align="right"> <font size="-3">(data credit: Latest Supernovae + Transient Name Server, courtesy of G. Hosseinzadeh) </font></div>

As you can see, I have intentionally subverted expectations. With axes labeled properly, it is not too difficult to see what is happening here *but we all expect bar charts to point "up" for positive numbers.* Use that expectation to your advantage to reduce the congnitive load on the viewer.

(In all things, subverting expectations has its place, but for visualization you must be absolutely sure that you have your audience's attention long enough for them to understand that you are subverting expectation)

## Appendix

Make an "upside down" histogram to mess with viewers expectations.

In [2]:
sndat = pd.read_csv('supernova_discoveries.txt', delim_whitespace=True)

In [33]:
fig, ax = plt.subplots()
ax.bar(sndat.year, -1.*sndat.discoveries, 
       width=1, label='total reported transients')
ax.set_xlim(1995.6,2020.5)

ax.spines['top'].set_linewidth(0)
ax.spines['right'].set_linewidth(0)
# ax.spines['left'].set_linewidth(0)
ax.set_xlabel('year', fontsize=16)
ax.set_ylabel('$N_\mathrm{transient}$', fontsize=16)


# misleading version
ax.set_yticklabels([])
fig.tight_layout()
fig.savefig("./images/upside_down.png")


ax.set_yticks([-25000,-20000,-15000,-10000,-5000, 0])
ax.set_yticklabels([25000,20000,15000,10000,5000, 0])
ax.xaxis.tick_top()
ax.xaxis.set_label_position('top') 
ax.spines['top'].set_linewidth(0.8)
ax.spines['bottom'].set_linewidth(0)
ax.legend(loc=3, bbox_to_anchor=(0.05,0.8))
fig.tight_layout()
fig.savefig("./images/better_labels.png")

<IPython.core.display.Javascript object>

Make an (unnecessary) 3d histogram.

In [40]:
import matplotlib.pyplot as plt
import numpy as np

from sklearn.datasets import fetch_openml

train_samples = 5000

# Load data from https://www.openml.org/d/554
X, y = fetch_openml('mnist_784', version=1, return_X_y=True, as_frame=False)


In [86]:
fig = plt.figure(figsize=(4, 4))
ax = fig.add_subplot(111, projection='3d')


_xx, _yy = np.meshgrid(np.arange(22), np.arange(22))
x, y = _xx.ravel(), _yy.ravel()

top = X[41].reshape(28,28)[3:-3,3:-3].flatten()/100
bottom = np.zeros_like(top)
width = depth = 1




ax.bar3d(x, y, bottom, width, depth, top, shade=True)
ax.set_xticklabels([])
ax.set_yticklabels([])
ax.set_zticklabels([])

fig.savefig('./images/bad_3d.png')

<IPython.core.display.Javascript object>

In [85]:
fig, ax = plt.subplots(figsize=(4,4))

ax.imshow(X[41].reshape(28,28)[3:-3,3:-3], 
          cmap='binary')
ax.set_xticks([])
ax.set_yticks([])
fig.tight_layout()
fig.savefig('./images/number8.png')

<IPython.core.display.Javascript object>