<div style="border-bottom: 2px solid #aaaaaa; border-right: 2px solid #aaaaaa; box-shadow: 5px 5px 3px #eeeeee;">
<h1>01 &#9658; Introduction</h1>
</div>

## Data Visualization - Why?

### Data Analysis

- We're asking questions about the data
- Trying to make sense of some underlying stories
- Better decision making through access to data
- Greater understanding

### Communication

- Present the data, the analysis, the stories
- Inform, educate, entertain
- Monitor, validate

## Data Visualization - What?

### Visual and External Representation

- Turning data into a picture

- Visual representation aids people to carry out tasks
- Is used to augment human capabilities 
- Exploit powerful human visual pattern detection


- External representation (external memory) augments our capacity
- Surpass limitations of internal cognition and memory
- Diagrams that organise by spatial location enhance search and recognition
- Can be physical objects as well as computer displays


### What we need to think about

- Consider design principles
- Work with human visual perception

### Do you know where these locations are from their latitude and longitude?

- 53.47063, -2.23603
- 53.47439, -2.25214
- 53.47732, -2.23710

In [None]:
from IPython.display import IFrame

In [None]:
IFrame(("https://www.google.com/maps/d/embed?mid=1ZZjynLY-Wq0Z00e2aaEyc5Fh2Z4&hl=en"), width=640, height=480)

## Human-in-the-loop

- Why not use some automated system?
- Use statistical analysis to derive some overall meaning?
- Use machine learning for complex datasets?


- Useful when the questions are well known and structured
- And the statistical analysis or machine learning algorithm is known to deliver the results


- However, lots of data is poorly understood
- The reliability and suitability of the analysis is unknown or insufficient
- There are potentially many (changing) questions to ask


- Sometimes there are no suitable automated mechanisms available
- No systems that can yet compete with the human visual system


- Sometimes a human needs to be in the loop

### Anscombe’s Quartet

In 1973, Francois Anscombe devised these four datasets to demonstrate the benefit of using visualization in addition to statistical methods for analysis.

In [None]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import Image
from IPython.core.display import HTML 

In [None]:
data = np.genfromtxt('data/anscombe.csv', delimiter=',', names=True, filling_values=0.0, dtype=None)

Let's initially take a look in tabular form.

In [None]:
# deliberately messy/nasty code or just ran out of time?

print("  Set I     ||   Set II    ||   Set III   ||   Set IV")
print("  x1 |   y1 ||   x2 |   y2 ||   x3 |   y3 ||   x4 |  y4")
print("-----+------++------+------++------+------++------+-----")
for row in data:  
    print ("%4.1f | %4.1f || %4.1f | %4.1f || %4.1f | %4.1f || %4.1f | %4.1f" % (row['x1'], row['y1'], row['x2'], row['y2'], row['x3'], row['y3'], row['x4'], row['y4']))

Perhaps you can start to visualize in your mind what is going on with these four datasets but it takes some *"thinking"*. On inspection **Set IV** has some unusual *x values*. It doesn't help that the others sets don't even have x values in order.

Some statistical analysis produces the following. All four sets have *nearly identical* statistical properties:

- Mean of x = 9
- Variance of x = 11
- Mean of y = ~7.50
- Variance of y = [4.122,4.127]
- Correlation between x and y = 0.816
- Linear regression of y = 3 + 0.5x

Yet, when plotted, they look like very different.

In [None]:
best_fit = {'x':[0,20],'y':[3,13]}
fig,ax = plt.subplots(2,2,figsize=(10,10))

ax[0][0].plot(data['x1'],data['y1'],'ro')
ax[0][0].set_title('Set I')
ax[0][0].set_xlim([3,20])
ax[0][0].set_ylim([2,14])
ax[0][0].plot(best_fit['x'], best_fit['y'])

ax[0][1].plot(data['x2'],data['y2'],'ro')
ax[0][1].set_title('Set II')
ax[0][1].set_xlim([3,20])
ax[0][1].set_ylim([2,14])
ax[0][1].plot(best_fit['x'], best_fit['y'])

ax[1][0].plot(data['x3'],data['y3'],'ro')
ax[1][0].set_title('Set III')
ax[1][0].set_xlim([3,20])
ax[1][0].set_ylim([2,14])
ax[1][0].plot(best_fit['x'], best_fit['y'])

ax[1][1].plot(data['x4'],data['y4'],'ro')
ax[1][1].set_title('Set IV')
ax[1][1].set_xlim([3,20])
ax[1][1].set_ylim([2,14])
ax[1][1].plot(best_fit['x'], best_fit['y']);

## Types of Visualization

While all visualization is an attempt to plot data in one form or another, the academic fields of visualization
are commonly broken down into areas such as Scientific Visualization, Information Visualization, Illustrative Visualization,
and Visual Analytics etc. Sometimes there are specialist areas e.g., molecular visualization.

### Scientific Visualization

- Usually means physical/scientific raw data
- Visualized with spatial/geometric representation
- Matches that of the data and physical world

Examples:

- X-ray (medical) volumes
- Engineering models and simulations
- Fluid flow through valves

<div style="background-color: #eef5f5; border-left: 2px solid black; border-right: 2px solid black; padding: 10px">
<h3 style="font-variant: small-caps;">Visualization of the Young's Modulus on a Random Finite Element Model of a Nuclear Graphite Brick</h3>
![Visualization of the Young's Modulus on a Random Finite Element Model of a Nuclear Graphite Brick](images/rfem_brick.png)
<a href="http://parafem.org"><small>[Source]</small></a></div>

In real life the (similar) brick looks like this. And it's easy to see the direct mapping between the real world and the data/model representation.

<div style="background-color: #eef5f5; border-left: 2px solid black; border-right: 2px solid black; padding: 10px">
<h3 style="font-variant: small-caps;">Nuclear Graphite Brick</h3>
![Nuclear Graphite Brick](images/graphite_brick.jpg)
<a href="http://parafem.org"><small>[Source]</small></a></div>

### Information Visualization (the main focus of this workshop)

- Has a greater focus on abstract
- Possibly non-numerical data
- Requires some **convention** to be applied to represent it spatially

Examples:

- Population chart
- Social network graphs
- Calendar

<div style="background-color: #eef5f5; border-left: 2px solid black; border-right: 2px solid black; padding: 10px">
<h3 style="font-variant: small-caps;">One representation of a calendar</h3>
![One representation of a calendar](images/calendar.png)
<a href="https://bl.ocks.org/mbostock/4063318"><small>[Source]</small></a></div>

Are days of the week actually little squares? Are months of year physically positioned like this? ;-)

## Representation and Convention

- Lots of data is abstract in nature
- It does not have a direct *physical* representation
- We must apply some *conventions* and *abstractions* to translate from data to picture


- The table of Anscombe's Quartet data allows individual values to be looked up precisely
- However, as it showed it's not the ideal format for identifying patterns, observing trends or spotting outliers


### Abstract Visualization Objects

- To achieve the translation from data to picture we must map the data to one or more Abstract Visualization Objects (AVOs)
- An AVO is some visual representation that can be parameterised

In [None]:
line_fig,line_ax = plt.subplots(1,1,figsize=(8,4))

line_ax.set_title('Line AVO and some example styles')
line_ax.set_xlim([0,12])
line_ax.set_ylim([0,6])

line_ax.plot([1,5], [1,5],'k-')
line_ax.plot([2,6], [1,5],'r-')
line_ax.plot([3,7], [1,5],'y--')
line_ax.plot([4,8], [1,5],'co--')
line_ax.plot([4,8], [1,5],'b*:')
line_ax.plot([5,9], [1,5],'m-', linewidth=3)
line_ax.plot([6,10], [1,5],'gd-.', linewidth=2, markersize=15)
line_ax.arrow(7, 1, 4, 4, head_width=0.5, head_length=0.5, fc='r', ec='k');

- e.g., a single straight line segment
 - has two x,y coordinates \*
 - has a thickness
 - has a line style (solid, dotted, dashed, etc)
 - has a colour (perhaps just greyscale intensity)
 - has an opacity
 - has a head and/or tail marker/glyph (arrows etc)
 - has a head/tail marker colour
 - annotations
- And that's before any consideration for its relationship and connectivity with other AVOs, or its temporal nature

\* Note: Actually the x's and y's are independent ordinates themselves.

- From the dataset, one or more, or a combination of the data variables can be used to *parameterise* the AVOs
- Assuming the data variable is appropriate, or can be prepared by *mapping* and *filtering*
- e.g., for the Anscombe datasets (which is relatively simple)
 - each x,y point maps directly to x,y positions on the plot
 - each x,y point is represented as a small circle glyph
 - the colour is statically chosen
 - the size of the glyph has been chosen

### Effective Data Mapping

- There are two principle **Data Classifications** that we can use to help with the abstraction and mapping of data:
 - *Quantitative data* - essentially magnitude values and ordered data
 - *Categorical data* - identity data and unordered data


- There are several geometric **Data Primitives** (AKA marks) that can be used to represent data:
 - Points
 - Lines
 - Areas
 - Volumes
 

- Several **Data Channels**\* for controlling their appearance:
 - Quantitative Data:
   - Position
     - On dependent scales
     - On independent/unaligned scales
   - Length
   - Angle
   - Area
   - Depth
   - Luminance
   - Saturation
   - Curvature
   - Volume
 - Categorical Data:
   - Spatial location
   - Hue
   - Motion
   - Shape (Glyph)


- And several **Gestalt Principles of Perception** covering how we see patterns and forms:
  - Proximity
  - Similarity
  - Enclosure
  - Closure
  - Continuity
  - Connection
  
  
- The primitives and channels are used to build all data visualizations no matter how complex and can use the principles of percerption to take advantage of our visual system.


\* Note: The data channels above are ordered in regards to their overall effectiveness, from best to worst.

### Human Visual Perception

The reason for the data classification used above, is that it maps onto how our visual system works. The visual cortex of the brain - the "seeing" part - operates in two modalities: identity (what or where something is) and magnitude (how much there is).


The brain's cerebral cortex - the "thinking" part - is responsible for asking and answering the questions we have, such as *how long is that line*, and *is it longer than that other line?*


The visual cortex is fast and efficient; the cerebral cortex is slow and inefficient. Therefore analysing a basic data table requires a great deal more cognition than a representative visualization does, because the external visualization shifts the hard work to the visual cortex and frees our mind to pose the important questions.

 ### Visualization Pipeline
 
 - The whole process of visualization is effectively a pipeline
 - The Haber & McNabb model shows how the various forms of the data flows from source to display

![Haber & McNabb Visualization Pipeline](images/Dossantos04vis_pipeline.png)

<a href="http://www.infovis-wiki.net/index.php/Visualization_Pipeline"><small>[InfoVis Wiki]</small></a></div>

### Visualization Idioms and Techniques

While all of the primitives and channels are used to create visualizations, we generally don't produce an entire visualization from scratch from the low-level components. Instead, we're more likely to use high-level techniques and idioms.

- Lots of visualization idioms
- Many ways to encode data as a picture
- Visit d3js.org for many example of Info Vis

- Not always applicable to data or task
- Use multiple appropriate techniques to compare and contrast

<div style="background-color: #eef5f5; border-left: 2px solid black; border-right: 2px solid black; padding: 10px">
<h3 style="font-variant: small-caps;">Start of the Examples page on d3js.org</h3>
![Start of the Examples page on d3js.org](images/d3js_visual_index_top.png)
<a href="http://d3js.org"><small>[Source]</small></a></div>

## Design and Ethics

### Effectiveness

- How effective is your visualization?
- How effective is the choice of idiom (AVO)?


- Relationship
- Quantities
- Comparison
- Rank


- Obvious?
- "Seeing" vs "thinking"


### Use of Visualization

How are you going to use the visualization? Who is the audience and the delivery medium?

- Static or dynamic and interactive?
 - Static visualization for presentation (especially print)
 - Dynamic web-app for interacting with data
- Exploration or presentation?
 - Exploring the unknown
 - Confirming models and predictions
 - Presentation of the known
- Short-term or long-term requirement?
 - Building towards the next automated tool
 - Validation and checking an existing tool
 - Teaching students or the public about how something works
 - Or will it be a "permanent" tool?

### Summary

- Careful choice about design, about the idioms used
- Careful choice about the tool used, its appropriateness for the task
- Careful choice about the effectiveness
- Concern for the correctness, accuracy and truth
- Careful choice about the emphasis

> “It's not just about making pretty pictures"

### Some Examples (on how not to do it)

<div style="background-color: #eef5f5; border-left: 2px solid black; border-right: 2px solid black; padding: 10px">
<h3 style="font-variant: small-caps;">Fox News and a rather large pie</h3>
![Fox News and a rather large pie](images/Fox-News-pie-chart.png)
<a href="http://viz.wtf"><small>[Source]</small></a>
<h3>&#x2717; Pie charts show proportions and the total should be 100%.</h3>
<h3>&#x2717; Most likely respondents would ask if they would vote for said candidate "if they were the choice", not "who would you choose".</h3>
</div>

<div style="background-color: #eef5f5; border-left: 2px solid black; border-right: 2px solid black; padding: 10px">
<h3 style="font-variant: small-caps;">In-pie-ception</h3>
![In-pie-ception](images/biginsights.jpg)
<a href="http://viz.wtf"><small>[Source]</small></a>
<h3>&#x2717; Too many categories for a pie chart.</h3>
<h3>&#x2717; Too many pie charts!</h3>
<h3>&#x2717; What does a pie chart in a cluster mean?</h3>
<h3>&#x2717; What does the overall size of the pie charts mean?</h3>
</div>

<div style="background-color: #eef5f5; border-left: 2px solid black; border-right: 2px solid black; padding: 10px">
<h3 style="font-variant: small-caps;">Web of Policies</h3>
![Web of Policies](images/Cc6YayEWoAIcq8R.jpg)
<a href="http://viz.wtf"><small>[Source]</small></a>
<h3>&#x2717; Nodes of graph are not clear - obstructed by overly large labels.</h3>
<h3>&#x2717; Lots of connections but nothing to denote what the connection is or why it's important.</h3>
</div>

<div style="background-color: #eef5f5; border-left: 2px solid black; border-right: 2px solid black; padding: 10px">
<h3 style="font-variant: small-caps;">There's some data but just what is the question?</h3>
![There's some data but just what is the question?](images/tumblr_o3f706ys4l1sgh0voo1_1280.jpg)
<a href="http://viz.wtf"><small>[Source]</small></a>
<h3>&#x2717; Chart title should be used to set the context and describe what's being shown.</h3>
<h3>&#x2717; While the data might reflect "favourable", the title has made it unclear.</h3>
</div>

<div style="background-color: #eef5f5; border-left: 2px solid black; border-right: 2px solid black; padding: 10px">
<h3 style="font-variant: small-caps;">Categorically wrong</h3>
![Categorically wrong](images/trends.png)
<a href="http://viz.wtf"><small>[Source]</small></a>
<h3>&#x2717; Colour mapping used is for sequential/quantitative data.</h3>
<h3>&#x2717; Choice of similar colours makes it hard to discerne the candidates </h3>
</div>

<div style="background-color: #eef5f5; border-left: 2px solid black; border-right: 2px solid black; padding: 10px">
<h3 style="font-variant: small-caps;">Funneldamental Error</h3>
![Funneldamental Error](images/funnel.png)
<a href="http://viz.wtf"><small>[Source]</small></a>
<h3>&#x2717; Chart is full of junk and uninterpretable parts.</h3>
<h3>&#x2717; The funnel is segmented and has data placed alongside, but the scale of the segments bares no relationship to the data.</h3>
<h3>&#x2717; Colour of the segments bares no relation to the data.</h3>
<h3>&#x2717; Does a great job of obfuscating the data.</h3>
</div>

<div style="background-color: #eef5f5; border-left: 2px solid black; border-right: 2px solid black; padding: 10px">
<h3 style="font-variant: small-caps;">Need for Quarantine</h3>
![Need for Quarantine](images/effects.png)
<a href="http://viz.wtf"><small>[Source]</small></a>
<h3>&#x2717; Pointless illustration to accompany what is ostensibly a list.</h3>
<h3>&#x2717; Positioned annotations imply location is important.</h3>
</div>

> ## “The purpose of [scientific] computing is insight not numbers’’
> R W Hamming, Numerical Methods for Scientists & Engineers, 1962

> ## “The purpose of visualization is insight, not pictures”
> Ben Shneiderman, Information Visualization – Using Vision to Think, 1999