<div style="border-bottom: 2px solid #aaaaaa; border-right: 2px solid #aaaaaa; box-shadow: 5px 5px 3px #eeeeee;">
<h1>02 &#9658; Data Representation</h1>
</div>

Each visualization technique/idiom is suited to demonstrating particular types of data. How applicable a particular technique is, is sometimes obvious, or sometimes a design choice. Starting with the familiar pie chart,
we can see how well *proportions* are represented.

We'll start with a test. There's some code to run first.... don't look too closely just yet!

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

from bqplot import *
import numpy as np
import pandas as pd

from ipywidgets import *
from IPython.display import display

In [None]:
# some chart and figure templates

no_margin = dict(left=0, right=0, top=0, bottom=0)
sm_margin = dict(left=5, right=5, top=5, bottom=5)
bar_margin = dict(left=20, right=0, top=0, bottom=20)

eg_pie_margin = dict(left=0, right=0, top=0, bottom=0)

pie_tmpl = dict(radius=100, labels = ['A','B','C','D','E'],
                label_color = 'white', font_size = '20px', font_weight = 'normal')
fig_tmpl = dict(animation_duration=1000, fig_margin=no_margin, min_width=250, min_height=250)

bar_chart_tmpl = dict(padding=0.2, type='grouped', colors=['#dd7777','#7777dd'])
bar_fig_tmpl = dict(min_width=250, min_height=250, fig_margin=bar_margin)
hbox_tmpl = dict(width="250px", border="solid 1px #dddddd")
eg_box_tmpl = dict(width="500px", height="500px", border="solid 1px #dddddd")

In [None]:
# bqplot bar chart, figure and legend helper functions

def mk_eg_pie(eg_pie):
    pie = Pie(**eg_pie,radius = 180)
    pie.label_color = 'white'
    pie.font_size = '20px'
    pie.font_weight = 'normal'
    fig = Figure(marks=[pie], fig_margin=eg_pie_margin, min_width=400, min_height=400)
    fig.layout.width = 400
    fig.layout.height = 400
    box = VBox([fig])
    return box

def mk_eg_bar(eg_bar):
    x_scale = OrdinalScale()
    y_scale = LinearScale()
    bar = Bars(**eg_bar, scales={"x":x_scale, "y":y_scale}, padding=0.2)
    ax_x = Axis(scale=x_scale)
    ax_y = Axis(scale=y_scale, orientation='vertical', tick_format='0.0f')
    fig = Figure(marks=[bar], axes=[ax_x, ax_y], min_width=400, min_height=400, fig_margin=bar_margin)
    fig.layout.width = 400
    fig.layout.height = 400
    box = VBox([fig])
    return box

def mk_bar_chart(x, y, xy_scales, b_tmpl):
    return Bars(x=x, y=y, scales=xy_scales, **b_tmpl)

def mk_bar_figure(bar,f_tmpl):
    ax_x = Axis(scale=bar.scales['x'])
    ax_y = Axis(scale=bar.scales['y'], orientation='vertical', tick_format='0.0f')
    new_fig = Figure(marks=[bar], axes=[ax_x, ax_y], **f_tmpl)
    return new_fig

def mk_legend(labels):
    leg_x_scale = OrdinalScale()
    leg_y_scale = LinearScale()
    leg_chart = Bars(x=[0,1], y=[[0,0],[0,0]],
                     scales={'x': leg_x_scale, 'y': leg_y_scale},
                     **bar_chart_tmpl, 
                     labels=labels,
                     display_legend=True)
    leg_ax_x = Axis(scale=leg_x_scale)
    leg_ax_y = Axis(scale=leg_y_scale, orientation='vertical', tick_format='0.0f')
    leg_fig = Figure(marks=[leg_chart],legend_location='top-left',min_width=250,min_height=75,fig_margin=bar_margin)
    
    return leg_fig

# ipywidget observer update and submit functions

def upd_estimate_observe(update):
    update['owner'].data_ref[update['owner'].pie_index]['estimated'][update['owner'].label_index] = update['new']
    
def submit_guesses(b):
    for i, chart in enumerate(b.dashboard['bar_charts']):
        chart.y = b.dashboard['bar_data'][i]
    if b.dashboard['displayed'] == False:
        display(b.dashboard['ui_bars'])
        display(b.dashboard['legend'])
        b.description='Update'
        b.dashboard['displayed'] = True

In [None]:
def mk_pie_test_dashboard(dataset):
    
    dashboard = {}
    dashboard['dataset'] = dataset
    dashboard['displayed'] = False
    
    dashboard['scales'] = {'x':OrdinalScale(), 'y':LinearScale()}
    dashboard['x_data'] = ['A','B','C','D','E']
    
    # load the specified dataset
    with open(dataset, 'r') as fp:
        dashboard['y_data'] = json.load(fp)
        
    # add some initial data for user estimates
    for d in dashboard['y_data']:
        d['estimated'] = [0,0,0,0,0]
        
    # use bqplot to create some Pie and Figure objects
    dashboard['pies'] = [Pie(**d,**pie_tmpl) for d in dashboard['y_data']]
    dashboard['figs'] = [Figure(marks=[p],**fig_tmpl) for p in dashboard['pies']]
    dashboard['fig_boxes'] = [HBox([f],**hbox_tmpl) for f in dashboard['figs']]
    
    # build up corresponding bar chart data and the bar charts
    dashboard['bar_data'] = [[d['sizes'],d['estimated']] for d in dashboard['y_data']]
    dashboard['bar_charts'] = [mk_bar_chart(dashboard['x_data'], y, dashboard['scales'], bar_chart_tmpl)
                               for y in dashboard['bar_data']]
    dashboard['bar_figs'] = [mk_bar_figure(b, bar_fig_tmpl) for b in dashboard['bar_charts']]
    dashboard['bar_boxes'] = [HBox([bf],**hbox_tmpl) for bf in dashboard['bar_figs']]
    
    dashboard['legend'] = mk_legend(['Actual','Estimated'])
    
    # build up all the Text widgets for estimates
    dashboard['widgets'] = []
    for i_pie,x in enumerate(dashboard['y_data']):
        dashboard['widgets'].append([])
        for i_seg,seg in enumerate(dashboard['y_data'][i_pie]['estimated']):
            tmp_w = IntText(value=str(seg),
                            description=pie_tmpl['labels'][i_seg],
                            data_ref=dashboard['y_data'],
                            pie_index=i_pie,
                            label_index=i_seg)
            tmp_w.layout.width = '150px'
            tmp_w.observe(upd_estimate_observe, names="value")
            dashboard['widgets'][i_pie].append(tmp_w)
    
    dashboard['button'] = widgets.Button(description="Submit",dashboard=dashboard)
    dashboard['button'].on_click(submit_guesses)
    
    dashboard['ui_pies'] = HBox(dashboard['fig_boxes'])
    dashboard['ui_bars'] = HBox(dashboard['bar_boxes'])    
    dashboard['ui_widgets'] = [HBox([VBox([w for w in g])],**hbox_tmpl) for g in dashboard['widgets']]
    dashboard['ui_guess'] = HBox([u for u in dashboard['ui_widgets']])
                                 
    dashboard['ui'] = VBox([dashboard['ui_pies'],dashboard['ui_guess'], dashboard['button']])    

    return dashboard

## Exercise

Almost there:

- The next cell will display *FOUR* pie charts with *FIVE* segments each
- In the boxes below each one, enter your estimate for what percentage share each one represents
- When ready press the **Submit button** to see how the _actual_ values compare with your _estimates_

In [None]:
dash1 = mk_pie_test_dashboard('data/multi_pie_data1.json')
dash1['ui']

Let's try some new data and the same test.

In [None]:
dash2 = mk_pie_test_dashboard('data/multi_pie_data2.json')
dash2['ui']

## Some explanation

Load some more data...

In [None]:
with open('data/eg_pies.json', 'r') as fp:
    eg_pies = json.load(fp)

In [None]:
with open('data/eg_bars.json', 'r') as fp:
    eg_bars = json.load(fp)

This shows five equal sized pie segments. **Do they look equal?**

In [None]:
mk_eg_pie(eg_pies[0])

And as observed in the test, **can you confidently select the largest segment out of these? Or put them in order?**

In [None]:
mk_eg_pie(eg_pies[1])

**Is it a little easier with a bar chart?**

In [None]:
HBox([mk_eg_pie(eg_pies[1]),VBox([mk_eg_bar(eg_bars[1])])])

The test included pies where the segments had been rotated but the values remained the same. **Did the segments look the same size when in different positions and orientations?**

In [None]:
#HBox([mk_eg_pie(eg_pies[2]), mk_eg_pie(eg_pies[3])])
HBox([VBox([mk_eg_pie(eg_pies[2])]),VBox([mk_eg_pie(eg_pies[3])])])

**Did the colour have any effect on perception of proportion? **

If this was a poll on political candidates, is B doing better than C or D?

Pie charts are commonly used to show share of votes and unless there is an obvious "lead" for someone, the data shown is effectively useless. **Which is why the raw numbers typically appear next to the pie chart - so why bother?**

> You can't compare angles and slice areas to each other. Human perception and cognition is poor when viewing angles and areas and trying to make a mental comparison. Pie charts overload the working memory, forcing the person to make complicated calculations, and at the same time make a decision based on those comparisons.

> What's the point of showing a pie chart when you want to compare data, except to say, "well, the slices are almost the same, but I'm not really sure which one is bigger, or by how much, or what order they are from largest to smallest. But the colors sure are pretty. Plus, I like round things. Oh, was I suppose to make some important business decision? Sorry."

> &mdash; <a href="https://blogs.oracle.com/experience/entry/pie_charts_just_dont_work_when_comparing_data_-_number_10_of_top_10_reasons_to_never_ever_use_a_pie">[Oracle Blogs]</a>

In [None]:
HBox([VBox([mk_eg_pie(eg_pies[4])]),VBox([mk_eg_pie(eg_pies[5])])])

A lot of these examples have shown the drawbacks of pie charts while only using five segments. The problem is much worse as the
number of categories is increased. **Making comparisons between every combination of slices is a little taxing.**

With 11 categories, other problems have cropped up too:

1. The default set of 10 colours has started to be re-used (see A and Q).
2. The labelling is cluttered and overlapping.

In [None]:
mk_eg_pie(eg_pies[6])

Without numbers just exactly how many people have X?

In [None]:
b7 = mk_eg_pie(eg_pies[7])
f7 = b7.children[0]
p7 = f7.marks[0]
p7.radius = 160
f7.title = "Latest studies showing how many people have X"
f7.display_legend = True
display(b7)

Alternatively you could simply write: **21% of people have X**

## Representation of Data - Changing Data

With the pie chart examples, the proportional data shown is a percentage share of the absolute data values. When comparing pie charts side-by-side, both are showing a total of 100% of *something* but is the absolute data the same for each one or does it change too? If that absolute total changes, what does the size of the share now show over time?

### Use pandas to load CSV as a dataframe

In [None]:
df = pd.read_csv('data/kramerica_industries.csv', index_col=0)
df

### Some Pandas access code for reference/interest

In [None]:
# get the column names for labels
df.dtypes.index

In [None]:
# get a row from dataframe
df.loc[1996].values

In [None]:
# get row keys
df.index

### Let's plot some sales figures

In [None]:
# Some supporting data
labels = df.dtypes.index
colors = ['gold', 'yellowgreen', 'lightcoral', 'lightskyblue', 'pink', 'lightgrey']

In [None]:
# Create two plots with a pie each for 1996 and 1997 data rows

kfig, kax = plt.subplots(1, 2, figsize=(14,6))
kax[0].pie(df.loc[1996].values, labels=labels, colors=colors, autopct='%1.1f%%', shadow=True, startangle=140);
kax[1].pie(df.loc[1997].values, labels=labels, colors=colors, autopct='%1.1f%%', shadow=True, startangle=140);
kfig.suptitle('Kramerica Industries - Sales by Sector - 1996 to 1997', fontsize=14, fontweight='bold');

This representation could be misleading. From a quick glance, the 1997 pie chart on the right makes it look as if a) the "Oil Bladders" sector of Kramerica Industries has grown (from 9.0% to 16.8%), along with other changes, and b) contributes more to the overall sales figures.

However, as the pie chart is showing percentage share, there is no indication of what the absolute data values are.

In [None]:
# create two plots with a bar chart each for 1996 and 1997 data rows

kfig, kax = plt.subplots(1,2, figsize=(14,6))
index = np.arange(6)
bar_width = 0.8
x_offset = 0.1

# two bar charts of "sector" v "sales"
kax[0].bar(index+x_offset, df.loc[1996].values, bar_width, color=colors)
kax[1].bar(index+x_offset, df.loc[1997].values, bar_width, color=colors)

# set x ticks and labels - we'll go for funky sloped labels this time
kax[0].set_xticks(index+x_offset)
xtickNames0 = kax[0].set_xticklabels(labels)
plt.setp(xtickNames0, rotation=30, fontsize=10)
kax[1].set_xticks(index+x_offset)
xtickNames1 = kax[1].set_xticklabels(labels)
plt.setp(xtickNames1, rotation=45, fontsize=10)

# set labels, plot titles and limits (to fix size in both - should be calculated!)
kax[0].set_ylabel('$ Millions')
kax[0].set_title('1996 Sales')
kax[0].set_ylim([0,250])

kax[1].set_ylabel('$ Millions')
kax[1].set_title('1997 Sales')
kax[1].set_ylim([0,250]);

# add overall title and render
kfig.suptitle('Kramerica Industries - Sales by Sector - 1996 to 1997', fontsize=14, fontweight='bold');

With bar charts instead, even side-by-side comparison of the absolute sales figures per sector. shows it can easily be seen that absolute sales figures have dropped in all sectors. Furthermore per chart you can a) determine the leading sector and b) with some trivial effort determine the order of all sectors.

**Illusory images:** Are you drawing lines between the two charts in your mind, to show the trend?

### More Data

So how well do bar charts handle a larger series of data? This time we show eight charts, showing sales figures from 1994 - 2001.

In [None]:
# plot 8 bar charts in 2 rows of 4 to show whole dataset

kfig2, kax2 = plt.subplots(2, 4, figsize=(14,6))
index = np.arange(6)
bar_width = 0.8
x_offset = 0.1

# plot each bar chart; limit y labels to left-hand charts only and x labels to bottom row only
for ind,year in enumerate(df.index):
    i = int(ind / 4)
    j = int(ind % 4)
    kax2[i][j].bar(index+x_offset, df.loc[year].values, bar_width, color=colors)
    kax2[i][j].set_xticks(index+0.5)
    kax2[i][j].set_title('%d Sales' % year)
    kax2[i][j].set_ylim([0,250])
    if j==0:
        kax2[i][j].set_ylabel('$ Millions')
    if i==1:
        xtickNames = kax2[i][j].set_xticklabels(labels)
        plt.setp(xtickNames, rotation=90, fontsize=10) # vertical labels to tidy-up
    else:
        xtickNames = kax2[i][j].set_xticklabels([])

# add overall title and render
kfig2.suptitle('Kramerica Industries - Sales by Sector - 1994 to 2001', fontsize=14, fontweight='bold');

While it's still possible to see the overall trend, it's getting harder to track any trends and do comparisons. In particular a) any illusory lines drawn across the charts, breaks down with the introduction of a second row, b) more data means more charts and rows, leaving the whole thing unmanageable, c) the small sales figures are harder to track and d) especially when the net change is small too.

### The line chart

For all your trending needs.

In [None]:
# plot one line chart to show all sectors over whole dataset

kfig3, kax3 = plt.subplots(1, 1, figsize=(14,6))

# plot a line for each sector (column)
for i, col in enumerate(df):
    kax3.plot(df.index.values, df[col].values, 'o-', color=colors[i], label=col, linewidth=2)
    
kax3.set_xticklabels(df.index.values)
kax3.set_ylabel('$ Millions')
kax3.set_ylim([0,250])

# really need a legend this time
kax3.legend()

# add overall title and render
kfig3.suptitle('Kramerica Industries - Sales by Sector - 1994 to 2001', fontsize=14, fontweight='bold');

With the line chart, we can now easily see the overall trend of every sector of Kramerica Industries. Subtle changes are more readily observed. While we can no longer determine each sector's share *as a percentage* we can still at least see the leading sector and the order of the rest.

However, some of the values overlap (pink/yellow lines for 1998-1999), which could lead to confusion. This would become worse as more and more sectors (categories) are added. Lots of lines might be difficult to follow - we will also run out of distinguishable colours.

<div style="background-color: #eef5f5; border-left: 2px solid black; border-right: 2px solid black; padding: 10px">
<h3 style="font-variant: small-caps;">For a few lines more</h3>
![For a few lines more](images/lots_of_lines.png)
<a href="http://stats.stackexchange.com/questions/46334/line-graph-has-too-many-lines-is-there-a-better-solution"><small>[Source]</small></a></div>

## Further Reading

> http://www.businessinsider.com/pie-charts-are-the-worst-2013-6?IR=T

> http://junkcharts.typepad.com/junk_charts/