# Generating your own atlas-like interactive visualizations

This guide will show you how to generate your own interactive visualizations. The idea is that you should be able to copy this notebook, remove the parts you don't need and modify it slightly to get the graphs you need.

First you must follow the "setup" guide here to install what's necessary:
https://github.com/cid-harvard/visualization-notebook-templates#setup

### How to use this notebook

This notebook is really a program - as you read through it, you can run a code snippet by clicking on a cell hitting the little play button on top. An easier way is to keep pressing `shift+Enter` as you read through to run every cell one by one - you should see the cell run and then the selection box advance to the next cell!

A quick note, there's a small bug when viewing charts in a notebook specifically where when you run a new graph, the tooltips for the old ones stop working - you can get it to work again by simply running the cell containing the visualization again!

### Getting started

To start, we load the necessary libraries that do most of the heavy lifting:

In [1]:
import sys
sys.path.append("./modules")
import d3plus2 as d3plus
import pandas as pd

### A quick teaser

Before we start, here is a tiny self-contained example to show you how quick and easy it is. Then we'll dive in depth with a more complete example and many different variations. Here is all the code you need to draw a treemap:

In [2]:
viz = d3plus.Treemap(
    id=["groups", "name"],
    value="share",
    name="description",
    color="groups",
    tooltip=["name", "groups", "description"]
)
viz.draw(pd.read_table("./sourcedata/list_for_mali.csv", sep=";"))

<IPython.core.display.Javascript object>

## A full example and explanation

### Loading data

OK, now let's start at the beginning. First we must read in some data. Pandas is a data manipulation library that has many features for munging data in the way you want.

Here, we mainly use it to read the data in quickly and easily - pandas supports many different formats, including CSVs, more generic delimited files and STATA .dta files (via `read_csv`, `read_table` and `read_stata`). For excel files, you need to install an additional package with pip (called `xlrd`), but that works just as well. Here is the  [documentation](http://pandas.pydata.org/pandas-docs/stable/io.html) for all the read functions.

Pandas also includes many different options while reading for skipping rows, including the header, filling null values, converting data types, etc.

Let's go ahead and try:

In [3]:
df = pd.read_table("./sourcedata/list_for_mali.csv", sep=";")

Pandas takes our file and reads it in, converting it into a table-like structure called a dataframe (hence the variable name `df`. Now let's take a peek at the beginning of the dataframe:

In [4]:
df.head()

Unnamed: 0,name,share,groups,presence,description,color
0,111,0.143092,1,0,"Growing of cereals (except rice), leguminous c...",#F0411A
1,112,0.143092,1,0,Growing of rice,#1FA454
2,113,0.190602,1,0,"Growing of vegetables and melons, roots and tu...",#F0411A
3,114,0.143092,1,0,Growing of sugar cane,#1FA454
4,115,0.143092,1,0,Growing of tobacco,#1FA454


We can also view information on the data we got.

In [5]:
df.describe()

Unnamed: 0,name,share,groups,presence
count,196.0,196.0,196.0,196.0
mean,2389.408163,0.510204,8.862245,0.193878
std,1922.542317,0.945409,3.468008,0.396346
min,111.0,8.4e-05,1.0,0.0
25%,1073.75,0.071942,7.0,0.0
50%,2391.5,0.185198,10.0,0.0
75%,2912.5,0.445706,11.0,0.0
max,9602.0,7.436335,13.0,1.0


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 196 entries, 0 to 195
Data columns (total 6 columns):
name           196 non-null int64
share          196 non-null float64
groups         196 non-null int64
presence       196 non-null int64
description    196 non-null object
color          196 non-null object
dtypes: float64(1), int64(3), object(2)
memory usage: 9.3+ KB


### Manipulating data

The "name" column is supposed to have 4-digit HS codes in it, but if we look we notice that instead it looks like it's a numeric (``info()`` says int64, for "integer). This could cause issues in some visualizations like the network where this column has to match the codes. 

Ideally you would fix most issues in whatever other tool you're comfortable with and then import it here, but for the sake of providing example code, let's fix a few problems that come up often:

In [7]:
# Convert the column to string, left-pad it with zeros (zfill) up to 4 digits
df.name = df.name.astype(str).str.zfill(4)

In [8]:
# Change the column names to something we're more comfortable with
df.columns = ["code", "percent_world_trade", "group", "M", "description", "color"]

In [9]:
# Filter to only columns with M > 1
df = df[df.M == 1]

In [10]:
# see what we did
df.head()

Unnamed: 0,code,percent_world_trade,group,M,description,color
5,116,0.143092,1,1,Growing of fibre crops,#1FA454
26,520,0.00758,5,1,Mining of lignite,#F0411A
30,721,0.001966,5,1,Mining of uranium and thorium ores,#1FA454
37,910,3.56956,6,1,Support activities for petroleum and natural g...,#1FA454
38,990,1.165681,6,1,Support activities for other mining and quarrying,#1FA454


There, done! Pandas is capable of much more complex data munging than this of course. If you want to learn more, consult this [quick crash course](http://pandas.pydata.org/pandas-docs/stable/10min.html).

### Drawing things

Now, let's do a simple treemap. First, we create a treemap object that takes some parameters:

In [11]:
tm1 = d3plus.Treemap(
    id="code",
    value="percent_world_trade"
)

We state that we want to use the `code` column as the unique identifier, meaning we'll draw a square of the treemap for each unique code. Then we set the `value` to the `percent_world_trade` column as the thing to size the squares by. This object in itself does nothing of significance, but just provides a definition for a specific kind of treemap. 

Now that we have this visualization definition object, we can tell it to draw according to the data in our dataframe, which will use the columns that we just specified:

In [12]:
tm1.draw(df)

<IPython.core.display.Javascript object>

Not quite there yet, but we still got pretty far with only two columns. Since we defined no colors, it automatically assigned some random ones. We also have no text in there, and no notion of categories, let alone grouping by category.

Let's try to add the names in first to see if it even makes sense:

In [13]:
tm2 = d3plus.Treemap(
    id="code",
    value="percent_world_trade",
    name="description"
)
tm2.draw(df)

<IPython.core.display.Javascript object>

Much better.

### Grouped treemaps

Now we can add in groupings. The way we do this is that we specify that the `id` is not just a simple column, but a combination of columns - `group` being the higher level one and `id` being a lower level one. We pass those in together to the `id` parameter by turning them into a list:

In [14]:
tm3 = d3plus.Treemap(
    id=["group", "code"],
    value="percent_world_trade",
    name="description"
)
tm3.draw(df)

<IPython.core.display.Javascript object>

Now the items are grouped according to categories.

## Colors

We notice that the automatic colors are now gone - the visualization library doesn't know what to do by default. We can give it a cue as to what to color by, by passing in a `color=` parameter.

One way is to specify that we want to color by group. In this case, the visualization library will assign a categorical color to each `group` value we have automatically, like so:

In [15]:
tm4 = d3plus.Treemap(
    id=["group", "code"],
    value="percent_world_trade",
    name="description",
    color="group"
)
tm4.draw(df)

<IPython.core.display.Javascript object>

Not bad! Another idea is to pass in our own custom colors, from our `color` column earlier:

In [16]:
tm4 = d3plus.Treemap(
    id=["group", "code"],
    value="percent_world_trade",
    name="description",
    color="color"
)
tm4.draw(df)

<IPython.core.display.Javascript object>

This `color` column is a column that contains what we call "hex color codes". This is a common way of describing colors that's used across many kinds of software. Here's an example of what it looks like:

In [17]:
df.color.head()

5     #1FA454
26    #F0411A
30    #1FA454
37    #1FA454
38    #1FA454
Name: color, dtype: object

You don't have to understand how this works, but you do need to be able to get your colors into this format. There are many tools an websites that do this, for example:

http://www.color-hex.com/ or 
http://htmlcolorcodes.com/

Picking the right kind of colors for the kind of data you're trying to show also matters. For that, take a look at:

http://earthobservatory.nasa.gov/blogs/elegantfigures/2013/08/12/subtleties-of-color-part-3-of-6/
and
http://colorbrewer2.org/#

### Tooltips

Finally, we can change what we include in the tooltips by adding a `tooltip=` parameter and passing it a list with all the column names we want to include:

In [18]:
tm5 = d3plus.Treemap(
    id=["group", "code"],
    value="percent_world_trade",
    name="description",
    color="color",
    tooltip=["code", "group", "description"]
)
tm5.draw(df)

<IPython.core.display.Javascript object>

The visualization library also adds in cells that it thinks is relevant, in this case the `percent_world_trade` variable that we're sizing the nodes by, as well as the share that that variable constitutes.

### Exporting an embeddable visualization 

One quick way to export the visualization is to take a screenshot. For more complicated situations, like when you need a large screenshot or when you want to embed the visualization in a website, you can export a .html file like so:



In [19]:
# Generate HTML
visualization_html = tm5.dump_html(df)
# Dump it out in my_treemap.html
open("./my_treemap.html", "w+").write(visualization_html)

6370

Now you should be able to open this file in a browser and it should contain the visualization above.

### Under the hood

Under the hood, this library is just a thin wrapper around d3plus (http://d3plus.org/), an interactive charting library. You can browse through the documentation and the examples to see what's possible.

The wrapper doesn't support every functionality that d3plus does, just the most common ones. But if you want more, one strategy could be to generate the output html file (as shown above) to give you a basic template, and then hand-edit the code in there.

If you need help with this, talk to Mali!

### Challenge!

The file "./sourcedata/Complexity and opportunity What industries have the most potential for this state.csv" contains industry data for a state in Mexico.

Draw a treemap of employment for each industry, grouped and colored by category!

OR

Try loading your own data!

### The end!

This is very much a work in progress and in need of suggestions - I want to make the most common tasks as easy as possible, so let me know!