<img src="figs/eh_logo.png" style="width: 200px;">

# EmptyHeaded Graph Tutorial

This is a brief example of how to run EmptyHeaded on graph data. 
We first provide a brief system overview then discuss how to run a sample query in EmptyHeaded.

This example assumes that you have resolved all dependencies listed on our [GitHub](https://github.com/craberger/EmptyHeaded) page and were able to run `setup.sh` successfully.

## System Overview

The EmptyHeaded engine works in three phases:
(1) the query compiler translates a high-level datalog-like query
language into a logical query plan represented as a
GHD, replacing
the traditional role of relational algebra; (2) the query compiler
generates code for the execution engine by translating the GHD into a
series of set intersections and loops; and (3) the execution engine
performs automatic algorithmic and representation decisions based upon
skew in the data.

<br/>
<div style="text-align:left">
<img src="figs/systemOverview.png" style="width: 600px">
</div>

## Importing Emptyheaded

We being with the command to rule them all. Lets first import the EmptyHeaded runtime module.

In [1]:
import emptyheaded

## Defining a Schema

EmptyHeaded expects a json file from the user that defines the schema of the relations that will be queried over. An example schema can be found in `data/facebook/config_pruned.json`. This file defines the configuration settings for EmptyHeaded (such as the number of threads to run with), the schemas for all the relations, and the files to load the relations from. EmptyHeaded currently supports loading from `tsv` and `csv` files. One can see that this configuration points to `duplicate_pruned.tsv` which contains an edge list that represents a graph. In fact this edgelist is (small) real dataset from our [Snap](https://snap.stanford.edu/data/egonets-Facebook.html) friends (feel free to call us your EmptyHeaded friends). Also, it is important to note the `database` path in this file. This will be where our database gets placed on disk in the next step!

## Creating a Database

Now that we have defined a schema and imported EmptyHeaded we can create the database. To create a database in EmptyHeaded we simply execute the following command which points to our json file that contains our schema. Note: this step takes a little bit (~10-20 seconds) because we compile the whole EmptyHeaded library, read the files from disk, dictionary encode the relations, build tries, and spill binary files to disk in this step. A visualization of what the following step is doing is below.

<br/>
<img src="figs/table_transform.png" style="width: 600px">

In [2]:
emptyheaded.createDB("$EMPTYHEADED_HOME/examples/graph/data/facebook/config_pruned.json")

Created database with the following relations: 
	Edge(node:long,node:long)


## Loading a Previously Created Database

The creation process should only need to be run once per dataset, so if you build a database then want to come back to it in another python session we enable you to do this with the following command. Note: the path in the `loadDB` command is the database path that you specified in your input schema file.

In [2]:
emptyheaded.loadDB("$EMPTYHEADED_HOME/examples/graph/data/facebook/db_pruned")

## Running a Query

Whef! We are finally able to run a query now. Using our datalog like syntax we can express the triangle query over this dataset as follows. Here we express joins over our `Edge` relation with a rule that defines a new relation called `Triangle`.

In [3]:
emptyheaded.query("Triangle(a,b,c) :- Edge(a,b),Edge(b,c),Edge(a,c).")

## Debugging Output

A quick sanity check that EmptyHeaded did the right thing is to check the cardinality of the output of the query (we discuss how you can check each value below). You can print the number of rows in your output result as follows. All the user needs to specify is the name of the table.

In [4]:
print emptyheaded.numRows("Triangle")

1612010


## Performance Times

If you want to see a breakdown of the performance times we breakdown the time spent loading the data into memory versus the time spent running the query in the terminal output. You can view this in your shell from which you launched iPython notebook.

## View the Result 

That's great but what if I want to actually do something with my result (like view it in MatPlotLib or run a computation using SciPy). Fear not EmptyHeaded can help you. We use [Pandas](http://pandas.pydata.org/) data frames to return relations. An example of a data frame being returned is below. Here we fetch the data corresponding to the relation `Triangle` and return it in a Pandas [data frame](http://pandas.pydata.org/pandas-docs/version/0.16.2/generated/pandas.DataFrame.html)

In [4]:
TriangleTable = emptyheaded.fetchData("Triangle")

If we want to see the first ten rows.

In [5]:
TriangleTable[0:10]

Unnamed: 0,0,1,2
0,6,5,2
1,8,7,0
2,9,7,0
3,9,8,0
4,9,8,7
5,10,5,2
6,10,6,2
7,10,6,5
8,11,7,0
9,11,8,0


In [6]:
emptyheaded.query("Flique(a,b,c,d) :- Edge(a,b),Edge(b,c),Edge(a,c),Edge(a,d),Edge(b,d),Edge(c,d).")

In [8]:
emptyheaded.fetchData("Flique")

Unnamed: 0,0,1,2,3
0,9,8,7,0
1,10,6,5,2
2,11,8,7,0
3,11,9,7,0
4,11,9,8,0
5,11,9,8,7
6,14,8,7,0
7,14,9,7,0
8,14,9,8,0
9,14,9,8,7
