# Introduction: build a graph from an edge list


* Dataset: [Open Tree of Life](https://tree.opentreeoflife.org)
* Tools: [pandas](https://pandas.pydata.org), [numpy](http://www.numpy.org), [networkx](https://networkx.github.io)

## Importing packages

By convention, the first lines of code are always about importing the packages we'll use.

In [None]:
import pandas as pd
import numpy as np
import networkx as nx

Tutorials on pandas can be found at:
* https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html
* https://pandas.pydata.org/pandas-docs/stable/getting_started/tutorials.html

Tutorials on numpy can be found at:
* https://numpy.org/doc/stable/user/quickstart.html
* <http://www.scipy-lectures.org/intro/numpy/index.html>
* <http://www.scipy-lectures.org/advanced/advanced_numpy/index.html>

A tutorial on networkx can be found at:
* https://networkx.org/documentation/stable/tutorial.html

## Import the data

We will play with a excerpt of the Tree of Life, that can be found together with this notebook. This dataset is reduced to the first 1000 taxons (starting from the root node). The full version is available here: [Open Tree of Life](https://tree.opentreeoflife.org/about/taxonomy-version/ott3.0).

![Public domain, https://en.wikipedia.org/wiki/File:Phylogenetic_tree.svg](https://upload.wikimedia.org/wikipedia/commons/thumb/7/70/Phylogenetic_tree.svg/800px-Phylogenetic_tree.svg.png)

In [None]:
tree_of_life = pd.read_csv('data/taxonomy_small.tsv', sep='\t\|\t?', encoding='utf-8', engine='python')

If you do not remember the details of a function:

In [None]:
pd.read_csv?

For more info on the separator, see [regex](https://docs.python.org/3.6/library/re.html).

Now, what is the object `tree_of_life`? It is a Pandas DataFrame.

In [None]:
tree_of_life

The description of the entries is given here:
https://github.com/OpenTreeOfLife/reference-taxonomy/wiki/Interim-taxonomy-file-format

## Explore the table

In [None]:
tree_of_life.columns

Let us drop some columns.

In [None]:
tree_of_life = tree_of_life.drop(columns=['sourceinfo', 'uniqname', 'flags','Unnamed: 7'])

In [None]:
tree_of_life.head()

Pandas infered the type of values inside each column (`int`, `float`, `string` and `string`). The `parent_uid` column has floating-point values because there was a missing value, converted to `NaN`, which is considered as a float.

In [None]:
print(tree_of_life['uid'].dtype, tree_of_life.parent_uid.dtype)

How to access individual values.

In [None]:
tree_of_life.iloc[0, 2]

In [None]:
tree_of_life.loc[0, 'name']

**Exercise**: Guess the output of the following line:

In [None]:
# tree_of_life.uid[0] == tree_of_life.parent_uid[1]

Ordering the data.

In [None]:
tree_of_life.sort_values(by='name').head()

 *Remark:* Some functions do not change the dataframe (option `inline=False` by default).

In [None]:
tree_of_life.head()

## Operation on the columns

Unique values, useful for categories:

In [None]:
tree_of_life['rank'].unique()

Selecting only one category.

In [None]:
tree_of_life[tree_of_life['rank'] == 'species'].head()

How many species do we have?

In [None]:
len(tree_of_life[tree_of_life['rank'] == 'species'])

In [None]:
tree_of_life['rank'].value_counts()

**Exercise:** Display the entry with name 'Archaea', then display the entry of its parent.

In [None]:
# Your code here.

## Preparing the data

Before building the graph, we need to reorganize the data. First we separate the nodes and their properties from the edges.

In [None]:
nodes = tree_of_life[['uid', 'name','rank']]
edges = tree_of_life[['uid', 'parent_uid']]

Second step, some more data pre-processing for the edges and nodes data.

In [None]:
edges.head()

In [None]:
# Drop the first row as it is not encoding an edge (no parent for the first node)
edges = edges.drop(0)
edges.head()

For the node data, we shall index them with the node id.

In [None]:
nodes.head()

In [None]:
nodes.set_index('uid',inplace=True)
nodes.head()

## The graph
Now the data is has the appropriate shape, we may build the graph using `networkx`. It is a simple iteration over the rows of the dataframe, using `nx.add_edge`. Alternatively, you may use `nx.add_edge_from` with a list of edges as input.

In [None]:
# A simple command to create the graph from the edge list.
graph = nx.DiGraph() # DiGraph class is for directed graph
_ = [graph.add_edge(source, target) for source, target in zip(edges['parent_uid'], edges['uid'])]

We can also use the `add_edges_from` function instead of a list comprehension (beware of the column reordering needed since we have a directed graph)

In [None]:
graph = nx.DiGraph()
graph.add_edges_from(edges[['parent_uid', 'uid']].itertuples(name=None, index=False))

And finally, the dataframe can be used directly to create the graph thanks to the `from_pandas_edgelist` function.

In [None]:
graph = nx.from_pandas_edgelist(edges, source='parent_uid', target='uid', create_using=nx.DiGraph())

In addition, let us add some attributes to the nodes:

In [None]:
node_props = nodes.to_dict()

In [None]:
for key in node_props:
    nx.set_node_attributes(graph, node_props[key], key)

Let us check if it is correctly recorded:

In [None]:
print(graph.nodes[805080], graph.nodes[102415])

## Graph visualization

To conclude, let us visualize the graph. We will use the python module networkx.

The following line is a [magic command](https://ipython.readthedocs.io/en/stable/interactive/magics.html). It enables plotting inside the notebook.

In [None]:
%matplotlib inline

You may also try `%matplotlib notebook` for a zoomable version of plots.

Let us draw the graph with two different [layout algorithms](https://en.wikipedia.org/wiki/Graph_drawing#Layout_methods). As you will see, networkx and matplotlib are not very convenient for plotting graphs. We will see other visualization tools later on.

In [None]:
nx.draw_spectral(graph)

In [None]:
nx.draw_spring(graph)
# You may also visualize names with the following command,
# but in our case the graph is too big and labels overlap:
#
# nx.draw_spring(graph), labels=node_props['name'])

## Saving the graph
Save the graph to disk in the `gexf` format, readable by gephi and other tools that manipulate graphs. You may now explore the graph using [gephi](https://gephi.org/) and compare the visualizations.

In [None]:
nx.write_gexf(graph, 'data/tree_of_life.gexf')

Note: the `gexf` format allows one to save node and edge properties, except if the properties have a complex structure such as python lists or dictionaries. In that case, these structures must be converted to strings (using json) before saving the graph.