## Script for converting a tree of life from the Tree of Life Project in xml to a networkx graph and to a graphml file

This script convert a Tree of life from an xml format to a networkx graph then to several file formats (json, graphml).
The original data can be found here http://tolweb.org/tree/home.pages/downloadtree.html
The xml file available on the above website is licenced under the Attribution Creative Commons 3.0 https://creativecommons.org/licenses/by/3.0/, the copyright is owned by the Tree Of Life Project.


Copyright 2017 Benjamin Ricaud

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

In [1]:
# Tools for parsing xml
import xml.etree.ElementTree as ET

In [3]:
# Load the xml file
# If there are error during the loading, make sure the file is encoded in UTF8
# You may have to open it with a text editor and save it with encoding UTF8
xml_file_to_load = 'tolskeletaldumpUTF8.xml'
tree = ET.parse(xml_file_to_load)

In [7]:
# The data will be loaded in a networkx graph
# The networkx module can be installed using 'pip install networkx'
import networkx as nx

In [87]:
# Code for the tree construction
i=1
G = nx.DiGraph()
root = tree.getroot()
for livingElement in root.iter('NODE'):
    name = livingElement.find('NAME').text
    data_dic = livingElement.attrib
    node_id = data_dic['ID']
    if name == None:
        name = 'None'
    data_dic['name'] = name
    if not G.has_node(node_id):
        G.add_node(node_id,data_dic)
    if data_dic['CHILDCOUNT']!='0':
        for child in livingElement[1]:
            child_name = child.find('NAME').text
            child_data_dic = child.attrib
            child_id = child_data_dic['ID']
            if child_name == None:
                child_name = 'None'
            child_data_dic['name'] = child_name
            #print(child_name,child_data_dic)
            if not G.has_node(child_id):
                G.add_node(child_id,child_data_dic)
            if G.has_edge(node_id,child_id):
                print('found exisiting edge',name,child_name)
                print('data: ',data_dic,child_data_dic)
            G.add_edge(node_id,child_id,weight=1)
            i+=1
print('Number of nodes processed:',i)
print('Number of nodes in the graph:',G.number_of_nodes())
print('Number of edges in the graph:',G.number_of_edges())
print('The graph is a tree?',nx.is_tree(G))

Number of nodes processed: 35960
Number of nodes in the graph: 35960
Number of edges in the graph: 35959
The graph is a tree? True


In [81]:
# Find the root node, the only one that has in_degree 0
root_node_list = [n for n,d in G.in_degree().items() if d==0] 
root_node_id = root_node_list[0]
print('Root node id:',root_node_id)

Root node id: 1


In [82]:
# Details about the root node
print(G.node[root_node_id])
print('Degree:',G.degree(root_node_id))
print('Successors: ',[G.node[node]['name'] for node in G.successors(root_node_id)])

{'HASPAGE': '2', 'CHILDCOUNT': '4', 'CONFIDENCE': '0', 'EXTINCT': '0', 'LEAF': '0', 'name': 'Life on Earth', 'ID': '1', 'PHYLESIS': '0'}
Degree: 4
Successors:  ['Eubacteria', 'Eukaryotes', 'Viruses', 'Archaea']


In [83]:
# Saving the graph in json format
from networkx.readwrite import json_graph
import json
with open('treeoflife.json', 'w') as outfile1:
    outfile1.write(json.dumps(json_graph.node_link_data(G)))

In [88]:
# Saving the graph in graphML format
nx.write_graphml(G, "treeoflife.graphml")

See https://networkx.github.io/ for more file formats and additional details on the handling of the graph.