#GSoC 2014 el-grafo d3-based Bio4j data model visualization
Student: Carmen Torrecillas / Organization: Bio4j / Mentors: @eparejatobes, @laughedelic, @evdokim
##PROJECT DESCRIPTION GSoC 2014 el-grafo project is the first development of an interactive web-based tool that allows users to intuitively explore the abstract domain model of the Bio4j open source bioinformatics data platform, which integrates the data available in the most representative open data sources around protein information: in UniProtKB (SwissProt + Trembl), Gene Ontology (GO), UniRef (50, 90, 100), RefSeq, NCBI taxonomy and Expasy Enzyme.
el-grafo project details the connection between all vertex and edge components which shape the logic structure of its network module by module, as well as provides specific typing information for retrieving useful data. Its purpose is to help users gain a better understanding of the Bio4j domain model structure lowering their entry barrier to the database platform in order to utilize and query it more efficiently.
The project consists on a Html page accompanied by the use of several different scripts, data files from the model database, and an external css-type file. The directory layout is presented as follows:
All scripts could be loaded from the HTML pages, but as the API may change over the time, a copy of the libraries is included in the scripts folder. They are loaded as follows:
ABOUT THE MODEL DATA
Once retrieved, the service data was adapted until it suited the graphlib/dagre expected income. Once the graph structure and layout were set, we used this as an input for dagre-d3 in order to perform its visualization using d3 on its base. For the purposes of this project, it was necessary to link all info generated on the layout with the represented outcome graph.
It is important to note that the data represented on el-grafo visualization tool is related with the Bio4j abstract model, not with the actual data stored on the database, although this is an interesting feature that could be integrated somehow in further development of this tool.
ABOUT THE NETWORK LAYOUT AND REPRESENTATION
Network structures are usually represented by plotting x, y coordinates as attributes of the nodes, or by applying any "force-based" algorithm, based on repulsive/attractive forces and a canvas gravity.
A set of tools and libs for graph data manipulation and representation were explored in order to find the most suitable solution for the Bio4j database model visualization.
The final election regarding the layout component of the project was to integrate the graph data lib dagre.js as the most orderly and simple way to layout the network. It has graphlib library bundle on it to provide data structures for undirected and directed multi-graphs along with algorithms. Our case corresponds with Diagraph as they are directed multigraphs.
On the rendering side, I decided to use dagre-d3.js, the d3-based rendered for dagre.js on the client-side, as the main core of the project.
As previously explained, this approach to the Bio4j Database representation consists on a Html page that works initially as an introduction page to the project: its purpose and context.
- General information about the project.
- Modules and Dependencies schematic representation, as colored areas and lines connecting them. Sizes relative to the number of types (vertexes and edges) on each module.
- Selection onClick of the Modules and Dependencies that could be LOADED and explored independently.
Once a specific Module or Dependency is loaded, the web interface turns into the main Graph representation site.
- Graph representation of how all vertex and edge components are connected and how they shape each module of the network.
- Graphical representation of ARITY possibilities of the edges: many/many, many/one, one/many, one/one.
- Graph filtering actions via Contextual Menu, that allow users to perform specific graph actions (filtering by successors, predecessors, neighbors, etc). Some graphlib Digraph functions used: inEdges, outEdges, filterNodes...
- Specific information of each vertex/edge of the graphs (id and properties).
- Dependency collapsing/expanding features which show particular vertexes connected between modules, as well as to continue exploring other elements of their modules.
The findings above are a starting point for the el-grafo project as a first approach to its complex data structure. Further development of this tool and its guide is necessary for complete use of the program. In my findings the links between modules and dependencies can be examined more closely in order to elaborate on their uses and functions of the Bio4j model.
I plan to take this project further improving its current functionalities by conducting additional research and testing of the network representation and users' interaction in order to make el-grafo a more useful and efficient tool.