GSoC 2014 project - D3-based Bio4j data model visualization
JavaScript Other
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
docs
examples
resources
test
readme.md

readme.md

#GSoC 2014 el-grafo d3-based Bio4j data model visualization

Student: Carmen Torrecillas / Organization: Bio4j / Mentors: @eparejatobes, @laughedelic, @evdokim

##PROJECT DESCRIPTION GSoC 2014 el-grafo project is the first development of an interactive web-based tool that allows users to intuitively explore the abstract domain model of the Bio4j open source bioinformatics data platform, which integrates the data available in the most representative open data sources around protein information: in UniProtKB (SwissProt + Trembl), Gene Ontology (GO), UniRef (50, 90, 100), RefSeq, NCBI taxonomy and Expasy Enzyme.

el-grafo project details the connection between all vertex and edge components which shape the logic structure of its network module by module, as well as provides specific typing information for retrieving useful data. Its purpose is to help users gain a better understanding of the Bio4j domain model structure lowering their entry barrier to the database platform in order to utilize and query it more efficiently.

##TECHNOLOGY Technologically speaking, d3.js and dagre /dagre-d3 open-source JavaScript libraries compose the main core of the project.

The project consists on a Html page accompanied by the use of several different scripts, data files from the model database, and an external css-type file. The directory layout is presented as follows:

  • 20140818_el-grafo.html
    • scripts
      • lib
      • alg
    • data
    • css

Regarding the external libraries it uses, the latest browser-ready scripts can be found in their respective github repositories: d3, dagre and dagre-d3.

Apart from them, the scripts folder contains several algorithms from graphlib.js and contextmenu jQuery plugin.

All scripts could be loaded from the HTML pages, but as the API may change over the time, a copy of the libraries is included in the scripts folder. They are loaded as follows:

<script src="scripts/lib/dagre-d3.js"></script>

ABOUT THE MODEL DATA

A HTTP-service request for exploring the Bio4j domain model has been used to return updated graphSON/JSON objects of the different modules integrated. More information about this feature here

Once retrieved, the service data was adapted until it suited the graphlib/dagre expected income. Once the graph structure and layout were set, we used this as an input for dagre-d3 in order to perform its visualization using d3 on its base. For the purposes of this project, it was necessary to link all info generated on the layout with the represented outcome graph.

It is important to note that the data represented on el-grafo visualization tool is related with the Bio4j abstract model, not with the actual data stored on the database, although this is an interesting feature that could be integrated somehow in further development of this tool.

ABOUT THE NETWORK LAYOUT AND REPRESENTATION

Network structures are usually represented by plotting x, y coordinates as attributes of the nodes, or by applying any "force-based" algorithm, based on repulsive/attractive forces and a canvas gravity.

A set of tools and libs for graph data manipulation and representation were explored in order to find the most suitable solution for the Bio4j database model visualization.

On first trial, I discovered that the d3.js JavaScript library has its own Force Layout for visualizing networks. It utilizes physical "forces" to arrange the elements and give attributes to the nodes (x and y coordinates, weight..) and links (source and target nodes). Although it is a flexible and interactive way to visualize these networks, it was decided that it was not suitable for this project as it jumbles the layout and the representation. Due to the dynamic nature of the d3.js JavaScript library the physics behind it would make it difficult in terms of usability.

Other very interesting open-source JavaScript libraries consulted where cola.js: a constraint-Based Layout in the Browser that works with d3.js, or joint for interactive diagramming.

The final election regarding the layout component of the project was to integrate the graph data lib dagre.js as the most orderly and simple way to layout the network. It has graphlib library bundle on it to provide data structures for undirected and directed multi-graphs along with algorithms. Our case corresponds with Diagraph as they are directed multigraphs.

On the rendering side, I decided to use dagre-d3.js, the d3-based rendered for dagre.js on the client-side, as the main core of the project.

DESCRIPTION

As previously explained, this approach to the Bio4j Database representation consists on a Html page that works initially as an introduction page to the project: its purpose and context.

alt tag

Features:

  • General information about the project.
  • Modules and Dependencies schematic representation, as colored areas and lines connecting them. Sizes relative to the number of types (vertexes and edges) on each module.
  • Selection onClick of the Modules and Dependencies that could be LOADED and explored independently.

alt tag

Once a specific Module or Dependency is loaded, the web interface turns into the main Graph representation site.

Features:

  • Graph representation of how all vertex and edge components are connected and how they shape each module of the network.
  • Graphical representation of ARITY possibilities of the edges: many/many, many/one, one/many, one/one.
  • Graph filtering actions via Contextual Menu, that allow users to perform specific graph actions (filtering by successors, predecessors, neighbors, etc). Some graphlib Digraph functions used: inEdges, outEdges, filterNodes...
  • Specific information of each vertex/edge of the graphs (id and properties).
  • Dependency collapsing/expanding features which show particular vertexes connected between modules, as well as to continue exploring other elements of their modules.

WHAT'S NEXT

The findings above are a starting point for the el-grafo project as a first approach to its complex data structure. Further development of this tool and its guide is necessary for complete use of the program. In my findings the links between modules and dependencies can be examined more closely in order to elaborate on their uses and functions of the Bio4j model.

I plan to take this project further improving its current functionalities by conducting additional research and testing of the network representation and users' interaction in order to make el-grafo a more useful and efficient tool.