Skip to content

Latest commit

 

History

History
194 lines (138 loc) · 8.15 KB

datatransform.rst

File metadata and controls

194 lines (138 loc) · 8.15 KB

DataTransform Module

A key problem when merging data from multiple data sources is finding a common identifier. To ameliorate this problem, we have written a DataTransform module to convert identifiers from one type to another. Frequently, this conversion process has multiple steps, where an identifier is converted to one or more intermediates before having its final value. To describe these steps, the user defines a graph where each node represents an identifier type and each edge represents a conversion. The module processes documents using the network to convert their identifiers to their final form.

A graph is a mathematical model describing how different things are connected. Using our model, our module is connecting different identifiers together. Each connection is an identifier conversion or lookup process. For example, a simple graph could describe how pubchem identifiers could be converted to drugbank identifiers using MyChem.info.

Graph Definition

The following graph facilitates conversion from inchi to inchikey using pubchem as an intermediate:

from biothings.hub.datatransform import MongoDBEdge
import networkx as nx

graph_mychem = nx.DiGraph()

###############################################################################
# DataTransform Nodes and Edges
###############################################################################
graph_mychem.add_node('inchi')
graph_mychem.add_node('pubchem')
graph_mychem.add_node('inchikey')

graph_mychem.add_edge('inchi', 'pubchem',
                      object=MongoDBEdge('pubchem', 'pubchem.inchi', 'pubchem.cid'))

graph_mychem.add_edge('pubchem', 'inchikey',
                      object=MongoDBEdge('pubchem', 'pubchem.cid', 'pubchem.inchi_key'))

To setup a graph, one must define nodes and edges. There should be a node for each type of identifier and an edge which describes how to convert from one identifier to another. Node names can be arbitrary; the user is allowed to chose what an identifier should be called. Edge classes, however, must be defined precisely for conversion to be successful.

Edge Classes

The following edge classes are supported by the DataTransform module. One of these edge classes must be selected when defining an edge connecting two nodes in a graph.

MongoDBEdge

biothings.hub.datatransform.MongoDBEdge

The example above uses the MongoDBEdge class to convert from inchi to inchikey.

MyChemInfoEdge

biothings.hub.datatransform.MyChemInfoEdge

This example graph uses the MyChemInfoEdge class to convert from pubchem to inchikey. The pubchem.cid and pubchem.inchi_key fields are returned by MyChem.info and are listed by /metadata/fields.

from biothings.hub.datatransform import MyChemInfoEdge
import networkx as nx

graph_mychem = nx.DiGraph()

###############################################################################
# DataTransform Nodes and Edges
###############################################################################
graph_mychem.add_node('pubchem')
graph_mychem.add_node('inchikey')

graph_mychem.add_edge('pubchem', 'inchikey',
                      object=MyChemInfoEdge('pubchem.cid', 'pubchem.inchi_key'))

MyGeneInfoEdge

biothings.hub.datatransform.MyGeneInfoEdge

RegExEdge

biothings.hub.datatransform.RegExEdge

This example graph uses the RegExEdge class to convert from pubchem to a shorter form. The CID: prefix is removed by the regular expression substitution:

from biothings.hub.datatransform import RegExEdge
import networkx as nx

graph = nx.DiGraph()

###############################################################################
# DataTransform Nodes and Edges
###############################################################################
graph.add_node('pubchem')
graph.add_node('pubchem-short')

graph.add_edge('pubchem', 'pubchem-short',
               object=RegExEdge('CID:', ''))

Example Usage

A complex graph developed for use with MyChem.info is shown here. This file includes a definition of the MyChemKeyLookup class which is used to call the module on the data source. In general, the graph and class should be supplied to the user by the BioThings.api maintainers.

To call the DataTransform module on the Biothings Uploader, the following definition is used:

keylookup = MyChemKeyLookup(
        [('inchi', 'pharmgkb.inchi'),
         ('pubchem', 'pharmgkb.xrefs.pubchem.cid'),
         ('drugbank', 'pharmgkb.xrefs.drugbank'),
         ('chebi', 'pharmgkb.xrefs.chebi')])

def load_data(self,data_folder):
    input_file = os.path.join(data_folder,"drugs.tsv")
    return self.keylookup(load_data)(input_file)

The parameters passed to MyChemKeyLookup are a list of input types. The first element in an input type is the node name that must match the graph. The second element is the field in dotstring notation which should describe where the identifier should be read from in a document.

The following report was reported when using the DataTransform module with PharmGKB. Reports have a section for document conversion and a section describing conversion along each edge. The document section shows which inputs were used to produce which outputs. The edge section is useful in debugging graphs, ensuring that different conversion edges are working properly.

{
     'doc_report': {
          "('inchi', 'pharmgkb.inchi')-->inchikey": 1637,
          "('pubchem', 'pharmgkb.xrefs.pubchem.cid')-->inchikey": 46
          "('drugbank', 'pharmgkb.xrefs.drugbank')-->inchikey": 41,
          "('drugbank', 'pharmgkb.xrefs.drugbank')-->drugbank": 25,
     }
     'edge_report': {
          'inchi-->chembl': 1109,
          'inchi-->drugbank': 319,
          'inchi-->pubchem': 209,
          'chembl-->inchikey': 1109,
          'drugbank-->inchikey': 360,
          'pubchem-->inchikey': 255
          'drugbank-->drugbank': 25,
     },
}

As an example, the number identifiers converted from inchi to inchikey is 1637. However, these conversions are done via intermediates. One of these intermediates is chembl and the number of identifiers converted from inchi to chembl is 319. Some identifiers are converted directly from pubchem and drugbank. The inchi field is used to lookup several intermediates (chembl, drugbank, and pubchem). Eventually, most of these intermediates are converted to inchikey.

Advanced Usage - DataTransform MDB

The DataTransformMDB module was written as a decorator class which is intended to be applied to the load_data function of a Biothings Uploader. This class can be sub-classed to simplify applification within a Biothings service.

biothings.hub.datatransform.DataTransformMDB

An example of how to apply this class is shown below:

keylookup = DataTransformMDB(graph, input_types, output_types,
                             skip_on_failure=False, skip_w_regex=None,
                             idstruct_class=IDStruct, copy_from_doc=False)
def load_data(self,data_folder):
     input_file = os.path.join(data_folder,"drugs.tsv")
     return self.keylookup(load_data)(input_file)

It is possible to extend the DataTransformEdge type and define custom edges. This could be useful for example if the user wanted to define a computation that transforms one identifier to another. For example inchikey may be computed directly by performing a hash on the inchi identifier.

Document Maintainers

  • Greg Taylor (@gregtaylor)
  • Chunlei Wu (@chunleiwu)