# Tagstorm as GA4GH Metadata

The GA4GH dataset message allows data regarding some topic of concern to be grouped and is a natural place to provide top level metadata for analysis.

In this notebook we will show how the GA4GH schemas can be used to interchange tagstorm metadata in a dataset message.

## Using GA4GH Schemas

In [1]:
from ga4gh.schemas.ga4gh import metadata_pb2 as metadata

Protocol buffers provides a language neutral representation that can be serialized as binary or JSON. We begin by importing the schemas and initializing an empty dataset message.

In [2]:
dataset = metadata.Dataset()
print(dataset)




To demonstrate basic features of the message, we first set it's name. Datasets are also expected to have a unique identifier for any server instance.

In [3]:
dataset.name = "Tag Storm Test"
print(dataset)

name: "Tag Storm Test"



### Setting attributes

The info field was designed to allow arbitrary data to be interchanged and is organized as a map of string to lists. A proposal has been made to improve type representation in the field [here](https://github.com/ga4gh/schemas/pull/700). However, for the string-string mappings the info field should suffice.

In [4]:
print(dataset.attributes)
attributes = dataset.attributes.attr




Currently this field isn't set. Let's add a test attribute key that will be a mapping of a string to a single string. Remember, we are acting on language neutral protobuf objects here, although the syntax may borrow from the python API, they are not python dictionaries.

In [5]:
dataset.attributes.attr['tagStorm'].values.add().string_value = "version 0.0"
tagStorm = dataset.attributes.attr['tagStorm']

In [6]:
print(dataset)

name: "Tag Storm Test"
attributes {
  attr {
    key: "tagStorm"
    value {
      values {
        string_value: "version 0.0"
      }
    }
  }
}



For the simplest use case, our resulting structure is a bit overwrought. The `value` and `values` fields are the indirection that allow us to define our own type system, which protobuf serializers and deserializers will handle by default.

We can add more values to this key using the same method as above.

In [7]:
tagStorm.values.add().string_value = "test"
print(tagStorm)

values {
  string_value: "version 0.0"
}
values {
  string_value: "test"
}



Note that types under a key do not have to be the same.

In [8]:
tagStorm.values.add().int64_value = 31412412549823242
print(tagStorm)

values {
  string_value: "version 0.0"
}
values {
  string_value: "test"
}
values {
  int64_value: 31412412549823242
}



## Importing tagStorm data

tagStorm is a no-markup language for describing genomics metadata. First, we'll get the file.

In [9]:
import urllib2
#response = urllib2.urlopen("http://hgwdev.cse.ucsc.edu/~kent/tagStorm/testTagStorm.txt")
response = urllib2.urlopen("https://gist.githubusercontent.com/david4096/93483758c6519ed9d5e0e6888af1b566/raw/df11db57fac290585dd0c2efe22de8fc698fadcf/tagStorm.txt")
tagStorm = response.read()

We'll print out some to observe the file structure.

In [10]:
print(tagStorm[:500])

lab smith
assay RNA-seq
access all
organ brain

    donor 002
    age 5
    age_units weeks
    life_stage embryo
    sex male
    biosample_date 2015-01-05

        organ brain
        lab_smith_disassociation_protocol fetal_brain_digest.pdf
        lab_smith_quality 0

            part 1
            format fastq
            file ex00do00or00fa00.fq.gz

            file ex00do00or00fa01.fq.gz
            format fastq
            part 2

            file ex00do00or00fa02.fq.gz
            format


We note that this is a hierarchical data structure that uses indentation. Order seems to be important, and there seems to be no restriction on the types of nodes that can have children. 

Unfortunately, protobuf maps require the keys to be strings, integers, or boolean. The tagStorm structure allows arbitrary ordered lists of key-value pairs to act as keys to nested nodes. To interchange this data structure with the proper fidelity, we'll have to create our own key: `children`.

## Representing tagStorm

To interchange tagStorm, we create a hierarchical structure of attributes, using a `children` key. Let's create a simple tagStorm structure with a two tag stanza and one nested stanza.

In [84]:
dataset = metadata.Dataset()
# Make a list of attributes that will be the root of the tag tree
tagStorm = dataset.attributes.attr['tagStorm'].values.add().attribute_list
# Make a node that will have some attributes
node = tagStorm.values.add().attributes
node.attr["lab"].values.add().string_value = "smith"
node.attr["assay"].values.add().string_value = "RNA-Seq"
node2 = tagStorm.values.add().attributes
node2.attr["lab"].values.add().string_value = "doe"
node2.attr["assay"].values.add().string_value = "RNA-Seq"
children = node.attr["children"].values.add().attribute_list.values
childnode = children.add().attributes
childnode.attr['donor'].values.add().string_value = "002"
print(tagStorm)

values {
  attributes {
    attr {
      key: "assay"
      value {
        values {
          string_value: "RNA-Seq"
        }
      }
    }
    attr {
      key: "children"
      value {
        values {
          attribute_list {
            values {
              attributes {
                attr {
                  key: "donor"
                  value {
                    values {
                      string_value: "002"
                    }
                  }
                }
              }
            }
          }
        }
      }
    }
    attr {
      key: "lab"
      value {
        values {
          string_value: "smith"
        }
      }
    }
  }
}
values {
  attributes {
    attr {
      key: "assay"
      value {
        values {
          string_value: "RNA-Seq"
        }
      }
    }
    attr {
      key: "lab"
      value {
        values {
          string_value: "doe"
        }
      }
    }
  }
}



By using a combination of attributes and attribute value lists the GA4GH schemas can interchange tagstorm data, with the added benefit of type security on values.

In [95]:
def proto_to_tagstorm(message, indent=0):
    i = 0
    lines = []
    for v in message.values:
        for node in v.attribute_list.values:
            children = []
            for key in node.attributes.attr:
                if key != 'children':
                    value = node.attributes.attr[key].values[0].string_value
                    lines.append("{indent}{key} {value}".format(
                            indent=indent * "    ", key=key, value=value))
                else:
                    children = proto_to_tagstorm(node.attributes.attr[key], indent=(i+1))
            lines.append("")
            lines = lines + children
    return lines

for line in proto_to_tagstorm(dataset.attributes.attr['tagStorm']):
    print line


assay RNA-Seq
lab smith

    donor 002

assay RNA-Seq
lab doe



Since tag storm files appear to not maintain type information, the above code should be able to be create tagStorm files from properly formatted attribute lists.

## Parsing tagStorm

We can now create a simple parser that will read the tagStorm file line by line. By counting the indentation level and splitting each line, we can create a dictionary that represents the tagStorm. We will assume that space-based indentation is used, that the first line has no indentation, and that the first space on a line is always the key-value separator. Extra empty new lines are optional.

In [144]:
def tagstorm_to_proto(tagstorm):
    lastindent = -1
    dataset = metadata.Dataset()
    current = dataset.attributes.attr['tagStorm'].values.add().attribute_list
    i = 0
    while i < len(tagstorm):
        line = tagstorm[i]
        indentation = (len(line) - len(line.lstrip())) / 4 # spaces
        line = line.lstrip()
        # read a stanza
        parent = current
        node = current.values.add().attributes
        while len(line) != 0 and i < len(tagstorm):
            key, value = line.split(' ')[0], "".join(line.split(' ')[1:])
            node.attr[key].values.add().string_value = value
            i += 1
            line = tagstorm[i]
        i += 1
        # end of a stanza
        if indentation > lastindent:
            current = node.attr['children'].values.add().attribute_list
            lastindent = indentation
        else:
            current = parent
    print(dataset)
    return dataset

#print tagstorm_to_proto(proto_to_tagstorm(dataset.attributes.attr['tagStorm']))
for line in proto_to_tagstorm(
    tagstorm_to_proto(
        proto_to_tagstorm(
            dataset.attributes.attr['tagStorm'])).attributes.attr['tagStorm']):
    print line

attributes {
  attr {
    key: "tagStorm"
    value {
      values {
        attribute_list {
          values {
            attributes {
              attr {
                key: "assay"
                value {
                  values {
                    string_value: "RNA-Seq"
                  }
                }
              }
              attr {
                key: "children"
                value {
                  values {
                    attribute_list {
                      values {
                        attributes {
                          attr {
                            key: "children"
                            value {
                              values {
                                attribute_list {
                                  values {
                                    attributes {
                                      attr {
                                        key: "assay"
                                        value {
               

In [61]:

dataset = metadata.Dataset()
current = dataset.attributes.attr['tagStorm'].attributes.attr
lastindent = 1
roots = []
roots.append(current)
split = tagStorm.split("\n")
i = 0
for i in xrange(len(split)):
    line = split[i]
    # add a stanza
    while len(line) != 0:
        indentation = (len(line) - len(line.lstrip())) / 4 # spaces
        line = line.lstrip()
        key, value = line.split(' ')[0], "".join(line.split(' ')[1:])
        attr[''].values.add().string_value = value
        i += 1
        line = split[i]

[attributes {
  attr {
    key: "hi"
    value {
      values {
        string_value: "there"
      }
    }
  }
}
]
