<a href="https://colab.research.google.com/github/pradh/api-python/blob/svg/notebooks/Custom_Hierarchy_Generator.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Use this notebook to generate a custom StatisticalVariable hierarchy for Data Commons.

Given an outline in a specific format, it generates the hierarchy in the form of an [MCF output](https://github.com/datacommonsorg/data/blob/master/docs/mcf_format.md) ready for import into Data Commons.


### Input Format

Here is how you specify the hierarchy found in the [custom hierarchy documentation page](https://docs.datacommons.org/custom_dc/upload_data.html):

```
Root: dc/g/Custom_Root
Prefix: anything/g
- Group 1A
--* Variable_X
--- Group 2A
---* Variable_Y
- Group 1B
--* Variable_Z: Median Age
```

You identify the `Root` StatVarGroup node to attach the hierarchy under, and add a `Prefix` for the generated StatVarGroup nodes.  Both these are optional.

Every node in the hierarchy is prefixed by some number of dashes `-` depending on the level.  Variables should end with an asterisk `*`.

For StatisticalVariables, you should include the `dcid`, and an optional `name` (notice only "Variable_Z" sets the name). For StatVarGroups, the `dcid` gets auto-generated based on the titles.

In [None]:
# @title First, paste the hierarchy here

HIERARCHY = """
Root: dc/g/Custom_Root
Prefix: anything/g
- Group 1A
--* Var X: Median Income
--- Group 2A
---* Var Y: Mean Income of Rats
- Group 1B
--* Var Z: Mean Rainfall
"""

In [None]:
# @title Next, run this cell and copy the generated output

def get_id(name):
  if name.startswith('dc/g'): return name
  replace_pairs = [
      (' ', '_'),
      ('%', 'Pct'),
      ('&', 'And'),
      ('-', ''),
      ('+', ''),
      ('(', ''),
      (')', ''),
      ('[', ''),
      (']', ''),
      ('{', ''),
      ('}', ''),
      ('=', ''),
      (':', ''),
  ]
  for old, new in replace_pairs:
      name = name.replace(old, new)
  return name


def get_mcf(lines):
  root = None
  prefix = None
  stack = []
  sv_blocks = []
  svg_blocks = []
  for line in lines.splitlines():
      line = line.strip()
      if not line:
          continue
      if line.lower().startswith("root:"):
          root = line.split(':', 1)[1].strip()
          continue
      if line.lower().startswith("prefix:"):
          prefix = line.split(':', 1)[1].strip()
          if not prefix.endswith('/'):
            prefix = prefix + '/'
          continue

      indents, name = line.strip().split(' ', 1)
      nlevel = len(indents.strip())
      name = name.strip()

      if not indents.startswith('-'):
        print('# ERROR: Bogus line:', line)
        continue

      if not root:
        # Use default root
        root = 'dc/g/Root'
      if not prefix:
        # Use custom prefix
        prefix = 'custom/g/'

      is_sv = False
      if indents.endswith('*'):
        is_sv = True
        parts = name.split(':', 1)
        var_id = parts[0].strip()
        if len(parts) > 1:
          var_name = parts[1].strip()
        else:
          var_name = var_id

      while len(stack) >= nlevel:
        stack.pop()

      if not stack:
        assert root
        parent = root
        prefix = prefix
        separator = ''
      else:
        parent = stack[-1]
        prefix = parent
        separator = '_'

      if is_sv:
        sv_parts = [
          f'Node: dcid:{var_id}',
          'typeOf: dcs:StatisticalVariable',
          'populationType: schema:Thing',
          'statType: dcs:measuredValue',
          f'name: "{var_name}"',
          f'measuredProperty: dcid:{var_id}',
          f'memberOf: dcid:{parent}',
        ]
        sv_blocks.append('\n'.join(sv_parts))
      else:
        gid = f'{prefix}{separator}{get_id(name)}'
        svg_parts = [
            f'Node: dcid:{gid}',
            'typeOf: dcs:StatVarGroup',
            f'name: "{name}"',
            f'specializationOf: dcid:{parent}',
        ]
        svg_blocks.append('\n'.join(svg_parts))
        stack.append(gid)

  return svg_blocks, sv_blocks


def run(hierarchy):
  svg_mcf, sv_mcf = get_mcf(hierarchy)

  print('# StatVarGroups\n')
  print('\n\n'.join(svg_mcf))

  print('\n# StatVars\n')
  print('\n\n'.join(sv_mcf))

run(HIERARCHY)

# StatVarGroups

Node: dcid:anything/g/Group_1A
typeOf: dcs:StatVarGroup
name: "Group 1A"
specializationOf: dcid:dc/g/Custom_Root

Node: dcid:anything/g/Group_1A_Group_2A
typeOf: dcs:StatVarGroup
name: "Group 2A"
specializationOf: dcid:anything/g/Group_1A

Node: dcid:anything/g/Group_1A_Group_2AGroup_1B
typeOf: dcs:StatVarGroup
name: "Group 1B"
specializationOf: dcid:dc/g/Custom_Root

# StatVars

Node: dcid:Var X
typeOf: dcs:StatisticalVariable
populationType: schema:Thing
statType: dcs:measuredValue
name: "Median Income"
measuredProperty: dcid:Var X
memberOf: dcid:anything/g/Group_1A

Node: dcid:Var Y
typeOf: dcs:StatisticalVariable
populationType: schema:Thing
statType: dcs:measuredValue
name: "Mean Income of Rats"
measuredProperty: dcid:Var Y
memberOf: dcid:anything/g/Group_1A_Group_2A

Node: dcid:Var Z
typeOf: dcs:StatisticalVariable
populationType: schema:Thing
statType: dcs:measuredValue
name: "Mean Rainfall"
measuredProperty: dcid:Var Z
memberOf: dcid:anything/g/Group_1A_Group_2