# Background

## XTbML

An XML standard for representing mortality tables called XTbML was developed in a joint effort between the Technology Section of the SOA and the standards-setting body ACORD. For more information, read [this](https://www.soa.org/news-and-publications/newsletters/compact/2013/april/com-2013-iss47/tables-database-goes-xtbml/).

## Low-level access

Not all XTbML files follow the select and ultimate structure of some mortality tables. For example, table `2682` has three tables and each table has two axes. We cannot make strong assumptions about the structure of the tables and so a general, low-level access is provided that is guaranteed to work across all tables. A simplified high-level access is to be provided in the future for tables that are known to follow the select and ultimate structure.

## An Object-Oriented API

We propose to construct an object-oriented API for accessing XTbML files. The API is designed to be easy to use and to be flexible enough to handle all the different types of XTbML files that appear in MORT. At the bottom of the following code block you will see an `XTbML` class that is used as the main entry point to the API.

In [50]:
from dataclasses import dataclass

@dataclass
class AxisDef:
    ScaleType: str
    AxisName: str
    MinScaleValue: int
    MaxScaleValue: int
    Increment: int

@dataclass
class MetaData:
    ScalingFactor: float
    DataType: str
    Nation: str
    TableDescription: str
    AxisDefs: list[AxisDef]

@dataclass
class Table:
    MetaData: MetaData
    Values: Union[dict[int, float], dict[int, dict[int, float]]]

@dataclass
class ContentClassification:
    TableIdentity: str
    ProviderDomain: str
    ProviderName: str
    TableReference: str
    ContentType: str
    TableName: str
    TableDescription: str
    Comments: str
    KeyWords: list[str]

@dataclass
class XTbML:
    ContentClassification: ContentClassification
    Tables: list[Table]

# Designing the API

The [About](https://mort.soa.org/About.aspx) page for MORT has a link to an [XML Schema](https://mort.soa.org/XTbML2.7.01.xsd) that describes the structure of the XTbML files. In theory the XML schema is the specification for what XTbML files should look like, but in practice several of the features it describes are not implemented in the XTbML files. Because we want a minimal API to optimize the access patterns in Pymort, we take a more empirical approach so that we can only provide the features that are actually used.

In [51]:
# imports
from collections import Counter
from glob import glob
import xml.etree.ElementTree as ET
from typing import Union
from pprint import pprint


First we get the root element of each XTbML file, which is an `<XTbML>` element. This is a slower computation so we store the result and use it in later computations.

In [52]:
tablePaths = glob('archive-2021-Oct-17-051924/*')
roots = [tree.getroot() for tree in [ET.parse(tablePath) for tablePath in tablePaths]]

These are some utility functions used in the analysis.

In [53]:
def getTagCounts(roots: list[ET.Element]) -> list[Counter[str]]:
    '''
    Return dict[tag, count of elements with that tag] for all root XTbML elements
    '''
    return [Counter(element.tag for element in root.iter()) for root in roots]

def getChildrenCounters(roots: list[ET.Element], path: str) -> list[Counter[str]]:
    '''
    Return list of dict[tag, count of children with tag] for all elements at path from root
    This is similar to getTagCounts, but not recursive (children only) and starting from a particular path.
    '''
    allElements = [element for root in roots for element in root.findall(path)]
    childrenTags = [[child.tag for child in el] for el in allElements]
    childrenTagCounts = [Counter(tags) for tags in childrenTags]
    return childrenTagCounts

@dataclass
class CounterStats:
    exists: int
    counts: int
    uniqueCounts: set[int]
    
def getChildrenCounterStats(roots: list[ET.Element], path: str) -> dict[str, CounterStats]:
    '''
    Return dict[tag, CounterStats] for all elements at path from root
    CounterStats contains summary statistics. We can see if the elements at path always have a tag for example, or what the multiplicities of tag are.
    '''
    counterStats: dict[str, CounterStats] = {}
    childrenCounters = getChildrenCounters(roots, path)
    for counter in childrenCounters:
        for key in counter.keys():
            if not key in counterStats:
                counterStats[key] = CounterStats(exists=0, counts=0, uniqueCounts=set())
            counterStats[key].exists += 1
            counterStats[key].counts += counter[key]
            counterStats[key].uniqueCounts.add(counter[key])
    return counterStats

def childrenStatsPrint(path: str):
    '''
    Print summary statistics for all elements at path from root
    '''
    print(path)
    pprint(getChildrenCounterStats(roots, path))

## An unused part of the XML schema

The `AxisDef` element provides a concrete example of a feature of the XML schema that is not implemented in the XTbML files. Here is the definition of `AxisDef`:

```xml
<xsd:element name="AxisDef">
    <xsd:annotation>
        <xsd:documentation>Definition of each Axis</xsd:documentation>
    </xsd:annotation>
    <xsd:complexType>
        <xsd:sequence>
            <xsd:element ref="ScaleType" minOccurs="0"/>
            <xsd:element ref="ScaleSubType" minOccurs="0"/>
            <xsd:element ref="AxisName" minOccurs="0"/>
            <xsd:choice>
                <xsd:sequence>
                    <xsd:element ref="MinScaleDate"/>
                    <xsd:element ref="MaxScaleDate" minOccurs="0"/>
                </xsd:sequence>
                <xsd:sequence>
                    <xsd:element ref="MinScaleValue"/>
                    <xsd:element ref="MaxScaleValue" minOccurs="0"/>
                </xsd:sequence>
            </xsd:choice>
            <xsd:element ref="Increment" minOccurs="0"/>
            <xsd:element ref="Mode" minOccurs="0"/>
            <xsd:element ref="Continuous" minOccurs="0"/>
        </xsd:sequence>
        <xsd:attribute name="id" type="xsd:string"/>
    </xsd:complexType>
</xsd:element>
```

An investigation of the child elements of `AxisDef` shows that there are always 5 child tags of `AxisDef` and that they always appear exactly once.

In [54]:
childrenStatsPrint("./Table/MetaData/AxisDef")

./Table/MetaData/AxisDef
{'AxisName': CounterStats(exists=5364, counts=5364, uniqueCounts={1}),
 'Increment': CounterStats(exists=5364, counts=5364, uniqueCounts={1}),
 'MaxScaleValue': CounterStats(exists=5364, counts=5364, uniqueCounts={1}),
 'MinScaleValue': CounterStats(exists=5364, counts=5364, uniqueCounts={1}),
 'ScaleType': CounterStats(exists=5364, counts=5364, uniqueCounts={1})}


It is possible that the XML schema could be more strict than it currently is by guaranteeing that the child tags of `AxisDef` always appear exactly once and by removing unused tags.

```xml
<xsd:element name="AxisDef">
    <xsd:annotation>
        <xsd:documentation>Definition of each Axis</xsd:documentation>
    </xsd:annotation>
    <xsd:complexType>
        <xsd:sequence>
            <xsd:element ref="ScaleType"/>
            <xsd:element ref="AxisName"/>
            <xsd:element ref="MinScaleValue"/>
            <xsd:element ref="MaxScaleValue"/>
            <xsd:element ref="Increment"/>
        </xsd:sequence>
        <xsd:attribute name="id" type="xsd:string"/>
    </xsd:complexType>
</xsd:element>
```

## Unused children of ContentClassification

The following child elements of `ContentClassification` were found to not be in use:

* `Extension`
* `ContentClassificationKey`
* `EffDate`
* `TableURL`

## Wrapping up

Basically the approach is to just look at the structure of all the XTbML files and see what tags have what children, and if the children always appear exactly once. That is how the object-oriented API is designed.

In [55]:
childrenStatsPrint(".")
childrenStatsPrint("./ContentClassification")
childrenStatsPrint("./Table")
childrenStatsPrint("./Table/MetaData")
childrenStatsPrint("./Table/MetaData/AxisDef")
childrenStatsPrint("./Table/Values/Axis/Axis")

.
{'ContentClassification': CounterStats(exists=3012, counts=3012, uniqueCounts={1}),
 'Table': CounterStats(exists=3012, counts=4483, uniqueCounts={1, 2, 3, 4, 5, 6, 7, 44, 55, 28, 29})}
./ContentClassification
{'Comments': CounterStats(exists=3012, counts=3012, uniqueCounts={1}),
 'ContentType': CounterStats(exists=3012, counts=3012, uniqueCounts={1}),
 'KeyWord': CounterStats(exists=3012, counts=9086, uniqueCounts={2, 3, 4, 5}),
 'ProviderDomain': CounterStats(exists=3012, counts=3012, uniqueCounts={1}),
 'ProviderName': CounterStats(exists=3012, counts=3012, uniqueCounts={1}),
 'TableDescription': CounterStats(exists=3012, counts=3012, uniqueCounts={1}),
 'TableIdentity': CounterStats(exists=3012, counts=3012, uniqueCounts={1}),
 'TableName': CounterStats(exists=3012, counts=3012, uniqueCounts={1}),
 'TableReference': CounterStats(exists=3012, counts=3012, uniqueCounts={1})}
./Table
{'MetaData': CounterStats(exists=4483, counts=4483, uniqueCounts={1}),
 'Values': CounterStats(exist