Skip to content
zupon edited this page Nov 18, 2021 · 6 revisions

Grounders

In Eidos, all grounding is done against an ontology. Internally, each ontology is represented as a DomainOntology. The DomainOntology still has the original yaml ontology content. That is, it still contains the String examples etc. However, this format isn't particularly computationally efficient, so instead of interacting with it directly, we use an OntologyGrounder (note, in practice all grounders extend the EidosOntologyGrounder subclass). The EidosOntologyGrounder is initialized with a DomainOntology, and also derives more efficient representations of the ontology node examples and patterns:

  • conceptEmbeddings: these are the internal representations of the ontology nodes in terms of their examples. Each node has a ConceptEmbedding, which essentially has a name and a vector representation that is more or less the average of the example/definition terms. See class definition for more detail.
  • conceptPatterns: similar to above, each node in the ontology here is represented in terms of its (optional) regex. Thus, each ConceptPattern essentially has a name and an optional array of Regex.

There are essentially two different Grounders in active use, selected here:

Flat grounding

Flat grounding is way simpler than compositional grounding, but pushes the complexity to the ontology. That is, more complex concepts must each have their own node in the ontology, and a given mention is aligned to one at a time. (In practice we return the top k as there's always some uncertainty involved in automated grounding).

The WM flat ontology has two main "branches" -- the interventions and the main ontology. Since the interventions were so numerous and at such a different level of granularity, the grounding algorithm first considers the main branch of the ontology, and only looks at the intervention branch if certain conditions are met.

The grounding essentially:

  1. first tries to match a main branch ontology node regex. If one matches, then it is considered to be a perfect match, and the node is selected.
  2. If there are no main branch regex matches, then the word embedding representation of the mention (the average of the embeddings for the tokens in the canonical name) is compared to all main branch nodes (their embedding representation is similarly the average of all non-stop words in the examples and definition).
  3. Then, the grounder checks to see if the mention should also be grounded against the intervention branch. This currently is done through pattern matching, using these patterns plus any included in that branch of the ontology.
  4. If allowed to match against the intervention branch, then again first the branch is checked for node regex matches and then embedding matches.
  5. If the algorithm didn't yet return a grounding, then finally, all groundings are combined, sorted, and the top k are returned.

Compositional grounding

Compositional grounding is a much more complicated algorithm, as the complexity burden has shifted from the ontology to the grounding process. To understand the approach, first you should understand the compositional grounding representation. Each is a 4-tuple, where the slots represent (in order):

  1. the theme of the concept
  2. any property of that theme
  3. the process that applies to or acts on the theme, if any
  4. any property of that process

The case class that stores this representation is a PredicateTuple. This is the primary target of the compositional grounder: given an EidosMention, return one or more PredicateTuples (which are wrapped into a PredicateGrounding for consistency. Note you can simplify this if desired! Currently, in addition to each of the individual groundings for the 4 slots having scores, the overall tuple also has an aggregated score.

In order to generate these compositional groundings, we rely heavily on the Semantic Role Labeling (SRL) provided through the CluLab processors library. The approach works as follows:

if no predicatesAndArguments:
    // First try grounding to Property branch
    groundToBranches(text, Property)
    // If we use up everything, just return a Property grounding
    if (nothing left):
        return groundings
        stop
    // Otherwise, continue trying to ground to Concept and Process branches w/ remaining text
    else:
        groundToBranches(remainingText, Seq(Concept, Process))
        return groundings
        stop

if predicatesAndArguments:
    // Try to ground the entire mention text before separating into preds/args individually
    // Match the mention text against the node names in the ontology
    // For example, mention try go ground "climate change" as a whole before splitting into "climate" and "change"
    // This will match the existing ontology node "climate_change", which is not the top grounding for either pred/arg separately
    // If the entire mention text matches multiple nodes, keep the match that uses most material from the mention
    exactMatchGrounding = getExactMatches(text)
    // If we use up everything, just return the exact match grounding
    if (nothing left):
        return exactMatchGrounding
        stop
    // Otherwise, continue trying to ground the remaining text
    else:
        allGroundings = [exactMatchGrounding]
        for (item <- remainingText):
            isArgument = true if (incoming SRL edges AND no outgoing SRL edges)
            isPredicate = true if (no incoming SRL edges OR there are outgoing SRL edges)
            // try to ground to Property branch first
            groundToBranches(item, Property)
            if properties nonEmpty:
                allGroundings.append(grounding)
                continue to next item
            // if it's not a Property, move on
            else:
                // If it's an argument, ground to Concept
                if isArgument:
                    groundToBranches(item, Concept)
                    allGroundings.append(grounding)
                // If it's a predicate, ground to Process
                elif isPredicate:
                    groundToBranches(item, Process)
                    allGroundings.append(grounding)
        return allGroundings


def groundToBranches(mentionText, branches):
    // get exact matches based on node name
    val exactMatchGroundings = getExactMatches
    // get pattern matches based on regex patterns
    val patternMatchGroundings = getPatternMatches
    // proceed to w2v
    // get min edit distance for each node
    getExampleScores
    // get w2v similarity score for each node
    getEmbeddingScores
    // use function to get the combination score of edit distance and cosine similarity for each node
    val w2vGroundings = comboScore(getExampleScores, getEmbeddingScores)
    // return all matched groundings; will be filtered and sorted later
    return exactMatchGroundings ++ patternMatchGroundings ++ w2vGroundings

Everything below is an older implementation of the compositional grounding algorithm, kept for historical reasons

  1. Given a mention (which corresponds to a span from the sentence), the valid predicates are found. These are the tokens (a) identified by the SRL as predicates and (b) occur within the span of the mention.
  2. If no valid predicates are found, we consider the mention span to be essentially a theme, though we check for any properties which may be there and attach them if found. This results in a 4-tuple that has either only one slot filled (the theme) or two (when the theme has a property).
  3. However, if there are predicates, we consider each of them in order of shortest graph distance to the syntactic root of the sentence. For each we:
  • Generate distinct paths through the SRL graph from that predicate to each of its leaf themes.
  • Iterate through each of these paths and look for properties. If any are found, we convert the property to being "attached" to its theme (in SRL properties tend to be predicates), and pop it from the path list.
  • Iterate through the remaining nodes in the SRL graph path, again, starting from the deepest node and create the tuple. The deepest node is taken as the theme, and if there is a second deepest it is taken as the process. Since properties were previously converted to attachments on the nodes, at this point if there are any they are included in the 4-tuple. If there was only one thing in the list, then it is by default the theme.