# Elyra NLP Pipeline Editor Integration

This notebook shows how to translate the JSON version of of the Elyra pipeline editor's NLP nodes to Python rules that use Text Extensions for Pandas' implementation of [spanner algebra](https://text-extensions-for-pandas.readthedocs.io/en/latest/spanner.html).

In [1]:
import json
import os
import pathlib
import sys

from abc import ABC, abstractmethod

from typing import Union

import pandas as pd
from memoized_property import memoized_property

try:
    import text_extensions_for_pandas as tp
except ModuleNotFoundError as e:
    # If we're running from within the project source tree and the parent Python
    # environment doesn't have the text_extensions_for_pandas package, use the
    # version in the local source tree.
    if not os.getcwd().endswith("notebooks"):
        raise e
    if ".." not in sys.path:
        sys.path.insert(0, "..")
    import text_extensions_for_pandas as tp

In [2]:
# Start by parsing the Elyra JSON data file
INPUT_JSON_FILE = "../test-sequence-nlp.json"

with open(INPUT_JSON_FILE, "r") as f:
    graph_json = json.load(f)
graph_json.keys()

dict_keys(['flow', 'nodes'])

In [3]:
# Nodes of the graph are under the "nodes" key.
graph_json["nodes"]

[{'label': 'Metric',
  'nodeId': 'd88d7167-3b2c-427a-9b50-27049feb904d',
  'type': 'dictionary',
  'description': 'Phrases to be matched.',
  'isValid': True,
  'items': ['Revenues'],
  'caseSensitivity': 'ignore',
  'lemmaMatch': False,
  'externalResourceChecked': False},
 {'label': 'Preposition',
  'nodeId': 'dfe40d07-c154-467c-aab1-f17b13ac6122',
  'type': 'dictionary',
  'description': 'Phrases to be matched.',
  'isValid': True,
  'items': ['from', 'to'],
  'caseSensitivity': 'ignore',
  'lemmaMatch': False,
  'externalResourceChecked': False},
 {'label': 'Division',
  'nodeId': '4cc4d6f5-a5ff-4c80-891e-f9c48285977e',
  'type': 'dictionary',
  'description': 'Phrases to be matched.',
  'isValid': True,
  'items': ['Software',
   'Hardware',
   'Global Business Services',
   'Global Technology Services'],
  'caseSensitivity': 'match',
  'lemmaMatch': False,
  'externalResourceChecked': False},
 {'label': 'RevenueOfDivision',
  'nodeId': 'eba98361-462f-4332-9dc6-cf6e10c80bb3',
  't

In [4]:
# Edges of the graph can be pulled out of the contents of
# the "flow" key.
# Later we'll massage this data into a DataFrame for easier
# manipulation.
[{k: (n[k] if k in n else None) 
  for k in ("id", "inputs", "outputs")}
    for n in graph_json["flow"]["pipelines"][0]["nodes"]]

[{'id': 'd88b8da4-b941-45d5-8ff9-bb6630d82f44',
  'inputs': None,
  'outputs': [{'id': 'outPort',
    'app_data': {'ui_data': {'cardinality': {'min': 1, 'max': -1},
      'label': 'Output Port'}}}]},
 {'id': 'd88d7167-3b2c-427a-9b50-27049feb904d',
  'inputs': [{'id': 'inPort',
    'app_data': {'ui_data': {'cardinality': {'min': 1, 'max': 1},
      'label': 'Input Port'}},
    'links': [{'id': '16712335-7e47-48cf-954b-ec2864ef726f',
      'node_id_ref': 'd88b8da4-b941-45d5-8ff9-bb6630d82f44',
      'port_id_ref': 'outPort'}]}],
  'outputs': [{'id': 'outPort',
    'app_data': {'ui_data': {'cardinality': {'min': 1, 'max': -1},
      'label': 'Output Port'}}}]},
 {'id': 'dfe40d07-c154-467c-aab1-f17b13ac6122',
  'inputs': [{'id': 'inPort',
    'app_data': {'ui_data': {'cardinality': {'min': 1, 'max': 1},
      'label': 'Input Port'}},
    'links': [{'id': 'd657e397-ee22-4559-b8ff-d12e376a195f',
      'node_id_ref': 'd88b8da4-b941-45d5-8ff9-bb6630d82f44',
      'port_id_ref': 'outPort'}]}],


## Manual Translation of Individual Nodes

Let's start with a manual translation of the nodes of our example graph into
Python code. Here we define a function for each node.

### Metric

The `Metric` node is a dictionary node with this JSON definition:

```json
{'label': 'Metric',
  'nodeId': 'd88d7167-3b2c-427a-9b50-27049feb904d',
  'type': 'dictionary',
  'description': 'Phrases to be matched.',
  'isValid': True,
  'items': ['Revenues'],
  'caseSensitivity': 'ignore',
  'lemmaMatch': False,
  'externalResourceChecked': False}
```

We translate this JSON into a Text Extensions for Pandas inline dictionary and a function that evaluates this dictionary.

In [5]:
metric_dict = tp.spanner.create_dict(["Revenues"])


def metric(tokens: pd.Series):
    return tp.spanner.extract_dict(tokens, metric_dict, "metric")

In [6]:
EXAMPLE_DOC = "../resources/financialStatements/4Q2007.txt"
doc_text = pathlib.Path(EXAMPLE_DOC).read_text()
doc_tokens = tp.io.spacy.make_tokens(doc_text)

metric(doc_tokens).head()

Unnamed: 0,metric
0,"[330, 338): 'Revenues'"
1,"[400, 408): 'revenues'"
2,"[480, 488): 'revenues'"
3,"[614, 622): 'revenues'"
4,"[683, 691): 'revenues'"


In [7]:
# We can view results in context
metric(doc_tokens)["metric"].array

Unnamed: 0,begin,end,begin token,end token,context
0,330,338,139,140,Revenues
1,400,408,155,156,revenues
2,480,488,172,173,revenues
3,614,622,205,206,revenues
4,683,691,222,223,revenues
5,722,730,236,237,revenues
6,1167,1175,332,333,revenues
7,2001,2009,502,503,revenues
8,2125,2133,530,531,Revenues
9,2249,2257,560,561,revenues


### Preposition

The `Preposition` node also uses a dictionary. The JSON for this node is:

```json
{'label': 'Preposition',
  'nodeId': 'dfe40d07-c154-467c-aab1-f17b13ac6122',
  'type': 'dictionary',
  'description': 'Phrases to be matched.',
  'isValid': True,
  'items': ['from', 'to'],
  'caseSensitivity': 'ignore',
  'lemmaMatch': False,
  'externalResourceChecked': False}
```

The translation is similar to what we do for `Metric` above.

In [8]:
preposition_dict = tp.spanner.create_dict(["from", "to"])


def preposition(tokens: pd.Series):
    return tp.spanner.extract_dict(tokens, preposition_dict,
                                   "preposition")


preposition(doc_tokens)

Unnamed: 0,preposition
0,"[692, 696): 'from'"
1,"[863, 867): 'from'"
2,"[1032, 1036): 'from'"
3,"[1281, 1285): 'from'"
35,"[1867, 1869): 'to'"
...,...
71,"[12214, 12216): 'to'"
72,"[12503, 12505): 'to'"
73,"[12567, 12569): 'to'"
74,"[13031, 13033): 'to'"


### Division

The `Division` node is another dictionary node with the following JSON:

```json
{'label': 'Division',
  'nodeId': '4cc4d6f5-a5ff-4c80-891e-f9c48285977e',
  'type': 'dictionary',
  'description': 'Phrases to be matched.',
  'isValid': True,
  'items': ['Software',
   'Hardware',
   'Global Business Services',
   'Global Technology Services'],
  'caseSensitivity': 'match',
  'lemmaMatch': False,
  'externalResourceChecked': False},
```

Text Extensions for Pandas doesn't currently implement case-sensitive dictionary matching, so 
we translate to a case-insensitive match for now. Here's the Python code.

In [9]:
division_dict = tp.spanner.create_dict(
    ["Software",
     "Hardware",
     "Global Business Services",
     "Global Technology Services"])


def division(tokens: pd.Series):
    return tp.spanner.extract_dict(tokens, division_dict,
                                   "division")


division(doc_tokens).head()

Unnamed: 0,division
12,"[373, 399): 'Global Technology Services'"
13,"[455, 479): 'Global Business Services'"
0,"[605, 613): 'Software'"
1,"[1570, 1578): 'software'"
14,"[2546, 2572): 'Global Technology Services'"


## RevenueOfDivision

The `RevenueOfDivision` node is a sequence pattern node with the following definition.

```json
{'label': 'RevenueOfDivision',
  'nodeId': 'eba98361-462f-4332-9dc6-cf6e10c80bb3',
  'type': 'sequence',
  'description': 'Connect your extractors for execution',
  'isValid': True,
  'pattern': '(<Metric.Metric>)<Token>{0,0}(<Preposition.Preposition>)<Token>{0,2}(<Division.Division>)',
  'upstreamNodes': [{'label': 'Metric',
    'nodeId': 'd88d7167-3b2c-427a-9b50-27049feb904d'},
   {'label': 'Preposition', 'nodeId': 'dfe40d07-c154-467c-aab1-f17b13ac6122'},
   {'label': 'Division', 'nodeId': '4cc4d6f5-a5ff-4c80-891e-f9c48285977e'}],
  'tokens': [{'min': '0', 'max': '0'}, {'min': '0', 'max': '2'}]}
```

We can translate linear sequence patterns to Python with method chaining. Here are some helper classes for doing so.

In [10]:
class SequencePattern(ABC):
    """
    Base class for sequence patterns
    """

    @abstractmethod
    def result(self) -> tp.SpanArray:
        pass

    def token(self, min_: int, max_: int):
        return TokenExpr(self, min_, max_)


class Atom(SequencePattern):
    """
    Atomic (single-element) sequence pattern with support for right-chaining 
    to produce linear sequence patterns.
    """

    def __init__(self, inputs: Union[pd.Series, tp.SpanArray]):
        """
        :param series: Input span data.
        """
        self._spans = tp.SpanArray.make_array(inputs)

    def result(self) -> tp.SpanArray:
        return self._spans


class Binary(SequencePattern):
    """
    Binary sequence with two parts separated by a token distance.
    """

    def __init__(self, lhs: SequencePattern, rhs: SequencePattern,
                 min_: int, max_: int):
        """
        :param lhs: Left-hand input
        :param rhs: Right-hand input
        :param min_: Minimum token distance
        :param max_: Maximum token distance
        """
        self._lhs = lhs
        self._rhs = rhs
        self._min = min_
        self._max = max_

    def result(self) -> tp.SpanArray:
        lhs_spans = self._lhs.result()
        rhs_spans = self._rhs.result()
        pairs = tp.spanner.adjacent_join(pd.Series(lhs_spans),
                                         pd.Series(rhs_spans),
                                         min_gap=self._min,
                                         max_gap=self._max)
        result_series = pairs["first"] + pairs["second"]
        return result_series.array


class TokenExpr:
    """
    A token distance expression with only a left-hand argument, as returned by 
    :func:`SequencePatternInput.token()`.

    For internal use only.
    """

    def __init__(self, lhs: SequencePattern, min_: int, max_: int):
        self._lhs = lhs
        self._min = min_
        self._max = max_

    def __call__(self, rhs: SequencePattern) -> SequencePattern:
        return Binary(self._lhs, rhs, self._min, self._max)

And here's a manual translation of the sequence pattern using the above helper classes.

In [11]:
metric_atom = Atom(metric(doc_tokens)["metric"])
preposition_atom = Atom(preposition(doc_tokens)["preposition"])
division_atom = Atom(division(doc_tokens)["division"])

pattern = metric_atom.token(0, 0)(preposition_atom).token(0, 2)(division_atom)
pattern.result()

Unnamed: 0,begin,end,begin token,end token,context
0,4208,4234,922,926,Revenues from the Software
1,4951,4996,1060,1065,Revenues from Information Management software
2,5077,5106,1079,1083,Revenues from Tivoli software
3,5251,5279,1104,1108,revenues from Lotus software
4,5423,5454,1132,1136,Revenues from Rational software
5,8278,8322,1751,1757,Revenues from the Global Technology Services
6,8436,8478,1782,1788,Revenues from the Global Business Services


## Manual translation of entire graph into a class

The manual translations so far have produced standalone Python functions. 
To tie together the entire rule set and enable evaluating it on multiple
documents, we create a Python class that both wraps and produces the results
on a single document. Here's a manually-created version of such a class.

In [12]:
class RevenueDoc:
    # Singletons for dictionaries, attached to the class object
    metric_dict = tp.spanner.create_dict(["Revenues"])
    preposition_dict = tp.spanner.create_dict(["from", "to"])
    division_dict = tp.spanner.create_dict(
        ["Software",
         "Hardware",
         "Global Business Services",
         "Global Technology Services"])

    def __init__(self, doc_text: str):
        self._doc_text = doc_text
        self._doc_tokens = tp.io.spacy.make_tokens(doc_text)

    @memoized_property
    def metric(self) -> pd.DataFrame:
        """
        Compiled version of the graph node `Metric`.
        """
        return tp.spanner.extract_dict(self._doc_tokens,
                                       metric_dict, "metric")

    @memoized_property
    def preposition(self) -> pd.DataFrame:
        """
        Compiled version of the graph node `Preposition`.
        """
        return tp.spanner.extract_dict(self._doc_tokens,
                                       preposition_dict,
                                       "preposition")

    @memoized_property
    def division(self) -> pd.DataFrame:
        """
        Compiled version of the graph node `Division`.
        """
        return tp.spanner.extract_dict(self._doc_tokens,
                                       division_dict,
                                       "division")

    @memoized_property
    def revenue_of_division(self) -> pd.DataFrame:
        """
        Compiled version of the graph node `RevenueOfDivision`.
        """
        metric_atom = Atom(self.metric["metric"])
        preposition_atom = Atom(self.preposition["preposition"])
        division_atom = Atom(self.division["division"])

        pattern = metric_atom.token(0, 0)(preposition_atom).token(0, 2)(division_atom)
        return pd.DataFrame({
            "revenue_of_division": pattern.result()
        })


To use this class, we create an instance and pass the document text to the constructor.

In [13]:
doc_results = RevenueDoc(doc_text)
doc_results.revenue_of_division

Unnamed: 0,revenue_of_division
0,"[4208, 4234): 'Revenues from the Software'"
1,"[4951, 4996): 'Revenues from Information Manag..."
2,"[5077, 5106): 'Revenues from Tivoli software'"
3,"[5251, 5279): 'revenues from Lotus software'"
4,"[5423, 5454): 'Revenues from Rational software'"
5,"[8278, 8322): 'Revenues from the Global Techno..."
6,"[8436, 8478): 'Revenues from the Global Busine..."


The `revenue_of_division` series here is backed by a `SpanArray`, so we can ask it to visualize itself.

In [14]:
doc_results.revenue_of_division["revenue_of_division"].array

Unnamed: 0,begin,end,begin token,end token,context
0,4208,4234,922,926,Revenues from the Software
1,4951,4996,1060,1065,Revenues from Information Management software
2,5077,5106,1079,1083,Revenues from Tivoli software
3,5251,5279,1104,1108,revenues from Lotus software
4,5423,5454,1132,1136,Revenues from Rational software
5,8278,8322,1751,1757,Revenues from the Global Technology Services
6,8436,8478,1782,1788,Revenues from the Global Business Services


Intermediate results are available as DataFrames hanging off of eponymous memoized properties of the object.

In [15]:
doc_results.division.head()

Unnamed: 0,division
12,"[373, 399): 'Global Technology Services'"
13,"[455, 479): 'Global Business Services'"
0,"[605, 613): 'Software'"
1,"[1570, 1578): 'software'"
14,"[2546, 2572): 'Global Technology Services'"


We can join the views to show relationships between spans.

In [16]:
def add_column(df: pd.DataFrame, context: pd.DataFrame,
               df_col_name: str, context_col_name: str) -> pd.DataFrame:
    """
    Add a new column containing context joined with a target output span
    using the containment relationship to do the join.
    """
    pairs = tp.spanner.contain_join(df[df_col_name], context[context_col_name],
                                    df_col_name, context_col_name)
    result = df.copy()
    result = result.merge(pairs)
    return result

results = doc_results.revenue_of_division
results = add_column(results, doc_results.metric,
                     "revenue_of_division", "metric")
results = add_column(results, doc_results.preposition,
                     "revenue_of_division", "preposition")
results = add_column(results, doc_results.division,
                     "revenue_of_division", "division")

results

Unnamed: 0,revenue_of_division,metric,preposition,division
0,"[4208, 4234): 'Revenues from the Software'","[4208, 4216): 'Revenues'","[4217, 4221): 'from'","[4226, 4234): 'Software'"
1,"[4951, 4996): 'Revenues from Information Manag...","[4951, 4959): 'Revenues'","[4960, 4964): 'from'","[4988, 4996): 'software'"
2,"[5077, 5106): 'Revenues from Tivoli software'","[5077, 5085): 'Revenues'","[5086, 5090): 'from'","[5098, 5106): 'software'"
3,"[5251, 5279): 'revenues from Lotus software'","[5251, 5259): 'revenues'","[5260, 5264): 'from'","[5271, 5279): 'software'"
4,"[5423, 5454): 'Revenues from Rational software'","[5423, 5431): 'Revenues'","[5432, 5436): 'from'","[5446, 5454): 'software'"
5,"[8278, 8322): 'Revenues from the Global Techno...","[8278, 8286): 'Revenues'","[8287, 8291): 'from'","[8296, 8322): 'Global Technology Services'"
6,"[8436, 8478): 'Revenues from the Global Busine...","[8436, 8444): 'Revenues'","[8445, 8449): 'from'","[8454, 8478): 'Global Business Services'"


Alternately, we can concatenate all the spans into a single view and use the Text Extensions for Pandas dataframe visualizer to show color-coded spans in context.

In [17]:
all_spans = pd.concat([
    pd.DataFrame({
        "span": getattr(doc_results, name).iloc[:, 0],
        "view_name": name
    })
    for name in ("metric", "preposition", "division", "revenue_of_division")
]).reset_index(drop=True)
all_spans

Unnamed: 0,span,view_name
0,"[330, 338): 'Revenues'",metric
1,"[400, 408): 'revenues'",metric
2,"[480, 488): 'revenues'",metric
3,"[614, 622): 'revenues'",metric
4,"[683, 691): 'revenues'",metric
...,...,...
146,"[5077, 5106): 'Revenues from Tivoli software'",revenue_of_division
147,"[5251, 5279): 'revenues from Lotus software'",revenue_of_division
148,"[5423, 5454): 'Revenues from Rational software'",revenue_of_division
149,"[8278, 8322): 'Revenues from the Global Techno...",revenue_of_division


In [18]:
viz = tp.jupyter.DataFrameWidget(all_spans)

# Simulate selecting tag column and color mode in the UI
viz._update_tag({"new": ('column_data', 'view_name')})
viz._update_color_mode({"new": 'TAG'})
viz.display()

Output(_dom_classes=('tep--dfwidget--output',))