# Watchful Data Enrichment API

Now that you have gone through the API introduction notebook, you are ready to explore the data enrichment API.

Everything here is in Python; you can use the code segments outside a Jupyter notebook as well.

By now, your hosted Watchful application instance should be spun up by your ops team, already up and running.
For the purpose of experimenting with the API, you will connect to your hosted Watchful application instance in this notebook.

We recommend the Python 3.8.12 environment as it is used to build the SDK and run this notebook.
Generally, Python >=3.7 and <=3.10.9 should work.

## Installing dependencies and Watchful SDK

In [1]:
# Install the dependencies
import sys
!{sys.executable} -m pip install -r requirements_enrichment_intro.txt

In [2]:
# Import Watchful SDK
import watchful as w
w.__version__

'3.5.0'

## Connecting to your already-running Watchful application instance

Here, you can connect to your already-running Watchful application instance and its currently active project.

Here's what that would look like:

In [3]:
# Connect to your hosted Watchful application instance
host = "your.watchful.application.host"  # change this string to your actual host
port = "9001"
w.external(host, port)

We can do a sanity check here by calling `w.get()`.
After you've connected to your hosted Watchful application instance, this function can be called anytime you like to check on its status.

As you've seen in the API introduction notebook, `w.get()` returns a response that contains information such as your currently active project, dataset examples (candidates) and classes, hinters created, hand labels and label distribution, confidences and error rate, recall and precision and many more.

If your running Watchful application instance has an opened project, it should look like this:

In [4]:
import pprint
pp = pprint.PrettyPrinter(indent=4).pprint
pp(w.get())

{   'auto_complete': {   'end': 4,
                         'start': 0,
                         'values': ['CELLS', 'ROWS', 'SENTS', 'TOKS']},
    'cand_seq_full': 1,
    'cand_seq_prefix': 1,
    'candidates': [],
    'classes': {},
    'datasets': ['7846c6da-c2c7-40cc-9380-912666007680'],
    'disagreements': [],
    'error_msg': None,
    'error_verb': None,
    'export_preview': None,
    'exports': [],
    'field_names': ['Review1', 'Review2'],
    'hand_labels': None,
    'hinters': [],
    'is_shared': False,
    'messages': [],
    'n_candidates': 3,
    'n_handlabels': 0,
    'ner_hl_text': None,
    'notifications': [],
    'precision_candidate': {'candidate': [], 'mode': 'ner'},
    'project_config': {},
    'project_id': '2022-08-04.hints',
    'published_title': None,
    'pull_actions': [],
    'push_actions': [],
    'query': 'TOKS: [entity MISC]',
    'query_breakdown': {   'depths': [0, 1],
                           'hits': [0, 0],
                           'offsets

Now, you can create a new project with w.create_project(), then give it the title "My Project" by calling w.title("My Project").

The summary below is empty because we don't have any data yet, but it shows the fields that are always there.

In [5]:
w.create_project()

b'"OK"\n'

In [6]:
w.title("My Project")

{'error_msg': None,
 'error_verb': None,
 'status': 'current',
 'title': 'My Project',
 'project_id': '2022-11-16 (8).hints',
 'watchful_home': '/root/watchful',
 'published_title': None,
 'is_shared': False,
 'pull_actions': [],
 'push_actions': [],
 'state_seq': 8800,
 'cand_seq_prefix': 0,
 'cand_seq_full': 0,
 'n_candidates': 0,
 'n_handlabels': 0,
 'datasets': [''],
 'field_names': [],
 'candidates': [],
 'unlabeled_candidate': [],
 'precision_candidate': {'mode': 'ner', 'candidate': []},
 'classes': {},
 'hinters': [],
 'ner_hl_text': None,
 'hand_labels': None,
 'disagreements': [],
 'query': '',
 'query_page': 0,
 'query_end': True,
 'query_hit_count': 0,
 'query_examined': 0,
 'query_breakdown': {'values': [], 'offsets': [], 'hits': [], 'depths': []},
 'query_completed': True,
 'query_full_rows': False,
 'auto_complete': {'values': [], 'start': 0, 'end': 0},
 'query_history': {'values': [], 'hits': []},
 'selections': [],
 'selected_class': '',
 'suggestions': {'positive': [],

Next, we create a toy dataset using data of some companies from Wikipedia (https://en.wikipedia.org/wiki), and add it to the project you have just created. The returned `dataset_id` indicates that the dataset has been created for use in the Watchful application instance.

You can see that the dataset has 2 columns, `Company_info` and `Company_founders`, and there are 3 rows of data. `Company_info` shares some background information about the company while `Company_founders` shares who the founders are. This is the data that will be enriched in the later part of this notebook.

After we have enriched this data in the later part of this notebook, we can refer back here to do a sanity-check of the enriched attributes with this dataset.

In [7]:
header = "Company_info,Company_founders"
data = '''\
"Microsoft Corporation is an American multinational technology corporation \
producing computer software, consumer electronics, personal computers, and \
related services headquartered at the Microsoft Redmond campus located in \
Redmond, Washington, United States.","Microsoft was founded by Bill Gates and \
Paul Allen on April 4, 1975, to develop and sell BASIC interpreters for the \
Altair 8800."\n\
"Alphabet Inc. is an American multinational technology conglomerate holding \
company headquartered in Mountain View, California.","Founders Larry Page and \
Sergey Brin announced their resignation from their executive posts in December \
2019, with the CEO role to be filled by Sundar Pichai, also the CEO of Google."\n\
"Amazon.com, Inc. is an American multinational technology company focusing on \
e-commerce, cloud computing, online advertising, digital streaming, and \
artificial intelligence.","Amazon was founded by Jeff Bezos from his garage in \
Bellevue, Washington on July 5, 1994."\n\
"Amazon.com, Inc. is an American multinational technology company focusing on \
e-commerce, cloud computing, online advertising, digital streaming, and \
artificial intelligence.","Amazon was founded by Jeff Bezos from his garage in \
Bellevue, Washington, on July 5, 1994."
'''
dataset_id = w.create_dataset(("{}\n{}").format(header, data).encode("utf-8"), header.split(","))
dataset_id

'd41bf7fa-25b5-438f-921e-23b7673849ab'

# A walk-through of the data enrichment API

Now that you're connected to your Watchful application, let's take a look at the data enrichment API using an example.

## Creating your enricher class

To enrich your data that is loaded into your Watchful application, you'd need to create an enricher class that inherits from Watchful's `Enricher` abstract class. Your enricher class should also implement the `__init__` and `enrich_fn` methods. In the example below, you can see that the data enricher class is `NEREnricher` and within it the  `__init__` and `enrich_fn` methods have been implemented.

The `__init__` method is used to initialize your data enrichment objects, such as your model (that is used for data enrichment) and functions that, when put together, perform the data enrichment. In here you can include other variables that matter to your enrichment logic, such as parameters. In the example below, you can see that inside `__init__`, `tagger` is a model and `predict` is a function that makes a prediction using the `tagger` model on the input `sent`. On the last line of the `__init__` method, these enrichment objects are saved as a tuple into `enrichment_args`. We finally set the `order` attribute of the enrichment function `enrich_fn` to either "row" or "col" so that we indicate to the Watchful enrichment algorithm the intended enrichment order.

The `enrich_fn` method is applied to every input `row`or `col` of your data to effect the data enrichment. Every row or column of your data comprise a number of cells, so eventually a list of enriched cells, `List[EnrichedCell]`, should be returned from this function. Every `EnrichedCell` conforms to the Watchful formatting for enriched data in a cell of the dataset, so that the enriched data can be parsed by your Watchful application. At the start, the enrichment objects are unpacked on the first line of the `enrich_fn` method. Next, we create `enriched_fn` as an empty list (of to-be `enriched_cells`). Then we iterate over every cell of the input `row` or `col`; in every iteration we apply the `predict` method to the cell, create an `enriched_cell` as an empty list (of to-be attributes), extract and insert all the available attributes, i.e. the NER entities (spans(offsets), entity and score), into the `enriched_cell`. We also perform an optional post-processing step that adjusts the byte offsets of the spans to character offsets. Finally, the completed `enriched_row` or `enriched_col` is returned.

It is worthy to note that the input `row` is a dictionary of column names to column values for that row, and an input `col` is a tuple of the column name and the values for that column. Therefore, you can make use of the column names to apply different enrichment logic to different columns, if you wish to do so.

In [8]:
################################################################################
"""
Your enricher should inherit from Watchful's `Enricher` interface and
implement the methods `__init__` and `enrich_fn` with the same 
signatures.
"""

import os
import pprint
import sys
from typing import Dict, List, Optional, Tuple
from watchful.enricher import EnrichedCell, Enricher, set_enrich_fn_order
from watchful.attributes import adjust_span_offsets_from_char_to_byte

pprint._sorted = lambda x: x
pprint = pprint.PrettyPrinter(indent=4).pprint


class NEREnricher(Enricher):
    """
    This is an example of a customized enricher class that inherits from the
    `Enricher` interface, with subsequent implementation of the methods
    `__init__` and `enrich_fn` with the same signatures.
    """

    def __init__(self) -> None:
        """
        In this function, we create variables that we will later use in
        `enrich_fn` to enrich our data row-wise or column-wise.
        """

        global Sentence
        from flair.data import Sentence
        from flair.models import SequenceTagger
        import logging
        import warnings

        logging.getLogger("flair").setLevel(logging.ERROR)
        warnings.filterwarnings("ignore", module="huggingface_hub")

        tagger = SequenceTagger.load("ner")

        def predict(sent: Sentence) -> None:
            tagger.predict(sent)

        self.enrichment_args = (predict,)

    @set_enrich_fn_order(order="row")
    def enrich_fn(
        self,
        row: Dict[Optional[str], Optional[str]],
    ) -> List[EnrichedCell]:
        """
        In this function, we use our variables from `self.enrichment_args` to
        enrich every row or column of your data. The return value is our
        enriched row or column. In this example, we enrich by rows instead of
        columns.
        """

        predict, = self.enrichment_args

        enriched_row = []

        for cell in row.values():
            sent = Sentence(cell)
            predict(sent)

            enriched_cell = []

            ent_spans = []
            ent_values = []
            ent_scores = []
            for ent in sent.get_spans("ner"):
                spans = (ent.start_position, ent.end_position)
                ent_spans.append(spans)
                ent_value = ent.get_label("ner").value
                ent_values.append(ent_value)
                ent_score = str(int(round(ent.get_label("ner").score, 2) * 100))
                ent_scores.append(ent_score)
                
            enriched_cell.append(
                (ent_spans, {"entity": ent_values, "score": ent_scores}, "ENTS")
            )

            adjust_span_offsets_from_char_to_byte(cell, enriched_cell)

            enriched_row.append(enriched_cell)

        """
        Prints your enriched row or column so you can see the intermediate
        output in this notebook later on. Comment this if you are enriching a
        large dataset.
        """
        print("Enriched row:")
        pprint(enriched_row)
        print("*" * 80)

        return enriched_row

## Enriching your data

Having specified your enricher class, you're ready to perform data enrichment.

It is straightforward; simply pass your enricher class to the `enrich_dataset` function; Watchful SDK and your Watchful application will take care of all the data processing work for you.

In [None]:
from watchful.enrich import enrich_dataset
enrich_dataset(NEREnricher, ["--host", host, "--port", port, "--dataset_id", dataset_id])

Using your custom enricher ...
Enriching ~/watchful/working/28104ef2-8eca-4015-a92f-c137a322d9a2 ...
Enriched row:
[   [   (   [(0, 21), (28, 36), (187, 204), (223, 230), (232, 242), (244, 257)],
            {   'entity': ['ORG', 'MISC', 'ORG', 'LOC', 'LOC', 'LOC'],
                'score': ['99', '97', '73', '100', '99', '100']},
            'ENTS')],
    [   (   [(0, 9), (25, 35), (40, 50), (116, 127)],
            {   'entity': ['ORG', 'PER', 'PER', 'MISC'],
                'score': ['100', '100', '100', '88']},
            'ENTS')]]
********************************************************************************
Enriched row:
[   [   (   [(0, 13), (20, 28), (100, 113), (115, 125)],
            {   'entity': ['ORG', 'MISC', 'LOC', 'LOC'],
                'score': ['100', '99', '100', '100']},
            'ENTS')],
    [   (   [(9, 19), (24, 35), (143, 156), (174, 180)],
            {   'entity': ['PER', 'PER', 'PER', 'ORG'],
                'score': ['100', '100', '99', '100']},
   

## Querying your enriched data 

Now that you've enriched your data in your Watchful application, you can query your data to verify that enrichment has indeed been successfully performed.

We do this by calling the `query` function, to query for the `LOC`, `ORG`, `PER` and `MISC` attributes that were created by your enrichment class.

We can see that the query results indeed belong to the respective query attributes. You could also compare the attibutes produced here with your dataset loaded earlier in the notebook, or go back to your Watchful application and run the queries to see the same results in the UI.

In [10]:
def query_to_fields(query_str):
    query_res = w.query(query_str)
    return [attr_data["fields"] for attr_data in query_res["candidates"]]

for entity_value in ["LOC", "ORG", "PER", "MISC"]:
    query = f"TOKS: [entity {entity_value}]"
    fields = query_to_fields(query)
    print(f"{entity_value}:\n{fields}\n")

LOC:
[['Redmond', ''], ['Washington', ''], ['United', ''], ['States', ''], ['Mountain', ''], ['View', ''], ['California', ''], ['', 'Bellevue']]

ORG:
[['Microsoft', ''], ['Corporation', ''], ['Microsoft', ''], ['Redmond', ''], ['', 'Microsoft'], ['Alphabet', ''], ['Inc', ''], ['.', '']]

PER:
[['', 'Bill'], ['', 'Gates'], ['', 'Paul'], ['', 'Allen'], ['', 'Larry'], ['', 'Page'], ['', 'Sergey'], ['', 'Brin']]

MISC:
[['American', ''], ['', 'Altair'], ['', '8800'], ['American', ''], ['American', ''], ['American', '']]



# Exploring the data enrichment API further

You can also look at the `attributes.py`, `enricher.py` and `enrich.py` modules in the Watchful package for the data enrichment functions and classes, some of which are already covered above.

A way to quickly explore them and their current documentation is to use the built-in `help` function, as shown below.

You could also explore the Watchful package documentation hosted at https://watchful.readthedocs.io/en/stable/.

In [11]:
help(w)

Help on package watchful:

NAME
    watchful - Initializes ``watchful`` as a module.

PACKAGE CONTENTS
    attributes
    client
    enrich
    enricher

DATA
    API_SUMMARY_HOOK_CALLBACK = None
    API_TIMEOUT_SEC = 600
    ATTR_WRITER = None
    BASE = 64
    COMPRESSED = {0: '#', 1: '$', 2: '%', 3: '&', 4: "'", 5: '(', 6: ')', ...
    COMPRESSED_LEN = 8
    Callable = typing.Callable
    Dict = typing.Dict
    ENRICHMENT_ARGS = None
    EnrichedCell = typing.List[typing.Tuple[typing.Union[typing.Lis...ing....
    Generator = typing.Generator
    HOST = 'localhost'
    IS_MULTIPROC = False
    List = typing.List
    Literal = typing.Literal
    MULTIPROC_CHUNKSIZE = None
    NUMERALS = {0: '0', 1: '1', 2: '2', 3: '3', 4: '4', 5: '5', 6: '6', 7:...
    Optional = typing.Optional
    PORT = '9002'
    SCHEME = 'http'
    Tuple = typing.Tuple
    Union = typing.Union

VERSION
    3.5.0




To explore the `enricher.py` module:

In [12]:
help(w.enricher)

Help on module watchful.enricher in watchful:

NAME
    watchful.enricher

DESCRIPTION
    This script provides the abstract :class:`Enricher` class interface to be
    inherited in your custom enricher class, where you can then implement your
    custom data enrichment functions and models within :meth:`enrich_row`. Refer to
    https://github.com/Watchfulio/watchful-py/blob/main/examples/enrichment_intro.ipynb
    for a tutorial on how to implement your custom enricher class.

CLASSES
    builtins.object
        Enricher
    
    class Enricher(builtins.object)
     |  Enricher() -> None
     |  
     |  This is the abstract class that customized enricher classes should inherit,
     |  and then implement the abstract methods :meth:`__init__` and
     |  :meth:`enrich_row`.
     |  
     |  Methods defined here:
     |  
     |  __init__(self) -> None
     |      In this method, we create variables that we will store in
     |      :attr:`self.enrichment_args`. We then later use them