# Watchful Python API

This notebock introduces the Watchful Python API with some examples.
Be sure to check the documentation as there's more you can do.

Of course, since everything here is just Python, you can use it outside a Jupyter notebook just as well.

First, we need to get Watchful running, so we have two options:

  - run an ephemeral backend, which will forget your data when you exit the backend
  - connect to a backend that you run from a terminal or from the MacOS bundled application

For the purposes of experimenting with the API, the ephemeral option is much more convenient, so that's what we'll do here.

The server output is saved in a file, called watchful_ephemeral_output.txt, probably in the same directory you ran the Jupyter notebook command from.
This can be useful to look at if anything goes wrong.

It's good to be aware that this file contains a complete record of your session, so if you import a lot of data, the file will be correspondingly big, and you might want to delete it to save space.

The Python 3.8.12 environment is used to run this notebook.

In [1]:
# Install the dependencies
import sys
!{sys.executable} -m pip install -r requirements_api_intro.txt

In [2]:
# Install and import Watchful SDK 🙂
import sys
!{sys.executable} -m pip install watchful --upgrade
import watchful as w

In [3]:
# Run the ephemeral backend. It will listen on port 9002 by default.
# If you get timeout errors here, make sure the path to your watchful binary is in your PATH by
# uncommenting the code below, setting `path_to_your_watchful_directory` and then running this cell.
#import os
#path_to_your_watchful_directory = ""
#os.environ["PATH"] += ":" + path_to_your_watchful_directory
w.ephemeral()

## Connecting to an already-running Watchful instance

If you're creating hinters in an existing project, you may want to run the MacOS application and connect to it that way, or run the backend directly on any supported OS, and connect to an existing project.
Here's what that would look like:

In [4]:
# first run the app or run the backend yourself, and then call `external()`.

#w.external()

# You can also specify a non-default port.
# Use the `--port / -p` option to the backend.

#w.external("localhost", "9001")

When not running in ephemeral mode, you need to open or create a project before you can do anything else.

If you already opened the project in the app or through the web frontend (usually [http://localhost:9001/]) then you don't need to do anything except `w.external()` to connect to the running app that already has the project open. But if you are using the API only then you will need to open the project through the API:

In [5]:
#w.list_projects()
#w.open_project(...)

The projects list is a list of all the projects in your Watchful directory.

`w.list_projects()` includes every `*.hints` file in your watchful directory, but you can open a hints file outside of this directory as well by passing it to `w.open_project()`.

If you are using running the app yourself and connecting to it with `external()` then if you call `open_project` you will see that project also open in the app.

If you're not using the frontend to manage your projects, and want to do everything with the API, you can create a new project with `w.new_project()`.
It will be opened automatically, so you don't need to call `w.open_project()`.
You can give it a title with `w.title("My Project")`.

We recommend using the frontend and the API at the same time, so you can connect to the API in a notebook or from plain Python, and at the same time also see a nice overview in the UI.

# A walk through the API

Now that we're connected, let's take a look at the key parts of the API that are most commonly used.

# `get()`

You can use `get()` at any time to get the current status.
The structure returned is called the `summary`, and it is fairly complete.
It's worth noting that the frontend of the app gets everything that it displays from this same `summary` object, so if you see it in the frontend, you can find it in the `summary`.
As always, the API docs have more details if you need them.

The `summary` object is returned from every API call, not just get().

The `summary` below is empty because we don't have any data yet, but it shows the fields that are always there.

In [6]:
w.get()

{'cand_seq_full': 0,
 'cand_seq_prefix': 0,
 'candidates': [],
 'classes': {},
 'error_msg': None,
 'error_verb': None,
 'exports': [],
 'field_names': [],
 'hinters': [],
 'n_candidates': 0,
 'n_handlabels': 0,
 'query': '',
 'query_end': True,
 'query_examined': 0,
 'query_hit_count': 0,
 'query_page': 0,
 'selected_class': '',
 'selections': [],
 'state_seq': 2,
 'status': 'current',
 'suggestions': {'negative': [], 'positive': []},
 'title': ''}

Or we can do this:

In [7]:
w.get().keys()

dict_keys(['cand_seq_full', 'cand_seq_prefix', 'candidates', 'classes', 'datasets', 'error_msg', 'error_verb', 'exports', 'field_names', 'hinters', 'messages', 'mode', 'n_candidates', 'n_handlabels', 'published_title', 'pull_actions', 'push_actions', 'query', 'query_end', 'query_examined', 'query_hit_count', 'query_page', 'selected_class', 'selections', 'state_seq', 'status', 'suggestions', 'title', 'unlabeled_candidate'])

It's worth mentioning a couple of these fields that are especially important.
The `status` field tells you whether the backend is doing work or not, and as we can see here it is "current", which is usually what you want.
If it is "working", then the backend is still doing some work, and you can expect that some things may change.
An example is creating a hinter, as we'll do below, when you can see that the summary object returns immediately with a status of "working", and the hinter is fully applied to all the candidates in the background, at which point it will go back to "current".

The second field is `error_msg`, which is how errors are reported.
If there is a value in this field, it means the API request did not succeed, so check this field when appropriate.

# Loading and querying data

If you want to edit the notebook here and import a CSV from your own computer, here is how:

In [7]:
#csv = open("~/path/to/data.csv").read()

We just want to focus on the API here without getting distracted by real data, so we'll use integers from 1 to 1000 as a toy example.
We add newlines to separate them, and this is a minimal CSV file containing a single column.
When we import it, the system will give the first (and only) column a field name of "F1".

In [8]:
csv = ""

for i in range(1000):
    csv += str(i) + "\n"

Now we import the CSV data into watchful with `records()`.

Since the `summary` object is always returned from any API call, we just directly index the `n_candidates` field to make sure that importing the data worked, and we expect to see 1000 candidates loaded.

In [9]:
w.records(csv)['n_candidates']

1000

Now we can do a query.

Since our data is just numbers, we'll just look for occurrences of the regex `/88/`, or two eight digits in a row.
If you like these kinds of puzzles, take a moment to guess how many numbers there are between 1 and 1000 that have the digit pattern "88" somewhere in them.

We'll print the whole summary object here, so we can see what it looks like when there is some data.

In [10]:
w.query("/88/")

{'cand_seq_full': 0,
 'cand_seq_prefix': 0,
 'candidates': [{'fields': ['688'],
   'hints': [],
   'labels': {},
   'matches': {'by_hinter': [], 'query': [[[1, 3]]]}},
  {'fields': ['884'],
   'hints': [],
   'labels': {},
   'matches': {'by_hinter': [], 'query': [[[0, 2]]]}},
  {'fields': ['881'],
   'hints': [],
   'labels': {},
   'matches': {'by_hinter': [], 'query': [[[0, 2]]]}},
  {'fields': ['888'],
   'hints': [],
   'labels': {},
   'matches': {'by_hinter': [], 'query': [[[0, 2]]]}},
  {'fields': ['883'],
   'hints': [],
   'labels': {},
   'matches': {'by_hinter': [], 'query': [[[0, 2]]]}},
  {'fields': ['988'],
   'hints': [],
   'labels': {},
   'matches': {'by_hinter': [], 'query': [[[1, 3]]]}},
  {'fields': ['388'],
   'hints': [],
   'labels': {},
   'matches': {'by_hinter': [], 'query': [[[1, 3]]]}},
  {'fields': ['885'],
   'hints': [],
   'labels': {},
   'matches': {'by_hinter': [], 'query': [[[0, 2]]]}}],
 'classes': {},
 'error_msg': None,
 'error_verb': None,
 'ex

We can see the `query_examined` is 1000, which means it finished searching all the data before returning.
Often we work with larger data sets and so typically the query will continue to run in the background, but since this data set is so small, it returns all the results.
We also see `query_hit_count` is 19, which is the answer to our puzzle above (ten for 88, 188, ... 988, and ten from 880-889, but we counted 888 twice, so only 19 not 20).

Finally we see that eight candidates are returned that match the query.

# Now set a base rate, create a hinter, and check for matches

Let's look at the API for creating a class and a hinter for that class.

Either setting a base rate or creating a hinter will create the class for us.

We'll create a class of "interesting numbers", with a base rate of 10%.

Numbers ending in "7" can be interesting, so we'll create a hinter for those with a weight of 80%.

Anything this hinter matches on our data set should come back with a probability for the class of 80%, at least until we start adding other hinters to the mix.

So we'll do a query, print a matching candidate, and check that the hinter was applied and also that the probability for the class is what it should be.

In [11]:
w.base_rate("Interesting", 10)
w.hinter("Interesting", "/7$/", 80)
w.query("/7$/")['candidates'][0]

{'fields': ['677'],
 'hints': [1],
 'labels': {'Interesting': 80},
 'matches': {'by_hinter': [[1, [[[2, 3]]]]], 'query': [[[2, 3]]]}}

So we can see that hinter 1 was applied, and the candidates has 80% for the "Interesting" class.

We can see all the hinters we have by looking at "hinters" on the `summary`:

In [12]:
w.get()['hinters']

[{'hit_ratio': [100, 1000],
  'id': 1,
  'label': 'Interesting',
  'name': '',
  'query': '/7$/',
  'weight': 80}]

# We can create a hinter from external data

Sometimes you can't extract all the data you want using a regex, or you need to query an external data source for a hinter.
"External hinters" let you provide the hint values yourself for each candidate.

Since we are looking for interesting numbers, prime numbers might be interesting.
Perhaps it is possible to write a regex to recognize prime numbers, but it doesn't sound like a good idea, so let's use an external hinter!

In [13]:
# get the hinters
w.get()["hinters"]

[{'hit_ratio': [100, 1000],
  'id': 1,
  'label': 'Interesting',
  'name': '',
  'query': '/7$/',
  'weight': 80}]

In [14]:
w.external_hinter("Interesting", "prime", 90)["hinters"]

[{'hit_ratio': [100, 1000],
  'id': 1,
  'label': 'Interesting',
  'name': '',
  'query': '/7$/',
  'weight': 80},
 {'hit_ratio': [0, 0],
  'id': 2,
  'label': 'Interesting',
  'name': 'prime',
  'query': '[external]',
  'weight': 90}]

We called `external_hinter` which takes the class name, the name of the hinter, and the expectation, here 90.
We are saying that prime numbers are 90% likely to be interesting.
Why not 100%?
Remember that we set our base rate to 10%, which means we are looking specifically for the 10% of numbers that are the *most* interesting, so some of the prime numbers might not make the cut.
A number that is both prime *and* ending in a 7 would match both of our hinters and would be boosted much more, so we'll definitely expect to see those ones if we take the top 10% of labels for the class.

In the output after creating the external hinter, you can see that there is now a hinter in the list with name "prime" and a query "\[external\]".
Compare this to the hinter we created before which has the query `/7$/`.
This is the difference between a regular hinter, which is defined by what matches a query, and an external hinter.

This is also why it's important that we give our external hinters names.
If we look at the first hinter, we can see what it's doing because there's a query there, but for the second hinter, if we had several external hinters, we would have no idea what this one is doing if we hadn't given it a name.

The `external_hinter()` API call provides "\[external\]" as the query value, which not a real query but a special value that Watchful recognizes.
Unlike regular hinters, which are applied immediately, an external hinter will not be applied until we send the hint values, either true or false, for each candidate.

Note that the `hit_ratio` in a hinter, seen above, indicates both the number of positive matches (the first number) and the number of candidates that have been examined so far (the second number).
In our new external hinter, both of these numbers will be 0, because the system is waiting for us to give it these values.
With a regular hinter that is defined by a query, you can see both these numbers go up as the entire dataset is queried, just like with the `query_examined` and `query_hit_count` values that we saw before.

Now we'll provide the values for this hinter.

In [15]:
# we will iterate over all candidates, and store the values as we go, 
# then provide all the candidates' hint values at once with `hint_all`
values = []

for candidate in w.dump():
    n = int(candidate[0])
    is_prime = not any(map(lambda x: n % x == 0, range(2,32)))
    values.append(is_prime)
    
# Using `hint_all` because we already have all the values in memory,
# but if the dataset is large, you can stream your hints back in chunks.
summary = w.hint_all("prime", values)
print(summary["hinters"])
print(summary["status"])

[{'hit_ratio': [100, 1000], 'id': 1, 'label': 'Interesting', 'name': '', 'query': '/7$/', 'weight': 80}, {'hit_ratio': [0, 0], 'id': 2, 'label': 'Interesting', 'name': 'prime', 'query': '[external]', 'weight': 90}]
working


`dump()` is a special API call used for external hinting.
Behind the scenes, it is making multiple calls to the API to get candidates in chunks.
If your dataset is very large, you may want to stream the results back to the system as well, rather than creating a single array of values as we are doing here.
If you need to do this, check out the implementation of `hint_all()` and the other `hint` functions for details.

While we have a table of numbers here, candidate data is stored as strings, so we'll need to convert from string to number.

When you run `hint_all()` the new hint values will be applied in the background, which means the summary returned from the `hint_all()` call itself may not show that much has changed. Note that the `hit_ratio` is still 0/0 in the immediate return from `hint_all()`. The `status` property on the summary object also shows "working" to show that background work is continuing. You can call `get` a moment later to see the results:

In [16]:
summary = w.get()
print(summary["hinters"])
print(summary["status"])

[{'hit_ratio': [100, 1000], 'id': 1, 'label': 'Interesting', 'name': '', 'query': '/7$/', 'weight': 80}, {'hit_ratio': [158, 1000], 'id': 2, 'label': 'Interesting', 'name': 'prime', 'query': '[external]', 'weight': 90}]
current


Now we can see the hit ratio for our external hinter.

So, apparently, there are 158 prime numbers under 1000.

Finally we can look at some candidates and see which ones matched which hinters, and what probability was assigned for our class:

In [17]:
list(map(lambda c: (c["fields"][0], c["hints"], c["labels"]["Interesting"]), w.query("")["candidates"]))

[('677', [1, 2], 100),
 ('351', [], 0),
 ('153', [], 0),
 ('598', [], 0),
 ('952', [], 0),
 ('258', [], 0),
 ('507', [1], 100),
 ('860', [], 0)]

Note that we're just doing `w.query("")` with an empty query string to get everything.
We could have done `w.get()` again, but that would give the results of the current query, which was `/88/` from before.

## Deleting hinters and classes

Now that we've created some hinters, we can also delete them.

You can delete a class as well, but you have to delete all the hinters in it first.

Hinters are deleted by id and classes by name.

In [18]:
w.get()['hinters']

[{'hit_ratio': [100, 1000],
  'id': 1,
  'label': 'Interesting',
  'name': '',
  'query': '/7$/',
  'weight': 80},
 {'hit_ratio': [158, 1000],
  'id': 2,
  'label': 'Interesting',
  'name': 'prime',
  'query': '[external]',
  'weight': 90}]

In [19]:
w.delete(1)
w.delete(2)

summary = w.get()
list(summary['classes'].keys())

['Interesting']

In [20]:
summary = w.delete_class("Interesting")
list(summary['classes'].keys())

[]

# When you're done, you can shut down the backend

This terminates the backend process that we started with `ephemeral()`.

If you run the app yourself and connect with `external()`, this would also exit the app.

In [21]:
w.exit_backend()

# More

The entire backend API is documented in watchful/web/api.md, and you can access any API features with the generic `api()` call taking a verb as the first argument, and named parameters for the verb's arguments.

You can also look at `watchful.py` for other API methods we didn't cover here.