# Token classifier model 

In this notebook, we will load and use a token classifier language model that assigns labels to some tokens in a text. Learn more about the Question Answering task <a href="https://huggingface.co/tasks/token-classification" target="_blank" rel="noopener">here</a>. Please also refer to the Transformer Extension <a href="https://github.com/exasol/transformers-extension/blob/main/doc/user_guide/user_guide.md" target="_blank" rel="noopener">User Guide</a> to find more information about the UDF used in this notebook.

We will be running SQL queries using <a href="https://jupysql.ploomber.io/en/latest/quick-start.html" target="_blank" rel="noopener"> JupySQL</a> SQL Magic.

## Prerequisites

Prior to using this notebook the following steps need to be completed:
1. [Configure the sandbox](../sandbox_config.ipynb).
2. [Initialize the Transformer Extension](te_init.ipynb).

## Set up

### Access configuration

In [None]:
%run ../access_store_ui.ipynb
display(get_access_store_ui('../'))

In [None]:
EXTERNAL_HOST = f"{sb_config.EXTERNAL_HOST_NAME}:{sb_config.HOST_PORT}"

WEBSOCKET_URL = f"exa+websocket://{sb_config.USER}:{sb_config.PASSWORD}" \
    f"@{EXTERNAL_HOST}/{sb_config.SCHEMA}?SSLCertificate=SSL_VERIFY_NONE"

Let's bring up JupySQL and connect to the database via SQLAlchemy. Please refer to the documentation of <a href="https://github.com/exasol/sqlalchemy-exasol" target="_blank" rel="noopener">sqlalchemy-exasol</a> for details on how to connect to the database using the Exasol SQLAlchemy driver.

In [None]:
from sqlalchemy import create_engine

engine = create_engine(WEBSOCKET_URL)

%load_ext sql
%sql engine

## Get language model

To demonstrate the token classification task we will use an [English Named Entity Recognition model](https://huggingface.co/sschet/biomedical-ner-all), trained on Maccrobat to recognize the bio-medical entities (107 entities) from a given text corpus (case reports, etc.).

We need to load the model from the Huggingface hub into the BucketFS. This could potentially be a long process. Unfortunately, we cannot tell exactly when it has finished. The notebook's hourglass may not be a reliable indicator. BucketFS will still be doing some work when the call issued by the notebook returns. Please wait for a few moments after that, before querying the model.

In [None]:
# This is the name of the model at the Huggingface Hub
MODEL_NAME = 'sschet/biomedical-ner-all'

In [None]:
%run ./model_retrieval.ipynb
load_huggingface_model(MODEL_NAME, sb_config)

## Use language model

Below is a medical report on a patient examination. In this report, we will be looking for occurrences of recognized entities, such as patient data (e.g. age), symptoms, clinical events, laboratory test results, etc.

In [None]:
# We will display all model output
%config SqlMagic.displaylimit = 0

In [None]:
MY_TEXT = """
A 63-year-old woman with no known cardiac history presented with a sudden onset of dyspnea requiring
intubation and ventilatory support out of hospital. She denied preceding symptoms of chest discomfort,
palpitations, syncope or infection. The patient was afebrile and normotensive, with a sinus tachycardia
of 140 beats/min.
"""

# Make sure our texts can be used in an SQL statement.
MY_TEXT = MY_TEXT.replace("'", "''")

In [None]:
%%sql
WITH MODEL_OUTPUT AS
(
    SELECT TE_TOKEN_CLASSIFICATION_UDF(
        NULL,
        '{{sb_config.TE_BFS_CONN}}',
        '{{sb_config.TE_TOKEN_CONN}}',
        '{{sb_config.TE_MODELS_BFS_DIR}}',
        '{{MODEL_NAME}}',
        '{{MY_TEXT}}',
        NULL
    )
)
SELECT start_pos, end_pos, word, entity, error_message FROM MODEL_OUTPUT ORDER BY start_pos, end_pos

The code above shows how the model works on a toy example. However, the main purpose of having a model deployed in the database is to get a quick response for a batch input. The performance gain comes from two factors - localization and parallelization. The first means that the input data never crosses the machine boundaries. The second means that multiple instances of the model are processing the data on all available nodes in parallel.

Another advantage of making predictions within the database is enhanced data security. The task of safeguarding privacy can be simplified given the fact that the source data never leaves the database machine.

In a more practical application, we might want to tokenize text stored in a column of a database table. One possible use case would be collecting statistics of token occurrence in say customer reviews, where each review is stored in a separate row. For example, if the text to be tokenized is stored in the column `MY_TEXT_COLUMN` of the table `MY_TEXT_TABLE` and we want to get the counts of the top 10 tokens, the SQL would look similar to this:
```
SELECT entity, COUNT(*) as occurrence
FROM (
    SELECT TE_TOKEN_CLASSIFICATION_UDF(..., MY_TEXT_COLUMN, NULL)
    FROM MY_TEXT_TABLE
) tokenized 
GROUP BY entity
ORDER BY 2 DESC
LIMIT 10;
```
Please note, that the response time observed on the provided example with a single input will not be scaled up linearly in case of multiple inputs. Much of the latency falls on loading the model into the CPU memory from BucketFS. This needs to be done only once regardless of the number of inputs.