# Zero-shot classification model

In this notebook we will load and use a zero shot classification language model. Learn about the Zero Shot Classification task <a href="https://huggingface.co/tasks/zero-shot-classification" target="_blank" rel="noopener">here</a>. Please also refer to the Transformer Extension <a href="https://github.com/exasol/transformers-extension/blob/main/doc/user_guide/user_guide.md" target="_blank" rel="noopener">User Guide</a> to find more information about the UDF used in this notebook.

We will be running SQL queries using <a href="https://jupysql.ploomber.io/en/latest/quick-start.html" target="_blank" rel="noopener"> JupySQL</a> SQL Magic.

## Prerequisites

Prior to using this notebook the following steps need to be completed:
1. [Configure the sandbox](../sandbox_config.ipynb).
2. [Initialize the Transformer Extension](te_init.ipynb).

## Set up

### Access configuration

In [None]:
%run ../utils/access_store_ui.ipynb
display(get_access_store_ui('../'))

Let's bring up JupySQL and connect to the database via SQLAlchemy. Please refer to the documentation of <a href="https://github.com/exasol/sqlalchemy-exasol" target="_blank" rel="noopener">sqlalchemy-exasol</a> for details on how to connect to the database using the Exasol SQLAlchemy driver.

In [None]:
%run ../utils/jupysql_init.ipynb

## Get language model

To demonstrate the zero shot classification task we will use the [Cross-Encoder for Natural Language Inference](https://huggingface.co/cross-encoder/nli-deberta-base).

We need to load the model from the Huggingface hub into the BucketFS. This could potentially be a long process. Unfortunately, we cannot tell exactly when it has finished. The notebook's hourglass may not be a reliable indicator. BucketFS will still be doing some work when the call issued by the notebook returns. Please wait for a few moments after that, before querying the model.

In [None]:
# This is the name of the model at the Huggingface Hub
MODEL_NAME = 'cross-encoder/nli-deberta-base'

In [None]:
%run utils/model_retrieval.ipynb
load_huggingface_model(MODEL_NAME, sb_config)

## Use language model

Below is a chunk of text that we will try to classify using labels that were not used during the model training. Out of five suggested labels the first two are much more relevant than the others. We expect the model to give them significantly higher score.

In [None]:
# Text to be classified.
MY_TEXT = """
A new model offers an explanation for how the Galilean satellites formed around the solar system’s largest world. 
Konstantin Batygin did not set out to solve one of the solar system’s most puzzling mysteries when he went for a
run up a hill in Nice, France. Dr. Batygin, a Caltech researcher, best known for his contributions to the search
for the solar system’s missing “Planet Nine,” spotted a beer bottle. At a steep, 20 degree grade, he wondered why
it wasn’t rolling down the hill. He realized there was a breeze at his back holding the bottle in place. Then he
had a thought that would only pop into the mind of a theoretical astrophysicist: “Oh! This is how Europa formed.”
Europa is one of Jupiter’s four large Galilean moons. And in a paper published Monday in the Astrophysical Journal,
Dr. Batygin and a co-author, Alessandro Morbidelli, a planetary scientist at the Côte d’Azur Observatory in France,
present a theory explaining how some moons form around gas giants like Jupiter and Saturn, suggesting that
millimeter-sized grains of hail produced during the solar system’s formation became trapped around these massive
worlds, taking shape one at a time into the potentially habitable moons we know today.
"""

# Make sure our texts can be used in an SQL statement.
MY_TEXT = MY_TEXT.replace("'", "''")

# Classes, not seen during model training.
MY_LABELS = 'space & cosmos, scientific discovery, microbiology, robots, archeology'

In [None]:
%%sql
WITH MODEL_OUTPUT AS
(
    SELECT TE_ZERO_SHOT_TEXT_CLASSIFICATION_UDF(
        NULL,
        '{{sb_config.TE_BFS_CONN}}',
        '{{sb_config.TE_TOKEN_CONN}}',
        '{{sb_config.TE_MODELS_BFS_DIR}}',
        '{{MODEL_NAME}}',
        '{{MY_TEXT}}',
        '{{MY_LABELS}}'
    )
)
SELECT label, score, error_message FROM MODEL_OUTPUT ORDER BY SCORE DESC

The code above shows how the model works on a toy example. However, the main purpose of having a model deployed in the database is to get a quick response for a batch input. The performance gain comes from two factors - localization and parallelization. The first means that the input data never crosses the machine boundaries. The second means that multiple instances of the model are processing the data on all available nodes in parallel.

Another advantage of making predictions within the database is the enhanced data security. The task of safeguarding privacy can be simplified giving the fact that the source data never leaves the database machine.

In a more practical application the text to be classified would be stored in a column of a database table. For example, if we wanted to get a label with the highest score for each row of the input table `MY_TEXT_TABLE`, where the text in question is in the column `MY_TEXT_COLUMN`, the SQL would look similar to this:
```
WITH MODEL_OUTPUT AS
(
    SELECT TE_ZERO_SHOT_TEXT_CLASSIFICATION_UDF(..., MY_TEXT_COLUMN, <MY_LABELS>) FROM MY_TEXT_TABLE
)
SELECT test_data, label FROM MODEL_OUTPUT WHERE rank=1;
```
Please note, that the response time observed on the provided example with a single input will not be scaled up linearly in case of multiple inputs. Much of the latency falls on loading the model into the CPU memory from BucketFS. This needs to be done only once regardless of the number of inputs.