<div style="text-align: right;">
  <img src="https://raw.githubusercontent.com/exasol/ai-lab/refs/heads/main/assets/Exasol_Logo_2025_Dark.svg" style="width:200px; margin: 10px;" />
</div>

# Text classification model

In this notebook, we will load and use a text classification language model that can assign a label to a given text. Learn more about the Text Classification task <a href="https://huggingface.co/tasks/text-classification" target="_blank" rel="noopener">here</a>. Please also refer to the Transformer Extension <a href="https://github.com/exasol/transformers-extension/blob/main/doc/user_guide/user_guide.md" target="_blank" rel="noopener">User Guide</a> to find more information about the UDFs used in this notebook.

We will be running SQL queries using <a href="https://jupysql.ploomber.io/en/latest/quick-start.html" target="_blank" rel="noopener"> JupySQL</a> SQL Magic.

## Prerequisites

Prior to using this notebook the following steps need to be completed:
1. [Configure the AI Lab](../main_config.ipynb).
2. [Initialize the Transformer Extension](te_init.ipynb).

## Setup

### Open Secure Configuration Storage

In [None]:
%run ../utils/access_store_ui.ipynb
display(get_access_store_ui('../'))

Let's bring up JupySQL and connect to the database via SQLAlchemy. 
Please refer to the documentation of [sqlalchemy-exasol](https://github.com/exasol/sqlalchemy-exasol) for details on how to connect to the database using the Exasol SQLAlchemy driver.

In [None]:
%run ../utils/jupysql_init.ipynb

## Get a language model

To demonstrate the text classification task we will use the [Ekman emotions classifier](https://huggingface.co/arpanghoshal/EkmanClassifier) model.

We need to load the model from the Hugging Face Hub into the [BucketFS](https://docs.exasol.com/db/latest/database_concepts/bucketfs/bucketfs.htm). This could potentially be a long process, depending on the connection of the Database. Unfortunately, we cannot tell exactly when it has finished. The notebook's hourglass may not be a reliable indicator. BucketFS will still be doing some work when the call issued by the notebook returns. Please wait for a few moments after that, before querying the model.
# TODO check with the list models if model is loaded?
You might see a warning that some weights are newly initialized and the model should be trained on a down-stream task. Please ignore this warning. For the purpose of this demonstration, it is not important, the model should still be able to produce some meaningful output.

In [None]:
from exasol.nb_connector.model_installation import install_model, TransformerModel
from transformers import AutoModelForSequenceClassification

# This is the name of the model at the Hugging Face Hub
MODEL_NAME = 'arpanghoshal/EkmanClassifier'
install_model(ai_lab_config, TransformerModel(MODEL_NAME, 'sequence_classification', AutoModelForSequenceClassification))

## Use the language model

In the Transformers Extension, there ar two UDF's for sequence classification. There is  the `TE_SEQUENCE_CLASSIFICATION_SINGLE_TEXT_UDF`, which takes a single input text, and the `TE_SEQUENCE_CLASSIFICATION_TEXT_PAIR_UDF` which takes two input texts, and compares them.

Both sequence classification udf's additionally take the following input parameters:

* device_id: To run on a GPU, specify the valid cuda device ID.
* bucketfs_conn: The BucketFS connection name.
* sub_dir: The directory where the model is stored in the BucketFS.
* model_name: The name of the model to use for prediction.
* return_ranks: Either "ALL" or "HIGHEST".
You need to supply these parameters in the correct order. Further information can be found in the  <a href="https://github.com/exasol/transformers-extension/blob/main/doc/user_guide/user_guide.md" target="_blank" rel="noopener">User Guide</a>.

### Single Text Classification

Let's try to classify a single phrase that definitely bears emotions but is also somewhat ambiguous - "Oh my God!".
We will save the result in the variable `udf_output` to support automatic testing of this notebook.

In [None]:
%%sql --save udf_output
WITH MODEL_OUTPUT AS
(
    SELECT TE_SEQUENCE_CLASSIFICATION_SINGLE_TEXT_UDF(
        NULL,
        '{{ai_lab_config.bfs_connection_name}}',
        '{{ai_lab_config.bfs_model_subdir}}',
        '{{MODEL_NAME}}',
        'Oh my God!',
        'HIGHEST'
    )
)
SELECT label, score, rank, error_message FROM MODEL_OUTPUT

In this instance, we let the model return only the highest ranking result by setting `return_ranks = "HIGHEST"`. If we want to get all available results instead, we can run it with `return_ranks = "ALL"` instead:

In [None]:
%%sql --save udf_output
WITH MODEL_OUTPUT AS
(
    SELECT TE_SEQUENCE_CLASSIFICATION_SINGLE_TEXT_UDF(
        NULL,
        '{{ai_lab_config.bfs_connection_name}}',
        '{{ai_lab_config.bfs_model_subdir}}',
        '{{MODEL_NAME}}',
        'Oh my God!',
        'ALL'
    )
)
SELECT label, score, rank, error_message FROM MODEL_OUTPUT  ORDER BY SCORE DESC

As you can see, we select only some of the udf's output columns in these examples.  If you need more details to your output, you can find information on all output columns in the <a href="https://github.com/exasol/transformers-extension/blob/main/doc/user_guide/user_guide.md" target="_blank" rel="noopener">User Guide</a>.

The output of the model is sorted into the following columns by the udf:

* label: the label the model assigned for the input
* score: the confidence, with which the label was assigned
* rank: the rank of the label. In this context, all predictions/labels for one input are ranked by their score. rank=1 means best result/highest score.
* error_message: error occurring while executing the udf will be saved here

### Text Pair Classification

Now we are going to add some context to our exclamation in the form af an additional input text, and use the `TE_SEQUENCE_CLASSIFICATION_TEXT_PAIR_UDF` to analyze our text again. Let's see how it will change the model output.

In [None]:
%%sql --save udf_output
WITH MODEL_OUTPUT AS
(
    SELECT TE_SEQUENCE_CLASSIFICATION_TEXT_PAIR_UDF(
        NULL,
        '{{ai_lab_config.bfs_connection_name}}',
        '{{ai_lab_config.bfs_model_subdir}}',
        '{{MODEL_NAME}}',
        'Oh my God!',
        'I lost my purse.',
        'ALL'
    )
)
SELECT label, score, rank, error_message FROM MODEL_OUTPUT ORDER BY SCORE DESC

The code above shows how the model works on a toy example. However, the main purpose of having a model deployed in the database is to get a quick response for a batch input. The performance gain comes from two factors - localization and parallelization. The first means that the input data never crosses the machine boundaries. The second means that multiple instances of the model are processing the data on all available nodes in parallel.

Another advantage of making predictions within the database is enhanced data security. The task of safeguarding privacy can be simplified given the fact that the source data never leaves the database machine.

In a more practical application, the text to be classified would be stored in a column of a database table. For example, if we wanted to get a label with the highest score for each row of the input table `MY_TEXT_TABLE`, where the text in question is in the column `MY_TEXT_COLUMN`, the SQL would look similar to this:
```
SELECT TE_SEQUENCE_CLASSIFICATION_SINGLE_TEXT_UDF(..., MY_TEXT_COLUMN) FROM MY_TEXT_TABLE;
```
Please note, that the response time observed on the provided example with a single input will not be scaled up linearly in case of multiple inputs. Much of the latency falls on loading the model into the CPU memory from BucketFS. This needs to be done only once regardless of the number of inputs.