# Text AI Extension preprocessing


Here we will demonstrate how the Text AI Extension can be used to build a data-preprocessing pipeline. We will be taking a dataset of customer support tickets. This dataset contains unstructured data in the form of ticket descriptions. We will sort these tickets into "urgent" and "not urgent" cases, and find important named entities and keywords within the text. This will be archived using a Text-Ai-Extension Pipeline. The extracted information will be used for data analysis in the [following notebook]().
#### todo link or remove
#### todo clear output from cells
#### todo for data download switch to S3 download, put filtered data there first                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
We will also demonstrate the Text-Ai-Extensions ability to determine if data was already processed, and skip it if applicable.

## Prerequisites

Prior to using this notebook one needs to complete the following steps:

**Note**: To be able to store the models used in this demo, make sure you set the Disk Size of the database to at least 10 GiB in the AI-Lab configuration.

1. [Configure the AI-Lab](../main_config.ipynb).
2. [initialize the Text AI Extension](./txaie_init.ipynb)


## Natural Language Processing introduction

This section contains a short introduction to Natural Language Processing, and the precesses we will use in this notebook.

### NLP

Natural Language Processing, or the processing of so-called "unstructured data" or "free text", is the processing(i.e. classification, retrieving of information) of unannotated language.

There are tasks in Natural Language Processing (NLP) which seem easy to us humans, but are very hard for a machine to do. For example, inferring the opinion the speaker has about a topic (Opinion Extraction/Mining). Doing these tasks on un-annotated text is even harder. Therefore, multiple ways to annotate a natural language text with additional information were developed. These annotated texts are then better suited for higher-level NLP tasks.
                                                                                                                                                   
Depending on the amount of data/text which should be processed, annotating by hand is mostly not an option these days, since with increasing dataset sizes the resources needed quickly become unrealistic. Therefore, Exasol's Text AI provides you with tools you can use for annotating your data in various ways.
                                                                                                                     
In this Notebook, we will show you our three default preprocessing pipeline steps. Of course, it is possible for you to define your own pipeline later on.
Let's explain these three steps before we dive into how to run the preprocessing.
                                                                                                                     
### Topic Classification
                                                                                                                     
Topic Classification is the task of assigning topics to text/documents/datapoints. In Topic Classification, a given set of topics is used, and each data point is assigned the best matching topic based on the probability the classification model calculates.
Given that a document is about a particular topic, it is expected for particular words to appear in the document more or less frequently. However, it is not required for the exact words to describe the topic to be found in a text. This means that topics can be inferred, even if their name/description/topic synonyms are not found in the data.

![diagramm a document text added topics](./images/topics.drawio.png)                                                                                                             
Topic Classification works with a given set of these topics as input, given each a probability of the text being about this topic. It is usually trained using supervised learning. It can also be used with Zero-Shot Classification models, which can assign classes/topics which have not been seen during the training. This is opposed to other approaches like topic extraction, which is often unsupervised and does not need a list of topics as input, instead extracting them from the data itself.
                                                                                                                     
### Keyword Search
                                                                               
Keyword Search is about identifying the most relevant words or phrases(Keywords/Keyphrases) from a given text.
These can then help in further steps, e.g. summarizing the content of texts and recognizing the main topics discussed.
Keywords or phrases need be present in the text.
For Example:
![diagramm a document text with highlighted keywords](./images/keywords.drawio.png)


### Named Entity Recognition

Named entity recognition (NER) is about locating and classify so called "named entities" mentioned in a text document. Depending on the model, entities are e.g. person names, organizations, locations, or vehicles etc., so "things that have names". The model seeks out those entities, returning their positions in the document, as well as their class.

### Example Result of 3 Steps

Let's look at an example of what the output for these three steps might look like combined. For a given document, consider the document content to be "I'm having an issue with the GoPro Hero. It's affecting my productivity.". We may use a topic classifier with the input topic set of "urgent, not urgent" for inferring urgency from ticket content. The NER and Keyword Search do not need additional input, they just work with the document itself. Then the output of a preprocessing pipeline containing all three steps could look something like this:

![diagramm showing document text with found entity and keyword and topic](./images/document_annotated.drawio.png)



## General Setup

As a first step, we need to get access to the Ai-Lab secret store:

In [1]:
%run ../utils/access_store_ui.ipynb
display(get_access_store_ui('../'))

Output()

Box(children=(Box(children=(Label(value='Configuration Store', layout=Layout(border_bottom='solid 1px', border…

Then we can get the activation SQL for our previously installed Script Language Containers. This will be used to activate those SLCs in order to use their UDFs.

We also want to import some of the Python functions of the text-ai and notebook-connector modules.

In [2]:
from exasol.nb_connector.connections import open_pyexasol_connection
from exasol.nb_connector.language_container_activation import get_activation_sql

activation_sql = get_activation_sql(ai_lab_config)

In [3]:
from exasol.nb_connector.ai_lab_config import AILabConfig

from exasol.ai.text.extraction.abstract_extraction import Defaults, Output
from exasol.ai.text.extractors.standard_extractor import StandardExtractor
from exasol.ai.text.extractors.extractor import PipelineExtractor
from exasol.ai.text.extractors.source_table_extractor import SourceTableExtractor, SchemaSource, TableSource, NameSelector
from exasol.ai.text.extractors.bucketfs_model_repository import BucketFSRepository

from exasol.nb_connector.text_ai_extension_wrapper import LANGUAGE_ALIAS

The next call will make it possible to run SQL directly in this notebook, in order to easier display the results of our preprocessing. The one below sets the maximum number of columns our SQL statements can display in the notebook.

In [4]:
%run ../utils/jupysql_init.ipynb

In [5]:
%config SqlMagic.displaylimit = 20

## Get an example dataset

We will be using a dataset which holds information on customer support tickets. We will split this data into 2 sets, in order to demonstrate how the preprocessing tasks handle new data being added to a dataset.
But first we want to make sure the tables we want to use don't already exist, for example from a previous run of this notebook. Therefore, we are going to drop them.
First, we define a list of tables to drop:


In [6]:
text_column="TICKET_DESCRIPTION"
key_column="TICKET_ID"
table="CUSTOMER_SUPPORT_TICKETS"
schema=ai_lab_config.db_schema

In [7]:
# A list of tables which the steps below create automatically. If you run the notebook multiple times they need to be dropped in between.
table_list = [
    "TXAIE_AUDIT_LOG",
    "DOCUMENTS",
    f"DOCUMENTS_{schema}_MY_VIEW",
    "NAMED_ENTITY",
    "NAMED_ENTITY_LOOKUP_ENTITY_TYPE",
    "NAMED_ENTITY_LOOKUP_SETUP",
    "KEYWORD_SEARCH",
    "KEYWORD_SEARCH_LOOKUP_KEYWORD",
    "KEYWORD_SEARCH_LOOKUP_SETUP",
    "TOPIC_CLASSIFIER",
    "TOPIC_CLASSIFIER_LOOKUP_TOPIC",
    "TOPIC_CLASSIFIER_LOOKUP_SETUP"
]


If you are curious about which tables are generated and how they look, you can find that information in the Results section below.
Next, define a function which drops these tables. Then we call the function.

**Note:** If you run into technical issues during the running of this notebook, you might want to run the "delete_text_ai_preprocessing_tables" function again, in order to re-run the Pipeline from scratch. This will ensure all data gets processed again.

In [37]:
def delete_text_ai_preprocessing_tables():
    with open_pyexasol_connection(ai_lab_config, compression=True) as conn:
        for drop_table in table_list:
            conn.execute(f"""DROP TABLE IF EXISTS "{schema}"."{drop_table}" """)

In [42]:
delete_text_ai_preprocessing_tables()

You can then load the data using **[this notebook](../data/data_customer_support.ipynb)**. This loads the data into a table called "CUSTOMER_SUPPORT_TICKETS" found in the schema defined in the ai_lab_config variable db_schema. Please go to that notebook and run it.
You can verify the import is done with the call below. It should return "8469".

In [8]:
%%sql
SELECT COUNT(*) FROM {{schema}}.{{table}};

COUNT(*)
8469


### Create a View on the data

This dataset has ~8000 entries. You could run the preprocessing for the whole dataset, but it would take quite some time. Instead, we will create a view containing only part of the dataset, and use this view as the base for our preprocessing.
We set the size of this view here. If you want to see how the AI-Lab handles bigger datasets on your Exasol instance, you can set the "view_size" higher.

In [9]:
view="MY_VIEW"
view_size = 100 # <= 4234

In [48]:
with open_pyexasol_connection(ai_lab_config, compression=True) as conn:
    conn.execute(f"""DROP VIEW IF EXISTS "{schema}"."{view}"; """)
    conn.execute(f"""CREATE OR REPLACE VIEW "{schema}"."{view}" AS SELECT * FROM "{schema}"."{table}" WHERE "TICKET_ID" <= {view_size}; """)


Lets check the size of our created view:

In [49]:
%%sql
SELECT COUNT(*) FROM {{schema}}.{{view}};

COUNT(*)
100


As you can see, we now have only our defined 100 data points to contend with.

Let's now see what our data contains:

In [50]:
%%sql
DESC {{schema}}.{{view}}

column_name,sql_type,nullable,distribution_key,partition_key,zonemapped
TICKET_ID,"DECIMAL(18,0)",,,,
CUSTOMER_NAME,VARCHAR(2000000) UTF8,,,,
CUSTOMER_EMAIL,VARCHAR(2000000) UTF8,,,,
CUSTOMER_AGE,"DECIMAL(18,0)",,,,
CUSTOMER_GENDER,VARCHAR(2000000) UTF8,,,,
PRODUCT_PURCHASED,VARCHAR(2000000) UTF8,,,,
DATE_OF_PURCHASE,VARCHAR(2000000) UTF8,,,,
TICKET_TYPE,VARCHAR(2000000) UTF8,,,,
TICKET_SUBJECT,VARCHAR(2000000) UTF8,,,,
TICKET_DESCRIPTION,VARCHAR(2000000) UTF8,,,,


We can see a ticket ID column, as well as some columns containing information about the customer, like name and e-mail address. There is also a column containing the product the ticket is about, and then some metadata columns for the ticket itself. The ticket description contains the actual text of the ticket. The resolution contains the resolution if there is one, otherwise it is empty.

In [51]:
%%sql
SELECT TICKET_ID,
    CUSTOMER_NAME,
    PRODUCT_PURCHASED,
    TICKET_SUBJECT, 
    TICKET_DESCRIPTION,
    RESOLUTION,
    CUSTOMER_SATISFACTION_RATING  
    FROM {{schema}}.{{view}} WHERE TICKET_ID < 6

ticket_id,customer_name,product_purchased,ticket_subject,ticket_description,resolution,customer_satisfaction_rating
1,Marisa Obrien,GoPro Hero,Product setup,"I'm having an issue with the GoPro Hero. Please assist. Your billing zip code is: 71701. We appreciate that you have requested a website address. Please double check your email address. I've tried troubleshooting steps mentioned in the user manual, but the issue persists.",,
2,Jessica Rios,LG Smart TV,Peripheral compatibility,"I'm having an issue with the LG Smart TV. Please assist. If you need to change an existing product. I'm having an issue with the LG Smart TV. Please assist. If The issue I'm facing is intermittent. Sometimes it works fine, but other times it acts up unexpectedly.",,
3,Christopher Robbins,Dell XPS,Network problem,"I'm facing a problem with my Dell XPS. The Dell XPS is not turning on. It was working fine until yesterday, but now it doesn't respond. 1.8.3 I really I'm using the original charger that came with my Dell XPS, but it's not charging properly.",Case maybe show recently my computer follow.,3.0
4,Christina Dillon,Microsoft Office,Account access,"I'm having an issue with the Microsoft Office. Please assist. If you have a problem you're interested in and I'd love to see this happen, please check out the Feedback. I've already contacted customer support multiple times, but the issue remains unresolved.",Try capital clearly never color toward story.,3.0
5,Alexander Carroll,Autodesk AutoCAD,Data loss,I'm having an issue with the Autodesk AutoCAD. Please assist. Note: The seller is not responsible for any damages arising out of the delivery of the battleground game. Please have the game in good condition and shipped to you I've noticed a sudden decrease in battery life on my Autodesk AutoCAD. It used to last much longer.,West decision evidence bit.,1.0


## Download NLP Models

We will use multiple different transformers models to run our preprocessing with. We will use [this notebook](./utils/txaie_default_models.ipynb) to download them from HuggingFace.

Simply run the next cell.
**This call will take some time to complete, depending on your internet connection. You will see some printed output once it is done.**

**Note**: If this operation fails with an error indicating a lost connection, please increase the size of your database and try again.

In [10]:
%run ./utils/txaie_default_models.ipynb

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/599M [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/694 [00:00<?, ?B/s]

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/496M [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/957 [00:00<?, ?B/s]

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/598M [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/694 [00:00<?, ?B/s]

Model download done.


## Configure the Text-AI Pipeline

In the Text-AI-Extension, you define steps to run, and then place them in a Pipeline which orchestrates the data flow for you. In this Notebook we will be using a basic example using the default steps defined in the "StandardExtractor".
                                                                                                                    
#### Configure defaults

Here, we will configure how our pipeline should run. In general, each NLP extractor has its own configuration parameters. The "Defaults" object is a helper object allowing us to set these parameters once and apply these settings to all extractors.

How these defaults are set will depend on your Database. We are using a rather small Docker-DB. Therefore, we set the "batch_size" to only 10, so only 10 rows will be processed at once per process, and also our "parallelism_per_node" is set to the low value of 2. "parallelism_per_node" determines how many parallel processes are run on each node of you database. If you have a bigger Database to run this Notebook on, you can play around with setting both values higher than we have here.
The model repository is a data object pointing to the location of the model files we downloaded earlier.


In [11]:
defaults = Defaults(
    parallelism_per_node=2,
    batch_size=10,
    model_repository=BucketFSRepository(
        connection_name = ai_lab_config.txaie_bfs_connection,
        sub_dir = ai_lab_config.txaie_models_bfs_dir
    )
)

### Define the extractor

Now we need to define an extractor to run our extraction/preprocessing. We will use a StandardExtractor which has 3 standard preprocessing steps built-in, namely the topic classification, keyword search and named entity recognition. It is possible to disable each of these steps in the StandartExtractor by setting its model to "None". You can also use a different model instead of the built-in one, by setting its model to a specific HuggingFace model. But here we will use the StandartExtractor as is.

For the topic classification model we will use the topics "urgent", and "not urgent".

In [56]:
topics={"urgent", "not urgent"}

std_extractor =  StandardExtractor(
                        # If you want to disable a step, set it to None:
                        # named_entity_recognition_model = None,
                        # topic_classification_model = None,
                        
                        # If you want to use a different(not default) model, set its name:
                        # keyword_search_model = HuggingFaceModel(name="MY_KEYWORD_SEARCH_MODEL"),
                        topics=topics
                    )

We will also need a SourceTableExtractor, which holds information on which data we want to use as a source for our preprocessing, and feed it to the StandardExtractor.
We give it our schema and view as a data source, and tell it to run the preprocessing on the column "TICKET_DESCRIPTION", since that is where the Natural Text part of our data is. We also tell it to use the "TICKET_ID" column as an id/key.

In [57]:
text_column="TICKET_DESCRIPTION"
key_column="TICKET_ID"

src_extractor = SourceTableExtractor(
                        sources=[
                            SchemaSource(
                                db_schema=NameSelector(pattern=schema),
                                tables=[
                                    TableSource(
                                        table=NameSelector(pattern=view),
                                        columns=[NameSelector(pattern=text_column)],
                                        keys=[NameSelector(pattern=key_column)]
                                    )
                                ]
                            )
                        ]
                    )

Now, we can give these two extractors as steps to a PipelineExtractor, which will build a Pipeline out of them:

In [58]:
p_extractor = PipelineExtractor(
                steps=[
                    src_extractor,
                    std_extractor
                ]
            )

Next, we will wrap our PipelineExtractor in an extraction wrapper. This will allow us to simply use our Secret Store "ai_lab_config"
as an input, and build the necessary database connection and "run"-function for us.

We feed it our PipelineExtractor as the extractor, tell it to put the Output into our schema, and also give it our run defaults.

In [59]:
%run ./utils/txaie_extraction_wrapper.ipynb

In [60]:
%run utils/txaie_init_ui.ipynb #todo do we want this ui in a seperate file?
display(get_txaie_SLC_name_ui(ai_lab_config)) #todo CKey.language_alias does not yet exist. use once made in NC
#todo this should get input "PYTHON3_TXAIE"

AttributeError: language_alias

In [61]:
extraction = ExtractionWrapper(extractor=p_extractor,
                               output=Output(db_schema=schema),
                               defaults=defaults)

Then the only step left is to define a convenience function which calls our preprocessing, and then run it in the next section.

In [62]:
def run_text_ai_preprocessing():
    extraction.run(ai_lab_config)

## Run the preprocessing

Time to run our preprocessing. First, let's verify how many entries our view has:

In [63]:
%%sql
SELECT COUNT(ALL TICKET_ID) FROM {{schema}}.{{view}};

Count(TICKET_ID)
100


Then we call our preprocessing function. This will use our view as input, and produce new tables and views using the models we downloaded. 

Also, take note of the time this operation takes on your setup.

In [64]:
%%time
run_text_ai_preprocessing()

CPU times: user 254 ms, sys: 41.1 ms, total: 295 ms
Wall time: 10min 22s


## Results

Now, we will take a look at some of the tables and views our preprocessing has created for us. 
First, let's look at the tables created by our preprocessing:


In [65]:
%%sql
SELECT TABLE_SCHEMA, TABLE_NAME FROM EXA_ALL_TABLES WHERE TABLE_SCHEMA='{{schema}}'

table_schema,table_name
AI_LAB,CUSTOMER_SUPPORT_TICKETS
AI_LAB,TXAIE_AUDIT_LOG
AI_LAB,DOCUMENTS
AI_LAB,DOCUMENTS_AI_LAB_MY_VIEW
AI_LAB,KEYWORD_SEARCH
AI_LAB,KEYWORD_SEARCH_LOOKUP_KEYWORD
AI_LAB,KEYWORD_SEARCH_LOOKUP_SETUP
AI_LAB,TOPIC_CLASSIFIER
AI_LAB,TOPIC_CLASSIFIER_LOOKUP_TOPIC
AI_LAB,TOPIC_CLASSIFIER_LOOKUP_SETUP


As you can see, there are a number of new tables related to our preprocessing. There is our original data table "CUSTOMER_SUPPORT_TICKETS", and a new log table "TXAIE_AUDIT_LOG" which we will take a closer look at below. The "DOCUMENTS" table contains our input texts together with an identifying Span, we will take a look at that as well. There is also a "DOCUMENTS_AI_LAB_MY_VIEW" table, which contains IDs of the input text and documents, as well as the column the input text originated from.
This enables you to trace back documents(and their associated results) to the original input data point.

And then there are 3 tables per step of our preprocessing, a "<<step-name>>" table, a "lookup" table and a "setup" table. We won't look at them in detail, but there are also some views generated which contain a normalized version of the contained information. If you are curious, feel free to look at the contents of these tables on your own.

If we want to find out how these new tables are structured, we can get a description from the Exasol Database. For example, let's see how the resulting "DOCUMENTS" table looks like.

### DOCUMENTS Table


In [66]:
%%sql
DESC DOCUMENTS

column_name,sql_type,nullable,distribution_key,partition_key,zonemapped
TEXT_DOC_ID,"DECIMAL(18,0)",True,True,False,False
TEXT_CHAR_BEGIN,"DECIMAL(18,0)",True,False,False,False
TEXT_CHAR_END,"DECIMAL(18,0)",True,False,False,False
TEXT,VARCHAR(2000000) UTF8,True,False,False,False


It looks like this table contains a "TEXT_DOC_ID", "TEXT_CHAR_BEGIN", "TEXT_CHAR_END" and a "TEXT" column.
The "TEXT" column includes the text of the document.
In case the content of one of our input datapoints does not fit within the VARCHAR limit of the text column, it gets split into multiple entries in the documents table. These will have the same "TEXT_DOC_ID",
indicating they came from the same document. "TEXT_CHAR_BEGIN" and "TEXT_CHAR_END" indicate which parts of the original document each specific row contains. This triplet of "TEXT_DOC_ID", "TEXT_CHAR_BEGIN" and "TEXT_CHAR_END" is called a "Span", and together build an identifier for a section of text. You will encounter them for a lot of text-subsections. For example, found keywords contained in a text are also identified by a span in our result tables (see below).
                                                                                                                                                        
The usage of these Spans allows you to do various operations on top of these results, such as joining results on the document-id, or checking the order in which keywords appear in a document.
                                                                                                                                                                We can also check the number of unique TEXT_DOC_IDs in our table:



In [67]:
%%sql
SELECT COUNT(ALL text_doc_id) FROM {{schema}}.DOCUMENTS;

Count(DOCUMENTS.TEXT_DOC_ID)
100


It's identical to the number of rows in our input view. So all the data was converted successfully.

Now, let's look at what the content of our table looks like:

In [68]:
%%sql
SELECT * FROM DOCUMENTS WHERE TEXT_DOC_ID < 5

text_doc_id,text_char_begin,text_char_end,TEXT
1,0,275,"I'm having an issue with the GoPro Hero. Please assist. Your billing zip code is: 71701. We appreciate that you have requested a website address. Please double check your email address. I've tried troubleshooting steps mentioned in the user manual, but the issue persists."
2,0,266,"I'm having an issue with the LG Smart TV. Please assist. If you need to change an existing product. I'm having an issue with the LG Smart TV. Please assist. If The issue I'm facing is intermittent. Sometimes it works fine, but other times it acts up unexpectedly."
3,0,242,"I'm facing a problem with my Dell XPS. The Dell XPS is not turning on. It was working fine until yesterday, but now it doesn't respond. 1.8.3 I really I'm using the original charger that came with my Dell XPS, but it's not charging properly."
4,0,259,"I'm having an issue with the Microsoft Office. Please assist. If you have a problem you're interested in and I'd love to see this happen, please check out the Feedback. I've already contacted customer support multiple times, but the issue remains unresolved."


## Resulting Views

There are also some new views:

In [69]:
%%sql
SELECT VIEW_SCHEMA, VIEW_NAME FROM EXA_ALL_VIEWS

view_schema,view_name
AI_LAB,MY_VIEW
AI_LAB,DOCUMENTS_AI_LAB_MY_VIEW_VIEW
AI_LAB,KEYWORD_SEARCH_VIEW
AI_LAB,TOPIC_CLASSIFIER_VIEW
AI_LAB,NAMED_ENTITY_VIEW


These views contain the results of our three preprocessing steps respectively. They are built on top of the resulting tables, which contain the data in a normalized form. The views denormalize these tables. So, for instance, instead of the topic name you will see a number in the table. The names are collected in a supporting table, named something like XYZ_LOOKUP.
The view will then sort these different tables into human-readable information.

The "DOCUMENTS_AI_LAB_MY_VIEW_VIEW" is a view on top of our input data, with the addition of the span identifier("TEXT_DOC_ID", "TEXT_CHAR_BEGIN", "TEXT_CHAR_END") for the text column of each row. This can be used to join the original data with the preprocessing results.

Let's take a closer look at the results of the topic classification step in our preprocessing now. These can be found in the view "TOPIC_CLASSIFIER_VIEW".

### Topic Classifier View


In [70]:
%%sql
DESC TOPIC_CLASSIFIER_VIEW

column_name,sql_type,nullable,distribution_key,partition_key,zonemapped
TEXT_DOC_ID,"DECIMAL(18,0)",,,,
TEXT_CHAR_BEGIN,"DECIMAL(18,0)",,,,
TEXT_CHAR_END,"DECIMAL(18,0)",,,,
TOPIC,VARCHAR(2000000) UTF8,,,,
TOPIC_SCORE,DOUBLE,,,,
TOPIC_RANK,"DECIMAL(18,0)",,,,
ERROR_MESSAGE,VARCHAR(2000000) UTF8,,,,
SETUP,VARCHAR(2000000) UTF8,,,,


This view contains a span identifying the classified document, the topic it was assigned, as well as a topic score. The latter contains a probability the classifier assigned this topic with regard to this text input. So "how sure" the classifier is about the assigned topic.
The topic_rank ranks the topics for each source document by their topic_score. For our example, we had only two topics, so each document was assigned each of the topics, with different scores. The one with the higher score for a given document will have rank 1, the one with the lower score will have rank 2.

There is also a column for error messages encountered during classification, as well as a "SETUP" column documenting which setup(i.e. model, model-settings) where used to obtain this result.

As you remember, we wanted to use the classifier to differentiate our user tickets into "urgent" issues and "non-urgent" issues. So those are the topics we expect to see in the results. Let's check how these results look:

In [71]:
%%sql
SELECT * FROM TOPIC_CLASSIFIER_VIEW WHERE TEXT_DOC_ID < 5

text_doc_id,text_char_begin,text_char_end,topic,topic_score,topic_rank,error_message,setup
2,0,266,not urgent,0.5791531801223755,1,,"{""HftTopicClassification"": {""model_name"": ""tasksource/ModernBERT-base-nli"", ""topics"": [""not urgent"", ""urgent""], ""hypothesis_template"": null, ""multi_label"": false}}"
2,0,266,urgent,0.4208468496799469,2,,"{""HftTopicClassification"": {""model_name"": ""tasksource/ModernBERT-base-nli"", ""topics"": [""not urgent"", ""urgent""], ""hypothesis_template"": null, ""multi_label"": false}}"
4,0,259,not urgent,0.528149425983429,1,,"{""HftTopicClassification"": {""model_name"": ""tasksource/ModernBERT-base-nli"", ""topics"": [""not urgent"", ""urgent""], ""hypothesis_template"": null, ""multi_label"": false}}"
4,0,259,urgent,0.4718504846096039,2,,"{""HftTopicClassification"": {""model_name"": ""tasksource/ModernBERT-base-nli"", ""topics"": [""not urgent"", ""urgent""], ""hypothesis_template"": null, ""multi_label"": false}}"
1,0,275,not urgent,0.5050211548805237,1,,"{""HftTopicClassification"": {""model_name"": ""tasksource/ModernBERT-base-nli"", ""topics"": [""not urgent"", ""urgent""], ""hypothesis_template"": null, ""multi_label"": false}}"
1,0,275,urgent,0.4949788153171539,2,,"{""HftTopicClassification"": {""model_name"": ""tasksource/ModernBERT-base-nli"", ""topics"": [""not urgent"", ""urgent""], ""hypothesis_template"": null, ""multi_label"": false}}"
3,0,242,not urgent,0.800348162651062,1,,"{""HftTopicClassification"": {""model_name"": ""tasksource/ModernBERT-base-nli"", ""topics"": [""not urgent"", ""urgent""], ""hypothesis_template"": null, ""multi_label"": false}}"
3,0,242,urgent,0.1996518075466156,2,,"{""HftTopicClassification"": {""model_name"": ""tasksource/ModernBERT-base-nli"", ""topics"": [""not urgent"", ""urgent""], ""hypothesis_template"": null, ""multi_label"": false}}"


Next, we look at the identified named entities for our input documents. These can be found in the "NAMED_ENTITY_VIEW".
### Named Entity View:


In [72]:
%%sql
DESC NAMED_ENTITY_VIEW

column_name,sql_type,nullable,distribution_key,partition_key,zonemapped
TEXT_DOC_ID,"DECIMAL(18,0)",,,,
TEXT_CHAR_BEGIN,"DECIMAL(18,0)",,,,
TEXT_CHAR_END,"DECIMAL(18,0)",,,,
ENTITY_TYPE,VARCHAR(2000000) UTF8,,,,
ENTITY_SCORE,DOUBLE,,,,
ENTITY,VARCHAR(2000000) UTF8,,,,
ENTITY_DOC_ID,"DECIMAL(18,0)",,,,
ENTITY_CHAR_BEGIN,"DECIMAL(18,0)",,,,
ENTITY_CHAR_END,"DECIMAL(18,0)",,,,
ERROR_MESSAGE,VARCHAR(2000000) UTF8,,,,


Similar to the "TOPIC_CLASSIFIER_VIEW", the "NAMED_ENTITY_VIEW" also has the Span("TEXT_DOC_ID", "TEXT_CHAR_BEGIN", "TEXT_CHAR_END") identifying the input document the entity was found in. Then there are the found named entity in the "ENTITY" column, as well as an entity type and an entity score. The entity type and entity score are assigned to the entity by the model. Additionally, we also have an identifying span for the entity itself :"ENTITY_DOC_ID", "ENTITY_CHAR_BEGIN", "ENTITY_CHAR_END". This span represents exactly where in our input data this entity was found.

![a text with an id number. the text containings the named entity subtext "GoPro Hero". from the id, subtext begin and subtext end arrows are pointing to the id,begin,end of the entity span.](./images/entity_span.drawio.png)

Since the named entity was found in the text identified by "TEXT_DOC_ID, TEXT_CHAR_BEGIN, TEXT_CHAR_END", it follows that "TEXT_DOC_ID"="ENTITY_DOC_ID" for a given row. Similarly, both "ENTITY_CHAR_BEGIN" and "ENTITY_CHAR_END" are between "TEXT_CHAR_BEGIN" and "TEXT_CHAR_END". You can use these spans for further processing down the line. For example,
if joined with the input data, especially in a case where an input document was split into multiple rows, this lets you determine where an entity was found in relation to the whole document. Or you could check how close together named entities of the same document were found, and then check if certain named entity clusters are indicative of different topics. However, this post-processing is not part of this tutorial.

The "NAMED_ENTITY_VIEW" also includes an error message column and a setup column like the "TOPIC_CLASSIFIER_VIEW" above.

#todo update span is relative to original document
#todo chunking currently not happening. still explain it?

In [73]:
%config SqlMagic.displaylimit = 10 # we set this lower so the show only a preview of the views

In [74]:
%%sql
SELECT TEXT_DOC_ID, 
    TEXT_CHAR_BEGIN, 
    TEXT_CHAR_END,
    ENTITY, 
    ENTITY_TYPE, 
    ENTITY_SCORE, 
    ENTITY_DOC_ID, 
    ENTITY_CHAR_BEGIN, 
    ENTITY_CHAR_END FROM NAMED_ENTITY_VIEW

text_doc_id,text_char_begin,text_char_end,entity,entity_type,entity_score,entity_doc_id,entity_char_begin,entity_char_end
56,0,380,Nintendo Switch Pro Controller,product_other,0.9391166567802428,56,29,59
56,0,380,Nintendo Switch Pro Controller,product_other,0.9385005831718444,56,304,334
38,0,362,Amazon Kindle,product_software,0.5349116325378418,38,34,47
38,0,362,Amazon,product_software,0.4779260158538818,38,270,276
38,0,362,Kindle,product_other,0.6612153649330139,38,277,283
82,0,349,Apple AirPods,product_other,0.9593544602394104,82,39,52
82,0,349,Apple AirPods,product_other,0.9604870080947876,82,290,303
16,0,336,GoPro Action Camera,product_other,0.9601884484291076,16,29,48
16,0,336,GoPro Action Camera,product_other,0.961948812007904,16,287,306
52,0,335,LG Smart TV,product_other,0.9153316020965576,52,24,35


### Keyword-Search View

Lastly, our preprocessing created a view containing the results of the keyword search step, the "KEYWORD_SEARCH_VIEW". This one is structured similar to the "NAMED_ENTITY_VIEW":

In [75]:
%%sql
DESC KEYWORD_SEARCH_VIEW

column_name,sql_type,nullable,distribution_key,partition_key,zonemapped
TEXT_DOC_ID,"DECIMAL(18,0)",,,,
TEXT_CHAR_BEGIN,"DECIMAL(18,0)",,,,
TEXT_CHAR_END,"DECIMAL(18,0)",,,,
KEYWORD,VARCHAR(2000000) UTF8,,,,
KEYWORD_SCORE,DOUBLE,,,,
KEYWORD_DOC_ID,"DECIMAL(18,0)",,,,
KEYWORD_CHAR_BEGIN,"DECIMAL(18,0)",,,,
KEYWORD_CHAR_END,"DECIMAL(18,0)",,,,
ERROR_MESSAGE,VARCHAR(2000000) UTF8,,,,
SETUP,VARCHAR(2000000) UTF8,,,,


The "TEXT_DOC_ID", "TEXT_CHAR_BEGIN" and "TEXT_CHAR_END" are again the input document span. But instead of an entity with an entity-score and an entity span, we now have a keyword column, a keyword score and a span("KEYWORD_DOC_ID", "KEYWORD_CHAR_BEGIN", "KEYWORD_CHAR_END") identifying the found keyword in the text. Then, of course, the "ERROR_MESSAGE" and "SETUP" columns.

In [76]:
%%sql
SELECT TEXT_DOC_ID, 
    TEXT_CHAR_BEGIN, 
    TEXT_CHAR_END,
    KEYWORD, 
    KEYWORD_SCORE, 
    KEYWORD_DOC_ID, 
    KEYWORD_CHAR_BEGIN, 
    KEYWORD_CHAR_END FROM KEYWORD_SEARCH_VIEW WHERE TEXT_DOC_ID < 5

text_doc_id,text_char_begin,text_char_end,keyword,keyword_score,keyword_doc_id,keyword_char_begin,keyword_char_end
1,0,275,gopro hero,0.7937,1,29,39
1,0,275,billing zip code,0.7466,1,62,78
1,0,275,user manual,0.6997,1,239,250
1,0,275,email address,0.691,1,174,187
1,0,275,website address,0.6738,1,131,146
3,0,242,dell xps,0.7552,3,29,37
3,0,242,dell xps,0.7552,3,43,51
3,0,242,dell xps,0.7552,3,201,209
3,0,242,original charger,0.748,3,166,182
3,0,242,yesterday,0.7178,3,97,106


You might notice some seemingly duplicated keywords for a given document. But take a look at the keyword spans of those "duplicates". They are different. This means the same keyword was found multiple times in the same document.

### Result Summary

Here is an overview of the data model our preprocessing created.
### todo this image is missing DOCUMENTS_AI_LAB_MY_VIEW_VIEW should i add it?
    
![A diagramm showing multiple Table names with their respective columns. Starting at "MY_VIEW" flowing to "DOCUMENTS" and then the three result views. The columns containg the text document span are highlighted.](./images/data_model.drawio.png)


## Adding data to source view

Now, let's try and run the preprocessing again, using the exact same input.

In [77]:
%%time
run_text_ai_preprocessing()

CPU times: user 90.8 ms, sys: 17.6 ms, total: 108 ms
Wall time: 1.62 s


See how quickly it runs this time? This is because the text-ai-extension does not compute results already computed in previous runs. We can test this behaviour further. Let's add more entries to our dataset, and see and see how long the preprocessing takes then.

So, in the next call let's double the data in our input view:

In [78]:
with open_pyexasol_connection(ai_lab_config, compression=True) as conn:
    conn.execute(f"""CREATE OR REPLACE VIEW "{schema}"."{view}" AS SELECT * FROM "{schema}"."{table}" WHERE "TICKET_ID" <= {view_size}*2; """)


In [79]:
%%sql
SELECT COUNT(ALL TICKET_ID) FROM {{schema}}.{{view}};

Count(TICKET_ID)
200


Once we run the preprocessing again, you would expect this run to take twice as long as the first run we did. However, thanks to the way the Text-Ai-Extension is implemented, you should now see that it is much faster than that. For us, it is slightly longer than the first run, but takes nowhere near twice the time.

In [80]:
%%time
run_text_ai_preprocessing()

CPU times: user 217 ms, sys: 32.1 ms, total: 249 ms
Wall time: 9min 58s


In [81]:
%%sql
SELECT COUNT (*) FROM DOCUMENTS;

COUNT(*)
200


Remember, the processing time is dependent on a lot of factors, such as the actual size of the data points, the batch size, parallelism per node, as well as available memory and number of nodes of the used Exasol Database. So the actual speedup you experience will differ from case to case.

If you want to experiment with this further, feel free to, for example, add even more data. For this Notebook we did not demonstrate this, because the calls take a long time for demonstration purposes.

## Audit Log

Lastly, let's look at the audit log table text-ai has generated for us. This is a table documenting each run text-ai does on our ExasolDatabase. It contains information on runtime, how mana data entries were used or created, and error messages. This can be very helpful if you suspect a problem with one of your pipelines and want to know where it is coming from. Or if you are interested in seeing how much data came from a specific step, or which of the pipeline steps is taking too long.


In [85]:
%config SqlMagic.displaylimit = 20

In [86]:
%%sql
DESC TXAIE_AUDIT_LOG

column_name,sql_type,nullable,distribution_key,partition_key,zonemapped
LOG_TIMESTAMP,TIMESTAMP(3),True,False,False,False
SESSION_ID,"DECIMAL(20,0)",True,False,False,False
RUN_ID,HASHTYPE(16 BYTE),True,False,False,False
ROW_COUNT,"DECIMAL(36,0)",True,False,False,False
LOG_SPAN_NAME,VARCHAR(2000000) UTF8,True,False,False,False
LOG_SPAN_ID,HASHTYPE(16 BYTE),True,False,False,False
PARENT_LOG_SPAN_ID,HASHTYPE(16 BYTE),True,False,False,False
EVENT_NAME,VARCHAR(128) UTF8,True,False,False,False
EVENT_ATTRIBUTES,VARCHAR(2000000) UTF8,True,False,False,False
DB_OBJECT_SCHEMA,VARCHAR(128) UTF8,True,False,False,False


In [87]:
from pandas import option_context
with open_pyexasol_connection(ai_lab_config, compression=True) as conn:
    audit_log = conn.export_to_pandas(f"""
        SELECT RUN_ID,DB_OBJECT_NAME,EVENT_NAME,ROW_COUNT,LOG_TIMESTAMP FROM {schema}.TXAIE_AUDIT_LOG
    """)
    with option_context('display.max_rows', 20, 'display.max_colwidth', 1000):
        display(audit_log)

Unnamed: 0,RUN_ID,DB_OBJECT_NAME,EVENT_NAME,ROW_COUNT,LOG_TIMESTAMP
0,,,SourceTableQueryHandler_Start,,2025-06-26 12:19:44.244000
1,76e961e490474418b564bced6dca6e54,DOCUMENTS_AI_LAB_MY_VIEW,Begin,0.0,2025-06-26 12:19:44.326000
2,76e961e490474418b564bced6dca6e54,DOCUMENTS_AI_LAB_MY_VIEW,End,100.0,2025-06-26 12:19:44.375000
3,76e961e490474418b564bced6dca6e54,DOCUMENTS,Begin,0.0,2025-06-26 12:19:44.394000
4,76e961e490474418b564bced6dca6e54,DOCUMENTS,End,100.0,2025-06-26 12:19:44.446000
...,...,...,...,...,...
91,16d772ca76ef4b6eae14fb9f244f5acc,NAMED_ENTITY_LOOKUP_SETUP,End,1.0,2025-06-26 12:41:10.360000
92,16d772ca76ef4b6eae14fb9f244f5acc,NAMED_ENTITY,Begin,197.0,2025-06-26 12:41:10.365000
93,16d772ca76ef4b6eae14fb9f244f5acc,NAMED_ENTITY,End,384.0,2025-06-26 12:41:10.394000
94,,,UDFAlgo_Error,,2025-06-26 12:41:10.413000


## Addendum

Consider also, the free text in a dataset may contain spelling errors, incomplete mentions or other quality issues. Therefore, it might need further processing steps to be at its most useful. However, we will not be demonstrating those here.