# Text AI Extension preprocessing

Here we will demonstrate how the Text AI Extension data-preprocessing can be used. We will be taking a dataset of customer support tickets, which contain unstructured data in the form of a ticket description. We will then run our preprocessing in order to sort these tickets into "uregent" and "not urgent" cases, and find imortant named etities and keywords within the text. This found information cane then be used for data analysis in the [following notebook](). 
#### todo link

We will also demonstrate the Text-Ai_extensions's ability to determine if data was allready processed, and skipp it if applicable.

## Prerequisites

Prior to using this notebook one needs to complete the following steps:

**Note**: To be able to store the models used in this demo, make sure you set the Disk Size of the database to at least 10 GiB in the AI-Lab configuration.

1. [Configure the AI-Lab](../main_config.ipynb).
2. [initialize the Transformers Extension](../transformers/te_init.ipynb)
3. [initialize the Text AI Extension](./txaie_init.ipynb)

## General Setup

As a first step, we need to get access to the Ai-Lab secret store:

In [2]:
%run ../utils/access_store_ui.ipynb
display(get_access_store_ui('../'))

Output()

Box(children=(Box(children=(Label(value='Configuration Store', layout=Layout(border_bottom='solid 1px', border…

Then we can get the activation sql for our previously installed Script Language Containers. This will be used to activate those SLC's in oder to use their UDF's.

We also want to import some of the Python functions of the text-ai and notebook-conector modules.

In [67]:
from exasol.nb_connector.connections import open_pyexasol_connection
from exasol.nb_connector.language_container_activation import get_activation_sql

activation_sql = get_activation_sql(ai_lab_config)

In [68]:
from exasol.nb_connector.ai_lab_config import AILabConfig
from exasol.ai.text.extractors import *
from exasol.nb_connector.text_ai_extension_wrapper import LANGUAGE_ALIAS
from exasol.ai.text.extraction.extraction import *
from exasol.ai.text.extraction.abstract_extraction import *

The next call will make it possible to run sql directly in this notebook, in order to easyer display the results of out preprocessing. The one below sets the max number of columns our sql statements can display in the notebook.

In [69]:
%run ../utils/jupysql_init.ipynb

The sql extension is already loaded. To reload it, use:
  %reload_ext sql


In [70]:
%config SqlMagic.displaylimit = 20

## Rational/Explanation/Motivation

Natural Language Processing, or the processing of so called "unstructured data" or free text, is the processing(i.e classifying, retreivig of information) of unannotated language.

There are tasks in Natural Language Processing (NLP) which seem easy to us humans, but are very hard for a machine to do. For example infering the opinion the speaker has about a topic (Opinion Extraction/Mining). Doing these tasks on un-annotated text is even harder. Therefore, multiple ways to annotate a natural laguage text with various additional information where developed. These annotated texts are then better suited for higher level NLP tasks.
                                                                                                                                                   
Depending on the amount of data/text which should be processed, annotating by hand is mostly not an option these days, since with increasing dataset sizes the resources needed quickly become unrealistic. Therefore, Exasols Text AI provides you with preprocessing steps you can use for annotating you data in various ways.
                                                                                                                     
In this Notebook, we will show you our three default preprocessing pipeline steps. Of course it is possible for you to define your own pipeline later on.
Lets explain these three steps before we dive into how to run the preprocessing.
                                                                                                                     
### Topic Classification
                                                                                                                     
Topic Classification is the task of assigning topics to text/documents/datapoints. In Topic Classification, a given set of topics is used, and each datapoint is assingned the best matching topic based on the probability the classifcation model calculates.
A topic in this context is an abstract category of text. Given that a document is about a particular topic, it is expected for particular words to appear in the document more or less frequently. However, it is not required for the exact words to describe the topic to be found in a text. This means that topics can be infered, even if their name/description/topic synonyms are not found in the data.

![diagramm a document text added topics](./images/topics.drawio.png)                                                                                                             
Topic Classification assinges a given set of these topics, and is usually trained using supervised learning. It can also be used with Zero-Shot Classification models, which can assign classes/topics which have not been seen during the training. This is opposed to other apppoaches like topic extraction, which is often unsupervised and does not need a list of topics as input, instead extrating them from the data itself.                                                                                                                                                                                                                         
                                                                                                                     
### KEYWORD_SEARCH/extraction
                                                                               
Keyword Search is about identifying the most relevant words or phrases(Keywords/Keyphrases) from a given text.
These can then help in further steps, e.g. summarizing the content of texts and recognize the main topics discussed.
Keywords or phrase need be present in the text
For Example:
![diagramm a document text with highlighted keywords](./images/keywords.drawio.png)


### Named Entity Recognition

Named entity recognition (NER) is about locateing and classify so called "named entities" mentioned in a text document. Depending on the model, entities are e.g. person names, organizations, locations, or vehicles etc, so "things that have names". The model seeks out those enties, returning their positions in the documnent, as well as their class.

#### Example

As an Example of what the output for these three steps might look like for a given documents, consider our document to be "I'm having an issue with the GoPro Hero. It's affecting my productivity.". We may use a topic classifier with the input topic set of "urgent, not urgent" for infering urgency from ticket content. Then our output could look something like this:

![diagramm showing document text with found entity and keyword and topic](./images/document_annotated.drawio.png)



## Get an example dataset

We will be using a Dataset which holds information on customer support tickets. We will split this data into 2 set, in order to demonstrate how the preprocessing tasks handle new data being added to a data set.
But first we want to make sure the tables we want to use don't already exist, for example from a previous run of this notebook. Therefore, we are going to drop them.
First, we define a list of tables to drop:

In [71]:
text_column="TICKET_DESCRIPTION"
key_column="TICKET_ID"
table="CUSTOMER_SUPPORT_TICKETS"
schema=ai_lab_config.db_schema

In [52]:
table_list = [
    "TXAIE_AUDIT_LOG",
    "DOCUMENTS",
    f"DOCUMENTS_{schema}_MY_VIEW",
    "NAMED_ENTITY",
    "NAMED_ENTITY_LOOKUP_ENTITY_TYPE",
    "NAMED_ENTITY_LOOKUP_SETUP",
    "KEYWORD_SEARCH",
    "KEYWORD_SEARCH_LOOKUP_KEYWORD",
    "KEYWORD_SEARCH_LOOKUP_SETUP",
    "TOPIC_CLASSIFIER",
    "TOPIC_CLASSIFIER_LOOKUP_TOPIC",
    "TOPIC_CLASSIFIER_LOOKUP_SETUP"
]


Next, define a function which drops these tables, as well as our main table. Then we call the function.

In [54]:
def delete_text_ai_preprocessing_tables():
    with open_pyexasol_connection(ai_lab_config, compression=True) as conn:
        for drop_table in table_list:
            conn.execute(f"""DROP TABLE IF EXISTS "{schema}"."{drop_table}" """)
        conn.execute(f"""DROP TABLE IF EXISTS "{schema}"."{table}" """)

In [55]:
delete_text_ai_preprocessing_tables()

You can then load the data using [this notebook](../data/data_customer_support.ipynb). This loads the data into a table called "CUSTOMER_SUPPORT_TICKETS" found in the schema defined in the ai_lab_config variable db_schema.

In [56]:
%run ../data/data_customer_support.ipynb

Output()

Box(children=(Box(children=(Label(value='Configuration Store', layout=Layout(border_bottom='solid 1px', border…


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Path to dataset files: /home/jupyter/.cache/kagglehub/datasets/suraj520/customer-support-ticket-dataset/versions/1


In [57]:
%%sql
SELECT COUNT(*) FROM {{schema}}.{{table}};

COUNT(*)
8469


### Create a View on the data
This Dataset as ~8000 entries. You could run the preprocessing for the whole Dataset, but it would take quite some time. Instead, we will create a view containing only part of the Dataset, and use this view as the base for our preprocessing.
We set the size of this view here. If you want to see how the AI-Lab handles bigger datasets on your Exasol instance, you can set the "view_size" higher.

In [72]:
view="MY_VIEW"
view_size = 100 # <= 4234

In [59]:
with open_pyexasol_connection(ai_lab_config, compression=True) as conn:
    conn.execute(f"""DROP VIEW IF EXISTS "{schema}"."{view}"; """)
    conn.execute(f"""CREATE OR REPLACE VIEW "{schema}"."{view}" AS SELECT * FROM "{schema}"."{table}" WHERE "TICKET_ID" <= {view_size}; """)


Lets check the size of our created view:

In [60]:
%%sql
SELECT COUNT(*) FROM {{schema}}.{{view}};

COUNT(*)
100


As you can see we now have only our definded 100 datapoints to contend with.

Lets now see what our data contains:

In [73]:
%%sql
DESC {{schema}}.{{view}}

column_name,sql_type,nullable,distribution_key,partition_key,zonemapped
TICKET_ID,"DECIMAL(18,0)",,,,
CUSTOMER_NAME,VARCHAR(2000000) UTF8,,,,
CUSTOMER_EMAIL,VARCHAR(2000000) UTF8,,,,
CUSTOMER_AGE,"DECIMAL(18,0)",,,,
CUSTOMER_GENDER,VARCHAR(2000000) UTF8,,,,
PRODUCT_PURCHASED,VARCHAR(2000000) UTF8,,,,
DATE_OF_PURCHASE,VARCHAR(2000000) UTF8,,,,
TICKET_TYPE,VARCHAR(2000000) UTF8,,,,
TICKET_SUBJECT,VARCHAR(2000000) UTF8,,,,
TICKET_DESCRIPTION,VARCHAR(2000000) UTF8,,,,


We can see a ticket id column, as well as some columns containing information about the customer like name 
and e-mail adress. There is also a column containg the product the ticket is about, and then some metadate column for the ticket itself.
ticket description contains the actual text of the ticket, and reolution contains the resolution if there is one, and is otherwise empty.

In [23]:
%%sql
SELECT TICKET_ID,
    CUSTOMER_NAME,
    PRODUCT_PURCHASED,
    TICKET_SUBJECT, 
    TICKET_DESCRIPTION,
    RESOLUTION,
    CUSTOMER_SATISFACTION_RATING  
    FROM {{schema}}.{{view}} WHERE TICKET_ID < 6

ticket_id,customer_name,product_purchased,ticket_subject,ticket_description,resolution,customer_satisfaction_rating
1,Marisa Obrien,GoPro Hero,Product setup,"I'm having an issue with the {product_purchased}. Please assist. Your billing zip code is: 71701. We appreciate that you have requested a website address. Please double check your email address. I've tried troubleshooting steps mentioned in the user manual, but the issue persists.",,
2,Jessica Rios,LG Smart TV,Peripheral compatibility,"I'm having an issue with the {product_purchased}. Please assist. If you need to change an existing product. I'm having an issue with the {product_purchased}. Please assist. If The issue I'm facing is intermittent. Sometimes it works fine, but other times it acts up unexpectedly.",,
3,Christopher Robbins,Dell XPS,Network problem,"I'm facing a problem with my {product_purchased}. The {product_purchased} is not turning on. It was working fine until yesterday, but now it doesn't respond. 1.8.3 I really I'm using the original charger that came with my {product_purchased}, but it's not charging properly.",Case maybe show recently my computer follow.,3.0
4,Christina Dillon,Microsoft Office,Account access,"I'm having an issue with the {product_purchased}. Please assist. If you have a problem you're interested in and I'd love to see this happen, please check out the Feedback. I've already contacted customer support multiple times, but the issue remains unresolved.",Try capital clearly never color toward story.,3.0
5,Alexander Carroll,Autodesk AutoCAD,Data loss,I'm having an issue with the {product_purchased}. Please assist. Note: The seller is not responsible for any damages arising out of the delivery of the battleground game. Please have the game in good condition and shipped to you I've noticed a sudden decrease in battery life on my {product_purchased}. It used to last much longer.,West decision evidence bit.,1.0


## Download NLP Models

we will use multiple different transformers models to run our preprocessing with. We will use [this notebook](./utils/txaie_default_models.ipynb) to Download them from HuggingFace. 

Simply run the next cell.
**This call will take some time to complete, depending on your internet connection. You will see some printed output once it is done.**

**Note**: If this operation fails with an Error indicating a lost connection, please increase the size of your Database and try again.

In [21]:
%run ./utils/txaie_default_models.ipynb

## Configure the Text-AI Pipeline

In the Text-AI-Extension, you define steps to run, and then place them in a Pipeline which orchestrates the data flow for you. In this Notebook we will be using a basic example using the default steps defined in the "StandardExtractor".
                                                                                                                    
#### Configure defaults
                                                                                                                    
Here, we will configure how our pipeline should be run. These are depending on you Database. We are using a rather small Docker-DB. Therefore we set the batch size to only 10, so only 10 rows will be processed at once per process, and also our parallelism_per_node is set low at 2. parallelism_per_node determines how many paralllel processes are run on each node of you Database. If you have a bigger Database to run this Notebook on, you can play around with setting both values higher than we have here.
The model repository an object used to find where in the [BucketFS](https://docs.exasol.com/db/latest/database_concepts/bucketfs/bucketfs.htm) (Exasols Filesystem) the model files we downloaded earlier can be found.Here in AI-Lab, the Text-AI_Extension uses the same directory for the models as the Transformers Extension, because both use HuggingFace models. Therefore we will be using the same configuration here as in the Transformers Extension Notebooks.

In [45]:
defaults = Defaults(
    parallelism_per_node=2,
    batch_size=10,
    model_repository=BucketFSRepository(
        connection_name = ai_lab_config.te_bfs_connection,
        sub_dir = ai_lab_config.te_models_bfs_dir
    )
)

### Define the extractor

Now we need to define an extractor to run our extraction/preprocessing. We will use a StandardExtractor which hase 3 standart preprocessing steps build in, the topic classification, keyword search and named entity recognition. It is possible to disable each of these steps in the StandartExtractor by setting its mode to "None", or use a diffferent model instead of the build in one. But here we will use the StandartExtractor as is.

For the topics we give into our topic classification model we will use "urgent", and "not urgent".

In [46]:
topics={"urgent", "not urgent"}

std_extractor =  StandardExtractor(
                        # If you want to disable a step, set it to None:
                        # named_entity_recognition_model = None,
                        # topic_classification_model = None,
                        
                        # If you want to use a different(not default) model, set its name:
                        # keyword_search_model = HuggingFaceModel(name="MY_KEYWORD_SEARCH_MODEL"),
                        topics=topics
                    )

We will also need a SourceTableExtractor, which holds information on which data we want to use as a source for our preprocessing, and feed it to the StandardExtractor.
We give it our schema and view as a data sorce, and tell it to run the preprocessing on the column TICKET_DESCRIPTION, since that is where the Natural Text part of our data is. We also tell it to use the TICKET_ID column as an id/key.

In [None]:
text_column="TICKET_DESCRIPTION"
key_column="TICKET_ID"

sc_extractor = SourceTableExtractor(
                        sources=[
                            SchemaSource(
                                db_schema=NameSelector(pattern=schema),
                                tables=[
                                    TableSource(
                                        table=NameSelector(pattern=view),
                                        columns=[NameSelector(pattern=text_column)],
                                        keys=[NameSelector(pattern=key_column)]
                                    )
                                ]
                            )
                        ]
                    )

Now, we can give these two extractors as steps to a PipelineExtractor, which will build a Pipeline out of them:

In [None]:
p_extractor = PipelineExtractor(
                steps=[
                    sc_extractor,
                    std_extractor
                ]
            )

Next, we will wrap our PipelineExtractor in an extraction wrapper. This will allow us to simply use our Secret Store "ai_lab_config"
as an input, and build the neccessary Database connection and run function for us.

We feed it our PipelineExtractor as the extractor, tell it to put the Output into our schema, and also give it our run defaults.

In [43]:
%run ./utils/txaie_extraction_wrapper.ipynb

In [91]:
%run utils/txaie_init_ui.ipynb #todo do we want this ui in a seperate file?
display(get_txaie_SLC_name_ui(ai_lab_config)) #todo CKey.language_alias does not yet exist. use once made in NC
#todo this should get input "PYTHON3_TXAIE"

AttributeError: language_alias

In [47]:
extraction = ExtractionWrapper(extractor=p_extractor,
                               output=Output(db_schema=schema),
                               defaults=defaults)

Then the only step left is to define a convienence function which calls our preprocessing, and the run it in the next section.

In [49]:
def run_text_ai_preprocessing():
    extraction.run(ai_lab_config)

## Run the preprocessing

Time to run our preprocessing. First, lets verify how man entries our view has:

In [26]:
%%sql
SELECT COUNT(ALL TICKET_ID) FROM {{schema}}.{{view}};

Count(TICKET_ID)
100


Then we call our preprocessing function. This will use our view as input, and produce new tables and views using the models we downloaded. 

Also take note of the time this operation takes on your setup.

In [62]:
%%time
run_text_ai_preprocessing()

CPU times: user 275 ms, sys: 46.1 ms, total: 321 ms
Wall time: 17min 4s


## Results

Now, we will take a look at some of the tables and views our preprocessing has created for us. 
First, lets look at which tables where created by our preprocessing:


In [63]:
%%sql
SELECT TABLE_SCHEMA, TABLE_NAME FROM EXA_ALL_TABLES WHERE TABLE_SCHEMA='{{schema}}'

table_schema,table_name
AI_LAB,CUSTOMER_SUPPORT_TICKETS
AI_LAB,TXAIE_AUDIT_LOG
AI_LAB,DOCUMENTS
AI_LAB,DOCUMENTS_AI_LAB_MY_VIEW
AI_LAB,KEYWORD_SEARCH
AI_LAB,KEYWORD_SEARCH_LOOKUP_KEYWORD
AI_LAB,KEYWORD_SEARCH_LOOKUP_SETUP
AI_LAB,NAMED_ENTITY
AI_LAB,NAMED_ENTITY_LOOKUP_ENTITY_TYPE
AI_LAB,NAMED_ENTITY_LOOKUP_SETUP


As you can see, there are a number of new tables related to our preprocessing. There is our origninal data table CUSTOMER_SUPPORT_TICKETS, and a new log table TXAIE_AUDIT_LOG which we will take a closer look at below. The DOCUMENTS table contains our input texts together with an identifing Span, we will take a look at that as well. There is aslo a DOCUMENTS_AI_LAB_MY_VIEW table, which contains id's of the input text and documents, as well as he column the input text originated from. This enables you to trace back Documents(and their associated results) to the exact column and of our inpu data view/table the originated from.

And then there are 3 tables per step of our prerocessing, a "step" table, and a "lookup" table and a "setup" table. We wont look at them in detail, but there are also some views generated which contain a condensed version of the contained information. If you are curious feel free to look at the contents of these tables on your own.

If we want to find out how these new tables are structured, we can get a description from the Exasol Database, for example lets see how the resulting documents table looks like.

### DOCUMENTS Table


In [125]:
%%sql
DESC DOCUMENTS

column_name,sql_type,nullable,distribution_key,partition_key,zonemapped
TEXT_DOC_ID,"DECIMAL(18,0)",True,True,False,False
TEXT_CHAR_BEGIN,"DECIMAL(18,0)",True,False,False,False
TEXT_CHAR_END,"DECIMAL(18,0)",True,False,False,False
TEXT,VARCHAR(2000000) UTF8,True,False,False,False


Looks like this table contains a text-document-id, text-char-begin, text-char-end and a text column.
The text column includes the text of the document. In case the content of one of our input datapoints does not fit within the VARCHAR limit of the text column, it gets split into multiple entries in the documents table. These will have the same text-doc-id, 
indicating they came from the same document. text-char-begin and text-char-end indicate which parts of the original document each specific row contains. This trifecta of text-document-id, text-char-begin and text-char-end is called a "Span", and toghether build an identifier for a section of text. You will encounter them for a lot of text-subsections. For example, found keywords contained in a text are also identified by a span in our result tables (see below). 
                                                                                                                                                                                                                                                                                                                                  
The usage of these spans allows you to do various operations on top of these results, such as joining results on the document-id, or checking the order in which keywords appear in a document.
                                                                                                                                                                                                                                                                                                      We can also check the number of unique doc-ids in our table:

In [28]:
%%sql
SELECT COUNT(ALL text_doc_id) FROM {{schema}}.DOCUMENTS;

Count(DOCUMENTS.TEXT_DOC_ID)
100


Its identical to the number of rows in our input view. So all the data was converted successfully.

Now lets look a what the content of our table looks like:

In [126]:
%%sql
SELECT * FROM DOCUMENTS WHERE TEXT_DOC_ID < 5

text_doc_id,text_char_begin,text_char_end,TEXT
1,0,284,"I'm having an issue with the {product_purchased}. Please assist. Your billing zip code is: 71701. We appreciate that you have requested a website address. Please double check your email address. I've tried troubleshooting steps mentioned in the user manual, but the issue persists."
2,0,282,"I'm having an issue with the {product_purchased}. Please assist. If you need to change an existing product. I'm having an issue with the {product_purchased}. Please assist. If The issue I'm facing is intermittent. Sometimes it works fine, but other times it acts up unexpectedly."
3,0,275,"I'm facing a problem with my {product_purchased}. The {product_purchased} is not turning on. It was working fine until yesterday, but now it doesn't respond. 1.8.3 I really I'm using the original charger that came with my {product_purchased}, but it's not charging properly."
4,0,262,"I'm having an issue with the {product_purchased}. Please assist. If you have a problem you're interested in and I'd love to see this happen, please check out the Feedback. I've already contacted customer support multiple times, but the issue remains unresolved."


## Resulting Views

There are also some new views:

In [30]:
%%sql
SELECT VIEW_SCHEMA, VIEW_NAME FROM EXA_ALL_VIEWS

view_schema,view_name
AI_LAB,MY_VIEW
AI_LAB,DOCUMENTS_AI_LAB_MY_VIEW_VIEW
AI_LAB,NAMED_ENTITY_VIEW
AI_LAB,TOPIC_CLASSIFIER_VIEW
AI_LAB,KEYWORD_SEARCH_VIEW


These views contain the results of our 3 preprocessing steps respectivley. They are build on top of the resulting tables, containing a collection of usefull information for your convenience. 
The DOCUMENTS_AI_LAB_MY_VIEW_VIEW is a view on top of our input data, with the addition of the span identifier(TEXT_DOC_ID, TEXT_CHAR_BEGIN, TEXT_CHAR_END) for the text column of each row. This can be used to join the original data with the preprocessing results.

Lets take a closer look at the results of the topic classification step in our preprocessing now. These can be found in the view TOPIC_CLASSIFIER_VIEW.

### Topic Classifier View


In [49]:
%%sql
DESC TOPIC_CLASSIFIER_VIEW

column_name,sql_type,nullable,distribution_key,partition_key,zonemapped
TEXT_DOC_ID,"DECIMAL(18,0)",,,,
TEXT_CHAR_BEGIN,"DECIMAL(18,0)",,,,
TEXT_CHAR_END,"DECIMAL(18,0)",,,,
TOPIC,VARCHAR(2000000) UTF8,,,,
TOPIC_SCORE,DOUBLE,,,,
TOPIC_RANK,"DECIMAL(18,0)",,,,
ERROR_MESSAGE,VARCHAR(2000000) UTF8,,,,
SETUP,VARCHAR(2000000) UTF8,,,,


This view contains a span identiefing the classified documnt, the topic it was assigned as well as a topic score, which contains a probability the classifier assigned this topic in regards to this text input. So "how sure" the classifier is about the assigned topic.
The topic_rank ranks the topics for each source doucment by their topic_score. For our example we had only two topics, so each document was assigned each of the topics, with different scores. The one with the higher score for a given document will have rank 1, the one with the lower score will have rank 2.

There is also a column for error mesassges encountered during classification, as well as a "setup" column documenting which setup(i.e. model, model-settings) where used to obtain this result.

As you remember, we wanted to use the classifier to differentiate our user tickets into hardware issues and software issues. So those are the topics we expect to see in the results. Lets check how these results look:

In [31]:
%%sql
SELECT * FROM TOPIC_CLASSIFIER_VIEW WHERE TEXT_DOC_ID < 5

text_doc_id,text_char_begin,text_char_end,topic,topic_score,topic_rank,error_message,setup
2,0,282,not urgent,0.5622634887695312,1,,"{""HftTopicClassification"": {""model_name"": ""tasksource/ModernBERT-base-nli"", ""topics"": [""not urgent"", ""urgent""], ""hypothesis_template"": null, ""multi_label"": false}}"
2,0,282,urgent,0.4377365112304687,2,,"{""HftTopicClassification"": {""model_name"": ""tasksource/ModernBERT-base-nli"", ""topics"": [""not urgent"", ""urgent""], ""hypothesis_template"": null, ""multi_label"": false}}"
4,0,262,not urgent,0.5839278101921082,1,,"{""HftTopicClassification"": {""model_name"": ""tasksource/ModernBERT-base-nli"", ""topics"": [""not urgent"", ""urgent""], ""hypothesis_template"": null, ""multi_label"": false}}"
4,0,262,urgent,0.4160721898078918,2,,"{""HftTopicClassification"": {""model_name"": ""tasksource/ModernBERT-base-nli"", ""topics"": [""not urgent"", ""urgent""], ""hypothesis_template"": null, ""multi_label"": false}}"
1,0,284,not urgent,0.5433771014213562,1,,"{""HftTopicClassification"": {""model_name"": ""tasksource/ModernBERT-base-nli"", ""topics"": [""not urgent"", ""urgent""], ""hypothesis_template"": null, ""multi_label"": false}}"
1,0,284,urgent,0.4566228985786438,2,,"{""HftTopicClassification"": {""model_name"": ""tasksource/ModernBERT-base-nli"", ""topics"": [""not urgent"", ""urgent""], ""hypothesis_template"": null, ""multi_label"": false}}"
3,0,275,not urgent,0.8373355865478516,1,,"{""HftTopicClassification"": {""model_name"": ""tasksource/ModernBERT-base-nli"", ""topics"": [""not urgent"", ""urgent""], ""hypothesis_template"": null, ""multi_label"": false}}"
3,0,275,urgent,0.1626643985509872,2,,"{""HftTopicClassification"": {""model_name"": ""tasksource/ModernBERT-base-nli"", ""topics"": [""not urgent"", ""urgent""], ""hypothesis_template"": null, ""multi_label"": false}}"


Next, we look at the identified named entities for our input documents. These can be found in the NAMED_ENTITY_VIEW.
### Named Entity View:


In [32]:
%%sql
DESC NAMED_ENTITY_VIEW

column_name,sql_type,nullable,distribution_key,partition_key,zonemapped
TEXT_DOC_ID,"DECIMAL(18,0)",,,,
TEXT_CHAR_BEGIN,"DECIMAL(18,0)",,,,
TEXT_CHAR_END,"DECIMAL(18,0)",,,,
ENTITY_TYPE,VARCHAR(2000000) UTF8,,,,
ENTITY_SCORE,DOUBLE,,,,
ENTITY,VARCHAR(2000000) UTF8,,,,
ENTITY_DOC_ID,"DECIMAL(18,0)",,,,
ENTITY_CHAR_BEGIN,"DECIMAL(18,0)",,,,
ENTITY_CHAR_END,"DECIMAL(18,0)",,,,
ERROR_MESSAGE,VARCHAR(2000000) UTF8,,,,


Similar to the TOPIC_CLASSIFIER_VIEW, the NAMED_ENTITY_VIEW also has the Span(TEXT_DOC_ID, TEXT_CHAR_BEGIN, TEXT_CHAR_END) identifing the input document the entity was found in. Then there are the found named entity itself in the "ENTITY" column, as well as an entity type and an entity score the model assigned the entity. Additionally we also have an identifying span for the entity itself :ENTITY_DOC_ID, ENTITY_CHAR_BEGIN, ENTITY_CHAR_END. This Span represents exactly where in our input data this entity was found. 

![a text with an id number. the text containings the named entity subtext "GoPro Hero". from the id, subtext begin and subtext end arrows are pointing to the id,begin,end of the entity span.](./images/entity_span.drawio.png)

Since the found entity was found in the text identified by "TEXT_DOC_ID, TEXT_CHAR_BEGIN, TEXT_CHAR_END", it follows that TEXT_DOC_ID=ENTITY_DOC_ID for a given row. Simmilarly, both ENTITY_CHAR_BEGIN and ENTITY_CHAR_END are bewtween TEXT_CHAR_BEGIN and TEXT_CHAR_END. You can use these spas for further processing down the line. For Example, 
if joined with the input data, especially in a case where an input document was split into multiple rows, this lets you determine where an entity was found in relation to the whole document. Or you could check how close together named entities of the same document where found, and then check if certain named entity clusters indicate result in different topics. However, this post processing is not part of this tutorial.

The NAMED_ENTITY_VIEW also includes an error message column and a setup column like the TOPIC_CLASSIFIER_VIEW above. These should however be empty.

In [34]:
%config SqlMagic.displaylimit = 10 # we set this lower so the show only a preview of the views

In [36]:
%%sql
SELECT TEXT_DOC_ID, 
    TEXT_CHAR_BEGIN, 
    TEXT_CHAR_END,
    ENTITY, 
    ENTITY_TYPE, 
    ENTITY_SCORE, 
    ENTITY_DOC_ID, 
    ENTITY_CHAR_BEGIN, 
    ENTITY_CHAR_END FROM NAMED_ENTITY_VIEW

text_doc_id,text_char_begin,text_char_end,entity,entity_type,entity_score,entity_doc_id,entity_char_begin,entity_char_end
27,0,338,Windows Vista,product_software,0.8014147877693176,27,176,189
29,0,334,Microsoft,organization_company,0.7323066592216492,29,135,144
55,0,326,U.S,location_GPE,0.9387378692626952,55,162,165
45,0,324,3DS,product_other,0.6588281989097595,45,92,95
17,0,315,Dan,person_other,0.9543948769569396,17,95,98
39,0,303,ZEROHITS,organization_company,0.3515172004699707,39,143,151
13,0,281,CQW,person_other,0.4729739725589752,13,66,69
13,0,281,,person_other,0.4319415986537933,13,120,121
21,0,233,Microsoft Surface Pro,product_other,0.7595992684364319,21,75,96
19,0,229,YouTube,product_software,0.716133177280426,19,115,122


### Keyword-Search View

Lastly, our preprocessing created a view containing the results of the keyword search step, the KEYWORD_SEARCH_VIEW. This one is structured similar to the NAMED_ENTITY_VIEW:

In [53]:
%%sql
DESC KEYWORD_SEARCH_VIEW

column_name,sql_type,nullable,distribution_key,partition_key,zonemapped
TEXT_DOC_ID,"DECIMAL(18,0)",,,,
TEXT_CHAR_BEGIN,"DECIMAL(18,0)",,,,
TEXT_CHAR_END,"DECIMAL(18,0)",,,,
KEYWORD,VARCHAR(2000000) UTF8,,,,
KEYWORD_SCORE,DOUBLE,,,,
KEYWORD_DOC_ID,"DECIMAL(18,0)",,,,
KEYWORD_CHAR_BEGIN,"DECIMAL(18,0)",,,,
KEYWORD_CHAR_END,"DECIMAL(18,0)",,,,
ERROR_MESSAGE,VARCHAR(2000000) UTF8,,,,
SETUP,VARCHAR(2000000) UTF8,,,,


The TEXT_DOC_ID, TEXT_CHAR_BEGIN, TEXT_CHAR_END are again the input document span. But instead of an entity with an entity-score ann an entity span, we now have a keyword column, a keyword score and a span(KEYWORD_DOC_ID, KEYWORD_CHAR_BEGIN, KEYWORD_CHAR_END) identifying the found keyword in the text. Then of course te error meassage and setup column.

In [37]:
%%sql
SELECT TEXT_DOC_ID, 
    TEXT_CHAR_BEGIN, 
    TEXT_CHAR_END,
    KEYWORD, 
    KEYWORD_SCORE, 
    KEYWORD_DOC_ID, 
    KEYWORD_CHAR_BEGIN, 
    KEYWORD_CHAR_END FROM KEYWORD_SEARCH_VIEW WHERE TEXT_DOC_ID < 5

text_doc_id,text_char_begin,text_char_end,keyword,keyword_score,keyword_doc_id,keyword_char_begin,keyword_char_end
2,0,282,product_purchased,0.8376,2,30,47
2,0,282,product_purchased,0.8376,2,140,157
2,0,282,product,0.7319,2,100,107
2,0,282,other times,0.7032,2,246,257
2,0,282,issue,0.4067,2,14,19
2,0,282,issue,0.4067,2,124,129
2,0,282,issue,0.4067,2,183,188
4,0,262,product_purchased,0.8166,4,30,47
4,0,262,feedback,0.7097,4,163,171
4,0,262,multiple times,0.6962,4,213,227


You might notice some seemingly duplicate keywords for a given document. But take a look at the keyword spans of those "duplicates". They are different. This means the same keyword was found multiple times in the same document.

### Result Summary

Here is an overview over the datamodel our preprocessing created. 
### this image is missing DOCUMENTS_AI_LAB_MY_VIEW_VIEW should i add it?
    
![A diagramm showing multiple Table names with their respective columns. Starting at "MY_VIEW" flowing to "DOCUMENTS" and then the three result views. The columns containg the text document span are highlighted.](./images/data_model.drawio.png)


## Adding data to source view

Now, lets try and run the preprocessing again, using the exact same input.

In [36]:
%%time
run_text_ai_preprocessing()

CPU times: user 89.5 ms, sys: 4.6 ms, total: 94.1 ms
Wall time: 1.54 s


See how quick it runs this time? This is because the text-ai-extensions does not compute results allready computed in previous runs. We can test this behaviour further. Lets add more entries to our dataset, and see and see how long the preprocessing takes then. 

So, in the next call lets double the data in out input view:

In [37]:
with open_pyexasol_connection(ai_lab_config, compression=True) as conn:
    conn.execute(f"""CREATE OR REPLACE VIEW "{schema}"."{view}" AS SELECT * FROM "{schema}"."{table}" WHERE "TICKET_ID" <= {view_size}*2; """)


In [38]:
%%sql
SELECT COUNT(ALL TICKET_ID) FROM {{schema}}.{{view}};

Count(TICKET_ID)
200


Once we run the preprocessing again, you would expect this run to take twice as long as the first run we did. However, thanks to the way the Text-Ai_Extension is implemented, you should now see that it actually is much faster than that. For us, it is slightly longer than the first run, but takes nowhere near twice the time.

In [40]:
%%time
run_text_ai_preprocessing()

CPU times: user 316 ms, sys: 51.2 ms, total: 368 ms
Wall time: 19min 43s


In [42]:
%%sql
SELECT COUNT (*) FROM DOCUMENTS;

COUNT(*)
200


Remember, the processing time is dependent on a lot of factors such as the actual size of the datapoints, the batch size, parallism per node, as well as available memory and number of nodes of your Exasol Database. So the actuall speedup you experince will differ from case to case.

I f you want to eperiment with this further feel free to , for example, add even more data. For this Notebook we did not demonstrate this, because the calls take a long time for demonstration purposes.

## Audit Log

Lastly, lets look at the audit log table text-ai has generated for us. This is a table documenting each run text-ai does on your ExasolDatabase. It contains information on runtime, how mana dataentries where used or created, and error messages. This can be very helpfull if you suspect a problem with one of your pipelines and want to know where it is coming from. Or if you are interested in seeing how much data came from a specific step, or which of your steps is taking so damn long.

In [25]:
%%sql
DESC TXAIE_AUDIT_LOG

column_name,sql_type,nullable,distribution_key,partition_key,zonemapped
LOG_TIMESTAMP,TIMESTAMP(3),True,False,False,False
SESSION_ID,"DECIMAL(20,0)",True,False,False,False
RUN_ID,HASHTYPE(16 BYTE),True,False,False,False
ROW_COUNT,"DECIMAL(36,0)",True,False,False,False
LOG_SPAN_NAME,VARCHAR(2000000) UTF8,True,False,False,False
LOG_SPAN_ID,HASHTYPE(16 BYTE),True,False,False,False
PARENT_LOG_SPAN_ID,HASHTYPE(16 BYTE),True,False,False,False
EVENT_NAME,VARCHAR(128) UTF8,True,False,False,False
EVENT_ATTRIBUTES,VARCHAR(2000000) UTF8,True,False,False,False
DB_OBJECT_SCHEMA,VARCHAR(128) UTF8,True,False,False,False


In [None]:
##%%sql
#SELECT * FROM TXAIE_AUDIT_LOG #todo error cause of hashtype

In [28]:
from pandas import option_context
with open_pyexasol_connection(ai_lab_config, compression=True) as conn:
    audit_log = conn.export_to_pandas(f"""
        SELECT RUN_ID,DB_OBJECT_NAME,EVENT_NAME,ROW_COUNT,LOG_TIMESTAMP FROM {schema}.TXAIE_AUDIT_LOG
    """)
    with option_context('display.max_rows', 20, 'display.max_colwidth', 1000):
        display(audit_log)

Unnamed: 0,RUN_ID,DB_OBJECT_NAME,EVENT_NAME,ROW_COUNT,LOG_TIMESTAMP
0,,,SourceTableQueryHandler_Start,,2025-06-13 12:06:40.034000
1,dfe6e45b94544526ab65e4e756aa1ed9,DOCUMENTS_AI_LAB_MY_VIEW,Begin,0.0,2025-06-13 12:06:40.105000
2,dfe6e45b94544526ab65e4e756aa1ed9,DOCUMENTS_AI_LAB_MY_VIEW,End,100.0,2025-06-13 12:06:40.173000
3,dfe6e45b94544526ab65e4e756aa1ed9,DOCUMENTS,Begin,0.0,2025-06-13 12:06:40.177000
4,dfe6e45b94544526ab65e4e756aa1ed9,DOCUMENTS,End,100.0,2025-06-13 12:06:40.235000
...,...,...,...,...,...
130,c0b9ffbb601c4418a8f2881af3d0e188,KEYWORD_SEARCH_LOOKUP_SETUP,End,1.0,2025-06-13 13:09:47.342000
131,c0b9ffbb601c4418a8f2881af3d0e188,KEYWORD_SEARCH,Begin,1196.0,2025-06-13 13:09:47.348000
132,c0b9ffbb601c4418a8f2881af3d0e188,KEYWORD_SEARCH,End,1799.0,2025-06-13 13:09:47.394000
133,,,UDFAlgo_Error,,2025-06-13 13:09:47.412000


## Addendum

    text may contain spelling error/incomplete mentions_ < might need postprocessing