# Text AI Extension preprocessing

Here we will demonstrate how the Text AI Extension data-preprocessing can be used.

        # todo short summary

## Prerequisites

Prior to using this notebook one needs to complete the following steps:

**Note**: To be able to store the models used in this demo, make sure you set the Disk Size of the database to at least 10 GiB in the AI-Lab configuration.

1. [Configure the AI-Lab](../main_config.ipynb).
2. [initialize the Text AI Extension](./txaie_init.ipynb)
3. [initialize the Transformers Extension](../transformers/te_init.ipynb)

## Activate the Text AI Extension SLC

In [1]:
%run ../utils/access_store_ui.ipynb
display(get_access_store_ui('../'))

Output()

Box(children=(Box(children=(Label(value='Configuration Store', layout=Layout(border_bottom='solid 1px', border…

In [90]:
from exasol.nb_connector.connections import open_pyexasol_connection
from exasol.nb_connector.language_container_activation import get_activation_sql

activation_sql = get_activation_sql(ai_lab_config)

triggers preprocessing to create the text annotations and text extraction.



# Rational/Explanation/Motivation

    Explain wich options and stuff we have

There are tasks in Natural Language Processing (NLP), seem easy to us humans, but are very hard for a machine to do. For example infering the opinion the speaker has about a topic (Opinion Extraction/Mining). Doing these tasks on un-annotated text is even harder. Therefore, multiple ways to annotate a natural laguage text with various additional information where developed. These annotated texts are then better suited for higher level NLP tasks.
                                                                                                                                                   
Depending on the amount of data/text which should be processed, annotating by hand is mostly not an option these days, since with increasing dataset sizes the resources needed quickly become unrealistic. Therefore, Exasols Text AI provides you with preprocessing steps you can use for annotating you data in various ways.
                                                                                                                     
In this Notebook, we will show you our three default preprocessing pipeline steps. Of course it is possible for you to define your own pipeline later on.
Lets explain these three steps before we dive into how to run the preprocessing.
                                                                                                                     
### Topic Classification
                                                                                                                     
Topic Classification is the task of assigning topics to text/documents/datapoints. In Topic Classification, a given set of topics is used, and each datapoint is assingned the best matching topic based on the probability the classifcation model calculates.
A topic in this context is an abstract category of text. Given that a document is about a particular topic, it is expected for particular words to appear in the document more or less frequently. However, it is not required for the exact words to describe the topic to be found in a text. This means that topics can be infered, even if their name/description/topic synonyms are not found in the data.

For Example:
    TextDocument: Elon Musk has shared a photo of the spacesuit designed by SpaceX. This is the second image shared of the new design and the first to feature the spacesuit’s full-body look.
    Possible Topics: space flight, celebrity                                                                                                               
Topic Classification assinges a given set of these topics, and is usually trained using supervised learning. It can also be used with Zero-Shot Classification models, which can assign classes/topics which have not been seen during the training. This is opposed to other apppoaches like topic extraction, which is often unsupervised and does not need a list of topics as input, instead extrating them from the data itself.                                                                                                                                                                                                                         
                                                                                                                     
### KEYWORD_SEARCH/extraction
                                                                               
Keyword Search is about identifying the most relevant words or phrases(Keywords/Keyphrases) from a given text.
These can then help in further steps, e.g. summarizing the content of texts and recognize the main topics discussed.
Keywords or phrase need be present in the text
For Example:
	TextDocument: Elon Musk has shared a photo of the spacesuit designed by SpaceX. This is the second image shared of the new design and the first to feature the spacesuit’s full-body look.
	Keywords: elon musk, second image, spacesuit, body look, new design, photo, spacex

### Named Entity Recognition

Named entity recognition (NER) is about locateing and classify so called "named entities" mentioned in a text document. Depending on the model, entities are e.g. person names, organizations, locations, or vehicles etc, so "things that have names". The model seeks out those enties, returning their positions in the documnent, as well as their class.

    

As an Example of what the output for these three steps might look like for a given documents, consider our document to be "I'm having an issue with the GoPro Hero. It's affecting my productivity.". We may use a topic classifier with the input topic set of "Low,Noral,Urgent,Critical" for infering urgency from ticket content. Then Our output could be:

Document "I'm having an issue with the GoPro Hero. It's affecting my productivity."

Topic: "Urgent"

Entities: "GoPro Hero"

Keywords: "productivity"

# todo desc how we do these steps, what output looks like
 # todo show grafic about the spans, see präsi

# Further setup


In [17]:
pip uninstall -y exasol-text-ai-extension

[0mNote: you may need to restart the kernel to use updated packages.


In [3]:
from exasol.nb_connector.ai_lab_config import AILabConfig
from exasol.ai.text.extractors import *
from exasol.nb_connector.text_ai_extension_wrapper import LANGUAGE_ALIAS
from exasol.ai.text.extraction.extraction import *
from exasol.ai.text.extraction.abstract_extraction import *

In [4]:
text_column="TICKET_DESCRIPTION"
key_column="TICKET_ID"


The next call will make it possible to run sql directly in this notebook, in order to easyer display the results of out preprocessing.

In [106]:
%run ../utils/jupysql_init.ipynb

The sql extension is already loaded. To reload it, use:
  %reload_ext sql


In [107]:
%config SqlMagic.displaylimit = 15

## Get an example dataset

We will be using a Dataset which holds information on customer support tickets. We will split this data into 2 set, in order to demonstrate how the preprocessing tasks handle new data being added to a data set.
But first we want to make sure the tables we want to use don't already exist, for example from a previous run of this notebook. Therefore, we are going to drop them.
First, we define a list of tables to drop:

In [98]:
table="CUSTOMER_SUPPORT_TICKETS"
schema=ai_lab_config.db_schema

In [99]:
table_list = [
    "TXAIE_AUDIT_LOG",
    "DOCUMENTS",
    f"DOCUMENTS_{schema}_MY_VIEW",
    "NAMED_ENTITY",
    "NAMED_ENTITY_LOOKUP_ENTITY_TYPE",
    "NAMED_ENTITY_LOOKUP_SETUP",
    "KEYWORD_SEARCH",
    "KEYWORD_SEARCH_LOOKUP_KEYWORD",
    "KEYWORD_SEARCH_LOOKUP_SETUP",
    "TOPIC_CLASSIFIER",
    "TOPIC_CLASSIFIER_LOOKUP_TOPIC",
    "TOPIC_CLASSIFIER_LOOKUP_SETUP"
]


Next, define a function which drops these tables, as well as our main table. Then we call the function.

In [100]:

def delete_text_ai_preprocessing_tables():
    with open_pyexasol_connection(ai_lab_config, compression=True) as conn:
        for drop_table in table_list:
            conn.execute(f"""DROP TABLE IF EXISTS "{schema}"."{drop_table}" """)
        conn.execute(f"""DROP TABLE IF EXISTS "{schema}"."{table}" """)

In [101]:
delete_text_ai_preprocessing_tables()

You can then load the data using [this notebook](../data/data_customer_support.ipynb). This loads the data into a table called "CUSTOMER_SUPPORT_TICKETS" found in the schema defined in the ai_lab_config variable db_schema.

In [102]:
%run ../data/data_customer_support.ipynb

Output()

Box(children=(Box(children=(Label(value='Configuration Store', layout=Layout(border_bottom='solid 1px', border…


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Path to dataset files: /home/jupyter/.cache/kagglehub/datasets/suraj520/customer-support-ticket-dataset/versions/1


In [108]:
%%sql
SELECT COUNT(ALL TICKET_ID) FROM {{schema}}.{{table}};

Count(CUSTOMER_SUPPORT_TICKETS.TICKET_ID)
8469


This Dataset as ~8000 entries. You could run the preprocessing for the whole Dataset, but it would take quite some time. Instead, we will create a view containing only part of the Dataset, and use this view as the base for our preprocessing.
We set the size of this view here. If you want to see how the AI-Lab handles bigger datasets on your Exasol instance, you can set the "view_size" higher.

In [109]:
view="MY_VIEW"
view_size = 100 # <= 4234

In [110]:
with open_pyexasol_connection(ai_lab_config, compression=True) as conn:
    conn.execute(f"""DROP VIEW IF EXISTS "{schema}"."{view}"; """)
    conn.execute(f"""CREATE OR REPLACE VIEW "{schema}"."{view}" AS SELECT * FROM "{schema}"."{table}" WHERE "TICKET_ID" <= {view_size}; """)


Lets check the size of our created view:

In [117]:
%%sql
SELECT COUNT(ALL TICKET_ID) FROM {{schema}}.{{view}};

Count(TICKET_ID)
100


And now, lets see what our data contains.

In [19]:
%%sql
SELECT * FROM {{schema}}.{{view}} WHERE TICKET_ID < 5

ticket_id,customer_name,customer_email,customer_age,customer_gender,product_purchased,date_of_purchase,ticket_type,ticket_subject,ticket_description,ticket_status,resolution,ticket_priority,ticket_channel,first_response_time,time_to_resolution,customer_satisfaction_rating
1,Marisa Obrien,carrollallison@example.com,32,Other,GoPro Hero,2021-03-22,Technical issue,Product setup,"I'm having an issue with the {product_purchased}. Please assist. Your billing zip code is: 71701. We appreciate that you have requested a website address. Please double check your email address. I've tried troubleshooting steps mentioned in the user manual, but the issue persists.",Pending Customer Response,,Critical,Social media,2023-06-01 12:15:36,,
2,Jessica Rios,clarkeashley@example.com,42,Female,LG Smart TV,2021-05-22,Technical issue,Peripheral compatibility,"I'm having an issue with the {product_purchased}. Please assist. If you need to change an existing product. I'm having an issue with the {product_purchased}. Please assist. If The issue I'm facing is intermittent. Sometimes it works fine, but other times it acts up unexpectedly.",Pending Customer Response,,Critical,Chat,2023-06-01 16:45:38,,
3,Christopher Robbins,gonzalestracy@example.com,48,Other,Dell XPS,2020-07-14,Technical issue,Network problem,"I'm facing a problem with my {product_purchased}. The {product_purchased} is not turning on. It was working fine until yesterday, but now it doesn't respond. 1.8.3 I really I'm using the original charger that came with my {product_purchased}, but it's not charging properly.",Closed,Case maybe show recently my computer follow.,Low,Social media,2023-06-01 11:14:38,2023-06-01 18:05:38,3.0
4,Christina Dillon,bradleyolson@example.org,27,Female,Microsoft Office,2020-11-13,Billing inquiry,Account access,"I'm having an issue with the {product_purchased}. Please assist. If you have a problem you're interested in and I'd love to see this happen, please check out the Feedback. I've already contacted customer support multiple times, but the issue remains unresolved.",Closed,Try capital clearly never color toward story.,Low,Social media,2023-06-01 07:29:40,2023-06-01 01:57:40,3.0


    ### todo some words about the data?

# Get Models

we will use multiple different transformers models to run our preprocessing with. We will use [this notebook](./utils/txaie_default_models.ipynb) to Download them from HuggingFace. 

Simply run the next cell.
This call will take some time to complete, depending on your internet conection.

**Note**: If this operation fails with an Error indicating a lost connection, please increase the size of your Database and try again.

In [112]:
%run ./utils/txaie_default_models.ipynb

## Define which steps to run



## configure defaults

In [118]:
defaults = Defaults(
    parallelism_per_node=2,
    batch_size=10,
    model_repository=BucketFSRepository(
        connection_name = ai_lab_config.te_bfs_connection,
        sub_dir = ai_lab_config.te_models_bfs_dir
    )
)

## define extractor


In [119]:
text_column="TICKET_DESCRIPTION"
key_column="TICKET_ID"
topics={"urgent", "not urgent"}

extractor = PipelineExtractor(
                steps=[
                    SourceTableExtractor(
                        sources=[
                            SchemaSource(
                                db_schema=NameSelector(pattern=schema),
                                tables=[
                                    TableSource(
                                        table=NameSelector(pattern=view),
                                        columns=[NameSelector(pattern=text_column)],
                                        keys=[NameSelector(pattern=key_column)]
                                    )
                                ]
                            )
                        ]
                    ),
                    StandardExtractor(
                        # named_entity_recognition_model = None, # None means disabled
                        # topic_classification_model = None,
                        # Use a different model
                        # keyword_search_model = HuggingFaceModel(name="MY_KEYWORD_SEARCH_MODEL"),
                        topics=topics
                    )
                ]
            )


In [90]:
#%run ./utils/txaie_extraction_wrapper.ipynb #todo still use

 here we explain what the extraction wrapper and default extraction do and where to find them

In [91]:
#%run utils/txaie_init_ui.ipynb #todo do we want this ui in a seperate file?
#display(get_txaie_SLC_name_ui(ai_lab_config)) #todo CKey.language_alias does not yet exist. use once made in NC
#todo this should get input "PYTHON3_TXAIE"

AttributeError: language_alias

In [92]:
#todo put stuff into secret store?
#extraction = ExtractionWrapper(ai_lab_config)


In [120]:
from exasol.nb_connector.connections import open_pyexasol_connection

def run_text_ai_preprocessing():
    with open_pyexasol_connection(ai_lab_config, compression=True) as conn:
        conn.execute(query=activation_sql)
        Extraction(
            extractor=extractor,
            output=Output(db_schema=schema),
            defaults=defaults
        ).run(
            conn,
            schema,
            LANGUAGE_ALIAS,
        )

    # todo explain why preprocessing, what preprocessing, show image:


    3 parts, shows tables: then point to docments

## Run the preprocessing

Time to run our preprocessing. First, lets verify how man entries our view has:

In [121]:
%%sql
SELECT COUNT(ALL TICKET_ID) FROM {{schema}}.{{view}};

Count(TICKET_ID)
100


Then we call our preprocessing function. This will use our view as input, and produce new tables and views using the models we downloaded. Also take note of the time this operation takes on your setup.

In [None]:
%%time
run_text_ai_preprocessing()

### Results
First, lets look at which tables where created by our preprocessing:

    talk about what they contain? view: what are they? what do they contain?
    # todo delete texai_test prefix

In [45]:
%%sql
SELECT TABLE_SCHEMA, TABLE_NAME FROM EXA_ALL_TABLES WHERE TABLE_SCHEMA='{{schema}}'

table_schema,table_name
AI_LAB,CUSTOMER_SUPPORT_TICKETS
AI_LAB,TXAIE_AUDIT_LOG
AI_LAB,DOCUMENTS
AI_LAB,DOCUMENTS_AI_LAB_MY_VIEW
AI_LAB,NAMED_ENTITY
AI_LAB,NAMED_ENTITY_LOOKUP_ENTITY_TYPE
AI_LAB,NAMED_ENTITY_LOOKUP_SETUP
AI_LAB,TOPIC_CLASSIFIER
AI_LAB,TOPIC_CLASSIFIER_LOOKUP_TOPIC
AI_LAB,TOPIC_CLASSIFIER_LOOKUP_SETUP


    Show Tables counts for  Extractions and Audit Log

If we want to find out how these new tables are structured, we can get a description from the Exasol Database, for example lets see how the resulting documents table looks like.

### DOCUMENTS Table


In [46]:
%%sql
DESC DOCUMENTS

column_name,sql_type,nullable,distribution_key,partition_key,zonemapped
TEXT_DOC_ID,"DECIMAL(18,0)",True,True,False,False
TEXT_CHAR_BEGIN,"DECIMAL(18,0)",True,False,False,False
TEXT_CHAR_END,"DECIMAL(18,0)",True,False,False,False
TEXT,VARCHAR(2000000) UTF8,True,False,False,False


Looks like this table contains a text-document-id, text-char-begin, text-char-end and a text column.
The text column includes the text of the document. In case the content of one of our input datapoints does not fit within the VARCHAR limit of the text column, it gets split into multiple entries in the documents table. These will have the same text-doc-id, 
indicating they came from the same document. text-char-begin and text-char-end indicate which parts of the original document each specific row contains. This trifecta of text-document-id, text-char-begin and text-char-end is called a "Span", and toghether build an identifier for a section of text. You will encounter them for a lot of text-subsections. For example, found keywords contained in a text are also identified by a span in our result tables (see below). 
                                                                                                                                                                                                                                                                                                                                  
The usage of these spans allows you to do various operations on top of these results, such as joining results on the document-id, or checking the order in which keywords appear in a document.
                                                                                                                                                                                                                                                                                                      We can also check the number of unique doc-ids in our table:

In [27]:
%%sql
SELECT COUNT(ALL text_doc_id) FROM {{schema}}.DOCUMENTS;

Count(DOCUMENTS.TEXT_DOC_ID)
100


Its identical to the number of rows in our input view. So all the data was converted successfully.

Now lets look a what the content of our table looks like:

In [28]:
%%sql
SELECT * FROM DOCUMENTS WHERE TEXT_DOC_ID < 5

text_doc_id,text_char_begin,text_char_end,TEXT
1,0,284,"I'm having an issue with the {product_purchased}. Please assist. Your billing zip code is: 71701. We appreciate that you have requested a website address. Please double check your email address. I've tried troubleshooting steps mentioned in the user manual, but the issue persists."
2,0,282,"I'm having an issue with the {product_purchased}. Please assist. If you need to change an existing product. I'm having an issue with the {product_purchased}. Please assist. If The issue I'm facing is intermittent. Sometimes it works fine, but other times it acts up unexpectedly."
3,0,275,"I'm facing a problem with my {product_purchased}. The {product_purchased} is not turning on. It was working fine until yesterday, but now it doesn't respond. 1.8.3 I really I'm using the original charger that came with my {product_purchased}, but it's not charging properly."
4,0,262,"I'm having an issue with the {product_purchased}. Please assist. If you have a problem you're interested in and I'd love to see this happen, please check out the Feedback. I've already contacted customer support multiple times, but the issue remains unresolved."


In [69]:
##%%sql
#SELECT * FROM TXAIE_AUDIT_LOG #todo error cause of hashtype

KeyError: 'HASHTYPE'

In [47]:
# show audit logs. todo where?
from pandas import option_context
with open_pyexasol_connection(ai_lab_config, compression=True) as conn:
    audit_log = conn.export_to_pandas(f"""
        SELECT * FROM {schema}.TXAIE_AUDIT_LOG
    """)
    with option_context('display.max_rows', 10, 'display.max_colwidth', 1000):
        display(audit_log)

Unnamed: 0,LOG_TIMESTAMP,SESSION_ID,RUN_ID,ROW_COUNT,LOG_SPAN_NAME,LOG_SPAN_ID,PARENT_LOG_SPAN_ID,EVENT_NAME,EVENT_ATTRIBUTES,DB_OBJECT_SCHEMA,DB_OBJECT_NAME,DB_OBJECT_TYPE,ERROR_MESSAGE
0,2025-06-12 11:43:45.976000,1834715893865512960,,,,,,SourceTableQueryHandler_Start,,,,,
1,2025-06-12 11:43:46.101000,1834715893865512960,9f28162aad5947f0bb33c520bcfd3072,0.0,INSERT,1a235629ea2d41e0b045583396d2083b,,Begin,,AI_LAB,DOCUMENTS_AI_LAB_MY_VIEW,TABLE,
2,2025-06-12 11:43:46.157000,1834715893865512960,9f28162aad5947f0bb33c520bcfd3072,100.0,INSERT,1a235629ea2d41e0b045583396d2083b,,End,,AI_LAB,DOCUMENTS_AI_LAB_MY_VIEW,TABLE,
3,2025-06-12 11:43:46.165000,1834715893865512960,9f28162aad5947f0bb33c520bcfd3072,0.0,INSERT,f06a94af6258455d86bfc8a9ffbeb9a4,,Begin,,AI_LAB,DOCUMENTS,TABLE,
4,2025-06-12 11:43:46.261000,1834715893865512960,9f28162aad5947f0bb33c520bcfd3072,100.0,INSERT,f06a94af6258455d86bfc8a9ffbeb9a4,,End,,AI_LAB,DOCUMENTS,TABLE,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
28,2025-06-12 11:55:25.651000,1834715893865512960,9f28162aad5947f0bb33c520bcfd3072,1.0,INSERT,93f531824d744eedad5ae8ef5e70ad99,,End,,AI_LAB,KEYWORD_SEARCH_LOOKUP_SETUP,TABLE,
29,2025-06-12 11:55:25.656000,1834715893865512960,9f28162aad5947f0bb33c520bcfd3072,0.0,INSERT,4e8e95876bbe483d9c5a24b2e70a94d1,,Begin,,AI_LAB,KEYWORD_SEARCH,TABLE,
30,2025-06-12 11:55:25.685000,1834715893865512960,9f28162aad5947f0bb33c520bcfd3072,605.0,INSERT,4e8e95876bbe483d9c5a24b2e70a94d1,,End,,AI_LAB,KEYWORD_SEARCH,TABLE,
31,2025-06-12 11:55:25.689000,1834715893865512960,,,,,,UDFAlgo_Error,,,,,


## Resulting Views

There are also some new views:

In [48]:
%%sql
SELECT VIEW_SCHEMA, VIEW_NAME FROM EXA_ALL_VIEWS

view_schema,view_name
AI_LAB,MY_VIEW
AI_LAB,DOCUMENTS_AI_LAB_MY_VIEW_VIEW
AI_LAB,NAMED_ENTITY_VIEW
AI_LAB,TOPIC_CLASSIFIER_VIEW
AI_LAB,KEYWORD_SEARCH_VIEW


These views contain the results of our 3 preprocessing steps respectivley. They are build on top of the resulting tables, containing a collection of usefull information for your convenience. 
The DOCUMENTS_AI_LAB_MY_VIEW_VIEW is a view on top of our input data, with the addition of the span identifier(TEXT_DOC_ID, TEXT_CHAR_BEGIN, TEXT_CHAR_END) for the text column of each row. This can be used to join the original data with the preprocessing results.

Lets take a closer look at the reults of the topic classification step in our preprocessing now. These can be found in the view TOPIC_CLASSIFIER_VIEW.

#### Topic Classifier View


In [49]:
%%sql
DESC TOPIC_CLASSIFIER_VIEW

column_name,sql_type,nullable,distribution_key,partition_key,zonemapped
TEXT_DOC_ID,"DECIMAL(18,0)",,,,
TEXT_CHAR_BEGIN,"DECIMAL(18,0)",,,,
TEXT_CHAR_END,"DECIMAL(18,0)",,,,
TOPIC,VARCHAR(2000000) UTF8,,,,
TOPIC_SCORE,DOUBLE,,,,
TOPIC_RANK,"DECIMAL(18,0)",,,,
ERROR_MESSAGE,VARCHAR(2000000) UTF8,,,,
SETUP,VARCHAR(2000000) UTF8,,,,


This view contains a span identiefing the classified documnt, the topic it was assigned as well as a topic score, which contains a probability the classifier assigned this topic in regards to this text input. So "how sure" the classifier is about the assigned topic.
The topic_rank ranks the topics for each source doucment by their topic_score. For our example we had only two topics, so each document was assigned each of the topics, with different scores. The one with the higher score for a given document will have rank 1, the one with the lower score will have rank 2.

There is also a column for error mesassges encountered during classification, as well as a "setup" column documenting which setup(i.e. model, model-settings) where used to obtain this result.

As you remember, we wanted to use the classifier to differentiate our user tickets into hardware issues and software issues. So those are the topics we expect to see in the results. Lets check how these results look:

In [50]:
%%sql
SELECT * FROM TOPIC_CLASSIFIER_VIEW WHERE TEXT_DOC_ID < 5

text_doc_id,text_char_begin,text_char_end,topic,topic_score,topic_rank,error_message,setup
2,0,282,software issue,0.5114253759384155,1,,"{""HftTopicClassification"": {""model_name"": ""tasksource/ModernBERT-base-nli"", ""topics"": [""hardware issue"", ""software issue""], ""hypothesis_template"": null, ""multi_label"": false}}"
2,0,282,hardware issue,0.4885745644569397,2,,"{""HftTopicClassification"": {""model_name"": ""tasksource/ModernBERT-base-nli"", ""topics"": [""hardware issue"", ""software issue""], ""hypothesis_template"": null, ""multi_label"": false}}"
4,0,262,software issue,0.6153695583343506,1,,"{""HftTopicClassification"": {""model_name"": ""tasksource/ModernBERT-base-nli"", ""topics"": [""hardware issue"", ""software issue""], ""hypothesis_template"": null, ""multi_label"": false}}"
4,0,262,hardware issue,0.3846305012702942,2,,"{""HftTopicClassification"": {""model_name"": ""tasksource/ModernBERT-base-nli"", ""topics"": [""hardware issue"", ""software issue""], ""hypothesis_template"": null, ""multi_label"": false}}"
1,0,284,software issue,0.5539488196372986,1,,"{""HftTopicClassification"": {""model_name"": ""tasksource/ModernBERT-base-nli"", ""topics"": [""hardware issue"", ""software issue""], ""hypothesis_template"": null, ""multi_label"": false}}"
1,0,284,hardware issue,0.4460511207580566,2,,"{""HftTopicClassification"": {""model_name"": ""tasksource/ModernBERT-base-nli"", ""topics"": [""hardware issue"", ""software issue""], ""hypothesis_template"": null, ""multi_label"": false}}"
3,0,275,hardware issue,0.5567596554756165,1,,"{""HftTopicClassification"": {""model_name"": ""tasksource/ModernBERT-base-nli"", ""topics"": [""hardware issue"", ""software issue""], ""hypothesis_template"": null, ""multi_label"": false}}"
3,0,275,software issue,0.4432403445243835,2,,"{""HftTopicClassification"": {""model_name"": ""tasksource/ModernBERT-base-nli"", ""topics"": [""hardware issue"", ""software issue""], ""hypothesis_template"": null, ""multi_label"": false}}"


Next, we look at the identified named entities for our input documents. These can be found in the NAMED_ENTITY_VIEW.
#### Named Entity View:


In [51]:
%%sql
DESC NAMED_ENTITY_VIEW

column_name,sql_type,nullable,distribution_key,partition_key,zonemapped
TEXT_DOC_ID,"DECIMAL(18,0)",,,,
TEXT_CHAR_BEGIN,"DECIMAL(18,0)",,,,
TEXT_CHAR_END,"DECIMAL(18,0)",,,,
ENTITY_TYPE,VARCHAR(2000000) UTF8,,,,
ENTITY_SCORE,DOUBLE,,,,
ENTITY,VARCHAR(2000000) UTF8,,,,
ENTITY_DOC_ID,"DECIMAL(18,0)",,,,
ENTITY_CHAR_BEGIN,"DECIMAL(18,0)",,,,
ENTITY_CHAR_END,"DECIMAL(18,0)",,,,
ERROR_MESSAGE,VARCHAR(2000000) UTF8,,,,


Similar to the TOPIC_CLASSIFIER_VIEW, the NAMED_ENTITY_VIEW also has the Span(TEXT_DOC_ID, TEXT_CHAR_BEGIN, TEXT_CHAR_END) identifing the input document the entity was found in. Then there are the found named entity itself in the "ENTITY" column, as well as an entity type and an entity score the model assigned the entity. Additionally we also have an identifying span for the entity itself :ENTITY_DOC_ID, ENTITY_CHAR_BEGIN, ENTITY_CHAR_END. This Span represents exactly where in our input data this entity was found. 

Since the found entity was found in the text identified by "TEXT_DOC_ID, TEXT_CHAR_BEGIN, TEXT_CHAR_END", it follows that TEXT_DOC_ID=ENTITY_DOC_ID for a given row. Simmilarly, both ENTITY_CHAR_BEGIN and ENTITY_CHAR_END are bewtween TEXT_CHAR_BEGIN and TEXT_CHAR_END. You can use these spas for further processing down the line. For Example, 
if joined with the input data, especially in a case where an input document was split into multiple rows, this lets you determine where an entity was found in relation to the whole document. Or you could check how close together named entities of the same document where found, and then check if certain named entity clusters indicate result in different topics. However, this post processing is not part of this tutorial.

The NAMED_ENTITY_VIEW also includes an error message column and a setup column like the TOPIC_CLASSIFIER_VIEW above. These should however be empty.

In [68]:
%config SqlMagic.displaylimit = 10 # we set this lower so the show only a preview of the views

In [78]:
%%sql
SELECT * FROM NAMED_ENTITY_VIEW

text_doc_id,text_char_begin,text_char_end,entity_type,entity_score,entity,entity_doc_id,entity_char_begin,entity_char_end,error_message,setup
44,0,343,organization_company,0.850927472114563,Amazon,44,103,109,,"{""HftNamedEntityRecognition"": {""model_name"": ""guishe/nuner-v2_fewnerd_fine_super"", ""ignore_labels"": null, ""aggregation_strategy"": ""simple""}}"
98,0,312,product_software,0.5798302888870239,DNS,98,189,192,,"{""HftNamedEntityRecognition"": {""model_name"": ""guishe/nuner-v2_fewnerd_fine_super"", ""ignore_labels"": null, ""aggregation_strategy"": ""simple""}}"
12,0,296,person_other,0.6915323138237,Mr. Brown,12,165,174,,"{""HftNamedEntityRecognition"": {""model_name"": ""guishe/nuner-v2_fewnerd_fine_super"", ""ignore_labels"": null, ""aggregation_strategy"": ""simple""}}"
20,0,292,organization_company,0.941503405570984,Apple,20,191,196,,"{""HftNamedEntityRecognition"": {""model_name"": ""guishe/nuner-v2_fewnerd_fine_super"", ""ignore_labels"": null, ""aggregation_strategy"": ""simple""}}"
64,0,262,product_software,0.4990056157112121,App Manager,64,119,130,,"{""HftNamedEntityRecognition"": {""model_name"": ""guishe/nuner-v2_fewnerd_fine_super"", ""ignore_labels"": null, ""aggregation_strategy"": ""simple""}}"
27,0,338,product_software,0.8014147877693176,Windows Vista,27,176,189,,"{""HftNamedEntityRecognition"": {""model_name"": ""guishe/nuner-v2_fewnerd_fine_super"", ""ignore_labels"": null, ""aggregation_strategy"": ""simple""}}"
29,0,334,organization_company,0.7323066592216492,Microsoft,29,135,144,,"{""HftNamedEntityRecognition"": {""model_name"": ""guishe/nuner-v2_fewnerd_fine_super"", ""ignore_labels"": null, ""aggregation_strategy"": ""simple""}}"
55,0,326,location_GPE,0.9387378692626952,U.S,55,162,165,,"{""HftNamedEntityRecognition"": {""model_name"": ""guishe/nuner-v2_fewnerd_fine_super"", ""ignore_labels"": null, ""aggregation_strategy"": ""simple""}}"
45,0,324,product_other,0.6588281989097595,3DS,45,92,95,,"{""HftNamedEntityRecognition"": {""model_name"": ""guishe/nuner-v2_fewnerd_fine_super"", ""ignore_labels"": null, ""aggregation_strategy"": ""simple""}}"
17,0,315,person_other,0.9543948769569396,Dan,17,95,98,,"{""HftNamedEntityRecognition"": {""model_name"": ""guishe/nuner-v2_fewnerd_fine_super"", ""ignore_labels"": null, ""aggregation_strategy"": ""simple""}}"


#### Keyword-Search View

Lastly, our preprocessing created a view containing the results of the keyword search step, the KEYWORD_SEARCH_VIEW. This one is structured similar to the NAMED_ENTITY_VIEW:

In [53]:
%%sql
DESC KEYWORD_SEARCH_VIEW

column_name,sql_type,nullable,distribution_key,partition_key,zonemapped
TEXT_DOC_ID,"DECIMAL(18,0)",,,,
TEXT_CHAR_BEGIN,"DECIMAL(18,0)",,,,
TEXT_CHAR_END,"DECIMAL(18,0)",,,,
KEYWORD,VARCHAR(2000000) UTF8,,,,
KEYWORD_SCORE,DOUBLE,,,,
KEYWORD_DOC_ID,"DECIMAL(18,0)",,,,
KEYWORD_CHAR_BEGIN,"DECIMAL(18,0)",,,,
KEYWORD_CHAR_END,"DECIMAL(18,0)",,,,
ERROR_MESSAGE,VARCHAR(2000000) UTF8,,,,
SETUP,VARCHAR(2000000) UTF8,,,,


The TEXT_DOC_ID, TEXT_CHAR_BEGIN, TEXT_CHAR_END are again the input document span. But instead of an entity with an entity-score ans an entity span, we now have a keyword column, a keyword score and a span(KEYWORD_DOC_ID, KEYWORD_CHAR_BEGIN, KEYWORD_CHAR_END) identifying the found keyword in the text. Then of course te error meassage and setup column.

In [70]:
%%sql
SELECT * FROM KEYWORD_SEARCH_VIEW WHERE TEXT_DOC_ID < 5

text_doc_id,text_char_begin,text_char_end,keyword,keyword_score,keyword_doc_id,keyword_char_begin,keyword_char_end,error_message,setup
1,0,284,troubleshooting steps,0.8495,1,209,230,,"{""KeywordSearch"": [{""PatternRankKeywordExtractor"": {""model_name"": ""answerdotai/ModernBERT-base"", ""vec_kwargs"": {""max_df"": null, ""min_df"": null}, ""kbx_kwargs"": {""top_n"": 5, ""use_maxsum"": false, ""use_mmr"": false, ""diversity"": 0.5, ""nr_candidates"": 20}}}, {""TokenizedTextSearch"": {""fuzziness"": 0}}]}"
1,0,284,product_purchased,0.839,1,30,47,,"{""KeywordSearch"": [{""PatternRankKeywordExtractor"": {""model_name"": ""answerdotai/ModernBERT-base"", ""vec_kwargs"": {""max_df"": null, ""min_df"": null}, ""kbx_kwargs"": {""top_n"": 5, ""use_maxsum"": false, ""use_mmr"": false, ""diversity"": 0.5, ""nr_candidates"": 20}}}, {""TokenizedTextSearch"": {""fuzziness"": 0}}]}"
1,0,284,billing zip code,0.7402,1,71,87,,"{""KeywordSearch"": [{""PatternRankKeywordExtractor"": {""model_name"": ""answerdotai/ModernBERT-base"", ""vec_kwargs"": {""max_df"": null, ""min_df"": null}, ""kbx_kwargs"": {""top_n"": 5, ""use_maxsum"": false, ""use_mmr"": false, ""diversity"": 0.5, ""nr_candidates"": 20}}}, {""TokenizedTextSearch"": {""fuzziness"": 0}}]}"
1,0,284,user manual,0.6912,1,248,259,,"{""KeywordSearch"": [{""PatternRankKeywordExtractor"": {""model_name"": ""answerdotai/ModernBERT-base"", ""vec_kwargs"": {""max_df"": null, ""min_df"": null}, ""kbx_kwargs"": {""top_n"": 5, ""use_maxsum"": false, ""use_mmr"": false, ""diversity"": 0.5, ""nr_candidates"": 20}}}, {""TokenizedTextSearch"": {""fuzziness"": 0}}]}"
1,0,284,email address,0.6837,1,183,196,,"{""KeywordSearch"": [{""PatternRankKeywordExtractor"": {""model_name"": ""answerdotai/ModernBERT-base"", ""vec_kwargs"": {""max_df"": null, ""min_df"": null}, ""kbx_kwargs"": {""top_n"": 5, ""use_maxsum"": false, ""use_mmr"": false, ""diversity"": 0.5, ""nr_candidates"": 20}}}, {""TokenizedTextSearch"": {""fuzziness"": 0}}]}"
3,0,275,product_purchased,0.8289,3,30,47,,"{""KeywordSearch"": [{""PatternRankKeywordExtractor"": {""model_name"": ""answerdotai/ModernBERT-base"", ""vec_kwargs"": {""max_df"": null, ""min_df"": null}, ""kbx_kwargs"": {""top_n"": 5, ""use_maxsum"": false, ""use_mmr"": false, ""diversity"": 0.5, ""nr_candidates"": 20}}}, {""TokenizedTextSearch"": {""fuzziness"": 0}}]}"
3,0,275,product_purchased,0.8289,3,55,72,,"{""KeywordSearch"": [{""PatternRankKeywordExtractor"": {""model_name"": ""answerdotai/ModernBERT-base"", ""vec_kwargs"": {""max_df"": null, ""min_df"": null}, ""kbx_kwargs"": {""top_n"": 5, ""use_maxsum"": false, ""use_mmr"": false, ""diversity"": 0.5, ""nr_candidates"": 20}}}, {""TokenizedTextSearch"": {""fuzziness"": 0}}]}"
3,0,275,product_purchased,0.8289,3,224,241,,"{""KeywordSearch"": [{""PatternRankKeywordExtractor"": {""model_name"": ""answerdotai/ModernBERT-base"", ""vec_kwargs"": {""max_df"": null, ""min_df"": null}, ""kbx_kwargs"": {""top_n"": 5, ""use_maxsum"": false, ""use_mmr"": false, ""diversity"": 0.5, ""nr_candidates"": 20}}}, {""TokenizedTextSearch"": {""fuzziness"": 0}}]}"
3,0,275,original charger,0.7676,3,188,204,,"{""KeywordSearch"": [{""PatternRankKeywordExtractor"": {""model_name"": ""answerdotai/ModernBERT-base"", ""vec_kwargs"": {""max_df"": null, ""min_df"": null}, ""kbx_kwargs"": {""top_n"": 5, ""use_maxsum"": false, ""use_mmr"": false, ""diversity"": 0.5, ""nr_candidates"": 20}}}, {""TokenizedTextSearch"": {""fuzziness"": 0}}]}"
3,0,275,yesterday,0.7337,3,119,128,,"{""KeywordSearch"": [{""PatternRankKeywordExtractor"": {""model_name"": ""answerdotai/ModernBERT-base"", ""vec_kwargs"": {""max_df"": null, ""min_df"": null}, ""kbx_kwargs"": {""top_n"": 5, ""use_maxsum"": false, ""use_mmr"": false, ""diversity"": 0.5, ""nr_candidates"": 20}}}, {""TokenizedTextSearch"": {""fuzziness"": 0}}]}"


## Adding data to source

Now, lets try adding more data to our view, and run the preprocessing again.

Lets see what happens if we run the preprocessing again. 

In [56]:
%%time
run_text_ai_preprocessing()

CPU times: user 111 ms, sys: 12.5 ms, total: 123 ms
Wall time: 1.72 s


See how quick it runs this time? This is because the text-ai-extensions does not compute results allready computed in previous runs. We can test this behaviour further. lets add more entries to our dataset, and see and see how long the preprocessing takes then:

In [65]:
with open_pyexasol_connection(ai_lab_config, compression=True) as conn:
    conn.execute(f"""CREATE OR REPLACE VIEW "{schema}"."{view}" AS SELECT * FROM "{schema}"."{table}" WHERE "TICKET_ID" <= {view_size}*2; """)
    #conn.execute(f"""INSERT INTO "{schema}"."{view}" VALUES (SELECT * FROM "{schema}"."{table}" WHERE "TICKET_ID" <= {view_size}*2 and "TICKET_ID" > {view_size}); """)


ExaQueryError: 
(
    message     =>  illegal INSERT statement: Cannot insert into a view. [line 1, column 13] (Session: 1834723082564665344)
    dsn         =>  172.19.0.2:8563
    user        =>  sys
    schema      =>  
    session_id  =>  1834723082564665344
    code        =>  42000
    query       =>  INSERT INTO "AI_LAB"."MY_VIEW" VALUES (SELECT * FROM "AI_LAB"."CUSTOMER_SUPPORT_TICKETS" WHERE "TICKET_ID" <= 100*2 and "TICKET_ID" > 100)
)


In [60]:
%%sql
SELECT COUNT(ALL TICKET_ID) FROM {{schema}}.{{view}};

Count(TICKET_ID)
200


In [61]:
%%time
run_text_ai_preprocessing()

CPU times: user 340 ms, sys: 38.1 ms, total: 378 ms
Wall time: 18min 33s


    talk about time preprocessing takes in step 1 and step 3, compare, discuss how is only run on new data.

    Show Tables counts for Documents, Extractions(3 view) and Audit Log
                                            

In [81]:
%%sql
SELECT COUNT (*) FROM DOCUMENTS;

COUNT(*)
200


In [84]:
# show audit logs. todo where?
from pandas import option_context
with open_pyexasol_connection(ai_lab_config, compression=True) as conn:
    audit_log = conn.export_to_pandas(f"""
        SELECT DB_OBJECT_NAME,EVENT_NAME,RUN_ID, ROW_COUNT FROM {schema}.TXAIE_AUDIT_LOG
    """)
    with option_context('display.max_rows', 1000, 'display.max_colwidth', 1000):
        display(audit_log)

Unnamed: 0,DB_OBJECT_NAME,EVENT_NAME,RUN_ID,ROW_COUNT
0,,SourceTableQueryHandler_Start,,
1,DOCUMENTS_AI_LAB_MY_VIEW,Begin,9f28162aad5947f0bb33c520bcfd3072,0.0
2,DOCUMENTS_AI_LAB_MY_VIEW,End,9f28162aad5947f0bb33c520bcfd3072,100.0
3,DOCUMENTS,Begin,9f28162aad5947f0bb33c520bcfd3072,0.0
4,DOCUMENTS,End,9f28162aad5947f0bb33c520bcfd3072,100.0
5,,SourceTableQueryHandler_End,,
6,,UDFAlgoQueryHandler_Start,,
7,NAMED_ENTITY_LOOKUP_ENTITY_TYPE,Begin,9f28162aad5947f0bb33c520bcfd3072,0.0
8,NAMED_ENTITY_LOOKUP_ENTITY_TYPE,End,9f28162aad5947f0bb33c520bcfd3072,5.0
9,NAMED_ENTITY_LOOKUP_SETUP,Begin,9f28162aad5947f0bb33c520bcfd3072,0.0


In [85]:
with open_pyexasol_connection(ai_lab_config, compression=True) as conn:
    conn.execute(f"""CREATE OR REPLACE VIEW "{schema}"."{view}" AS SELECT * FROM "{schema}"."{table}" WHERE "TICKET_ID" <= {view_size}*3; """)

In [86]:
%%sql
SELECT COUNT(ALL TICKET_ID) FROM {{schema}}.{{view}};

Count(TICKET_ID)
300


In [87]:
%%time
run_text_ai_preprocessing()

Catched exeception during cleanup after an exception.
Traceback (most recent call last):
  File "/home/jupyter/jupyterenv/lib/python3.10/site-packages/pyexasol/connection.py", line 539, in req
    recv_data = self._ws_recv()
  File "/home/jupyter/jupyterenv/lib/python3.10/site-packages/pyexasol/connection.py", line 626, in <lambda>
    self._ws_recv = lambda: zlib.decompress(self._ws.recv())
  File "/home/jupyter/jupyterenv/lib/python3.10/site-packages/websocket/_core.py", line 388, in recv
    opcode, data = self.recv_data()
  File "/home/jupyter/jupyterenv/lib/python3.10/site-packages/websocket/_core.py", line 416, in recv_data
    opcode, frame = self.recv_data_frame(control_frame)
  File "/home/jupyter/jupyterenv/lib/python3.10/site-packages/websocket/_core.py", line 437, in recv_data_frame
    frame = self.recv_frame()
  File "/home/jupyter/jupyterenv/lib/python3.10/site-packages/websocket/_core.py", line 478, in recv_frame
    return self.frame_buffer.recv_frame()
  File "/home/j

ExaCommunicationError: 
(
    message     =>  socket is already closed.
    dsn         =>  172.19.0.2:8563
    user        =>  sys
    schema      =>  
    session_id  =>  1834729278802952192
)


In [88]:
%%sql
SELECT COUNT (*) FROM DOCUMENTS;

COUNT(*)
300


In [89]:
# show audit logs. todo where?
from pandas import option_context
with open_pyexasol_connection(ai_lab_config, compression=True) as conn:
    audit_log = conn.export_to_pandas(f"""
        SELECT DB_OBJECT_NAME,EVENT_NAME,RUN_ID, ROW_COUNT FROM {schema}.TXAIE_AUDIT_LOG
    """)
    audit_log.to_csv("audit_log_after_3_runs_a_100_documents.csv")
    with option_context('display.max_rows', 1000, 'display.max_colwidth', 1000):
        display(audit_log)

Unnamed: 0,DB_OBJECT_NAME,EVENT_NAME,RUN_ID,ROW_COUNT
0,,SourceTableQueryHandler_Start,,
1,DOCUMENTS_AI_LAB_MY_VIEW,Begin,9f28162aad5947f0bb33c520bcfd3072,0.0
2,DOCUMENTS_AI_LAB_MY_VIEW,End,9f28162aad5947f0bb33c520bcfd3072,100.0
3,DOCUMENTS,Begin,9f28162aad5947f0bb33c520bcfd3072,0.0
4,DOCUMENTS,End,9f28162aad5947f0bb33c520bcfd3072,100.0
5,,SourceTableQueryHandler_End,,
6,,UDFAlgoQueryHandler_Start,,
7,NAMED_ENTITY_LOOKUP_ENTITY_TYPE,Begin,9f28162aad5947f0bb33c520bcfd3072,0.0
8,NAMED_ENTITY_LOOKUP_ENTITY_TYPE,End,9f28162aad5947f0bb33c520bcfd3072,5.0
9,NAMED_ENTITY_LOOKUP_SETUP,Begin,9f28162aad5947f0bb33c520bcfd3072,0.0


## Addendum

    text may contain spelling error/incomplete mentions_ < might need postprocessing