# Text AI Extension preprocessing

Here we will demonstrate how the Text AI Extension data-preprocessing can be used.

    Explain wich options and stuff we have, something about why do preprocessing? or is that to basic or out of scope?


## Prerequisites

Prior to using this notebook one needs to complete the following steps:
1. [Configure the AI-Lab](../main_config.ipynb).
2. [initialize the Text AI Extension](./txaie_init.ipynb)
3. [initialize the Transformers Extension](../transformers/te_init.ipynb)

## Activate the Text AI Extension SLC

In [24]:
%run ../utils/access_store_ui.ipynb
display(get_access_store_ui('../'))

Output()

Box(children=(Box(children=(Label(value='Configuration Store', layout=Layout(border_bottom='solid 1px', border…

In [25]:
from exasol.nb_connector.connections import open_pyexasol_connection
from exasol.nb_connector.language_container_activation import get_activation_sql

activation_sql = get_activation_sql(ai_lab_config)

triggers preprocessing to create the text annotations and text extraction.



## Get an example dataset

We will be using a Dataset which holds information on customer support tickets. We will split this data into 2 set, in order to demonstrate how the preprocessing tasks handle new data being added to a data set.
But first we want to make sure the tables we want to use don't already exist, for example from a previous run of this notebook. Therefore, we are going to drop them.
First, we define a list of tables to drop:

In [27]:
table_list = [
    "TOPIC_CLASSIFIER",
    "TOPIC_CLASSIFIER_LOOKUP_TOPIC",
    "TOPIC_CLASSIFIER_LOOKUP_SETUP",
    "NAMED_ENTITY",
    "NAMED_ENTITY_LOOKUP_ENTITY_NAME",
    "NAMED_ENTITY_LOOKUP_SETUP",
    "DOCUMENTS",
    "DOCUMENTS_AI_LAB_CUSTOMER_SUPPORT_TICKETS",
    "KEYWORD_SEARCH",
    "KEYWORD_SEARCH_LOOKUP_KEYWORD",
    "KEYWORD_SEARCH_LOOKUP_SETUP"
]

Next, define a function which drops these tables, as well as our main table. Then we call the function.

In [28]:
table="CUSTOMER_SUPPORT_TICKETS"
OUTPUT_SCHEMA=ai_lab_config.db_schema
schema=ai_lab_config.db_schema

def delete_text_ai_preprocessing_tables():
    with open_pyexasol_connection(ai_lab_config, compression=True) as conn:
        for drop_table in table_list:
            conn.execute(f"""DROP TABLE IF EXISTS "{OUTPUT_SCHEMA}"."{drop_table}" """)
        conn.execute(f"""DROP TABLE IF EXISTS "{schema}"."{table}" """)

In [30]:
delete_text_ai_preprocessing_tables()

You can then load the data using [this notebook](../data/data_customer_support.ipynb). This loads the data into a table called "CUSTOMER_SUPPORT_TICKETS" found in the schema defined in the ai_lab_config variable db_schema.
For the purpose of this notebook, we want to split this data into two parts. So we need to load it into a pandas dataframe.

In [31]:
%run ../data/data_customer_support.ipynb

Output()

Box(children=(Box(children=(Label(value='Configuration Store', layout=Layout(border_bottom='solid 1px', border…


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Path to dataset files: /home/jupyter/.cache/kagglehub/datasets/suraj520/customer-support-ticket-dataset/versions/1
   Ticket ID        Customer Name              Customer Email  Customer Age  \
0          1        Marisa Obrien  carrollallison@example.com            32   
1          2         Jessica Rios    clarkeashley@example.com            42   
2          3  Christopher Robbins   gonzalestracy@example.com            48   
3          4     Christina Dillon    bradleyolson@example.org            27   
4          5    Alexander Carroll     bradleymark@example.com            67   

  Customer Gender Product Purchased Date of Purchase      Ticket Type  \
0           Other        GoPro Hero       2021-03-22  Technical issue   
1          

In [11]:
#with open_pyexasol_connection(ai_lab_config, compression=True) as conn:
#        whole_data_df = conn.export_to_pandas(f"""SELECT * FROM "{schema}"."CUSTOMER_SUPPORT_TICKETS" """)

In [32]:
view="MY_VIEW"

In [63]:
with open_pyexasol_connection(ai_lab_config, compression=True) as conn:
    #conn.execute(f"""CREATE OR REPLACE VIEW "{schema}"."{view}" AS SELECT * FROM "{schema}"."{table}" WHERE "TICKET_ID" < 100; """)
    conn.execute(f"""DROP VIEW "{schema}"."{view}"; """)
    conn.execute(f"""CREATE OR REPLACE TABLE "{schema}"."{view}" AS SELECT * FROM "{schema}"."{table}" WHERE "TICKET_ID" < 100; """)


Then, we split the dataframe into two separate dataframes randomly. We will upload the first one in our data table ???.

In [34]:
#shuffled = df.sample(frac=1)
#split_df_list = np.array_split(shuffled, 2)
#todo create a view with limit (100) instead, add to view for step 3 below

## Get Models

we will use multiple different transformers models to run our preprocessing with. we need to download these from huggingface.
first, we define which models we want to use. you can browse for your preferred model [here](https://huggingface.co/models).

In [35]:
NAMED_ENTITY_MODEL="guishe/nuner-v2_fewnerd_fine_super"
NLI_MODEL="tasksource/ModernBERT-large-nli"
FEATURE_EXTRACTION_MODEL="answerdotai/ModernBERT-large"

Then we import the "load_huggingface_model" defined in another notebook, which will help us download the models.

In [36]:
%run ../transformers/utils/model_retrieval.ipynb

And now we are ready to download our models. Each of these calls will take some time, depending on your internet connection.

In [38]:
load_huggingface_model(ai_lab_config, NAMED_ENTITY_MODEL, 'token-classification')

In [39]:
load_huggingface_model(ai_lab_config, NLI_MODEL, 'zero-shot-classification')

ExaCommunicationError: 
(
    message     =>  Connection to remote host was lost.
    dsn         =>  172.19.0.2:8563
    user        =>  sys
    schema      =>  
    session_id  =>  1832635162249330688
)


In [40]:
load_huggingface_model(ai_lab_config, FEATURE_EXTRACTION_MODEL, 'feature-extraction')

ExaCommunicationError: 
(
    message     =>  Connection to remote host was lost.
    dsn         =>  172.19.0.2:8563
    user        =>  sys
    schema      =>  
    session_id  =>  1832635476158971904
)


# Further setup


In [41]:
from exasol.ai.text.extraction import *
from exasol.ai.text.extraction.extraction import Extraction
from exasol.ai.text.extraction.abstract_extraction import Output

In [42]:
schema=ai_lab_config.db_schema
table="CUSTOMER_SUPPORT_TICKETS"
text_column="TICKET_DESCRIPTION"
key_column="TICKET_ID"


## Define which steps to run



In [125]:
%run ./utils/txaie_default_extractor.ipynb

In [126]:
%run ./utils/txaie_extraction_wrapper.ipynb

 here we explain what the extraction wrapper and default extraction do and where to find them

In [127]:
%run utils/txaie_init_ui.ipynb #todo do we want this ui in a seperate file?
display(get_txaie_SLC_name_ui(ai_lab_config)) #todo CKey.language_alias does not yet exist. use once made in NC
#todo this should get input "PYTHON3_TXAIE"

AttributeError: language_alias

In [128]:
#todo put stuff into secret store?
extraction = ExtractionWrapper(ai_lab_config)


In [129]:
from exasol.nb_connector.connections import open_pyexasol_connection

def run_text_ai_preprocessing():
    with open_pyexasol_connection(ai_lab_config, compression=True) as conn:
        #conn.execute(query=activation_sql)
        #df = conn.export_to_pandas("SELECT session_value FROM EXA_PARAMETERS WHERE parameter_name='SCRIPT_LANGUAGES'; ")
        display(df.to_string())
        extraction.run(ai_lab_config)
        #extraction.run(conn, schema, "PYTHON3_TXAIE")

## Run the preprocessing

    with half data

    first, show number of rows in data table

In [130]:
run_text_ai_preprocessing()

'                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            SESSION_VALUE\n0  R=builtin_r JAVA=builtin_java PYTHON3=builtin_python3 PYTHON3_TE=localzmq+protobuf:///bfsdefault/default/TE/exasol_transformers_extension_container_release?lang=python#/buckets/bfsdefault/default/TE/exasol_transformers_extension_container_release/exaudf/exaudfclient_py3 PYTHON3_TXAIE=localzmq+protobuf:///bfsdefault/default/TXAIE/exasol_text_ai_extension_container_release?lang=python#/buckets/bfsdefault/default/TXAIE/exasol_text_ai_extension_container_release/exaudf/exaudfclient_py3'

The next call will make it possible to run sql directly in this notebook, in order to easyer display the results of out preprocessing.

In [132]:
%run ../utils/jupysql_init.ipynb

In [133]:
%config SqlMagic.displaylimit = 10

First, lets look at which tables where created by our preprocessing:

    talk about what they contain?

In [134]:
%%sql
SELECT TABLE_SCHEMA, TABLE_NAME FROM EXA_ALL_TABLES

table_schema,table_name
AI_LAB,CUSTOMER_SUPPORT_TICKETS
AI_LAB,MY_VIEW
AI_LAB,TEST_TXAI_DOCUMENTS
AI_LAB,TEST_TXAI_DOCUMENTS_AI_LAB_MY_VIEW
AI_LAB,TEST_TXAI_TOPIC_CLASSIFIER
AI_LAB,tmp_1832643658291609600_4_7_1
AI_LAB,TEST_TXAI_TOPIC_CLASSIFIER_LOOKUP_TOPIC
AI_LAB,TEST_TXAI_TOPIC_CLASSIFIER_LOOKUP_SETUP
AI_LAB,tmp_1832645095668514816_4_7_1
AI_LAB,tmp_1832645143647092736_4_7_1


There are also some new views:

In [135]:
%%sql
SELECT VIEW_SCHEMA, VIEW_NAME FROM EXA_ALL_VIEWS

view_schema,view_name
AI_LAB,tmp_1832637382593806336_6_1
AI_LAB,tmp_1832637777663623168_6_1
AI_LAB,tmp_1832637844057554944_6_1
AI_LAB,tmp_1832638228912799744_6_1
AI_LAB,tmp_1832642054621822976_6_1
AI_LAB,tmp_1832643000402444288_6_1
AI_LAB,tmp_1832643198880251904_6_1
AI_LAB,TEST_TXAI_TOPIC_CLASSIFIER_VIEW


    show example result rows,

    Show Tables counts for Documents, Extractions and Audit Log

In [137]:
%%sql
SELECT COUNT(ALL text_doc_id) FROM {{schema}}.TEST_TXAI_DOCUMENTS;


Count(TEST_TXAI_DOCUMENTS.TEXT_DOC_ID)
99


### Change config and a second run

    # todo how to change conifg?


In [None]:
run_text_ai_preprocessing()

    Show Tables counts for Documents, Extractions and Audit Log


## Adding data to source and a third run

add second data half to first data table, run again

In [None]:
run_text_ai_preprocessing()

    talk about time preprocessing takes in step 1 and step 3, compare, discuss how is only run on new data.

    Show Tables counts for Documents, Extractions and Audit Log