# Text AI Extension preprocessing

Here we will demonstrate how the Text AI Extension data-preprocessing can be used.

    Explain wich options and stuff we have, something about why do preprocessing? or is that to basic or out of scope?


## Prerequisites

Prior to using this notebook one needs to complete the following steps:
1. [Configure the AI-Lab](../main_config.ipynb).
2. [initialize the Text AI Extension](./txaie_init.ipynb)
3. [initialize the Transformers Extension](../transformers/te_init.ipynb)

## Activate the Text AI Extension SLC

In [None]:
from exasol.nb_connector.connections import open_pyexasol_connection
from exasol.nb_connector.language_container_activation import get_activation_sql

activation_sql = get_activation_sql(ai_lab_config)

do we need to do this with each new connection? and therefore not here?

In [None]:
with open_pyexasol_connection(ai_lab_config, compression=True) as conn:
    conn.execute(query=activation_sql)

triggers preprocessing to create the text annotations and text extraction.



## Get an example dataset

We will be using a Dataset which holds information on customer support tickets. We will split this data into 2 set, in order to demonstrate how the preprocessing tasks handle new data being added to a data set.
But first we want to make sure the tables we want to use don't already exist, for example from a previous run of this notebook. Therefore, we are going to drop them.
First, we define a list of tables to drop:

In [None]:
table_list = [
    "TOPIC_CLASSIFIER",
    "TOPIC_CLASSIFIER_LOOKUP_TOPIC",
    "TOPIC_CLASSIFIER_LOOKUP_SETUP",
    "NAMED_ENTITY",
    "NAMED_ENTITY_LOOKUP_ENTITY_NAME",
    "NAMED_ENTITY_LOOKUP_SETUP",
    "DOCUMENTS",
    "DOCUMENTS_AI_LAB_CUSTOMER_SUPPORT_TICKETS",
    "KEYWORD_SEARCH",
    "KEYWORD_SEARCH_LOOKUP_KEYWORD",
    "KEYWORD_SEARCH_LOOKUP_SETUP"
]

Next, define a function which drops these tables, as well as our main table. Then we call the function.

In [None]:
table="MY_TABLE"
OUTPUT_SCHEMA=ai_lab_config.db_schema

def delete_text_ai_preprocessing_tables():
    with open_pyexasol_connection(ai_lab_config, compression=True) as conn:
        for drop_table in table_list:
            conn.execute(f"""DROP TABLE IF EXISTS "{OUTPUT_SCHEMA}"."{drop_table}" """)
        conn.execute(f"""DROP TABLE IF EXISTS "{schema}"."{table}" """)

In [None]:
delete_text_ai_preprocessing_tables()

You can then load the data using [this notebook](../data/data_customer_support.ipynb). This loads the data into a table called "CUSTOMER_SUPPORT_TICKETS" found in the schema defined in the ai_lab_config variable db_schema.
For the purpose of this notebook, we want to split this data into two parts. So we need to load it into a pandas dataframe.

In [None]:
with open_pyexasol_connection(ai_lab_config, compression=True) as conn:
        whole_data_df = conn.export_to_pandas(f"""SELECT * FROM "{ai_lab_config.db_schema}"."CUSTOMER_SUPPORT_TICKETS" """)

Then, we split the dataframe into two separate dataframes randomly. We will upload the first one in our data table ???.

In [None]:
shuffled = df.sample(frac=1)
split_df_list = np.array_split(shuffled, 2)
#todo create a view with limit (100) instead, add to view for step 3 below

## Get Models

we will use multiple different transformers models to run our preprocessing with. we need to download these from huggingface.
first, we define which models we want to use. you can browse for your preferred model [here](https://huggingface.co/models).

In [None]:
NAMED_ENTITY_MODEL="guishe/nuner-v2_fewnerd_fine_super"
NLI_MODEL="tasksource/ModernBERT-large-nli"
FEATURE_EXTRACTION_MODEL="answerdotai/ModernBERT-large"

Then we import the "load_huggingface_model" defined in another notebook, which will help us download the models.

In [None]:
%run ../transformers/utils/model_retrieval.ipynb

And now we are ready to download our models. Each of these calls will take some time, depending on your internet connection.

In [None]:
load_huggingface_model(ai_lab_config, NAMED_ENTITY_MODEL, 'token-classification')

In [None]:
load_huggingface_model(ai_lab_config, NLI_MODEL, 'zero-shot-classification')

In [None]:
load_huggingface_model(ai_lab_config, FEATURE_EXTRACTION_MODEL, 'feature-extraction')

# Further setup


In [None]:
from exasol.ai.text.extraction import *
from exasol.ai.text.extraction.extraction import Extraction
from exasol.ai.text.extraction.abstract_extraction import Output

In [None]:
schema=ai_lab_config.db_schema
table="CUSTOMER_SUPPORT_TICKETS"
text_column="TICKET_DESCRIPTION"
key_column="TICKET_ID"
topics=["hardware issue", "software issue"]

## Define which steps to run



In [None]:
%config SqlMagic.displaylimit = 20

In [None]:
%run ./utils/txaie_default_extractor.ipynb

In [None]:
%run ./utils/txaie_extaction_wrapper.ipynb

 here we explain what the extraction wrapper and default extraction do and where to find them

In [None]:
#todo put stuff into secret store?
extraction = ExtractionWrapper()


In [None]:
def run_text_ai_preprocessing():
    with open_pyexasol_connection(ai_lab_config, compression=True) as conn:
        conn.execute(query=activation_sql)
        extraction.run(conn, schema, "PYTHON3_TXAIE")

## Run the preprocessing

    with half data

    first, show number of rows in data table

In [None]:
run_text_ai_preprocessing()

The next call will make it possible to run sql directly in this notebook, in order to easyer display the results of out preprocessing.

In [None]:
%run ../../utils/jupysql_init.ipynb

In [None]:
%config SqlMagic.displaylimit = 10

First, lets look at which tables where created by our preprocessing:

    talk about what they contain?

In [None]:
%%sql
SELECT TABLE_SCHEMA, TABLE_NAME FROM EXA_ALL_TABLES

There are also some new views:

In [None]:
%%sql
SELECT VIEW_SCHEMA, VIEW_NAME FROM EXA_ALL_VIEWS

    show example result rows,

    Show Tables counts for Documents, Extractions and Audit Log

In [None]:
%%sql
SELECT COUNT(ALL text_doc_id) FROM {{schema}}.DOCUMENTS; # or TEST_TXAI_DOCUMENTS?


### Change config and a second run

    # todo how to change conifg?


In [None]:
run_text_ai_preprocessing()

    Show Tables counts for Documents, Extractions and Audit Log


## Adding data to source and a third run

add second data half to first data table, run again

In [None]:
run_text_ai_preprocessing()

    talk about time preprocessing takes in step 1 and step 3, compare, discuss how is only run on new data.

    Show Tables counts for Documents, Extractions and Audit Log