## Advanced UDF with GPU

This perform an in-depth analysis of news articles using the data provided in the S3 virtual schema tutorial. The dataset that we will be working with, is derived from the well-known Reuters 21578 collection—a comprehensive repository of documents containing a wide variety of news articles, each categorized according to its subject matter.

Our first step will be to create two distinct columns:
1. Column **Topics:** will contain unique topics present in each news article, allowing us to systematically analyze the subject areas covered by the news article.
2. Column **Titles:** will contain the headline or title of each news article, which we will later use for semantic matching.

Once our tables are set up and populated, we will utilize a powerful natural language processing approach to link news article titles to their most relevant topics. Specifically, we will employ the `sentence-transformers` model—a state-of-the-art model designed for producing high-quality sentence embeddings. This model will run in a UDF leveraging the GPU to efficiently generate vector representations for both the titles and topics.

By comparing these embeddings, our system will automatically match each article title with the most relevant topic from our topics table. Finally, we will output the best-matching topic for each title, providing valuable insights into the subject distribution of our set of news articles.

### Open Secure Configuration Storage

In [None]:
%run ../utils/access_store_ui.ipynb
display(get_access_store_ui('../'))

### Prerequisites
- Run the [S3 VS tutorial](../cloud/02_s3_vs_reuters.ipynb) in order to populate our input data

### Instantiate ScriptLanguagesContainer

The following cell creates an instance of class `ScriptLanguageContainer` from the notebook-connector,
which enables using the`exaslct` in the AI Lab in a convenient way.

In [None]:
from exasol.nb_connector.slc import ScriptLanguageContainer
slc = ScriptLanguageContainer(secrets=ai_lab_config, name="gpu_slc")

### Connect to the database and activate the container

In [None]:
from exasol.nb_connector.language_container_activation import open_pyexasol_connection_with_lang_definitions

conn = open_pyexasol_connection_with_lang_definitions(ai_lab_config, schema=ai_lab_config.db_schema, compression=True)

### Prepare the models

In [None]:
from exasol.nb_connector.model_installation import install_model, TransformerModel
from transformers import AutoModel

In [None]:
install_model(ai_lab_config, TransformerModel(
                "sentence-transformers/all-mpnet-base-v2", "feature-extraction", AutoModel
            ),)

### Create the UDF

Now create the UDF which implements a "semantic join" between pairs of text data (text1, text2) by computing their semantic similarity using a transformer neural network model (from the sentence-transformers library). It runs on a GPU and is designed to scale to large amounts of data by processing in batches. For each pair (text1 and text2), computes the cosine similarity between their embeddings in a row-wise fashion, on the GPU. The UDF returns (emits) text1 and text2, as well as the `similarity_score` (between -1 and 1, with 1.0 = identical).


In [None]:
sql_semantic_join = f"""
--/
CREATE OR REPLACE {slc.language_alias} SET SCRIPT "SEMANTIC_JOIN"(text1 VARCHAR(2000000), text2 VARCHAR(2000000))
EMITS(text1 VARCHAR(2000000), text2 VARCHAR(2000000), similarity_score DOUBLE) AS
%perInstanceRequiredAcceleratorDevices GpuNvidia;

import pandas as pd
from transformers import AutoTokenizer, AutoModel, Pipeline, pipeline
import torch
import exasol.bucketfs as bfs
from pathlib import Path
from exasol_transformers_extension.utils import device_management
from exasol.python_extension_common.connections.bucketfs_location import (
    create_bucketfs_location_from_conn_object)
from exasol_transformers_extension.utils.bucketfs_model_specification import (
    BucketFSModelSpecification)
from exasol_transformers_extension.utils.huggingface_hub_bucketfs_model_transfer_sp import (
    HuggingFaceHubBucketFSModelTransferSP)
from exasol_transformers_extension.utils.load_local_model import LoadLocalModel


device = torch.device("cuda")

def get_bucketfs_location(exa, bucketfs_conn_name: str) -> bfs.path.PathLike:
    return create_bucketfs_location_from_conn_object(
        exa.get_connection(bucketfs_conn_name))

def load_transformers_pipline(exa,
                              bucketfs_conn_name: str,
                              sub_dir: str,
                              device: str,
                              task_type: str,
                              model_name: str,
                              model_factory,
                              tokenizer_factory=AutoTokenizer) -> Pipeline:
    model_loader = LoadLocalModel(pipeline,
                                  base_model_factory=model_factory,
                                  tokenizer_factory=tokenizer_factory,  # type: ignore
                                  task_type=task_type,
                                  device=device)    # type: ignore

    model_spec = BucketFSModelSpecification(model_name, task_type, bucketfs_conn_name,
                                            Path(sub_dir))

    bucketfs_location = get_bucketfs_location(exa, bucketfs_conn_name)

    model_loader.clear_device_memory()
    model_loader.set_current_model_specification(model_spec)
    model_loader.set_bucketfs_model_cache_dir(bucketfs_location)
    return model_loader.load_models()

model_pipeline = load_transformers_pipline(
    exa,
    bucketfs_conn_name='{ai_lab_config.bfs_connection_name}',
    sub_dir='{ai_lab_config.bfs_model_subdir}',
    device=device,
    model_name="sentence-transformers/all-mpnet-base-v2",
    task_type='feature-extraction',
    model_factory=AutoModel)

# Function to calculate embeddings
def batch_get_embeddings(batch_texts, model_pipeline, device):
    # Tokenize the batch of texts
    inputs = model_pipeline.tokenizer(batch_texts, padding=True, truncation=True, return_tensors='pt').to(device)  # Send to GPU
    # Get embeddings from the model
    with torch.no_grad():
        outputs = model_pipeline.model(**inputs)
    # Use the outputs pooler_output or last_hidden_state to get the embedding
    return outputs.pooler_output if 'pooler_output' in outputs else outputs.last_hidden_state.mean(dim=1)

# Function to compute row-by-row cosine similarity using PyTorch
def row_by_row_cosine_similarity_gpu(embeddings1, embeddings2):
    # Normalized embeddings can be calculated separately
    normalized_embeddings1 = embeddings1 / embeddings1.norm(dim=1)[:, None]
    normalized_embeddings2 = embeddings2 / embeddings2.norm(dim=1)[:, None]

    # Compute pair-wise cosine similarity for each corresponding pair
    similarities = (normalized_embeddings1 * normalized_embeddings2).sum(dim=1)
    return similarities

def run(ctx):
    # Batch processing configurations
    batch_size = 100  # Adjust this based on your memory capacity

    while True:
        df = ctx.get_dataframe(batch_size)
        if df is None:
            break

        data1 = df["text1"]
        data2 = df["text2"]

        # Compute embeddings for each dataset in batches
        embeddings1 = batch_get_embeddings(data1.to_list(), model_pipeline, device)
        embeddings2 = batch_get_embeddings(data2.to_list(), model_pipeline, device)

        # Calculate row-by-row cosine similarity using PyTorch on GPU
        similarity_scores = row_by_row_cosine_similarity_gpu(embeddings1, embeddings2)

        # Determine matches based on similarity scores (threshold can be adjusted)
        df["SIMILARITY_SCORE"]=similarity_scores.cpu()
        ctx.emit(df)
/
"""
conn.execute(sql_semantic_join)

### Initialise JupySQL
Let's bring up JupySQL and connect to the database via SQLAlchemy. Please refer to the documentation of <a href="https://github.com/exasol/sqlalchemy-exasol" target="_blank" rel="noopener">sqlalchemy-exasol</a> for details on how to connect to the database using the Exasol SQLAlchemy driver.

In [None]:
%run ../utils/jupysql_init.ipynb

### Run the UDF on the new data

Now we apply the `semantic_join` UDF on the topics and titles from the NEWS table, which was created by the S3 Virtual Schema tutorial: for every news TITLE, the TOPICs that best semantically match it, ranks them by how similar they are, and filters out all but the strong matches (similarity > 0.5). You get, for each TITLE, a list of TOPICs it is semantically closest to, in order.

In [None]:
%%sql
SELECT *
FROM (
    SELECT
        TEXT1 AS TITLE, TEXT2 AS TOPIC,
        RANK() OVER (PARTITION BY TEXT1 ORDER BY SIMILARITY_SCORE DESC) as r,
        SIMILARITY_SCORE
    FROM (
        SELECT
            "SEMANTIC_JOIN"(TITLE, TOPIC)
        FROM (
            SELECT distinct TOPIC, TITLE
            FROM (
                SELECT distinct title as TITLE FROM NEWS WHERE TITLE IS NOT NULL AND BODY IS NOT NULL
            )
            CROSS JOIN (
                SELECT distinct topic AS TOPIC FROM NEWS WHERE topic IS NOT NULL
            )
        )
    )
    WHERE SIMILARITY_SCORE > 0.5
)
WHERE r<=5
ORDER BY TITLE, r ASC

### Additional Notes

#### Parallel execution
Please note that executing queries with GPU UDFs in parallel can lead to unexpected errors, as the GPU libraries used within the UDFs may attempt to allocate the same GPU device and memory simultaneously. In the current Exasol version, it's the users responsibility to ensure that the libraries are configured appropriately or that only one query uses a GPU devices.
Here some links which might be helpful for the configuration:
- [Manage GPU Memory When Using TensorFlow and PyTorch](https://docs.ncsa.illinois.edu/systems/hal/en/latest/user-guide/prog-env/gpu-memory.html)
- [pytorch set per process memory fraction](https://docs.pytorch.org/docs/stable/generated/torch.cuda.memory.set_per_process_memory_fraction.html)
- [TensorFlow LogicalDeviceConfiguration](https://www.tensorflow.org/api_docs/python/tf/config/LogicalDeviceConfiguration)

Additionally, it is possible to control the number of parallel UDF instances by setting the  [UDF Instance Limiting](https://docs.exasol.com/db/latest/database_concepts/udf_scripts/udf_instance_limit.htm).
