# Generative text model

In this notebook we will load and use a generative language model that can produce a continuation for a given text. Learn more about the Text Generation task <a href="https://huggingface.co/tasks/text-generation" target="_blank" rel="noopener">here</a>.

We will be using a generic prediction UDF script. To execute queries and load data from Exasol database we will be using the <a href="https://github.com/exasol/pyexasol" target="_blank" rel="noopener">`pyexasol`</a> module.

## Prerequisites

Prior to using this notebook one needs to complete the follow steps:
1. [Configure the sandbox](../sendbox_config.ipynb).
2. [Initialize the Transformer Extension](te_init.ipynb).

## Set up

In [1]:
#TODO: start using the secret store.

from collections import UserDict

class Secrets(UserDict):
    """This class mimics the Secret Store we will start using soon."""

    def save(self, key: str, value: str) -> "Secrets":
        self[key] = value
        return self

# For now just hardcode the configuration.
sb_config = Secrets({
    'EXTERNAL_HOST_NAME': '192.168.124.93',
    'HOST_PORT': '8888',
    'USER': 'sys',
    'PASSWORD': 'exasol',
    'BUCKETFS_PORT': '6666',
    'BUCKETFS_USER': 'w',
    'BUCKETFS_PASSWORD': 'write',
    'BUCKETFS_USE_HTTPS': 'False',
    'BUCKETFS_SERVICE': 'bfsdefault',
    'BUCKETFS_BUCKET': 'default',
    'SCRIPT_LANGUAGE_NAME': 'PYTHON3_60',
    'UDF_FLAVOR': 'python3-ds-EXASOL-6.0.0',
    'UDF_RELEASE': '20190116',
    'UDF_CLIENT': 'exaudfclient_py3',
    'SCHEMA': 'IDA',
    'TE_TOKEN': '',
    'TE_TOKEN_CONN': '',
    'TE_BFS_CONN': 'MyBFSConn',
    'TE_BFS_DIR': 'my_storage',
    'TE_MODELS_BFS_DIR': 'models',
    'TE_MODELS_CACHE_DIR': 'models_cache'
})

EXTERNAL_HOST = f"{sb_config.get('EXTERNAL_HOST_NAME')}:{sb_config.get('HOST_PORT')}"
WEBSOCKET_URL = f"exa+websocket://{sb_config.get('USER')}:{sb_config.get('PASSWORD')}" \
    f"@{EXTERNAL_HOST}/{sb_config.get('SCHEMA')}?SSLCertificate=SSL_VERIFY_NONE"

## Get language model

To demonstrate the text generation task we will use [Open Pretrained Transformers (OPT)](https://huggingface.co/facebook/opt-125m), a decoder-only pre-trained transformer from Facebook.

We need to load the model from the Huggingface hub into the BucketFS. This could potentially be a long process. Unfortunately we cannot tell exactly when it has finished. Notebook's hourglass may not be a reliable indicator. BucketFS will still be doing some work when the call issued by the notebook returns. Please wait for few moments after that, before querying the model.

In [34]:
# This is the name of the model at the Huggingface Hub
MODEL_NAME = 'facebook/opt-125m'

In [35]:
%run ./model_retrieval.ipynb
load_huggingface_model(MODEL_NAME, sb_config)

Downloading (…)okenizer_config.json: 100%|█████████████████████████████████████████████████████████████| 685/685 [00:00<00:00, 80.6kB/s]
Downloading (…)lve/main/config.json: 100%|██████████████████████████████████████████████████████████████| 651/651 [00:00<00:00, 996kB/s]
Downloading (…)olve/main/vocab.json: 100%|███████████████████████████████████████████████████████████| 899k/899k [00:00<00:00, 5.27MB/s]
Downloading (…)olve/main/merges.txt: 100%|███████████████████████████████████████████████████████████| 456k/456k [00:00<00:00, 3.05MB/s]
Downloading (…)cial_tokens_map.json: 100%|██████████████████████████████████████████████████████████████| 441/441 [00:00<00:00, 325kB/s]
Downloading pytorch_model.bin: 100%|█████████████████████████████████████████████████████████████████| 251M/251M [00:14<00:00, 17.6MB/s]


## Use language model

Let's put the start of our conversation in a variable.

In [36]:
MY_TEXT = 'The bar-headed goose can fly at much'

# Make sure our texts can be used in an SQL statement.
MY_TEXT = MY_TEXT.replace("'", "''")

In [37]:
# Let's put a limit on the length of text the model can generate in one call.
# The limit is specified in the number of characters.
MAX_LENGTH = 30

In [38]:
import pyexasol

We will be updating this variable at every call to the model.
Please run the next cell multiple times to see how the text evolves.

In [46]:
sql = f"""
SELECT {sb_config.get("SCHEMA")}.TE_TEXT_GENERATION_UDF(
    NULL,
    '{sb_config.get("TE_BFS_CONN")}',
    '{sb_config.get("TE_TOKEN_CONN")}',
    '{sb_config.get("TE_MODELS_BFS_DIR")}',
    '{MODEL_NAME}',
    '{MY_TEXT}',
    {MAX_LENGTH},
    True
)
"""

with pyexasol.connect(dsn=EXTERNAL_HOST, user=sb_config.get('USER'), password=sb_config.get('PASSWORD'), compression=True) as conn:
    result = conn.export_to_pandas(query_or_table=sql).squeeze()
    MY_TEXT = result['GENERATED_TEXT']
    # The error can be observed at result['ERROR_MESSAGE']

print(MY_TEXT)
MY_TEXT = MY_TEXT.replace("'", "''")

The bar-headed goose can fly at much higher speeds than the average goose.
I'm not sure if you're being sarcastic or not, but the bar-headed goose is
