<header>
   <p  style='font-size:36px;font-family:Arial; color:#F0F0F0; background-color: #00233c; padding-left: 20pt; padding-top: 20pt;padding-bottom: 10pt; padding-right: 20pt;'>
       Language Models<br>
   <span style="font-size: 24px;">An Introduction to Parallel CPU Inferencing of HuggingFace Models in Vantage</span>
       
  <br>
       <img id="teradata-logo" src="https://storage.googleapis.com/clearscape_analytics_demo_data/DEMO_Logo/teradata.svg" alt="Teradata" style="width: 125px; height: auto; margin-top: 20pt;">
    </p>
</header>

<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>Introduction</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>
Hugging Face is a French-American company based in New York City that develops computation tools for building applications using machine learning. They are known for their <b>Transformers Library</b> which provides open-source implementations of transformer models for text, image, video, audio tasks including time-series. These models include well-known architectures like BERT and GPT. The library is compatible with PyTorch, TensorFlow, and JAX deep learning libraries. <br>
    Deep Learning Models in HuggingFace are pretrained by users/open source outfits/companies on various types of data – NLP, Audio, Images, Videos etc. Most popular tool of choice by users is PyTorch (open source python library) which helps create a Deep Learning model from scratch or take an existing model, retrain/fine-tune (Transfer Learning) on new set of data to be published in HF. Models can be inferenced with CPUs and GPUs with slight performance improvement for smaller models.<br>
</p>
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Why Vantage?</b></p>  
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>As many Hugging Face models are availble in ONNX Runtime, we can load them using the BYOM feature of Vantage and run them in Vantage. Because of Graph Optimizations on ONNX Runtime, there are proven benchmarks that show that inference with ONNX Runtime will be 20% faster than a native PyTorch model on a CPU. Vantage Parallelism on top of boosted ONNX Runtime inference can turn a Vantage system as effective as inference on GPUs. If we have a Vantage box with 72 AMPs, assuming the table is perfectly distributed, it will closely match the performance of a dedicated GPU and data never moves across the network saving time and I/O operations. As parallelism increases with number of AMPs, the model inference will complete faster in Teradata Vantage with the same amount of text data vs a GPU. We can of course quantize the model (change float8 weights to int8/int4) for inference on CPU to go even faster with some tradeoff with accuracy. However, If Model size goes up GPU advantage will widen – example LLM like LLama2 and costs will be disproportionate with GPU but for smaller models we can get comparable performance. 
</p>

<hr style='height:2px;border:none;background-color:#00233C;'>
<b style = 'font-size:20px;font-family:Arial;color:#00233c'>1. Configuring the environment</b>

<p style = 'font-size:18px;font-family:Arial;color:#00233c'><b>1.1 Install the required libraries</b></p>

In [1]:
!pip install optimum sentence_transformers

Collecting optimum
  Downloading optimum-1.23.3-py3-none-any.whl (424 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m424.1/424.1 KB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hCollecting sentence_transformers
  Downloading sentence_transformers-3.2.1-py3-none-any.whl (255 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m255.8/255.8 KB[0m [31m31.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 KB[0m [31m36.6 MB/s[0m eta [36m0:00:00[0m
Collecting transformers>=4.29
  Downloading transformers-4.46.2-py3-none-any.whl (10.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.0/10.0 MB[0m [31m72.2 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting tokenizers<0.21,>=0.20
  Downloading tokenizers-0.20.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.0

<div class="alert alert-block alert-info">
<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b>Note: </b><i>Please restart the kernel after executing these two lines. The simplest way to restart the Kernel is by typing zero zero: <b> 0 0</b></i></p>

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:18px;font-family:Arial;color:#00233c'><b>1.2 Import the required libraries</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Here, we import the required libraries, set environment variables and environment paths (if required).</p>

In [2]:
import json

# Standard libraries
import getpass
import warnings

# Teradata libraries
from teradataml import *
display.max_rows = 5

# Suppress warnings
warnings.filterwarnings('ignore')
warnings.simplefilter(action='ignore', category=DeprecationWarning)
warnings.simplefilter(action='ignore', category=RuntimeWarning)
warnings.simplefilter(action='ignore', category=FutureWarning)

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>2. Connect to Vantage</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We will be prompted to provide the password. We will enter the password, press the Enter key, and then use the down arrow to go to the next cell. Begin running steps with Shift + Enter keys.</p>

In [3]:
%run -i ../startup.ipynb
eng = create_context(host='host.docker.internal', username='demo_user', password=password)
print(eng)

Performing setup ...
Setup complete



Enter password:  ··············


... Logon successful
Connected as: teradatasql://demo_user:xxxxx@host.docker.internal/dbc
Engine(teradatasql://demo_user:***@host.docker.internal)


In [4]:
%%capture
execute_sql("SET query_band='DEMO=Language_Model_Init_Python.ipynb;' UPDATE FOR SESSION;")

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Optional step – We should execute the below step only if we want to see the status of databases/tables created and space used.</p>

In [5]:
%run -i ../run_procedure.py "call space_report();"

You have:  #databases=0 #tables=0 #views=0  You have used 0.8 MB of 30,678.3 MB available - 0.0%  ... Space Usage OK
 
   Database Name                  #tables  #views     Avail MB      Used MB
   demo_user                            0       0  30,678.3 MB       0.8 MB 


<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>3. Creation of functions</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Below command will create the database and functions required for text summarization and embedding models using Huggingface PyTorch models in Vantage.</p>

In [6]:
with open('commands.json', 'r') as file:
    data = json.load(file)

for item in data['queries']:
    try:
        execute_sql(item['query'])
    except Exception as e:
        print(f"The initialization steps have already been executed for this environment!")
        #print(f"Error: {e}")
        pass

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>4. HuggingFace Model installation</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>In the below steps we will download and install the HuggingFace Model in Vantage.</p> 
<p style = 'font-size:18px;font-family:Arial;color:#00233c'><b>4.1 Download the Model using Optium utility</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We will be using <a href = 'https://huggingface.co/BAAI/bge-small-en-v1.5'>BAAI/bge-small-en-v1.5</a><br> The bge-small-en model is a small-scale English text embedding model developed by BAAI (Beijing Academy of Artificial Intelligence) as part of their FlagEmbedding project.</p>

In [7]:
!optimum-cli export onnx --opset 16 --trust-remote-code --task sentence-similarity -m BAAI/bge-small-en-v1.5 bge-small-en-v1.5-onnx

modules.json: 100%|████████████████████████████| 349/349 [00:00<00:00, 43.1kB/s]
config_sentence_transformers.json: 100%|███████| 124/124 [00:00<00:00, 19.7kB/s]
README.md: 100%|███████████████████████████| 94.8k/94.8k [00:00<00:00, 4.98MB/s]
sentence_bert_config.json: 100%|█████████████| 52.0/52.0 [00:00<00:00, 21.5kB/s]
config.json: 100%|██████████████████████████████| 743/743 [00:00<00:00, 325kB/s]
model.safetensors: 100%|██████████████████████| 133M/133M [00:00<00:00, 136MB/s]
tokenizer_config.json: 100%|███████████████████| 366/366 [00:00<00:00, 41.0kB/s]
vocab.txt: 100%|█████████████████████████████| 232k/232k [00:00<00:00, 16.9MB/s]
tokenizer.json: 100%|████████████████████████| 711k/711k [00:00<00:00, 41.7MB/s]
special_tokens_map.json: 100%|█████████████████| 125/125 [00:00<00:00, 64.2kB/s]
1_Pooling/config.json: 100%|███████████████████| 190/190 [00:00<00:00, 23.5kB/s]
verbose: False, log level: Level.ERROR

Weight deduplication check in the ONNX export requires accelerate. Pl

<p style = 'font-size:18px;font-family:Arial;color:#00233c'><b>4.2 Model Preparation</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>In the below steps we will fix dynamic dims, fix versions for compatibility, etc and prepare the model to load in Vantage.</p>

In [8]:
import onnx
import onnxruntime as rt

import transformers
from onnxruntime.tools.onnx_model_utils import *

from sentence_transformers.util import cos_sim

import teradataml as tdml

op = onnx.OperatorSetIdProto()
op.version = 16

model = onnx.load('bge-small-en-v1.5-onnx/model.onnx')

#to be sure that we have compatible opset and IR version
model_ir8 = onnx.helper.make_model(model.graph, ir_version = 8, opset_imports = [op]) 


# fixing the variable dim sizes in our mode
rt.tools.onnx_model_utils.make_dim_param_fixed(model_ir8.graph, "batch_size", 1) 
rt.tools.onnx_model_utils.make_dim_param_fixed(model_ir8.graph, "sequence_length", 512)
rt.tools.onnx_model_utils.make_dim_param_fixed(model_ir8.graph, "Divsentence_embedding_dim_1", 384)


#remove useless token_embeddings output from the model
for node in model_ir8.graph.output:
    if node.name == "token_embeddings":
        model_ir8.graph.output.remove(node)

#saving the model
onnx.save(model_ir8, 'bge-small-en-v1.5-onnx/model_fixed.onnx')

<p style = 'font-size:18px;font-family:Arial;color:#00233c'><b>4.2 Model Preparation</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Checking that everything works with ONNX format localy.</p>

In [9]:
sentences_1 = 'How is the weather today?'
sentences_2 = 'What is the current weather like today?'

In [10]:
tokenizer = transformers.AutoTokenizer.from_pretrained("./bge-small-en-v1.5-onnx")
predef_sess = rt.InferenceSession("bge-small-en-v1.5-onnx/model_fixed.onnx")

In [11]:
enc = tokenizer(sentences_1, max_length = 512, padding='max_length' )

result = predef_sess.run(None,     {"input_ids": [enc.input_ids], 
     "attention_mask": [enc.attention_mask]})

enc2 = tokenizer(sentences_2, max_length = 512, padding='max_length' )

result2 = predef_sess.run(None,     {"input_ids": [enc2.input_ids], 
     "attention_mask": [enc2.attention_mask]})

In [12]:
from sentence_transformers.util import cos_sim

print(cos_sim(result[0][0], result2[0][0]))

tensor([[0.9186]])


In [13]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('BAAI/bge-small-en-v1.5')
embeddings_1 = model.encode(sentences_1, normalize_embeddings=True)
embeddings_2 = model.encode(sentences_2, normalize_embeddings=True)

print(cos_sim(embeddings_1, embeddings_2))

tensor([[0.9186]])


<p style = 'font-size:18px;font-family:Arial;color:#00233c'><b>4.3 Deploy Model and Tokenizer</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>In above steps, we have checked that the model is working fine in Onnx format. Now we will deploy the model and tokenizer in database.</p>

In [14]:
model_ids = ['bge-small-en-v1.5', 'bge-small-en-v1.5']
model_files = ['bge-small-en-v1.5-onnx/model_fixed.onnx', 'bge-small-en-v1.5-onnx/tokenizer.json']
table_names = ['embeddings_models', 'embeddings_tokenizers']

for model_id, model_file, table_name in zip(model_ids, model_files, table_names):
    try:
        save_byom(model_id = model_id, model_file = model_file, table_name = table_name)
    except Exception as e:
        # if our model exists, delete and rewrite
        if str(e.args).find('TDML_2200') >= 1:
            print(f"{table_name.split('_')[1][:-1]} already exists in the database")
            user_conformation = input(f"Do you want to reload the {table_name.split('_')[1][:-1]} (y/n)?")
            if user_conformation.lower() == 'y':
                delete_byom(model_id = model_id, table_name = table_name)
                save_byom(model_id = model_id, model_file = model_file, table_name = table_name)
            else:
                pass
        else:
            raise ValueError(f"Unable to save the {table_name.split('_')[1][:-1]} '{model_id}' in '{table_name}' due to the following error: {e}")

Created the model table 'embeddings_models' as it does not exist.
Model is saved.
Created the model table 'embeddings_tokenizers' as it does not exist.
Model is saved.


<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>5. Next Steps</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'> Now we have initialized and loaded the model into Vantage.  Now the notebooks listed below can be executed.
<ul style = 'font-size:16px;font-family:Arial;color:#00233C'> 
       <li>Semantic Similarity:Does Embeddings on CFPB complaints and uses TD_VECTORDISTANCE to find complaints that match some theme or topic <a href = './Semantic_Similarity_Python.ipynb'>Semantic Similarity </a></li> 
     <li>Semantic Clustering:Does Embeddings on CFPB complaints and uses K-MEANS to cluster and does Post-hoc explanations/topic detection on semantic clusters found. <a href = './Semantic_Clustering_Python.ipynb'>Semantic Clustering </a></li> 
     <li>RAG Notebook for TD Catalog:Does a dump of TD Catalog Metadata on a table. Does embeddings on both Metadata + language model prompt query. Does Semantic Similarity search of Top N Chunks and hands it off to a LLM to answer the prompt.<a href = './RAG_and_Bedrock_Querycatalogue.ipynb'>RAG and Bedrock to query Catalogue </a></li> 
     <li>RAG Notebook for SEC-10K PDF:Demo with some PDF parsing and chunking with a Teradata SEC-10K PDF, creates embedding and uses language model to answer prompts <a href = './RAG_and_Bedrock_QueryPDF.ipynb'>RAG and Bedrock to query Pdf </a></li> 
      
</ul>
    </p>

<footer style="padding-bottom:35px; background:#f9f9f9; border-bottom:3px solid #00233C">
    <div style="float:left;margin-top:14px">ClearScape Analytics™</div>
    <div style="float:right;">
        <div style="float:left; margin-top:14px">
            Copyright © Teradata Corporation - 2024. All Rights Reserved
        </div>
    </div>
</footer>