# Teradata as Embeddings Storage. Semantic search

## Introduction

Embeddings are revolutionizing the way we process and understand text data. These dense vector representations capture the semantic essence of words, phrases, and even entire documents, enabling machines to grasp nuanced meanings and relationships within the data. Imagine a high-dimensional space where each point represents a word, phrase, or document as a vector. The position of each point is determined by the context in which it appears, meaning similar concepts are located closer together.

For instance, the words "king" and "queen" would have vectors that are close to each other, as would "Paris" and "France." This spatial proximity allows embeddings to capture intricate relationships and patterns that traditional keyword-based methods miss.

Unlike traditional keyword-based search methods, embedding-based search leverages these rich representations to find relevant information based on context and meaning rather than exact word matches. This is where cosine similarity comes into play. By measuring the cosine of the angle between vectors, cosine similarity allows us to quantify how similar two embeddings are, effectively identifying the most relevant documents or texts. This technique is crucial in applications ranging from information retrieval and recommendation systems to natural language understanding, providing more accurate and meaningful search results.

### Illustration of Embeddings

To visualize embeddings, imagine a 2D plot (though embeddings typically exist in much higher dimensions):

![alt text](img/embeddings_1.jpg "Embeddings: King and Queen, Paris and France")


In this illustration:
- "King" and "Queen" are close together, indicating they are semantically similar.
- "Paris" and "France" are also close together, showing a geographical relationship.

By using embeddings, we can better understand and search through our data in ways that are meaningful and contextually relevant.

## Approach

In this demo, we showcase an advanced approach to embedding-based search using the Teradata database. Our methodology involves several key steps:

1. **Importing and Converting Model**: We begin by importing pre-trained models from Hugging Face, which are renowned for their ability to capture semantic meanings in text data effectively. To enhance performance and ensure compatibility with various execution environments, we convert these Hugging Face models into the ONNX (Open Neural Network Exchange) format using the [`optimum`](https://github.com/huggingface/optimum) utility.

2. **Model Deployment to Database to be Used with BYOM**: Leveraging Teradata's BYOM (Bring Your Own Model) capability, we deploy the model directly within the Teradata database. This integration minimizes data movement and optimizes performance by keeping the model execution close to the data storage.

3. **In-Database Embedding Generation and Building the Embedding Store**: We execute the embedding generation process directly within the Teradata database. Each text entry in our knowledge base is processed to create its corresponding embedding vector, which is then stored in a structured repository for efficient retrieval.

4. **Semantic Search with Cosine Similarity**: Finally, we utilize Teradata’s functionality to calculate cosine similarity between a query embedding and the embeddings stored in the database. Cosine similarity, which measures the angle between two vectors, effectively determines their similarity. This enables us to perform semantic searches directly within the database, retrieving the most relevant results based on the meaning of the text rather than exact keyword matches.

The advantage of this approach is that the data never leaves the database. This ensures data security and compliance while reducing latency and improving efficiency, as all operations are performed close to where the data resides.

This approach combines state-of-the-art embedding models with Teradata's robust data management and processing capabilities, facilitating efficient and accurate semantic searches at scale.


![alt text](img/embeddings_diagram.jpg "Teradata in-database Embedding Store")

The advantage of this approach is that the data never leaves the database. This ensures data security and compliance while reducing latency and improving efficiency, as all operations are performed close to where the data resides.

This approach combines state-of-the-art embedding models with Teradata's robust data management and processing capabilities, facilitating efficient and accurate semantic searches at scale.



## Part 1. Importing and Converting Model

We start by importing the pre-trained [BAAI/bge](https://huggingface.co/BAAI/bge-small-en-v1.5) model from Hugging Face, renowned for its effectiveness in capturing semantic meanings in text data. The BAAI/bge model is a state-of-the-art model trained on a large corpus, capable of generating high-quality text embeddings.

To enhance performance and ensure compatibility with various execution environments, we'll use the Optimum utility to convert the model into the ONNX (Open Neural Network Exchange) format.

In [1]:
! optimum-cli export onnx --opset 16 --trust-remote-code -m BAAI/bge-small-en-v1.5 bge-small-en-v1.5-onnx

Framework not specified. Using pt to export the model.
Automatic task detection to feature-extraction (possible synonyms are: default, image-feature-extraction, mask-generation, sentence-similarity).
Using the export variant default. Available variants are:
    - default: The default ONNX variant.

***** Exporting submodel 1/1: SentenceTransformer *****
  param_schemas = callee.param_schemas()
  param_schemas = callee.param_schemas()
Using framework PyTorch: 2.3.1+cu121
Overriding 1 configuration item(s)
	- use_cache -> False
Post-processing the exported models...
Deduplicating shared (tied) weights...

Validating ONNX model bge-small-en-v1.5-onnx/model.onnx...
	-[✓] ONNX model output names match reference model (token_embeddings, sentence_embedding)
	- Validating ONNX Model output "token_embeddings":
		-[✓] (2, 16, 384) matches (2, 16, 384)
		-[✓] all values close (atol: 1e-05)
	- Validating ONNX Model output "sentence_embedding":
		-[✓] (2, 384) matches (2, 384)
		-[✓] all values clo

In [1]:
import onnx
import onnxruntime as rt

import transformers
from onnxruntime.tools.onnx_model_utils import *

from sentence_transformers.util import cos_sim
from sentence_transformers import SentenceTransformer

import teradataml as tdml

import getpass

Once the model is converted, we proceed to test the correctness of the ONNX model by calculating cosine similarity between two texts using native SentenceTransformers and ONNX runtime, comparing the results.

If the results are identical, it confirms that the ONNX model gives the same result as the native models, validating its correctness and suitability for further use in the database.

In [2]:
sentences_1 = 'How is the weather today?'
sentences_2 = 'What is the current weather like today?'

In [3]:
# Calculate ONNX result

tokenizer = transformers.AutoTokenizer.from_pretrained("./bge-small-en-v1.5-onnx")
predef_sess = rt.InferenceSession("bge-small-en-v1.5-onnx/model.onnx")

enc1 = tokenizer(sentences_1, max_length = 512, padding='max_length' )
embeddings_1_onnx = predef_sess.run(None,     {"input_ids": [enc1.input_ids], 
     "attention_mask": [enc1.attention_mask]})

enc2 = tokenizer(sentences_2, max_length = 512, padding='max_length' )
embeddings_2_onnx = predef_sess.run(None,     {"input_ids": [enc2.input_ids], 
     "attention_mask": [enc2.attention_mask]})

In [4]:
# Calculate native model result using SentenceTransformer

model = SentenceTransformer('BAAI/bge-small-en-v1.5')
embeddings_1_sentence_transformer = model.encode(sentences_1, normalize_embeddings=True)
embeddings_2_sentence_transformer = model.encode(sentences_2, normalize_embeddings=True)

In [5]:
# Compare results

print("Cosine similiarity for embeddings calculated with ONNX:" + str(cos_sim(embeddings_1_onnx[1][0], embeddings_2_onnx[1][0])))
print("Cosine similiarity for embeddings calculated with SentenceTransformer:" + str(cos_sim(embeddings_1_sentence_transformer, embeddings_2_sentence_transformer)))

Cosine similiarity for embeddings calculated with ONNX:tensor([[0.9186]])
Cosine similiarity for embeddings calculated with SentenceTransformer:tensor([[0.9186]])


## Part 2. Model Deployment to Database to be Used with BYOM

In this section, we demonstrate how to deploy the model to the Teradata database using the BYOM (Bring Your Own Model) capability. We use the `teradataml` Python library to manage the connectivity and provide a convenient Python API that is similar to PySpark or pandas DataFrame.




### Opening Connection to Teradata

We start by setting up a connection to the Teradata database. The `teradataml` library handles all the intricacies of database connectivity, allowing us to interact with Teradata in a manner similar to working with data in pandas DataFrames.


In [2]:
tdml.create_context(host = 'teradata', username='sasha', password = 'teradata100500')

Engine(teradatasql://sasha:***@teradata)

### Deploying the Model and Tokenizer

After establishing the connection, we deploy two key artifacts to the database:
1. The model itself, converted to ONNX format.
2. The `tokenizer.json` file, which will be used for in-database tokenization.

Both artifacts are deployed using the `save_byom` function, which abstracts the underlying complexity and makes the deployment process straightforward. Internally, this function performs an insert operation into the database.

By using the `save_byom` function, we ensure that our model and tokenizer are readily available within the Teradata database for subsequent embedding generation and semantic search operations. This integration minimizes data movement and optimizes performance by keeping all operations within the database environment.

In [13]:
#UNCOMMENT IF TABLE EXISTS
tdml.db_drop_table('embeddings_models')
tdml.save_byom('bge-small-en-v1.5',
              'bge-small-en-v1.5-onnx/model.onnx',
              'embeddings_models')

#UNCOMMENT IF TABLE EXISTS
tdml.db_drop_table('embeddings_tokenizers')
tdml.save_byom('bge-small-en-v1.5',
              'bge-small-en-v1.5-onnx/tokenizer.json',
              'embeddings_tokenizers')

Created the model table 'embeddings_models' as it does not exist.
Model is saved.
Created the model table 'embeddings_tokenizers' as it does not exist.
Model is saved.


## Part 3. In-Database Embedding Generation and Building the Embedding Store

In this point, we are taking the history of the emails and building the embedding store in one simple step:


In [7]:
#UNCOMMENT IF TABLE EXISTS
#tdml.db_drop_table('emails_embeddings_store')

tdml.execute_sql("""

create table emails_embeddings_store as (
    select 
            *
    from mldb.ONNXEmbeddings(
            on emails.emails as InputTable
            on (select * from embeddings_models where model_id = 'bge-small-en-v1.5') as ModelTable DIMENSION
            on (select model as tokenizer from embeddings_tokenizers where model_id = 'bge-small-en-v1.5') as TokenizerTable DIMENSION
       
            using
                Accumulate('id', 'txt') 
                ModelOutputTensor('sentence_embedding')
                EnableMemoryCheck('false')
                OutputFormat('FLOAT32(384)')
        ) a 
) with data

""")

TeradataCursor uRowsHandle=33 bClosed=False

In [8]:
tdf_embeddings_store = tdml.DataFrame('emails_embeddings_store')
tdf_embeddings_store.head(3)

id,txt,emb_0,emb_1,emb_2,emb_3,emb_4,emb_5,emb_6,emb_7,emb_8,emb_9,emb_10,emb_11,emb_12,emb_13,emb_14,emb_15,emb_16,emb_17,emb_18,emb_19,emb_20,emb_21,emb_22,emb_23,emb_24,emb_25,emb_26,emb_27,emb_28,emb_29,emb_30,emb_31,emb_32,emb_33,emb_34,emb_35,emb_36,emb_37,emb_38,emb_39,emb_40,emb_41,emb_42,emb_43,emb_44,emb_45,emb_46,emb_47,emb_48,emb_49,emb_50,emb_51,emb_52,emb_53,emb_54,emb_55,emb_56,emb_57,emb_58,emb_59,emb_60,emb_61,emb_62,emb_63,emb_64,emb_65,emb_66,emb_67,emb_68,emb_69,emb_70,emb_71,emb_72,emb_73,emb_74,emb_75,emb_76,emb_77,emb_78,emb_79,emb_80,emb_81,emb_82,emb_83,emb_84,emb_85,emb_86,emb_87,emb_88,emb_89,emb_90,emb_91,emb_92,emb_93,emb_94,emb_95,emb_96,emb_97,emb_98,emb_99,emb_100,emb_101,emb_102,emb_103,emb_104,emb_105,emb_106,emb_107,emb_108,emb_109,emb_110,emb_111,emb_112,emb_113,emb_114,emb_115,emb_116,emb_117,emb_118,emb_119,emb_120,emb_121,emb_122,emb_123,emb_124,emb_125,emb_126,emb_127,emb_128,emb_129,emb_130,emb_131,emb_132,emb_133,emb_134,emb_135,emb_136,emb_137,emb_138,emb_139,emb_140,emb_141,emb_142,emb_143,emb_144,emb_145,emb_146,emb_147,emb_148,emb_149,emb_150,emb_151,emb_152,emb_153,emb_154,emb_155,emb_156,emb_157,emb_158,emb_159,emb_160,emb_161,emb_162,emb_163,emb_164,emb_165,emb_166,emb_167,emb_168,emb_169,emb_170,emb_171,emb_172,emb_173,emb_174,emb_175,emb_176,emb_177,emb_178,emb_179,emb_180,emb_181,emb_182,emb_183,emb_184,emb_185,emb_186,emb_187,emb_188,emb_189,emb_190,emb_191,emb_192,emb_193,emb_194,emb_195,emb_196,emb_197,emb_198,emb_199,emb_200,emb_201,emb_202,emb_203,emb_204,emb_205,emb_206,emb_207,emb_208,emb_209,emb_210,emb_211,emb_212,emb_213,emb_214,emb_215,emb_216,emb_217,emb_218,emb_219,emb_220,emb_221,emb_222,emb_223,emb_224,emb_225,emb_226,emb_227,emb_228,emb_229,emb_230,emb_231,emb_232,emb_233,emb_234,emb_235,emb_236,emb_237,emb_238,emb_239,emb_240,emb_241,emb_242,emb_243,emb_244,emb_245,emb_246,emb_247,emb_248,emb_249,emb_250,emb_251,emb_252,emb_253,emb_254,emb_255,emb_256,emb_257,emb_258,emb_259,emb_260,emb_261,emb_262,emb_263,emb_264,emb_265,emb_266,emb_267,emb_268,emb_269,emb_270,emb_271,emb_272,emb_273,emb_274,emb_275,emb_276,emb_277,emb_278,emb_279,emb_280,emb_281,emb_282,emb_283,emb_284,emb_285,emb_286,emb_287,emb_288,emb_289,emb_290,emb_291,emb_292,emb_293,emb_294,emb_295,emb_296,emb_297,emb_298,emb_299,emb_300,emb_301,emb_302,emb_303,emb_304,emb_305,emb_306,emb_307,emb_308,emb_309,emb_310,emb_311,emb_312,emb_313,emb_314,emb_315,emb_316,emb_317,emb_318,emb_319,emb_320,emb_321,emb_322,emb_323,emb_324,emb_325,emb_326,emb_327,emb_328,emb_329,emb_330,emb_331,emb_332,emb_333,emb_334,emb_335,emb_336,emb_337,emb_338,emb_339,emb_340,emb_341,emb_342,emb_343,emb_344,emb_345,emb_346,emb_347,emb_348,emb_349,emb_350,emb_351,emb_352,emb_353,emb_354,emb_355,emb_356,emb_357,emb_358,emb_359,emb_360,emb_361,emb_362,emb_363,emb_364,emb_365,emb_366,emb_367,emb_368,emb_369,emb_370,emb_371,emb_372,emb_373,emb_374,emb_375,emb_376,emb_377,emb_378,emb_379,emb_380,emb_381,emb_382,emb_383
3,"In my meeting with CFO Alan Turner on March 10th, we reviewed financial projections for the next quarter. Our goal is to raise $100,000 for expansion, targeting a 20% growth in revenue.",-0.0144999176263809,-0.0240447968244552,0.0148245496675372,-0.024179745465517,-0.0046223229728639,0.0145872849971055,-0.0327484384179115,0.0224548522382974,-0.0150787429884076,0.0353079065680503,0.0425337478518486,-0.0065402188338339,-0.0316364541649818,0.0289678927510976,0.0268700793385505,0.065931536257267,-0.0285619832575321,-0.0888291522860527,-0.070812776684761,0.0254897270351648,0.0262782834470272,-0.0092525780200958,-0.0167925655841827,-0.0601638369262218,0.0768952667713165,-0.0130692664533853,-0.0320324376225471,-0.040277574211359,-0.0452279821038246,-0.1245728731155395,-0.0088860988616943,-0.0193841326981782,0.0405524969100952,-0.042526826262474,0.0160486456006765,-0.0044478140771389,-0.028995731845498,0.0222660452127456,0.0573362223803997,0.0719310343265533,-0.0109211523085832,0.0261155180633068,0.0146966790780425,-0.0396247431635856,0.0191687978804111,-0.0095687005668878,-0.0522128716111183,0.0117133306339383,-0.0209801513701677,0.0592250674962997,0.0412584766745567,-0.0874752178788185,-0.0119113530963659,0.015477605164051,-0.0657605081796646,0.0172187481075525,-0.0280933137983083,-0.0069580138660967,0.0473846159875392,0.0044333282858133,0.0131416674703359,-0.0176038537174463,-0.1659221053123474,0.0531661957502365,0.0126568945124745,-0.017441676929593,0.0088291894644498,0.0489672869443893,0.0468692779541015,0.0249686390161514,0.053254272788763,-0.0156253390014171,-0.0026769922114908,0.0046149627305567,0.0412104912102222,-0.0353617742657661,0.0048023047856986,0.0145203098654747,0.0529215075075626,0.0077314628288149,0.0291845146566629,0.0404619537293911,-0.0832776874303817,-0.0285328160971403,-0.0254035014659166,0.0055426727049052,0.0364515669643878,0.0489101223647594,0.0284793395549058,0.0441624261438846,0.0478870160877704,-0.0542196370661258,-0.0150013323873281,-0.0251999739557504,-0.0194607358425855,0.0137313529849052,0.0182006787508726,0.0042222933843731,-0.0640705600380897,0.4618915617465973,0.0615590028464794,0.0190816204994916,0.0298726223409175,-0.0244075637310743,0.0100858509540557,-0.0391281247138977,-0.0282426979392766,0.0137304970994591,0.0294889919459819,-0.0207537673413753,-0.0024871125351637,-0.0180413275957107,0.0569202937185764,-0.0138816628605127,0.0353011153638362,0.0289064161479473,0.0061686197295784,-0.0100691262632608,-0.0336625352501869,0.0052239773795008,0.0534555278718471,0.02605452388525,0.0157818831503391,0.0129439169541001,-0.1171653047204017,-0.018441092222929,0.0634476095438003,0.0991625562310218,0.0169152598828077,0.05280552059412,0.0252977069467306,0.0102464221417903,-0.0685732960700988,-0.0075824349187314,-0.0120157012715935,-0.0175892636179924,-0.0048442333936691,0.0429915264248847,0.00837117806077,0.0557243935763835,-0.019661970436573,-0.0001316839334322,-0.055534578859806,-0.0840722844004631,-0.018150120973587,0.0894952863454818,0.0063632023520767,-0.0252320785075426,0.003748474875465,-0.0507483705878257,0.0211577918380498,-0.0156586375087499,0.014214782975614,-0.0467445366084575,0.0008583978051319,-0.0404383577406406,0.0364131778478622,0.0212252475321292,0.002089207060635,0.0328865610063076,-0.0664611682295799,0.0175994765013456,-0.0667636692523956,-0.008546176366508,-0.000742016476579,-0.0935288146138191,-0.0591326728463172,-0.0207986980676651,0.0528532117605209,0.020795039832592,-0.0359736606478691,0.0121141346171498,0.0031577420886605,-0.0350829474627971,0.1122008487582206,0.0084754349663853,-0.049725454300642,-0.0380494892597198,-0.0168321281671524,0.0312278643250465,0.0517155453562736,0.0027327369898557,-0.0370275564491748,0.0441878549754619,-0.0687825754284858,-0.0396444201469421,0.0162341967225074,-0.0280240513384342,-0.0617282539606094,-0.022864107042551,-0.0950747802853584,0.0216971021145582,-0.1421263515949249,0.0499006696045398,-0.0101241851225495,-0.0100314207375049,0.0085734045132994,0.0485200211405754,0.0104598030447959,-0.002052555559203,-0.0031512631103396,-0.0634248480200767,-0.0362692922353744,0.0545828975737094,-0.0222263094037771,0.0367621220648288,0.0139182554557919,0.0219781231135129,0.0989015772938728,-0.0095489611849188,-0.013996155001223,-0.0135611426085233,0.0866134613752365,0.0376581884920597,-0.0475584901869297,0.0698906555771827,-0.0546438954770565,0.06642347574234,-0.0405375286936759,-0.0272008571773767,0.0602843984961509,0.0456601902842521,-0.0046508694067597,-0.3127236068248749,-0.0003715962811838,0.0055142445489764,-0.0371589064598083,0.0198950339108705,-0.0076283053494989,0.0167927742004394,0.0202552247792482,-0.0440118014812469,0.0382916741073131,0.0090554887428879,-0.0121858138591051,0.0481534712016582,-0.0298506189137697,-0.0139820259064435,-0.0539590679109096,-0.0150297069922089,-0.0002763538213912,0.0346408076584339,0.0043068337254226,-0.046691421419382,0.0492359399795532,-0.0967290475964546,-0.0200890023261308,0.0240012947469949,0.005785159766674,0.0402197539806366,-0.0215695016086101,0.0017064227722585,-0.0313662700355052,0.0765195935964584,-0.0133551331236958,-0.0277766622602939,-0.0390426069498062,0.0505902059376239,0.0063826367259025,0.0354113169014453,0.0393437780439853,-0.0645862519741058,-0.0343119092285633,-0.0385633148252964,0.0203443486243486,-0.0240836720913648,-0.029336804524064,-0.0429947488009929,0.0275729112327098,0.0178953409194946,-0.0108239697292447,0.0109919933602213,0.0455698817968368,0.0028956066817045,-0.0384972132742404,0.0597990900278091,-0.0501073487102985,0.0224267225712537,-0.0441854782402515,-0.0156954247504472,0.0239892341196537,0.03878865391016,0.0597231984138488,0.0465668067336082,0.0035649705678224,-0.03342080488801,0.0202235411852598,0.0219419300556182,-0.0515837296843528,0.0153106050565838,0.0046651628799736,-0.003701469162479,0.0254147183150053,0.0417626239359378,-0.0527350306510925,0.0224242303520441,-0.0357144922018051,-0.0108835892751812,-0.005125789437443,-0.0260327570140361,-0.0215471144765615,-0.0462905876338481,0.003241361817345,0.0645269826054573,-0.0241128038614988,0.0085978247225284,0.0182575546205043,-0.0629238560795784,0.0161558184772729,0.0189598351716995,0.0056753549724817,0.015399482101202,0.003097141161561,-0.0510443486273288,-0.0200542323291301,-0.0537516996264457,0.0058591375127434,0.0100603392347693,0.0089844353497028,-0.2507031559944153,-0.0292380005121231,0.0033197836019098,-0.0130455717444419,0.0067491657100617,0.0517134368419647,0.0236755963414907,0.0541675016283988,-0.0037860099691897,0.0522538870573043,0.0284896790981292,0.0098635461181402,0.0412797816097736,0.023967957124114,0.0325935930013656,0.0117273237556219,0.0326181314885616,-0.0015258982311934,0.0326642394065856,-0.024234278127551,0.0054329200647771,-0.033954981714487,0.1792916655540466,0.0239572841674089,0.0259702932089567,0.0387536622583866,-0.0464638136327266,-0.0029948235023766,0.0672430396080017,0.0043112747371196,0.032582476735115,0.019411476328969,0.0255485381931066,-0.0430073179304599,0.0346238538622856,-0.0337138213217258,0.0088394051417708,0.02375029027462,-0.0095145674422383,-0.0423158593475818,0.0029065965209156,0.0024232596624642,-0.0240809563547372,0.0483832210302352,0.0477466620504856,0.0420163758099079,0.0074730343185365,-0.0322050936520099,-0.0455449894070625,-0.0259665586054325,-0.0553290694952011,-0.0021561146713793,0.0624863505363464,-0.0216719601303339,0.0227400660514831,0.0644568875432014,-0.0204814784228801,0.0057979417033493,-0.0111167645081877,0.0176822766661643,-0.0405144095420837,-0.0582056120038032,-0.0333047248423099,-0.0019634459167718,-0.0138335153460502
2,"During my visit to London last week on February 3rd, I attended a conference hosted by Tech Innovators Ltd. Our collaboration aims to launch a groundbreaking product, with a budget of $80,000 and an expected completion time of six months.",-0.0144568970426917,-0.0194675549864768,0.052134420722723,-0.0578236691653728,0.0497851446270942,-0.0271593742072582,-0.0104637211188673,0.0235041547566652,-0.0144770490005612,-0.0157029889523983,0.0259718019515275,-0.0218723453581333,0.0043113538995385,-0.0189408492296934,0.0180822722613811,-0.0224279910326004,-0.0446636974811553,-0.1720914542675018,-0.009622705169022,0.0071300184354186,-0.026949293911457,-0.0679196268320083,0.0359476990997791,-0.0654817521572113,-0.0044397353194653,0.0263112541288137,-0.0042033833451569,-0.0305254608392715,0.0056552812457084,-0.1257849633693695,0.0228415429592132,-0.0274133030325174,0.0150069240480661,0.0238844845443964,0.0047487742267549,0.0207277331501245,-0.0130960028618574,-0.0284817144274711,-0.0039083482697606,-0.0111017571762204,0.0067187049426138,-0.0064304978586733,-0.0056271785870194,-0.0237007644027471,0.0626340582966804,-0.0431141853332519,-0.0223446767777204,-0.0163346212357282,-0.0218329653143882,0.0129688782617449,-0.0047768861986696,-0.0457915812730789,0.0130650261417031,-0.0523656792938709,-0.053595021367073,0.0474075898528099,-0.0117047149688005,0.0317092910408973,0.052086018025875,-0.0256401002407073,0.0101434271782636,0.0005963089060969,-0.1457995027303695,0.1026207283139228,0.03162083029747,-0.0256411004811525,-0.0397006422281265,-0.0012956166174262,0.0440134406089782,0.0196572020649909,0.0661021098494529,0.0164622403681278,0.0439250282943248,0.0226472131907939,0.0458041541278362,-0.0004905770765617,-0.0107374992221593,0.0741036608815193,0.0402524396777153,-0.0168190617114305,0.0895088165998458,0.0176155231893062,-0.0340456552803516,-0.0101385684683918,-0.0326133966445922,-0.0315914340317249,0.0272196177393198,0.0578682795166969,-0.0096447234973311,0.0074191824533045,0.0073112826794385,0.017747476696968,-0.043238703161478,0.0180606059730052,-0.0429279282689094,0.0011612839298322,0.062947042286396,0.0285776574164628,-0.0229273494333028,0.4151783883571625,0.0090406788513064,0.0155931692570447,0.0399514473974704,-0.0109781892970204,0.0026283902116119,-0.0716071128845214,0.0123405782505869,0.0065986523404717,-0.0059802681207656,0.0319110229611396,-0.0053686685860157,0.0120834866538643,0.0295745078474283,-0.0625395253300666,0.0136537197977304,0.0837191194295883,0.0131898960098624,-0.0077722375281155,0.0066727101802825,-0.019977131858468,0.0049389330670237,0.0482301637530326,0.0031621335074305,-0.0152278263121843,-0.1200694367289543,0.0602477081120014,0.0079208388924598,0.0891198739409446,-0.0174641087651252,0.017488645389676,0.0490391440689563,0.0196170676499605,-0.025072818621993,0.0440857969224453,0.0089833894744515,-0.0302317794412374,-0.0420836508274078,-0.0486984699964523,0.0343376062810421,0.031437411904335,0.0016540250508114,-0.0071115908212959,-0.0152565147727727,-0.044963676482439,-0.011078093200922,0.0402418076992034,0.0283130519092082,-0.0510542429983615,-0.0179295688867568,-0.0618490949273109,0.0051467549055814,-0.0013365173945203,0.069610446691513,-0.0406916812062263,0.0182382371276617,0.0020754917059093,0.0902208387851715,0.0094532296061515,-0.0109852943569421,0.0412842631340026,-0.0370242521166801,0.0001174007556983,-0.0628571286797523,0.0487911850214004,0.0033950484357774,-0.207917109131813,0.0587968826293945,-0.005528945941478,0.0058151613920927,0.0104160038754343,-0.042968962341547,0.0287026092410087,-0.0004565337731037,0.0025277088861912,0.0496855825185775,-0.0325023718178272,-0.0813962295651435,0.0248812362551689,0.031327698379755,-0.0071953758597373,0.0275716949254274,-0.0270731206983327,-0.0172497890889644,0.0382760278880596,-0.0089956810697913,-0.0417573191225528,0.0374855026602745,-0.0141094038262963,-0.0298984982073307,0.0773424059152603,-0.0227282773703336,0.0638790056109428,-0.0808350667357444,0.0931893214583396,0.0363271869719028,-0.0304266102612018,-0.0338290371000766,-0.0259252917021513,0.0705922394990921,-0.0149607555940747,-0.0206148270517587,-0.0561383366584777,-0.0347065180540084,0.0399063341319561,-0.0155425984412431,0.0149553548544645,0.0602851882576942,-0.0036395159550011,0.1096042245626449,-0.0161014851182699,-0.0216212756931781,-0.0331295803189277,0.1029671132564544,-0.0209205392748117,-0.015610416419804,-0.0219422690570354,0.0247178506106138,0.0409049205482006,-0.0009652629960328,0.0192117523401975,0.0160634871572256,-0.0013122804230079,-0.0166382659226655,-0.3166283071041107,0.0029672407545149,0.0018062536837533,-0.0134851867333054,0.0298457089811563,-0.0081533212214708,-0.012238023802638,0.0066887284629046,0.0007437312742695,-0.0140129225328564,0.0690440312027931,0.022086214274168,0.037097655236721,0.0483604185283184,-0.0474629141390323,-0.0622512139379978,0.043817151337862,0.0166979003697633,-0.0003575566224753,0.0215024352073669,-0.0313323177397251,-0.05440004914999,-0.0573059730231761,-0.0168310329318046,-0.0295907016843557,0.0098434900864958,0.1280858963727951,-0.0303996093571186,-0.0362507924437522,-0.0074393637478351,0.04442735388875,0.0362010672688484,-0.0191918350756168,-0.1193065047264099,0.0061279227957129,0.0566530339419841,0.0944879651069641,0.0211892873048782,-0.0458890870213508,0.0066346046514809,-0.0612144283950328,-0.0108024086803197,-0.0415045283734798,-0.0115566775202751,-0.04608915746212,0.0659267604351043,-0.0434738434851169,0.0076820431277155,-0.0513545237481594,0.0138140190392732,0.0235710591077804,-0.0305517502129077,0.0361397117376327,0.026357565075159,0.0361842773854732,-0.0293703954666852,-0.0580522157251834,0.0934328064322471,0.0060685444623231,0.0569606311619281,0.0139237558469176,0.0172024518251419,0.0142275774851441,0.0038801711052656,0.0262217186391353,-0.0570242889225482,0.0639061257243156,0.0333163812756538,-0.0018376475200057,-0.0159404277801513,-0.0065275742672383,0.0369191579520702,0.0393734090030193,-0.0081539209932088,0.0341191925108432,-0.0733157843351364,0.0019886172376573,0.0093404632061719,-0.0167120806872844,0.0186455640941858,-0.0145766735076904,0.007177285850048,-0.0240634344518184,0.0507119037210941,-0.0430568046867847,0.0491014905273914,0.0122404675930738,-0.0423279739916324,-0.0099976556375622,-0.0156294722110033,-0.0299548897892236,0.0025416088756173,0.0183156915009021,0.0454913191497325,0.0126139661297202,0.0012929323129355,-0.2457327246665954,0.0663203671574592,0.0215372946113348,0.0037689467426389,-0.0366750992834568,-0.038196749985218,-0.0051440773531794,0.0104375695809721,0.0254368353635072,0.009317971765995,-0.005483356770128,-0.0352167673408985,0.0269968267530202,-0.0012020794674754,0.0889470651745796,-0.0132434694096446,0.0425921380519866,-0.0205434542149305,-0.0250876508653163,-0.0317632183432579,-0.01205048058182,-0.0320710353553295,0.1157311722636222,-0.0367158725857734,0.0611512064933776,0.0411325730383396,-0.0646610483527183,0.0530727505683898,0.067078098654747,-0.0373834408819675,-0.0855914577841758,-0.034606896340847,-0.0147514920681715,-0.0269965101033449,-0.0158593077212572,-0.0194731317460536,-0.0479639023542404,0.0087898792698979,-0.0142325209453701,0.0516374930739402,0.0105863194912672,0.0191213954240083,-0.000520612578839,0.0010221934644505,0.0397212095558643,0.0455886125564575,-0.0201869867742061,-0.0537796579301357,-0.0237799305468797,0.0286867599934339,-0.0072797904722392,-0.0518124848604202,-0.0021943640895187,-0.0046348106116056,0.0227729920297861,0.0415182150900363,0.0152282519266009,0.0115141468122601,-0.0139275230467319,-0.0028738959226757,-0.0306722559034824,-0.0222458858042955,-0.0729485675692558,0.0244394745677709,-0.0060263494960963
1,"I recently met with Dr. Emily Johnson in San Francisco on June 8th to discuss upcoming projects for our organization. We plan to allocate $50,000 for the initiative, targeting a 15% increase in efficiency by the end of Q3.",-0.0586574599146842,-0.0224278662353754,0.0120884198695421,-0.0207850281149148,0.0220338180661201,-0.0026325737126171,-0.0367981679737567,0.0537917800247669,-0.0213714204728603,0.01930782943964,0.0122699467465281,-0.0144289257004857,-0.0211269091814756,-0.0202465318143367,0.0523125231266021,0.0133519265800714,-0.0458203703165054,-0.0640682876110076,-0.0326438583433628,0.0137941623106598,0.0072145406156778,-0.0495110899209976,-0.0169985685497522,-0.0119534926488995,0.0220812670886516,-0.0024135075509548,-0.0181120783090591,-0.0377853736281394,-0.0332377925515174,-0.1206438988447189,-0.0060167834162712,0.0219007562845945,0.0548088103532791,-0.0456781834363937,-0.0011401996016502,0.0653530582785606,-0.0176792424172163,0.042285967618227,-0.0231807027012109,0.023386800661683,-0.0009440194116905,-0.0190113019198179,-0.0433152355253696,-0.0577495098114013,-0.0260082241147756,-0.0263760238885879,-0.0290298108011484,-0.0194269958883523,-0.0344281569123268,0.0100110759958624,0.0193871762603521,-0.0835096314549446,-0.0249290559440851,0.0107638854533433,-0.0042834430932998,0.0364569462835788,0.0032450803555548,0.0335829518735408,-0.0075964890420436,0.0170105081051588,0.0137120988219976,0.0485645420849323,-0.1624052226543426,0.0652600601315498,0.0208467077463865,-0.0444256328046321,0.0214360672980546,-0.0642879232764244,0.0337575450539588,0.0299825798720121,0.065305583178997,0.0124061852693557,0.0020723829511553,0.0087828431278467,0.0406563580036163,0.0094841029495,0.0137320458889007,0.0707864612340927,0.0576155632734298,0.0127870664000511,0.0056551177985966,-0.0408248901367187,-0.0320115871727466,-0.0128259193152189,-0.0646174401044845,0.0022336947731673,0.0493772067129612,0.0679226741194725,-0.0001037735928548,0.0170906111598014,0.0131488321349024,-0.0218946915119886,-0.0135431010276079,-0.0505680702626705,-0.0376878418028354,0.0086046708747744,0.0152460345998406,-0.002644432010129,-0.0601860955357551,0.4554443359375,-0.0259228348731994,-0.0175595227628946,0.0012885780306532,-0.0227185450494289,0.0429784320294857,-0.0656633675098419,-0.02120828256011,-0.0224920306354761,-0.0201109331101179,0.0197993423789739,-0.0025796755217015,-0.0002825088158715,0.0398058257997036,-0.0620608143508434,-0.0285086035728454,0.0579885356128215,0.0570663549005985,0.073760449886322,0.0083840554580092,-0.0210809633135795,0.0758547484874725,0.025020059198141,0.0169883295893669,0.027646504342556,-0.0867284759879112,-0.0442184098064899,-0.0012057878775522,0.1078984215855598,-0.0049645313993096,0.0785942897200584,0.0416178740561008,-0.0009070063242688,-0.0442178696393966,0.0420599915087223,-0.0005954048247076,-0.0503082312643528,0.0038275516126304,-0.0157879870384931,-0.0464855767786502,0.0320768281817436,0.0369216315448284,0.0338305346667766,-0.0862469598650932,-0.1020585820078849,-0.0395941138267517,0.0772995352745056,0.0354385673999786,-0.0349186174571514,0.0359297804534435,-0.0881306901574134,0.0719532147049903,0.0294565260410308,-0.0101400595158338,-0.0369227975606918,0.0171191729605197,-0.0249135680496692,0.0686227455735206,0.01023222040385,-0.033070158213377,0.0263071581721305,-0.0434467047452926,0.0444634594023227,-0.0296418741345405,0.0498334541916847,0.0322751626372337,-0.1077159196138382,-0.0141265699639916,0.0153930159285664,0.0627955719828605,0.0256567075848579,0.0585406273603439,0.0236635897308588,0.0365203134715557,-0.0118607925251126,0.1389070749282837,-0.0088020032271742,-0.0381804853677749,0.0396909341216087,0.0002758067857939,-0.0057291998527944,-0.0001937871129484,-0.0036628139205276,0.0193572212010622,-0.0128609193488955,-0.0554885864257812,-0.0464426986873149,0.0327977687120437,-0.0038450276479125,-0.095257744193077,0.0473874472081661,-0.0201734863221645,0.062977559864521,-0.0879213809967041,0.0658758878707885,-0.0076180417090654,-0.0724771916866302,0.0022162836976349,-0.0130309155210852,-0.0090334294363856,-0.0047091469168663,0.0006930653471499,-0.0118688698858022,-0.0728044062852859,0.0712625160813331,-0.0044292132370173,-0.0292436797171831,0.0275369863957166,0.0139541337266564,0.0974314510822296,0.0338225178420543,0.0031494984868913,0.0079945689067244,0.1097571030259132,0.0551481544971466,-0.0140390070155262,0.0417421571910381,-0.0288926027715206,0.0326872989535331,-0.0089564360678195,0.0033170203678309,0.0425417646765708,0.0400964543223381,0.0020855728071182,-0.3268796801567077,0.023038525134325,-0.0306019820272922,0.0148755684494972,0.0064697042107582,0.0054820696823298,0.0625607222318649,-0.0165868233889341,-0.0230220332741737,0.0761333107948303,0.0526518076658248,-0.0113079911097884,0.023481348529458,0.0340859852731227,0.0249967984855175,-0.0308336094021797,0.0201134085655212,-0.0117399534210562,-0.0119476616382598,-0.0089219994843006,0.0115734040737152,-0.0106764929369091,0.0084521528333425,0.0154689326882362,0.0421687178313732,0.040815994143486,0.0632507055997848,-0.0086429640650749,-0.0065340935252606,-0.0076713575981557,0.0426100268959999,-0.026475204154849,-0.0189363379031419,-0.1014021784067154,-0.0357104800641536,0.0569463148713111,0.0322932414710521,-0.0047390288673341,-0.0311402697116136,-0.0132782645523548,-0.0860797464847564,0.022496523335576,-0.0242993794381618,-0.0332153737545013,-0.0612721182405948,-0.0269121006131172,-0.052967369556427,-0.0101929325610399,0.0233936291188001,0.0346775949001312,0.0201148502528667,-0.0442791320383548,0.0489895306527614,-0.0352410078048706,0.016130419448018,-0.0228813216090202,-0.0446336343884468,0.0530614927411079,-0.015161789022386,0.0249304044991731,0.0308810342103242,-0.0182175692170858,0.0088000688701868,0.0012391449417918,0.0125872809439897,-0.0936570763587951,-0.006154497154057,0.0188709441572427,0.0012509691296145,0.0210382621735334,-0.0239592902362346,-0.0009078570292331,0.0149776162579655,-0.0285278111696243,0.0485517308115959,-0.0022306460887193,-0.0240192450582981,-0.0061245872639119,-0.0209132712334394,0.0131003567948937,0.0814143717288971,-0.0486371554434299,-0.034486386924982,0.0277280360460281,-0.068457581102848,0.0610152296721935,0.0118201998993754,0.0018715811893343,-0.0042450036853551,-0.0055849622003734,-0.0709163844585418,0.009191159158945,-0.0377275720238685,-0.0012981459731236,0.0401418432593345,-0.0234655421227216,-0.2470960319042205,0.0430722832679748,-0.0392128601670265,0.0048901420086622,-0.0285920538008213,0.028031088411808,0.0301187876611948,-0.007767355069518,0.0005140074645169,0.0240713451057672,-0.0250622145831584,0.0287028960883617,-0.0024316534399986,0.0143383340910077,0.0732508227229118,-0.0402397140860557,0.0324310325086116,0.040371835231781,-0.0078095691278576,-0.0072830067947506,0.0390789993107318,-0.0033813440240919,0.1280341148376464,-0.0185109283775091,0.0099267493933439,0.0201259665191173,-0.0621888414025306,0.0232616700232028,0.0832216516137123,-0.0469208918511867,0.0187665056437253,-0.0440079532563686,0.0094871325418353,-0.0269471630454063,0.0180363357067108,-0.042952511459589,-0.0227949079126119,0.0077916076406836,-0.0094371195882558,0.0112145207822322,0.032924760133028,-0.039412684738636,-0.0122034717351198,-0.007780211046338,0.0547618605196476,0.0192679781466722,0.0173456817865371,-0.0531641580164432,0.0113085759803652,0.0344415195286273,-0.0295180063694715,-0.0168979056179523,-0.000471792765893,-0.0297828800976276,0.0162956416606903,0.0364563204348087,-0.0237110909074544,-0.0033958996646106,0.0045187529176473,0.0078428862616419,-0.0534362234175205,-0.0330426022410392,-0.0483867414295673,0.0138827841728925,0.0067642778158187


By following these steps, we efficiently generate and store embeddings within the Teradata database, making them readily available for high-performance semantic search operations.


Building the embedding store directly within the Teradata database is both important and beneficial for several reasons:

- **Performance**: By generating and storing embeddings in-database, we reduce data movement and leverage Teradata’s powerful processing capabilities. This results in faster query execution and lower latency.

- **Scalability**: Teradata is designed to handle large-scale data. Embedding generation and storage within Teradata ensures that we can scale our operations to handle vast amounts of text data without compromising on performance.

- **Security**: Keeping data within the database ensures that sensitive information remains secure and complies with data governance policies. There is no need to move data to external systems for processing.

- **Integration**: Embedding the store directly in Teradata allows seamless integration with existing data and applications. This enables more comprehensive data analysis and supports advanced use cases such as real-time semantic search and analytics.

By leveraging Teradata's robust infrastructure and advanced capabilities, we can build an efficient, secure, and scalable embedding store that enhances our ability to perform sophisticated text analysis and semantic search.

## Part 4. Semantic Search with Cosine Similarity

In this final step, we perform semantic search using cosine similarity within the Teradata database. We utilize the `TD_VectorDistance` function, which is specifically designed for calculating cosine similarity between texts in our embedding store and given examples. This function leverages Teradata's Massive Parallel Processing (MPP) capabilities, enabling high-performance and scalable computation.

The `TD_VectorDistance` function computes the cosine similarity between the query embedding (representing the given example) and the embeddings stored in our embedding store. By comparing the angles between vectors in the multi-dimensional space, the function identifies the most semantically similar emails to the given example.

In this specific case, we aim to collect the most semantically similar emails by the given example. This allows us to efficiently identify relevant content and extract valuable insights from our email dataset.

By utilizing Teradata's powerful processing capabilities and in-database functions like `TD_VectorDistance`, we can perform advanced semantic search operations with unparalleled performance and scalability. This enables us to effectively analyze large volumes of text data and extract meaningful information, facilitating data-driven decision-making and enhancing business outcomes.


In [9]:
tdf_embeddings_store_tgt = tdf_embeddings_store[tdf_embeddings_store.id == 3]
tdf_embeddings_store_ref = tdf_embeddings_store[tdf_embeddings_store.id != 3]

In [10]:
tdml.DataFrame.from_query(f"""

SELECT 
    dt.target_id, 
    dt.reference_id,
    e_tgt.txt as target_txt,
    e_ref.txt as reference_txt,
    (1.0 - dt.distance) as similiarity 
FROM
    TD_VECTORDISTANCE (
        ON (%s) AS TargetTable
        ON (%s) AS ReferenceTable DIMENSION
        USING
            TargetIDColumn('id')
            TargetFeatureColumns('[emb_0:emb_383]')
            RefIDColumn('id')
            RefFeatureColumns('[emb_0:emb_383]')
            DistanceMeasure('cosine')
            topk(3)
    ) AS dt
JOIN emails.emails e_tgt on e_tgt.id = dt.target_id
JOIN emails.emails e_ref on e_ref.id = dt.reference_id;
"""%(tdf_embeddings_store_tgt.show_query(), tdf_embeddings_store_ref.show_query()))



target_id,reference_id,target_txt,reference_txt,similiarity
3,20,"In my meeting with CFO Alan Turner on March 10th, we reviewed financial projections for the next quarter. Our goal is to raise $100,000 for expansion, targeting a 20% growth in revenue.","In my meeting with CFO Sarah Davis in Sydney on August 15th, we reviewed financial projections for the next quarter. Our goal is to raise $150,000 for expansion, targeting a 25% growth in revenue.",0.9038906556512574
3,11,"In my meeting with CFO Alan Turner on March 10th, we reviewed financial projections for the next quarter. Our goal is to raise $100,000 for expansion, targeting a 20% growth in revenue.","In my recent meeting with CFO Richard Anderson on November 18th, we discussed budget allocations for the next fiscal year. We aim to invest $200,000 in research and development, anticipating a 18% increase in product innovation.",0.8303144841832985
3,16,"In my meeting with CFO Alan Turner on March 10th, we reviewed financial projections for the next quarter. Our goal is to raise $100,000 for expansion, targeting a 20% growth in revenue.","I had a meeting with CFO Rachel Miller in Singapore on April 20th to finalize financial projections. We re planning to invest $120,000 in expanding our market presence, aiming for a 15% increase in market share.",0.8600493790460071


In [31]:
tdml.remove_context()

True