### Retrival Augmentented Generation

#### Vector embeddings

In [1]:
%pip install langchain_cohere -q
%pip install spacy -q 

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
botocore 1.34.51 requires urllib3<2.1,>=1.25.4; python_version >= "3.10", but you have urllib3 2.2.1 which is incompatible.
sparkmagic 0.21.0 requires pandas<2.0.0,>=0.17.1, but you have pandas 2.1.4 which is incompatible.
transformers 4.31.0 requires tokenizers!=0.11.3,<0.14,>=0.11.1, but you have tokenizers 0.19.1 which is incompatible.[0m[31m
[0mNote: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [None]:
# now you need to run this in a terminal window
# python -m spacy download en_core_web_md
# now restart your kernel

Standard imports for the libraires we will be using in this notebook.  Try to keep your imports in the first cell so this can this code can more easliy be converted into a python program later

In [2]:
import boto3
import pandas as pd
import json
import numpy as np
import pyarrow
from langchain.embeddings import BedrockEmbeddings
from langchain_community.embeddings.spacy_embeddings import SpacyEmbeddings
from langchain.embeddings import HuggingFaceEmbeddings
from langchain_cohere import CohereEmbeddings

# Create the AWS client for the Bedrock runtime with boto3
aws_client = boto3.client(service_name="bedrock-runtime")

#### Lets define functions that will use various embedding models so we can generate vector embeddings
Spacey

In [3]:
def generate_spacy_vector_embedding(text):
    embedder = SpacyEmbeddings(model_name="en_core_web_md")
    query_embedding = embedder.embed_query(text)

    return(np.array(query_embedding))

Huggingface Bert

In [4]:
def generate_huggingface_vector_embedding(text):
    embeddings = HuggingFaceEmbeddings(model_name="bert-base-uncased")
    embedded_texts = embeddings.embed_query(text_data)
    return(embedded_texts)

Cohere

In [5]:
# send in an array size of one and only return the 0th element
def generate_cohere_vector_embedding(text_data):
    input_type = "clustering"
    truncate = "NONE" # optional
    model_id = "cohere.embed-english-v3" # or "cohere.embed-multilingual-v3"
    
    # Create the JSON payload for the request
    json_params = {
            'texts': [text_data],
            'truncate': truncate, 
            "input_type": input_type
        }
    json_body = json.dumps(json_params)
    params = {'body': json_body, 'modelId': model_id,}
    
    # Invoke the model and print the response
    result = aws_client.invoke_model(**params)
    response = json.loads(result['body'].read().decode())
    return(np.array(response['embeddings'][0]))


In [9]:
def cosine_similarity(vec1, vec2):
    dot_product = np.dot(vec1, vec2)
    norm_vec1 = np.linalg.norm(vec1)
    norm_vec2 = np.linalg.norm(vec2)
    similarity = dot_product / (norm_vec1 * norm_vec2)
    return similarity



Amazon Titan

In [6]:
# Let's generate a dense vector using Amazon Titan with LangChain
def generate_titan_vector_embedding(text):
    #create an Amazon Titan Text Embeddings client
    embeddings_client = BedrockEmbeddings(region_name="us-west-2") 

    #Invoke the model
    embedding = embeddings_client.embed_query(text)
    return(np.array(embedding))



Now let's generate vectors for simple words and compare how we can compute cosine similarity

In [7]:
#Titan
king_vector = generate_titan_vector_embedding("King")
queen_vector = generate_titan_vector_embedding("Queen")
man_vector = generate_titan_vector_embedding("man")
woman_vector = generate_titan_vector_embedding("woman")
print(king_vector[:5])

[-0.25       -0.91015625  0.55078125  0.59375     0.5546875 ]


In [10]:
calculated_queen_vector = king_vector - man_vector + woman_vector

similarity = cosine_similarity(calculated_queen_vector, queen_vector)
print(f"Cosine Similarity distance between Titan King - Queen: {similarity:.4f}")

similarity = cosine_similarity(king_vector, queen_vector)
print(f"Cosine Similarity of Titan King to Queen: {similarity:.4f}")

Cosine Similarity distance between Titan King - Queen: 0.6676
Cosine Similarity of Titan King to Queen: 0.6291


In [11]:
# Input cohere for embedding 
king_vector = generate_cohere_vector_embedding('King')
queen_vector = generate_cohere_vector_embedding("Queen")
man_vector = generate_cohere_vector_embedding("man")
woman_vector = generate_cohere_vector_embedding("woman")
print(king_vector[:5])


[ 0.02391052 -0.02467346  0.02523804 -0.04049683 -0.06323242]


In [12]:
calculated_queen_vector = king_vector - man_vector + woman_vector

similarity = cosine_similarity(calculated_queen_vector, queen_vector)
print(f"Cosine Similarity distance between Cohere King - Queen: {similarity:.4f}")

similarity = cosine_similarity(king_vector, queen_vector)
print(f"Cosine Similarity of Cohere King to Queen: {similarity:.4f}")

Cosine Similarity distance between Cohere King - Queen: 0.7226
Cosine Similarity of Cohere King to Queen: 0.7328


In [13]:
#Spacy
king_vector = generate_spacy_vector_embedding("King")
queen_vector = generate_spacy_vector_embedding("Queen")
man_vector = generate_spacy_vector_embedding("man")
woman_vector = generate_spacy_vector_embedding("woman")
print(king_vector[:5])

[ 0.19343001 -4.19689989  4.81750011 -0.72863001  2.31769991]


In [14]:
calculated_queen_vector = king_vector - man_vector + woman_vector

similarity = cosine_similarity(calculated_queen_vector, queen_vector)
print(f"Cosine Similarity distance between Spacey King - Queen: {similarity:.4f}")

similarity = cosine_similarity(king_vector, queen_vector)
print(f"Cosine Similarity of Spacey King to Queen: {similarity:.4f}")

Cosine Similarity distance between Spacey King - Queen: 0.6892
Cosine Similarity of Spacey King to Queen: 0.6593


Let's examine other phrases

In [15]:
similarity = cosine_similarity(generate_titan_vector_embedding("cat"), generate_titan_vector_embedding("book"))
print(f"Cosine Similarity of cat to book using Titan: {similarity:.4f}")

Cosine Similarity of cat to book using Titan: 0.3514


In [16]:
similarity = cosine_similarity(generate_cohere_vector_embedding("cat"), generate_cohere_vector_embedding("book"))
print(f"Cosine Similarity of cat to book using Cohere: {similarity:.4f}")

Cosine Similarity of cat to book using Cohere: 0.5575


In [17]:
similarity = cosine_similarity(generate_spacy_vector_embedding("cat"), generate_spacy_vector_embedding("book"))
print(f"Cosine Similarity of cat to book using Spacey: {similarity:.4f}")

Cosine Similarity of cat to book using Spacey: 0.0696


Now let's look at a larger sentences and see how larger models with more complexity handle the same task
Here are 2 sentences that semantically similar but use different words and phrasing.

1. The majestic, towering skyscrapers, their gleaming windows reflecting the golden rays of the setting sun, stood as a testament to human ingenuity and the 
 indomitable spirit of progress, while the bustling streets below teemed with life as people from all walks of life hurried to their destinations, their faces 
 a mix of determination and weariness, yet each individual contributing to the vibrant tapestry of the city's existence.

2. The awe-inspiring, colossal high-rises, their polished glass facades mirroring the warm, amber glow of the fading daylight, served as a powerful symbol of human
 innovation and the unyielding drive for advancement, as the lively thoroughfares beneath pulsed with energy, filled with individuals from diverse backgrounds rushing
 to their intended locations, their expressions an amalgamation of resolve and fatigue, yet all playing a vital role in the dynamic, intricate mosaic that shaped the 
 city's vibrant identity.


In [18]:
sentence1 = "The majestic, towering skyscrapers, their gleaming windows reflecting the golden rays of the setting sun, stood as a testament to human ingenuity and the indomitable spirit of progress, while the bustling streets below teemed with life as people from all walks of life hurried to their destinations, their faces a mix of determination and weariness, yet each individual contributing to the vibrant tapestry of the city's existence."
sentence2 = "The awe-inspiring, colossal high-rises, their polished glass facades mirroring the warm, amber glow of the fading daylight, served as a powerful symbol of human innovation and the unyielding drive for advancement, as the lively thoroughfares beneath pulsed with energy, filled with individuals from diverse backgrounds rushing to their intended locations, their expressions an amalgamation of resolve and fatigue, yet all playing a vital role in the dynamic, intricate mosaic that shaped the city's vibrant identity."
similarity = cosine_similarity(generate_spacy_vector_embedding(sentence1), generate_spacy_vector_embedding(sentence2))
print(f"Cosine Similarity of S1 to S2 using Spacey: {similarity:.4f}")

Cosine Similarity of S1 to S2 using Spacey: 0.9776


In [19]:
sentence1 = "The majestic, towering skyscrapers, their gleaming windows reflecting the golden rays of the setting sun, stood as a testament to human ingenuity and the indomitable spirit of progress, while the bustling streets below teemed with life as people from all walks of life hurried to their destinations, their faces a mix of determination and weariness, yet each individual contributing to the vibrant tapestry of the city's existence."
sentence2 = "The awe-inspiring, colossal high-rises, their polished glass facades mirroring the warm, amber glow of the fading daylight, served as a powerful symbol of human innovation and the unyielding drive for advancement, as the lively thoroughfares beneath pulsed with energy, filled with individuals from diverse backgrounds rushing to their intended locations, their expressions an amalgamation of resolve and fatigue, yet all playing a vital role in the dynamic, intricate mosaic that shaped the city's vibrant identity."
similarity = cosine_similarity(generate_titan_vector_embedding(sentence1), generate_titan_vector_embedding(sentence2))
print(f"Cosine Similarity of S1 to S2 using Titan: {similarity:.4f}")

Cosine Similarity of S1 to S2 using Titan: 0.8726


In [20]:
sentence1 = "The majestic, towering skyscrapers, their gleaming windows reflecting the golden rays of the setting sun, stood as a testament to human ingenuity and the indomitable spirit of progress, while the bustling streets below teemed with life as people from all walks of life hurried to their destinations, their faces a mix of determination and weariness, yet each individual contributing to the vibrant tapestry of the city's existence."
sentence2 = "The awe-inspiring, colossal high-rises, their polished glass facades mirroring the warm, amber glow of the fading daylight, served as a powerful symbol of human innovation and the unyielding drive for advancement, as the lively thoroughfares beneath pulsed with energy, filled with individuals from diverse backgrounds rushing to their intended locations, their expressions an amalgamation of resolve and fatigue, yet all playing a vital role in the dynamic, intricate mosaic that shaped the city's vibrant identity."
similarity = cosine_similarity(generate_cohere_vector_embedding(sentence1), generate_cohere_vector_embedding(sentence2))
print(f"Cosine Similarity of S1 to S2 using Cohere: {similarity:.4f}")

Cosine Similarity of S1 to S2 using Cohere: 0.8156


#### Right model for the right job
Bigger models aren't necessarily more capable for all tasks.  On simple sentences Spacey results show it to be the best model for this task.  Let's switch to a more complex task.

#### Persistance of embeddings for later retrieval
 In order to do semantic search and retrieve relevant content we need to store that content for later use.  We can store the embedding in several different persistence technologies.  To start simply let's store the data in memory using pandas dataframe.

In [42]:
def clean_value(value):
    value_str = str(value)
    cleaned_value = ''.join(char for char in value_str if char.isalnum() or char.isspace())
    return cleaned_value

In [50]:
def limit_string_size(x, max_chars=2048):
    # Check if the input is a string
    if isinstance(x, str):
        return x[:max_chars]
    else:
        return x

In [51]:
# clean abstract text
df = pd.read_csv('data/latest_research_articles.csv')
df['abstract'] = df['abstract'].apply(clean_value)
# for cohere we need to limit the string size to 2048
df['abstract'] = df['abstract'].apply(limit_string_size)

In [44]:
# Lets save the dataframe in 4 separate so we can parallalize the embedding
df1, df2, df3, df4 = np.array_split(df, 4)
df1.to_pickle('data/non-embedded_df1.pkl')
df2.to_pickle('data/non-embedded_df2.pkl')
df3.to_pickle('data/non-embedded_df3.pkl')
df4.to_pickle('data/non-embedded_df4.pkl')

  return bound(*args, **kwds)


In [53]:
# Generate embeddings using Titan
# Add a new column 'embedded_abstract' by applying the function to an existing column
df['embedded_abstract'] = df['abstract'].apply(generate_cohere_vector_embedding)

# This step takes a while so I did it for you and saved the output as pickle
#dft = pd.read_pickle('data/embedded_df.pkl')

In [54]:
df.to_pickle('data/cohere_embedded.pkl')

In [45]:
# Also generated Spacey embedding so we can compare
df1 = pd.read_pickle('data/embedded_spacey_df1.pkl')
df2 = pd.read_pickle('data/embedded_spacey_df2.pkl')
df3 = pd.read_pickle('data/embedded_spacey_df3.pkl')
df4 = pd.read_pickle('data/embedded_spacey_df4.pkl')


In [46]:
# Combine the dataframes that have been embedded then combine
dfs = pd.concat([df1, df2, df3, df4])

### Retrieval from embedded sources
Now that we have a dataframe with embedded content of interest, we can use semantic similarity to retrieve the right data to feed to an LLM

In [56]:
# Let's setup a query that a used might ask
query = "What is the latest research for broken ribs in children"

#### Spacey Embeddings

In [34]:
# Let's search our records for a good semantic search
#query_vector = generate_titan_vector_embedding(query)
query_vector = generate_spacy_vector_embedding(query)

results = []
# Iterate over each row in the DataFrame
for index, row in dfs.iterrows():
    # Extract the value from the specified column
    article_embedding = row['embedded_abstract']
    results.append((index, cosine_similarity(article_embedding, query_vector)))
    #print (index, value)

results.sort(key=lambda x: x[1], reverse=True)
i = 0
# Print the sorted data
print("Here are a few articles that may match your interest:")
for item in results:
    article_title = dfs.iloc[item[0]]['title']
    print(f"Abtract: '{article_title}' with a cosine match of: {item[1]}")
    i=i+1
    if i == 5:
        break

Here are a few articles that may match your interest:
Abtract: 'In situ evidence of magnetic reconnection in turbulent plasma' with a cosine match of: 0.8912633568632342
Abtract: 'Nodal s± pairing symmetry in an iron-based superconductor with only hole pockets' with a cosine match of: 0.8862041687900892
Abtract: 'A finite-temperature phase transition for disordered weakly interacting bosons in one dimension' with a cosine match of: 0.8836382547782047
Abtract: 'Enhanced production of multi-strange hadrons in high-multiplicity proton–proton collisions' with a cosine match of: 0.8823746488702473
Abtract: 'Destabilization of hidden order in URu2Si2 under magnetic field and pressure' with a cosine match of: 0.8814906267428417


#### Titan Embeddings

In [58]:
# Let's search our records for a good semantic search
query_vector = generate_titan_vector_embedding(query)

results = []
# Iterate over each row in the DataFrame
for index, row in dft.iterrows():
    # Extract the value from the specified column
    article_embedding = row['embedded_abstract']
    results.append((index, cosine_similarity(article_embedding, query_vector)))
    #print (index, value)

results.sort(key=lambda x: x[1], reverse=True)
i = 0
# Print the sorted data
print("Here are a few articles that may match your interest:")
for item in results:
    article_title = dft.iloc[item[0]]['title']
    print(f"Abtract: '{article_title}' with a cosine match of: {item[1]}")
    i=i+1
    if i == 5:
        break

Here are a few articles that may match your interest:
Abtract: 'High sensitivity methods for automated rib fracture detection in pediatric radiographs' with a cosine match of: 0.46759500523692665
Abtract: 'Magnetic resonance imaging based finite element modelling of the proximal femur: a short-term in vivo precision study' with a cosine match of: 0.23057253235100217
Abtract: 'On the crashworthiness analysis of bio-inspired DNA tubes' with a cosine match of: 0.21674978237483608
Abtract: 'Reproduction of forearm rotation dynamic using intensity-based biplane 2D–3D registration matching method' with a cosine match of: 0.19907903320363604
Abtract: 'Propagation of extended fractures by local nucleation and rapid transverse expansion of crack-front distortion' with a cosine match of: 0.19056102046718387


#### Cohere Embedding

In [57]:
# Let's search our records for a good semantic search
query_vector = generate_cohere_vector_embedding(query)

results = []
# Iterate over each row in the DataFrame
for index, row in df.iterrows():
    # Extract the value from the specified column
    article_embedding = row['embedded_abstract']
    results.append((index, cosine_similarity(article_embedding, query_vector)))
    #print (index, value)

results.sort(key=lambda x: x[1], reverse=True)
i = 0
# Print the sorted data
print("Here are a few articles that may match your interest:")
for item in results:
    article_title = df.iloc[item[0]]['title']
    print(f"Abtract: '{article_title}' with a cosine match of: {item[1]}")
    i=i+1
    if i == 5:
        break

Here are a few articles that may match your interest:
Abtract: 'High sensitivity methods for automated rib fracture detection in pediatric radiographs' with a cosine match of: 0.5573478086622601
Abtract: 'Magnetic resonance imaging based finite element modelling of the proximal femur: a short-term in vivo precision study' with a cosine match of: 0.37145634366738145
Abtract: 'Non-invasive biomarkers for detecting progression toward hypovolemic cardiovascular instability in a lower body negative pressure model' with a cosine match of: 0.3522809783290321
Abtract: 'A machine learning-based multiscale model to predict bone formation in scaffolds' with a cosine match of: 0.35033143216672435
Abtract: 'Application of machine learning algorithms for accurate determination of bilirubin level on in vitro engineered tissue phantom images' with a cosine match of: 0.3347026832383261


#### Conclusion
 For this task Titan performed much better in this scenario.  There are many models out there, before you begin to solve a problem with these new models that are changing daily take some time to test out a few before you land on a solution set.
 
 Here is a list of benchmarks for large embedded text models, most models nw have benchmarks with model performance.

 [Massive Text Embedding Benchmark](https://huggingface.co/spaces/mteb/leaderboard)