### Retrival Augmentented Generation
Retrieval-Augmented Generation (RAG) is the process of optimizing the output of a large language model, so it references an authoritative knowledge base outside of its training data sources before generating a response. Large Language Models (LLMs) are trained on vast volumes of data and use billions of parameters to generate original output for tasks like answering questions, translating languages, and completing sentences. RAG extends the already powerful capabilities of LLMs to specific domains or an organization's internal knowledge base, all without the need to retrain the model. It is a cost-effective approach to improving LLM output so it remains relevant, accurate, and useful in various contexts.

#### Vector embeddings
Let's begin with the fundamental building blocks of how retrieval works

In [None]:
%pip install langchain_cohere -q
%pip install spacy -q
#ignore error

In [None]:
# now you need to run this in a terminal window
# python -m spacy download en_core_web_md
# now restart your kernel

Standard imports for the libraires we will be using in this notebook.  Try to keep your imports in the first cell so this can this code can more easliy be converted into a python program later

In [None]:
import boto3
import pandas as pd
import json
import numpy as np
import pyarrow
from langchain.embeddings import BedrockEmbeddings
from langchain_community.embeddings.spacy_embeddings import SpacyEmbeddings
from langchain.embeddings import HuggingFaceEmbeddings
from langchain_cohere import CohereEmbeddings

# Create the AWS client for the Bedrock runtime with boto3
aws_client = boto3.client(service_name="bedrock-runtime")

#### Lets define functions that will use various embedding models so we can generate vector embeddings
Spacey

In [None]:
def generate_spacy_vector_embedding(text):
    embedder = SpacyEmbeddings(model_name="en_core_web_md")
    query_embedding = embedder.embed_query(text)

    return(np.array(query_embedding))

Huggingface Bert

In [None]:
def generate_huggingface_vector_embedding(text):
    embeddings = HuggingFaceEmbeddings(model_name="bert-base-uncased")
    embedded_texts = embeddings.embed_query(text_data)
    return(embedded_texts)

Cohere

In [None]:
def limit_string_size(x, max_chars=2048):
    # Check if the input is a string
    if isinstance(x, str):
        return x[:max_chars]
    else:
        return x

In [None]:
# send in an array size of one and only return the 0th element
def generate_cohere_vector_embedding(text_data):
    input_type = "clustering"
    truncate = "NONE" # optional
    model_id = "cohere.embed-english-v3" # or "cohere.embed-multilingual-v3"
    trunc_data = limit_string_size(text_data)
    # Create the JSON payload for the request
    json_params = {
            'texts': [trunc_data],
            'truncate': truncate, 
            "input_type": input_type
        }
    json_body = json.dumps(json_params)
    params = {'body': json_body, 'modelId': model_id,}
    
    # Invoke the model and print the response
    result = aws_client.invoke_model(**params)
    response = json.loads(result['body'].read().decode())
    return(np.array(response['embeddings'][0]))


Amazon Titan

In [None]:
# Let's generate a dense vector using Amazon Titan with LangChain
def generate_titan_vector_embedding(text):
    #create an Amazon Titan Text Embeddings client
    embeddings_client = BedrockEmbeddings(region_name="us-west-2") 

    #Invoke the model
    embedding = embeddings_client.embed_query(text)
    return(np.array(embedding))



This is the mathmatical formula to calcuate cosine similarity between 2 vectors

In [None]:
def cosine_similarity(vec1, vec2):
    dot_product = np.dot(vec1, vec2)
    norm_vec1 = np.linalg.norm(vec1)
    norm_vec2 = np.linalg.norm(vec2)
    similarity = dot_product / (norm_vec1 * norm_vec2)
    return similarity



Now let's generate vectors for simple words and compare how we can compute cosine similarity

In [None]:
#Titan
king_vector = generate_titan_vector_embedding("King")
queen_vector = generate_titan_vector_embedding("Queen")
man_vector = generate_titan_vector_embedding("man")
woman_vector = generate_titan_vector_embedding("woman")
print(f"This embedding has {len(king_vector)} dimensions")
print(king_vector[:5])

In [None]:
calculated_queen_vector = king_vector - man_vector + woman_vector

similarity = cosine_similarity(calculated_queen_vector, queen_vector)
print(f"Cosine Similarity distance between Titan King - Queen: {similarity:.4f}")

similarity = cosine_similarity(king_vector, queen_vector)
print(f"Cosine Similarity of Titan King to Queen: {similarity:.4f}")

In [None]:
# Input cohere for embedding 
king_vector = generate_cohere_vector_embedding('King')
queen_vector = generate_cohere_vector_embedding("Queen")
man_vector = generate_cohere_vector_embedding("man")
woman_vector = generate_cohere_vector_embedding("woman")
print(f"This embedding has {len(king_vector)} dimensions")
print(king_vector[:5])


In [None]:
calculated_queen_vector = king_vector - man_vector + woman_vector

similarity = cosine_similarity(calculated_queen_vector, queen_vector)
print(f"Cosine Similarity distance between Cohere King - Queen: {similarity:.4f}")

similarity = cosine_similarity(king_vector, queen_vector)
print(f"Cosine Similarity of Cohere King to Queen: {similarity:.4f}")

In [None]:
#Spacy
king_vector = generate_spacy_vector_embedding("King")
queen_vector = generate_spacy_vector_embedding("Queen")
man_vector = generate_spacy_vector_embedding("man")
woman_vector = generate_spacy_vector_embedding("woman")
print(f"This embedding has {len(king_vector)} dimensions")
print(king_vector[:5])

In [None]:
calculated_queen_vector = king_vector - man_vector + woman_vector

similarity = cosine_similarity(calculated_queen_vector, queen_vector)
print(f"Cosine Similarity distance between Spacey King - Queen: {similarity:.4f}")

similarity = cosine_similarity(king_vector, queen_vector)
print(f"Cosine Similarity of Spacey King to Queen: {similarity:.4f}")

Let's examine other phrases

In [None]:
similarity = cosine_similarity(generate_titan_vector_embedding("cat"), generate_titan_vector_embedding("book"))
print(f"Cosine Similarity of cat to book using Titan: {similarity:.4f}")

In [None]:
similarity = cosine_similarity(generate_cohere_vector_embedding("cat"), generate_cohere_vector_embedding("book"))
print(f"Cosine Similarity of cat to book using Cohere: {similarity:.4f}")

In [None]:
similarity = cosine_similarity(generate_spacy_vector_embedding("cat"), generate_spacy_vector_embedding("book"))
print(f"Cosine Similarity of cat to book using Spacey: {similarity:.4f}")

Now let's look at a larger sentences and see how larger models with more complexity handle the same task
Here are 2 sentences that semantically similar but use different words and phrasing.

1. The majestic, towering skyscrapers, their gleaming windows reflecting the golden rays of the setting sun, stood as a testament to human ingenuity and the 
 indomitable spirit of progress, while the bustling streets below teemed with life as people from all walks of life hurried to their destinations, their faces 
 a mix of determination and weariness, yet each individual contributing to the vibrant tapestry of the city's existence.

2. The awe-inspiring, colossal high-rises, their polished glass facades mirroring the warm, amber glow of the fading daylight, served as a powerful symbol of human
 innovation and the unyielding drive for advancement, as the lively thoroughfares beneath pulsed with energy, filled with individuals from diverse backgrounds rushing
 to their intended locations, their expressions an amalgamation of resolve and fatigue, yet all playing a vital role in the dynamic, intricate mosaic that shaped the 
 city's vibrant identity.


In [None]:
sentence1 = "The majestic, towering skyscrapers, their gleaming windows reflecting the golden rays of the setting sun, stood as a testament to human ingenuity and the indomitable spirit of progress, while the bustling streets below teemed with life as people from all walks of life hurried to their destinations, their faces a mix of determination and weariness, yet each individual contributing to the vibrant tapestry of the city's existence."
sentence2 = "The awe-inspiring, colossal high-rises, their polished glass facades mirroring the warm, amber glow of the fading daylight, served as a powerful symbol of human innovation and the unyielding drive for advancement, as the lively thoroughfares beneath pulsed with energy, filled with individuals from diverse backgrounds rushing to their intended locations, their expressions an amalgamation of resolve and fatigue, yet all playing a vital role in the dynamic, intricate mosaic that shaped the city's vibrant identity."
similarity = cosine_similarity(generate_spacy_vector_embedding(sentence1), generate_spacy_vector_embedding(sentence2))
print(f"Cosine Similarity of S1 to S2 using Spacey: {similarity:.4f}")

In [None]:
sentence1 = "The majestic, towering skyscrapers, their gleaming windows reflecting the golden rays of the setting sun, stood as a testament to human ingenuity and the indomitable spirit of progress, while the bustling streets below teemed with life as people from all walks of life hurried to their destinations, their faces a mix of determination and weariness, yet each individual contributing to the vibrant tapestry of the city's existence."
sentence2 = "The awe-inspiring, colossal high-rises, their polished glass facades mirroring the warm, amber glow of the fading daylight, served as a powerful symbol of human innovation and the unyielding drive for advancement, as the lively thoroughfares beneath pulsed with energy, filled with individuals from diverse backgrounds rushing to their intended locations, their expressions an amalgamation of resolve and fatigue, yet all playing a vital role in the dynamic, intricate mosaic that shaped the city's vibrant identity."
similarity = cosine_similarity(generate_titan_vector_embedding(sentence1), generate_titan_vector_embedding(sentence2))
print(f"Cosine Similarity of S1 to S2 using Titan: {similarity:.4f}")

In [None]:
sentence1 = "The majestic, towering skyscrapers, their gleaming windows reflecting the golden rays of the setting sun, stood as a testament to human ingenuity and the indomitable spirit of progress, while the bustling streets below teemed with life as people from all walks of life hurried to their destinations, their faces a mix of determination and weariness, yet each individual contributing to the vibrant tapestry of the city's existence."
sentence2 = "The awe-inspiring, colossal high-rises, their polished glass facades mirroring the warm, amber glow of the fading daylight, served as a powerful symbol of human innovation and the unyielding drive for advancement, as the lively thoroughfares beneath pulsed with energy, filled with individuals from diverse backgrounds rushing to their intended locations, their expressions an amalgamation of resolve and fatigue, yet all playing a vital role in the dynamic, intricate mosaic that shaped the city's vibrant identity."
similarity = cosine_similarity(generate_cohere_vector_embedding(sentence1), generate_cohere_vector_embedding(sentence2))
print(f"Cosine Similarity of S1 to S2 using Cohere: {similarity:.4f}")

#### Right model for the right job
Bigger models aren't necessarily more capable for all tasks.  On simple sentences Spacey results show it to be the best model for this task.  Let's switch to a more complex task.

#### Persistance of embeddings for later retrieval
 In order to do semantic search and retrieve relevant content we need to store that content for later use.  We can store the embedding in several different persistence technologies.  To start simply let's store the data in memory using pandas dataframe.

In [None]:
def clean_value(value):
    value_str = str(value)
    cleaned_value = ''.join(char for char in value_str if char.isalnum() or char.isspace())
    return cleaned_value

Using Gaggle I looked for an interesting dataset that had data that I wanted to use to answer specific questions that is likely not in the large corpus of trained data. Let's put the data into a pandas dataframe and examine the data.

In [None]:
# clean abstract text
df = pd.read_csv('data/latest_research_articles.csv')
df['abstract'] = df['abstract'].apply(clean_value)

df

In [None]:
# Lets save the dataframe in 4 separate so we can parallalize the embedding
# df1, df2, df3, df4 = np.array_split(df, 4)
# df1.to_pickle('data/non-embedded_df1.pkl')
# df2.to_pickle('data/non-embedded_df2.pkl')
# df3.to_pickle('data/non-embedded_df3.pkl')
# df4.to_pickle('data/non-embedded_df4.pkl')

Now let's create a new column that represents the article abstract as a vector embedding

In [None]:
# Generate embeddings using spaCy
# Add a new column 'embedded_abstract' by applying the function to an existing column
# This step takes a while so I did it for you and saved the output as pickle
#df['embedded_abstract'] = df['abstract'].apply(generate_spacy_vector_embedding)
#df.to_pickle('data/spacey_embedded.pkl')

# Easier way
dfs = pd.read_pickle('data/spacey_embedded.pkl')

In [None]:
# Also generated Spacey embedding so we can compare
# df1 = pd.read_pickle('data/embedded_spacey_df1.pkl')
# df2 = pd.read_pickle('data/embedded_spacey_df2.pkl')
# df3 = pd.read_pickle('data/embedded_spacey_df3.pkl')
# df4 = pd.read_pickle('data/embedded_spacey_df4.pkl')
# Combine the dataframes that have been embedded then combine
#dfs = pd.concat([df1, df2, df3, df4])

### Retrieval from embedded sources
Now that we have a dataframe with embedded content of interest, we can use semantic similarity to retrieve the right data to feed to an LLM

In [None]:
# Let's setup a query that a used might ask
query = "What is the latest research for broken ribs in children"

#### Spacey Embeddings

In [None]:
# Let's search our records for a good semantic search
# To compare the vectors we need to embed the string we want to compare and convert it to a vector
query_vector = generate_spacy_vector_embedding(query)

# Search through each each row and calculate cosine similarity and store the results in an array of tuples
results = []
# Iterate over each row in the DataFrame
for index, row in dfs.iterrows():
    # Extract the value from the specified column
    article_embedding = row['embedded_abstract']
    results.append((index, cosine_similarity(article_embedding, query_vector)))

# Sort the stored results from highest to lowest
results.sort(key=lambda x: x[1], reverse=True)
i = 0
# Print the first 5 sorted items with article index and title
print("Here are a few articles that may match your interest:")
for item in results:
    article_title = dfs.iloc[item[0]]['title']
    print(f"Abtract: '{article_title}' with a cosine match of: {item[1]}")
    i=i+1
    if i == 5:
        break

#### Titan Embeddings

In [None]:
# Generate embeddings using spaCy
# Add a new column 'embedded_abstract' by applying the function to an existing column
# This step takes a while so I did it for you and saved the output as pickle
#dft = df.copy()
#dft['embedded_abstract'] = dft['abstract'].apply(generate_titan_vector_embedding)
#df.to_pickle('data/spacey_embedded.pkl')

# Easier way
dft = pd.read_pickle('data/embedded_df.pkl')

In [None]:
# Let's search our records for a good semantic search
query_vector = generate_titan_vector_embedding(query)

results = []
# Iterate over each row in the DataFrame
for index, row in dft.iterrows():
    # Extract the value from the specified column
    article_embedding = row['embedded_abstract']
    results.append((index, cosine_similarity(article_embedding, query_vector)))
    #print (index, value)

results.sort(key=lambda x: x[1], reverse=True)
i = 0
# Print the sorted data
print("Here are a few articles that may match your interest:")
for item in results:
    article_title = dft.iloc[item[0]]['title']
    print(f"Abtract: '{article_title}' with a cosine match of: {item[1]}")
    i=i+1
    if i == 5:
        break

#### Cohere Embedding

Each of these models also have a [max input tokens limit](https://docs.aws.amazon.com/bedrock/latest/userguide/model-parameters-embed.html) and that determines the largest number of characters of text we can embed.

[https://cohere.com/blog/introducing-embed-v3](https://cohere.com/blog/introducing-embed-v3)

In [None]:
# Generate embeddings using Cohere
# Add a new column 'embedded_abstract' by applying the function to an existing column
# This step takes a while so I did it for you and saved the output as pickle
#df['embedded_abstract'] = df['abstract'].apply(generate_cohere_vector_embedding)
#df.to_pickle('data/cohere_embedded.pkl')

# This step takes a while so I did it for you and saved the output as pickle
dfc = pd.read_pickle('data/cohere_embedded.pkl')

In [None]:
# Let's search our records for a good semantic search
query_vector = generate_cohere_vector_embedding(query)

results = []
# Iterate over each row in the DataFrame
for index, row in dfc.iterrows():
    # Extract the value from the specified column
    article_embedding = row['embedded_abstract']
    results.append((index, cosine_similarity(article_embedding, query_vector)))
    #print (index, value)

results.sort(key=lambda x: x[1], reverse=True)
i = 0
# Print the sorted data
print("Here are a few articles that may match your interest:")
for item in results:
    article_title = dfc.iloc[item[0]]['title']
    print(f"Abtract: '{article_title}' with a cosine match of: {item[1]}")
    i=i+1
    if i == 5:
        break

#### Conclusion
 For this task Titan performed much better in this scenario.  There are many models out there, before you begin to solve a problem with these new models that are changing daily take some time to test out a few before you land on a solution set.
 
 Here is a list of benchmarks for large embedded text models, most models nw have benchmarks with model performance.

 [Massive Text Embedding Benchmark](https://huggingface.co/spaces/mteb/leaderboard)