#Embedding-Based Retrieval with Deep Lake and OpenAI

Copyright 2024 Denis Rothman



# 1. Installing the environment

*First run the following cells and restart Google Colab session if prompted. Then run the notebook again cell by cell to explore the code.*

In [1]:
import deeplake



In [2]:
#The OpenAI Key
import os
from dotenv import load_dotenv
import openai

# Load API Key
dotenv_path = 'D:/AdvancedR/knowbankedu/openai/.env'
load_dotenv(dotenv_path)
# OpenAI API Key
openai.api_key = os.getenv("OPENAI_API_KEY")
ACTIVELOOP_TOKEN = os.getenv('ACTIVELOOP_TOKEN')

# Retrieval Augmented Generation

### Initiating the query process by indicating location of vector store

In [3]:
vector_store_path = "hub://zagamog/space_exploration_v1"

In [4]:
from deeplake.core.vectorstore.deeplake_vectorstore import VectorStore
import deeplake.util
ds = deeplake.load(vector_store_path)

[K\5l|-/

This dataset can be visualized in Jupyter Notebook by ds.visualize() or at https://app.activeloop.ai/zagamog/space_exploration_v1



[K/

hub://zagamog/space_exploration_v1 loaded successfully.



[K[?25h

In [5]:
vector_store = VectorStore(path=vector_store_path)

Deep Lake Dataset in hub://zagamog/space_exploration_v1 already exists, loading from the storage


## Input and Query Retrieval

## Input

### Retrieval query

In [6]:
# Print all tensors in the dataset
print("Tensors in dataset:", list(ds.tensors.keys()))

Tensors in dataset: ['embedding_tensor', 'id', 'metadata', 'text']


In [7]:
def embedding_function(texts, model="text-embedding-3-small"):
   if isinstance(texts, str):
       texts = [texts]
   texts = [t.replace("\n", " ") for t in texts]
   return [data.embedding for data in openai.embeddings.create(input = texts, model=model).data]

In [8]:
def get_user_prompt():
    # Request user input for the search prompt
    return input("Enter your search query: ")

def search_query(prompt):
    # Assuming `vector_store` and `embedding_function` are already defined
    search_results = vector_store.search(embedding_data=prompt, embedding_function=embedding_function, embedding_tensor="embedding_tensor")
    return search_results

# Get the user's search query
#user_prompt = get_user_prompt()
# or enter prompt if it is in a queue
user_prompt="Tell me about space exploration on the Moon and Mars."

# Perform the search
search_results = search_query(user_prompt)

# Print the search results
print(search_results)

{'score': [0.5716322660446167, 0.5714114904403687, 0.5654380321502686, 0.561565637588501], 'id': ['chunk_79', 'chunk_121', 'chunk_119', 'chunk_42'], 'metadata': [{'source': 'Source URL: https://en.wikipedia.org/wiki/Exploration_of_Mars'}, {'source': ''}, {'source': ''}, {'source': 'Source URL: https://en.wikipedia.org/wiki/Exploration_of_Mars'}], 'text': ['udy of orbits to land on Mars and return to Earth (High School level) Planetary Society Mars page v t e Space exploration Benefits Future Topics Astronomy Deep space exploration Space colonization Space research Spaceflight Human Uncrewed Exploration targets Asteroids Comets Earth Moon Jupiter Mars Human mission Phobos Mercury Neptune Pluto Saturn Uranus Venus History List of spaceflight records Timeline of Solar System exploration Timeline of space exploration Space agencies CNSA CSA ESA ISRO JAXA NASA Roscosmos UAESA Category Outline v t e Mars Outline of Mars Geography Atmosphere Circulation Climate Dust devil tracks Methane Regio

In [9]:
print(user_prompt)

Tell me about space exploration on the Moon and Mars.


In [10]:
# Function to wrap text to a specified width
def wrap_text(text, width=80):
    lines = []
    while len(text) > width:
        split_index = text.rfind(' ', 0, width)
        if split_index == -1:
            split_index = width
        lines.append(text[:split_index])
        text = text[split_index:].strip()
    lines.append(text)
    return '\n'.join(lines)

In [11]:
import textwrap

# Assuming the search results are ordered with the top result first
top_score = search_results['score'][0]
top_text = search_results['text'][0].strip()
top_metadata = search_results['metadata'][0]['source']

# Print the top search result
print("Top Search Result:")
print(f"Score: {top_score}")
print(f"Source: {top_metadata}")
print("Text:")
print(wrap_text(top_text))

Top Search Result:
Score: 0.5716322660446167
Source: Source URL: https://en.wikipedia.org/wiki/Exploration_of_Mars
Text:
udy of orbits to land on Mars and return to Earth (High School level) Planetary
Society Mars page v t e Space exploration Benefits Future Topics Astronomy Deep
space exploration Space colonization Space research Spaceflight Human Uncrewed
Exploration targets Asteroids Comets Earth Moon Jupiter Mars Human mission
Phobos Mercury Neptune Pluto Saturn Uranus Venus History List of spaceflight
records Timeline of Solar System exploration Timeline of space exploration
Space agencies CNSA CSA ESA ISRO JAXA NASA Roscosmos UAESA Category Outline v t
e Mars Outline of Mars Geography Atmosphere Circulation Climate Dust devil
tracks Methane Regions Arabia Terra Cerberus (Mars) Cydonia Eridania Lake Iani
Chaos Olympia Undae Planum Australe Planum Boreum Quadrangles Sinus Meridiani
Tempe Terra Terra Cimmeria Terra Sabaea Tharsis Undae Ultimi Scopuli Vastitas
Borealis Physical featu

## Augmented Input

In [12]:
augmented_input=user_prompt+" "+top_text

In [13]:
print(augmented_input)

Tell me about space exploration on the Moon and Mars. udy of orbits to land on Mars and return to Earth (High School level) Planetary Society Mars page v t e Space exploration Benefits Future Topics Astronomy Deep space exploration Space colonization Space research Spaceflight Human Uncrewed Exploration targets Asteroids Comets Earth Moon Jupiter Mars Human mission Phobos Mercury Neptune Pluto Saturn Uranus Venus History List of spaceflight records Timeline of Solar System exploration Timeline of space exploration Space agencies CNSA CSA ESA ISRO JAXA NASA Roscosmos UAESA Category Outline v t e Mars Outline of Mars Geography Atmosphere Circulation Climate Dust devil tracks Methane Regions Arabia Terra Cerberus (Mars) Cydonia Eridania Lake Iani Chaos Olympia Undae Planum Australe Planum Boreum Quadrangles Sinus Meridiani Tempe Terra Terra Cimmeria Terra Sabaea Tharsis Undae Ultimi Scopuli Vastitas Borealis Physical features "Canals" ( list ) Canyons Catenae Chaos terrain Craters Fossae 

# Generation and  output with OpenAI Reasoning Model o1 preview

In [26]:
import openai
from openai import OpenAI
import time

client = OpenAI()
gpt_model="o1-mini"
start_time = time.time()  # Start timing before the request

def call_gpt4_with_full_text(itext):
    # Join all lines to form a single string
    text_input = '\n'.join(itext)
    prompt = f"Read the following text as a space exploration expert, then summarize or elaborate on the following content with as much explanation as possibl and different sections:\n{text_input}"


    try:
        response = client.chat.completions.create(
            model=gpt_model,
            messages=[
                {"role": "user", "content": prompt}
            ]
        )
        return response.choices[0].message.content
    except Exception as e:
        return str(e)

gpt4_response = call_gpt4_with_full_text(augmented_input)

response_time = time.time() - start_time  # Measure response time
print(f"Response Time: {response_time:.2f} seconds")  # Print response time

print(gpt_model, "Response:", gpt4_response)

Response Time: 0.11 seconds
o1-mini Response: Error code: 403 - {'error': {'message': 'Project `proj_1JtpcJrXydlgaL4g69bUSu7n` does not have access to model `o1-mini`', 'type': 'invalid_request_error', 'param': None, 'code': 'model_not_found'}}


### Formatted response

In [15]:
import textwrap
import re
from IPython.display import display, Markdown, HTML
import markdown

def print_formatted_response(response):
    # Check for markdown by looking for patterns like headers, bold, lists, etc.
    markdown_patterns = [
        r"^#+\s",           # Headers
        r"^\*+",            # Bullet points
        r"\*\*",            # Bold
        r"_",               # Italics
        r"\[.+\]\(.+\)",    # Links
        r"-\s",             # Dashes used for lists
        r"\`\`\`"           # Code blocks
    ]

    # If any pattern matches, assume the response is in markdown
    if any(re.search(pattern, response, re.MULTILINE) for pattern in markdown_patterns):
        # Markdown detected, convert to HTML for nicer display
        html_output = markdown.markdown(response)
        display(HTML(html_output))  # Use display(HTML()) to render HTML in Colab
    else:
        # No markdown detected, wrap and print as plain text
        wrapper = textwrap.TextWrapper(width=80)
        wrapped_text = wrapper.fill(text=response)

        print("Text Response:")
        print("--------------------")
        print(wrapped_text)
        print("--------------------\n")

print_formatted_response(gpt4_response)

# Evaluating the output with  Cosine Similarity

with initial user prompt

In [16]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def calculate_cosine_similarity(text1, text2):
    vectorizer = TfidfVectorizer()
    tfidf = vectorizer.fit_transform([text1, text2])
    similarity = cosine_similarity(tfidf[0:1], tfidf[1:2])
    return similarity[0][0]

similarity_score = calculate_cosine_similarity(user_prompt, gpt4_response)

print(f"Cosine Similarity Score: {similarity_score:.3f}")

Cosine Similarity Score: 0.000


with augmented user prompt

In [17]:
similarity_score = calculate_cosine_similarity(augmented_input, gpt4_response)

print(f"Cosine Similarity Score: {similarity_score:.3f}")

Cosine Similarity Score: 0.011


In [20]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')

In [21]:
def calculate_cosine_similarity_with_embeddings(text1, text2):
    embeddings1 = model.encode(text1)
    embeddings2 = model.encode(text2)
    similarity = cosine_similarity([embeddings1], [embeddings2])
    return similarity[0][0]


similarity_score = calculate_cosine_similarity_with_embeddings(augmented_input, gpt4_response)
print(f"Cosine Similarity Score: {similarity_score:.3f}")

Cosine Similarity Score: -0.050
