# Question answering using Embeddings

In this notebook, a different approach is tested to get GPT to answer questions about data outside its training data.
Following the official documentation and the tutorial on: https://cookbook.openai.com/examples/question_answering_using_embeddings

### 1.Imports

In [10]:
# imports
import ast  # for converting embeddings saved as strings back to arrays
import openai # for calling the OpenAI API
import pandas as pd  # for storing text and embeddings data
import tiktoken  # for counting tokens
import os # for getting API token from env variable OPENAI_API_KEY
from scipy import spatial  # for calculating vector similarities for search
from dotenv import load_dotenv
import json


### 2.Select models and initialize client

In [12]:
load_dotenv()
EMBEDDING_MODEL = "text-embedding-ada-002"  # the model to use for embeddings
GPT_MODEL = 'gpt-3.5-turbo'  # the model to use for text generation

# Load OpenAI API key
api_key = os.getenv('OPENAI_API_KEY')
# initialize client
client = openai.OpenAI(api_key=api_key)

### 3. Generate embedding for sushi JSON

For this trial, we use embeddings for each element in the JSON file. For sushi.json this means each restaurant entry is embedded.

In [19]:
# Load sushi data
with open('../chatbot-backend/sushi.json', 'r') as file:
    sushi_data = json.load(file)

def generate_embeddings_from_sushi_json(sushi_data: list) -> pd.DataFrame:
    embeddings = []
    for restaurant in sushi_data:
        # Convert the restaurant data to a JSON string
        # We could also create descriptive texts, but this is easier for now
        json_str = json.dumps(restaurant)
        
        # Generate embedding for the JSON string
        embedding = client.embeddings.create(input = json_str, model=EMBEDDING_MODEL).data[0].embedding
        
        # Append the embedding and associated restaurant title to the list
        embeddings.append({'Restaurant': restaurant['title'], 'Embedding': embedding})
    
    return pd.DataFrame(embeddings)

df_embeddings_sushi = generate_embeddings_from_sushi_json(sushi_data)



In [20]:
df_embeddings_sushi

Unnamed: 0,Restaurant,Embedding
0,Sasou,"[0.007590070366859436, 0.02465105429291725, -0..."
1,Galeria Restaurant,"[0.0037229920271784067, 0.01896066591143608, -..."
2,Shaokao Asian Grill&Wine,"[0.022152040153741837, 0.01048024371266365, -0..."
3,Secret Garden,"[0.01569896563887596, 0.009590079076588154, -0..."


### 4. Search function (modified)

In [24]:
# search function
def ranked_by_relatedness(
    query: str,
    df: pd.DataFrame,
    relatedness_fn=lambda x, y: 1 - spatial.distance.cosine(x, y),
    top_n: int = 1
) -> tuple[list[str], list[float]]:
    """Returns a list of strings and relatednesses, sorted from most related to least."""
    query_embedding_response = client.embeddings.create(
        model=EMBEDDING_MODEL,
        input=query,
    )
    query_embedding = query_embedding_response.data[0].embedding
    strings_and_relatednesses = [
        (row["Restaurant"], relatedness_fn(query_embedding, row["Embedding"]))
        for i, row in df.iterrows()
    ]
    strings_and_relatednesses.sort(key=lambda x: x[1], reverse=True)
    strings, relatednesses = zip(*strings_and_relatednesses)
    return strings[:top_n], relatednesses[:top_n]

In [28]:
#use ranked by relatedness function
strings, relatednesses = ranked_by_relatedness("I want a restaurant with a low price range", df_embeddings_sushi, top_n=1)
for string, relatedness in zip(strings, relatednesses):
    print(f"{relatedness=:.3f}")
    display(string)

relatedness=0.761


'Sasou'

#### Quick Recap

The sushi.json file was embedded with an embedding model using the OpenAI API. For each restaurant in the JSON file, one embedding was generated.
Queries are then embedded with the same model, and the ranked_by_relatedness function then calculates the relatedness between the query and each 
embedding from the json file. The top N embeddings are returned.

For the sushi or parking use case this is not the right approach.
Example: When asking for a budget restaurant, calculating the relatedness using embeddings does not result in giving back the entry with the lowest price range indicator.