# Custom Chatbot Project


For this code, we use RAG with information on space missions between 2023 and 2024. We use the Wikipedia pages [2023 in spaceflight](https://en.wikipedia.org/wiki/2023_in_spaceflight) and [2024 in spaceflight](https://en.wikipedia.org/wiki/2024_in_spaceflight).


This is useful to keep up-to-date information about space missions.

## Data Wrangling

Task: In the cells below, load your chosen dataset into a `pandas` dataframe with a column named `"text"`. This column should contain all of your text data, separated into at least 20 rows.

retrieving the data

In [1]:
import requests
import json

def retrieve_wiki_to_file(page_title: str):
    # Get the Wikipedia page; we do not set titles yet
    params = {
        "action": "query",
        "prop": "extracts",
        "exlimit": 1,
        "titles": page_title,
        "explaintext": 1,
        "formatversion": 2,
        "format": "json"
    }
    resp = requests.get("https://en.wikipedia.org/w/api.php", params=params)
    response_dict = resp.json()
    with open(page_title + ".json", 'w', encoding='utf-8') as f:
        json.dump(response_dict,f)

retrieve_wiki_to_file("2023_in_spaceflight")
retrieve_wiki_to_file("2024_in_spaceflight")

In [2]:
#loading json
def load_json(filename):
    with open(filename, "r") as f:
        json_data = json.load(f)
    return json_data

json_filename = "2023_in_spaceflight.json"
sp_2023_json = load_json(json_filename)
json_filename = "2024_in_spaceflight.json"
sp_2024_json = load_json(json_filename)

In [3]:
#cleaning data
def cleanup_data(original_data):
    cleaned_data = [item for item in original_data if len(item) > 0]
    cleaned_data = [item for item in cleaned_data if not (item.startswith("=") and item.endswith("="))]
    return cleaned_data
sp_2024_data = cleanup_data(sp_2024_json['query']['pages'][0]['extract'].split("\n"))
sp_2023_data = cleanup_data(sp_2023_json['query']['pages'][0]['extract'].split("\n"))

In [4]:
#creating the dataset
import pandas as pd
df = pd.DataFrame()
df["text"] = sp_2023_data + sp_2024_data

df.head()

Unnamed: 0,text
0,The year 2023 saw rapid growth and significant...
1,In terms of national-level scientific space mi...
2,"Two crewed space stations, the International S..."
3,This year also saw the first time citizens of ...
4,European Space Agency's (ESA) Euclid satellite...


## Custom Query Completion

Task: In the cells below, compose a custom query using your chosen dataset and retrieve results from an OpenAI `Completion` model. You may copy and paste any useful code from the course materials.

In [6]:
import os
os.environ["OPENAI_API_KEY"] = ""#TODO set key here

In [7]:
#importing openai and setting the client
import openai
import os
client = openai.OpenAI(
    base_url=f"https://openai.vocareum.com/v1",
    api_key=os.environ["OPENAI_API_KEY"]
)

In [8]:
# constanst for openai
EMBEDDING_MODEL_NAME = "text-embedding-3-small"
GPT_MODEL_NAME = "gpt-4o-mini"

Creating the embeddings

In [9]:
# creating the embeddings as in casestudy case
batch_size = 100
embeddings = []
for i in range(0, len(df), batch_size):
    # Send text data to OpenAI model to get embeddings
    response = client.embeddings.create(
        input=df.iloc[i:i+batch_size]["text"].tolist(),
        model=EMBEDDING_MODEL_NAME
    )
    
    # Add embeddings to list
    embeddings.extend([data.embedding for data in response.data])

# Add embeddings list to dataframe
df["embeddings"] = embeddings
#saving embeddings
df.to_csv("embeddings.csv")

In [None]:
#loading if needed
#import pandas as pd
#from ast import literal_eval
#df = pd.read_csv("embeddings.csv")
#df["embeddings"] = df["embeddings"].apply(literal_eval)
#df.head()

Function for sorting rows by similarity with the query

In [10]:
from scipy import spatial

def get_embedding(text: str):
    embedding_response = client.embeddings.create(
        input=text,
        model=EMBEDDING_MODEL_NAME
    )

    return embedding_response.data[0].embedding
    

def get_rows_sorted_by_relevance(question, df):
    """
    Function that takes in a question string and a dataframe containing
    rows of text and associated embeddings, and returns that dataframe
    sorted from least to most relevant for that question
    """
    
    # Get embeddings for the question text
    question = question.replace("\n", " ")
    question = question.replace("  ", " ")
    question_embeddings = get_embedding(question)
    
    # Make a copy of the dataframe and add a "distances" column containing
    # the cosine distances between each row's embeddings and the
    # embeddings of the question
    df_copy = df.copy()
    df_copy["distances"] = df_copy["embeddings"].apply(
        lambda el_embeddings: spatial.distance.cosine(question_embeddings, el_embeddings)
    )
    
    # Sort the copied dataframe by the distances and return it
    # (shorter distance = more relevant so we sort in ascending order)
    df_copy.sort_values("distances", ascending=True, inplace=True)
    return df_copy

Function for creating the prompt

In [17]:
import tiktoken
import copy

PROMPT_MSGS_TEMPLATE = [
    {
        "role":"system", 
        "content":"""
You are a helpful assistant that will answer the question of the user with the information provided in the context. If if the question can't be answered based on the context, say 'I don't know'."""
    },
    {
        "role":"user",
        "content": """
Context:
{}
_____"""
    },
    {
        "role":"user",
        "content":"""
Question:
{}"""
     }
]

def create_prompt_msgs(question, df, max_token_count):
    """
    Given a question and a dataframe containing rows of text and their
    embeddings, return a text prompt to send to a Completion model
    """
    tokenizer = tiktoken.encoding_for_model(GPT_MODEL_NAME)
    context_separator = "\n\n###\n\n"
    tokens_per_separator = len(tokenizer.encode(context_separator))

    
    #counting tokens for structure of message
    tokens_per_message = 4 #the new api uses some tokens to structure each message
    current_token_count = tokens_per_message*len(PROMPT_MSGS_TEMPLATE)
    current_token_count += 3 #the new api adds this number of tokens to the answer
    for msg_el in PROMPT_MSGS_TEMPLATE:
        current_token_count += len(tokenizer.encode(msg_el["content"].format("")))
    current_token_count += len(tokenizer.encode(question))

    #putting the messages that fit in the context
    context = []
    for text in get_rows_sorted_by_relevance(question, df)["text"].values:
        text_token_count = len(tokenizer.encode(text))
        
        if text_token_count + current_token_count <= max_token_count:
            current_token_count += text_token_count
            context.append(text)
        else:
            break
            
        if current_token_count + tokens_per_separator <= max_token_count:
            current_token_count += tokens_per_separator
        else:
            break
    
    #creating the message
    prompt_msgs = copy.deepcopy(PROMPT_MSGS_TEMPLATE)
    prompt_msgs[1]["content"] = prompt_msgs[1]["content"].format(context_separator.join(context))
    prompt_msgs[2]["content"] = prompt_msgs[2]["content"].format(question)
    return prompt_msgs

Function for answering questions.

In [12]:
def answer_question(
    question, df, max_prompt_tokens=1800, max_answer_tokens=150
):
    """
    Given a question, a dataframe containing rows of text, and a maximum
    number of desired tokens in the prompt and response, return the
    answer to the question according to an OpenAI Completion model
    
    If the model produces an error, return an empty string
    """

    prompt_msgs = create_prompt_msgs(question, df, max_prompt_tokens)

    try:
        response = client.chat.completions.create(
            model=GPT_MODEL_NAME,
            messages=prompt_msgs,
            max_tokens=max_answer_tokens
        )
        
        return response.choices[0].message.content
    except Exception as e:
        print(e)
        return ""

## Custom Performance Demonstration

Task: In the cells below, demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.

In [13]:
#function for retrieving simple answer
def simple_answer(question, max_answer_tokens=150):
    """
    Given a question and max of answer tokesn returns the
    answer to the question according to an OpenAI Completion model
    
    If the model produces an error, return an empty string
    """
    simple_prompt_messages = [
        {"role": "system", "content": "You are an assistant that responds to the questions of the user."},
        {"role": "user", "content": question}
    ]

    try:
        response = client.chat.completions.create(
            model=GPT_MODEL_NAME,
            messages=simple_prompt_messages,
            max_tokens=max_answer_tokens
        )
        
        return response.choices[0].message.content
    except Exception as e:
        print(e)
        return ""

### Question 1

In [14]:
ingenuity_question = "When did the Ingenuity helicopter make its last flight on Mars?"


#### orignal answer

In [15]:
simple_ingenuity_answer = simple_answer(ingenuity_question)
print(simple_ingenuity_answer)

As of my last knowledge update in October 2023, the Ingenuity helicopter on Mars made its last recorded flight on September 6, 2022. However, please verify with the latest sources for any updates beyond that date.


#### custom answer

In [18]:
custom_ingenuity_answer = answer_question(ingenuity_question, df)
print(custom_ingenuity_answer)

The Ingenuity helicopter made its last flight on Mars on January 18, 2024.


### Question 2

In [19]:
satellite_question="Which is the heaviest geostationary satellite ever launched, and who launched it?"

#### orignal answer

In [20]:
simple_satellite_answer=simple_answer(satellite_question)
print(simple_satellite_answer)

As of October 2023, the heaviest geostationary satellite ever launched is the GSAT-19, which was launched by the Indian Space Research Organisation (ISRO) on June 5, 2017. GSAT-19 weighs approximately 3,136 kilograms (6,910 pounds).


#### custom answer

In [21]:
custom_satellite_answer = answer_question(satellite_question, df)
print(custom_satellite_answer)

The heaviest geostationary satellite ever launched is the Jupiter-3 (EchoStar-24), and it was launched by a Falcon Heavy rocket.
