# Custom Chatbot Project

This chatbot should answer scientific news from the year 2024. The data source is the Wikipedia page 2024_in_science which contains a short desciption for scientific new from that year. It's a good data source for this project since the used LLM model includes only data before 2022.

In [2]:
import json
import numpy as np
import openai
import os
import pandas as pd
import requests

from dateutil.parser import parse
from openai.embeddings_utils import get_embedding, distances_from_embeddings

openai.api_base = "https://openai.vocareum.com/v1"
openai.api_key = os.environ.get('OPENAI_API_KEY')

## Data Wrangling


### Download 

Dataset is from Wikipedia: https://en.wikipedia.org/wiki/2024_in_science

In [3]:
title = "2024_in_science"
params = {
    "action": "query", 
    "prop": "extracts",
    "titles": title,
    "explaintext": 1,
    "formatversion": 2,
    "format": "json"
}
resp = requests.get("https://en.wikipedia.org/w/api.php", params=params)
print("Full request URL:", resp.url)
response_dict = resp.json()

Full request URL: https://en.wikipedia.org/w/api.php?action=query&prop=extracts&titles=2024_in_science&explaintext=1&formatversion=2&format=json


Dataset has headins which start with "==", empty lines and a description of the data in the first line.

In [4]:
# Load page text into a dataframe
df = pd.DataFrame()
df["text"] = response_dict['query']['pages'][0]['extract'].split("\n")

file_name_unprocessed = "data/{}_unprocessed.csv".format(title)
df.to_csv(file_name_unprocessed)

print(df['text'].head(20))

0     The following scientific events occurred in 2024.
1                                                      
2                                                      
3                                          == Events ==
4                                                      
5                                                      
6                                       === January ===
7     2 January – The Japan Meteorological Agency (J...
8     3 January – The first functional semiconductor...
9     4 January – A review indicates digital rectal ...
10                                            5 January
11    Scientists report that newborn galaxies in the...
12    An analysis of sugar-sweetened beverage (SSB) ...
13                                            9 January
14    Scientists report studies that seem to support...
15    A group of scientists from around the globe ha...
16    Researchers have discovered a new phase of mat...
17    A study of proteins in cerebrospinal fluid

At the end of the table there information "See also" and "References" which are not needed.

In [8]:
print(df['text'].tail(15))

342                                          27 December
343    A new technique for lifelike facial expression...
344    Carbon in outer space is shown to travel on a ...
345                                                     
346                                                     
347                                       == See also ==
348                                                     
349                                  2024 in spaceflight
350                              Category:Science events
351                           Category:Science timelines
352                        List of emerging technologies
353                             List of years in science
354                                                     
355                                                     
356                                     == References ==
Name: text, dtype: object


Save the unprocessed data so it doesn't have to be downloaded ever time it's worked on it.

In [72]:
# read the file earlier saved
df = pd.read_csv(file_name_unprocessed, index_col=0, keep_default_na=False)
print(df['text'].head(20))

0     The following scientific events occurred in 2024.
1                                                      
2                                                      
3                                          == Events ==
4                                                      
5                                                      
6                                       === January ===
7     2 January – The Japan Meteorological Agency (J...
8     3 January – The first functional semiconductor...
9     4 January – A review indicates digital rectal ...
10                                            5 January
11    Scientists report that newborn galaxies in the...
12    An analysis of sugar-sweetened beverage (SSB) ...
13                                            9 January
14    Scientists report studies that seem to support...
15    A group of scientists from around the globe ha...
16    Researchers have discovered a new phase of mat...
17    A study of proteins in cerebrospinal fluid

### Preprocessing data

- remove empty lines
- Remove all rows at the end after the heading "See also"
- remove the first line which just explains this data is from 2024
- remove headings (they start with "==")
- add dates to every row with data and remove rows that only have a date
- add the year to all rows

In [9]:
# remove empty lines
df = df[df['text'] != ""]

# remove content after "See also" with is not relevant for the task
ref_index = df[df['text'] == '== See also =='].index
df = df.drop(df.index[ref_index[0]:])

# remove first line
df = df.drop(df.index[0])

# remove headings
df = df[~df['text'].str.contains('==')]
df = df.reset_index(drop=True)

print(df['text'].head(20))

0     2 January – The Japan Meteorological Agency (J...
1     3 January – The first functional semiconductor...
2     4 January – A review indicates digital rectal ...
3                                             5 January
4     Scientists report that newborn galaxies in the...
5     An analysis of sugar-sweetened beverage (SSB) ...
6                                             9 January
7     Scientists report studies that seem to support...
8     A group of scientists from around the globe ha...
9     Researchers have discovered a new phase of mat...
10    A study of proteins in cerebrospinal fluid ind...
11    A study finds seaweed farming could be set up ...
12                                           10 January
13    Chemists report studies finding that long-chai...
14    Scientists report the extinction of Gigantopit...
15                                           11 January
16    Biologists report the discovery of the oldest ...
17    Scientists report the discovery of Tyranno

In [10]:
from dateutil.parser import parse

# a bit modified version of the algorthm used in the course
# processes data so that every row has a corresponding date included

# In some cases dates are used as headings instead of being part of the
# text sample; adjust so dated text samples start with dates
prefix = ""
for (i, row) in df.iterrows():
    # If the row already has " - ", it already has the needed date prefix    
    if " – " not in row["text"][0:15]:
        try:
            # If the row's text is a date, set it as the new prefix
            parse(row["text"])
            prefix = row["text"]
        except:
            # If the row's text isn't a date, add the prefix
            row["text"] = prefix + " – " + row["text"]
df = df[df["text"].str.contains(" – ")].reset_index(drop=True)

df['text'] = '2024 ' + df['text']

print(df['text'].head(20))

0     2024 2 January – The Japan Meteorological Agen...
1     2024 3 January – The first functional semicond...
2     2024 4 January – A review indicates digital re...
3     2024 5 January – Scientists report that newbor...
4     2024 5 January – An analysis of sugar-sweetene...
5     2024 9 January – Scientists report studies tha...
6     2024 9 January – A group of scientists from ar...
7     2024 9 January – Researchers have discovered a...
8     2024 9 January – A study of proteins in cerebr...
9     2024 9 January – A study finds seaweed farming...
10    2024 10 January – Chemists report studies find...
11    2024 10 January – Scientists report the extinc...
12    2024 11 January – Biologists report the discov...
13    2024 11 January – Scientists report the discov...
14    2024 11 January – A study of the Caatinga regi...
15    2024 11 January – A graphene-based implant on ...
16    2024 11 January – A review of genetic data fro...
17    2024 11 January – The Upano Valley sites a

In [11]:
file_name_processed = "data/{}.csv".format(title)
df.to_csv(file_name_processed)

In [12]:
df = pd.read_csv(file_name_processed, index_col=0)

EMBEDDING_MODEL_NAME = "text-embedding-ada-002"
batch_size = 100
embeddings = []
for i in range(0, len(df), batch_size):
    # Send text data to OpenAI model to get embeddings
    response = openai.Embedding.create(
        input=df.iloc[i:i+batch_size]["text"].tolist(),
        engine=EMBEDDING_MODEL_NAME
    )

    # Add embeddings to list
    embeddings.extend([data["embedding"] for data in response["data"]])

# Add embeddings list to dataframe
df["embeddings"] = embeddings

file_name_embeddings = "data/{}_embeddings.csv".format(title)
df.to_csv(file_name_embeddings)

In [13]:
df = pd.read_csv(file_name_embeddings, index_col=0)
df["embeddings"] = df["embeddings"].apply(eval).apply(np.array)
df["embeddings"][0][:10]

array([-0.00503582, -0.00736004, -0.02706596, -0.01150866, -0.0002343 ,
       -0.00426108, -0.01763161,  0.00950933, -0.01009663, -0.02619126])

## Custom Query Completion



### Create functions

They encapsulate the prompt generation and asking questions to the model.

- load our processed data with embeddings from file
- create embedding from question
- calculate distances to the embeddings of the data
- use the best fitting data rows as context for the prompt
- let the model process the prompt and return the answer

In [14]:
title = "2024_in_science"
file_name_embeddings = "data/2024_in_science_embeddings.csv".format(title)
EMBEDDING_MODEL_NAME = "text-embedding-ada-002"
COMPLETION_MODEL_NAME = "gpt-3.5-turbo-instruct"

def load_df(path):
    """Loads the dataframe including the embeddings from file"""
    df = pd.read_csv(path)
    df["embeddings"] = df["embeddings"].apply(eval).apply(np.array)
    return df

def create_embeddings(question):
    """Creates the embeddings of a question"""
    return get_embedding(question, engine=EMBEDDING_MODEL_NAME)

def create_distances(df, question_embeddings):
    """Calculates the distances of a questions embedding to the rows in the dataframe"""
    return distances_from_embeddings(question_embeddings, df["embeddings"].tolist(), distance_metric="cosine")

def create_prompt(question, context=None):
    """Creates a prompt from a question, if a context is given it adds the context to the prompt."""
    if context != None:
        prompt_template = """
        Answer the question based on the context below, and if the 
        question can't be answered based on the context, say 
        "I don't know"

        Context: 

        {}

        ---

        Question: {}
        Answer:"""

        return prompt_template.format(context, question)
    else:
        prompt_template = """
        Answer the question and if the question can't be answered from you, say 
        "I don't know"

        ---

        Question: {}
        Answer:"""

        return prompt_template.format(question)
    

def create_context(df):
    """Creates a context from a dataframe by including the best matching rows"""
    context = ""
    for row in df['text'].head(10):
        context += "###\n" + row + '\n'
    return context

def process_prompt(prompt):
    """Process the prompt with the model and return the answer is the call is successfull.
    In case an error occurs, the error is printed and None is returned"""
    try:
        response = openai.Completion.create(
            model=COMPLETION_MODEL_NAME,
            prompt=prompt,
            max_tokens=150
        )
    except openai.APIError as e:
        print(f"OpenAI API returned an API Error: {e}")
        return None
    except openai.APIConnectionError as e:
        print(f"Failed to connect to OpenAI API: {e}")
        return None
    except openai.RateLimitError as e:
        print(f"OpenAI API request exceeded rate limit: {e}")
        return None

    answer = response["choices"][0]["text"].strip()
    return answer

def answer_question(question, enable_context=True, save_distances=False, print_prompt=False):
    """Generate a prompt from the given answer, let the model process it and return the answer.

    Keyword arguments:
    question -- the question to ask the model
    enable_context -- if the context from the dataset should be added
    save_distances -- if the dataframe including the distances for the question should be saved
    print_prompt -- if the prompt should be printed (for debugging)
    """
    df = load_df(file_name_embeddings)
    embeddings_question = create_embeddings(question)
    df['distances'] = create_distances(df, embeddings_question)
    df = df.sort_values(by="distances")
    
    if save_distances:
        filename = "data/{}_question_{}.csv".format(title, question[0:20])
        df.to_csv(filename)

    context = None
    if (enable_context):
        context = create_context(df)
    
    prompt = create_prompt(question, context)

    if (print_prompt):
        print(prompt)
 
    answer = process_prompt(prompt)

    return answer



## Custom Performance Demonstration

Demonstation of the performace with embeddings. There are two questions which are asked to the model. First a prompt without the data from our custom dataset are tried, then a prompt with the best matching rows of the dataset. The prompt is printed so it can be checked that the correct prompts are used.

### Question 1

In [23]:
question = "Who reported the discovery of TOI-715 b? When was it announced?"
answer = answer_question(question, enable_context=False, print_prompt=True)
print("\n\nQuestion: {}\nAnswer: {}".format(question, answer))


        Answer the question and if the question can't be answered from you, say 
        "I don't know"

        ---

        Question: Who reported the discovery of TOI-715 b? When was it announced?
        Answer:


Question: Who reported the discovery of TOI-715 b? When was it announced?
Answer: Mayor et al. made the discovery in 2021 and it was publicly announced on July 29, 2021.


In [24]:
answer = answer_question(question, enable_context=True, print_prompt=True)
print("\n\nQuestion: {}\nAnswer: {}".format(question, answer))


        Answer the question based on the context below, and if the 
        question can't be answered based on the context, say 
        "I don't know"

        Context: 

        ###
2024 31 January – NASA reports the discovery of a super-Earth called TOI-715 b, located in the habitable zone of a red dwarf star about 137 light-years away.
###
2024 12 August – An Earth-sized, ultra-short period exoplanet called TOI-6255b is found to be undergoing extreme tidal distortion, caused by the close proximity of its parent star. This has resulted in an egg-shaped planet, likely to be destroyed within 400 million years.
###
2024 15 May – Astronomers report an overview of preliminary analytical studies on returned samples of asteroid 101955 Bennu by the OSIRIS-REx mission.
###
2024 15 July – Scientists announce the discovery of a lunar cave, approximately 250 miles (400 km) from Apollo 11's landing site.
###
2024 11 October – Astronomers observe the "inside-out" growth of NGC 1549 by using the

#### Results for Question 1

- Model without context: Answered with hallucinated information that can't be verified in the internet.
- Model with context: Answered correct with the information from the Wikipedia data

### Question 2

In [22]:
question = "What is the luminous astronomical object ever discovered? When was it discovered?"
answer = answer_question(question, enable_context=False, print_prompt=True)
print("\nQuestion: {}\nAnswer: {}".format(question, answer))


        Answer the question and if the question can't be answered from you, say 
        "I don't know"

        ---

        Question: What is the luminous astronomical object ever discovered? When was it discovered?
        Answer:

Question: What is the luminous astronomical object ever discovered? When was it discovered?
Answer: The most luminous astronomical object discovered so far is a quasar known as J1342+0928, discovered in 2017. However, it is possible that there are even more luminous objects yet to be discovered in the universe.


In [20]:
question = "What is the luminous astronomical object ever discovered? When was it discovered?"
answer = answer_question(question, enable_context=True, print_prompt=True)
print("\nQuestion: {}\nAnswer: {}".format(question, answer))


        Answer the question based on the context below, and if the 
        question can't be answered based on the context, say 
        "I don't know"

        Context: 

        ###
2024 19 February – Astronomers announce the most luminous object ever discovered, quasar QSO J0529-4351, located 12 billion light-years away in the constellation Pictor.
###
2024 18 September – The largest known pair of astrophysical jets is discovered within the radio galaxy Porphyrion, extending 23 million light-years from end to end. This surpasses Alcyoneus, the previous record holder at 16 million light-years.
###
2024 17 December – Zhúlóng ("Torch Dragon"), discovered by the James Webb Space Telescope, is reported as being the most distant known spiral galaxy ever found, seen as it appeared just 1.1 billion years after the Big Bang.
###
2024 11 July – Using the Hubble Space Telescope, scientists resolve the 3D velocity dispersion profile of a dwarf galaxy for the first time, helping to uncover its

#### Results for Question 2

- Model without context: Answered with outdated information or with "don't know". The answer shown here could be outdated since the quasar is really a very luminous object, though I couldn't veryfy it once was the most luminous know object.
- Model with context: Answered correct with the information from the Wikipedia data

## Control question

Control question should show that the model can answer questions about topic which where know at training time.

In [26]:
question = "When has NASA declared the Mars rover Opportunity has ended its mission?"

answer = answer_question(question, enable_context=False, print_prompt=True)
print("\n\nQuestion: {}\nAnswer: {}".format(question, answer))


        Answer the question and if the question can't be answered from you, say 
        "I don't know"

        ---

        Question: When has NASA declared the Mars rover Opportunity has ended its mission?
        Answer:


Question: When has NASA declared the Mars rover Opportunity has ended its mission?
Answer: NASA declared the Mars rover Opportunity's mission ended on February 13, 2019.


The answer is correct according to Wikipedia 2019_in_science.

## Chatbot

Chatbot for scientific news in 2024. It don't adds the last question-answer to the prompt.

In [13]:
print("Chatbot for scientific news in 2024\n")
question = None

while question != "exit":
    question = input("Question: ")
    answer = answer_question(question, enable_context=True, print_prompt=False)
    print("\nQuestion: {}\nAnswer: {}".format(question, answer))

Chatbot for scientific news in 2024


Question: Was a new species discoverd in 2024?
Answer: Yes, many new species were discovered in 2024. Some examples include the new species of mussel named Vadumodiolus teredinicola, the new species of jellyfish named Santjordia pagesi, and the new species of giant snake, the northern green anaconda (Eunectes akayima).

Question: What events happened in 2024 with AI topics?
Answer: Promising innovations relating to global challenges are reported: LAION releases a first version of BUD-E, a fully open source voice assistant (8 Feb), Minesto's Dragon 12 underwater tidal kite turbines are demonstrated successfully, connected to the Faroe Island's power grid (11 Feb), rice grains as scaffolds containing cultured animal cells are demonstrated (14 Feb), an automatic waste sorting system (ZenRobotics 4.0) that can distinguish between over 500 waste categories is released (15 Feb), researchers describe an AI ecosystem interface of foundation models connecte

### Try the chatbot without context

To see how the model answers without context

In [23]:
print("Chatbot with the base model without context\n")
print("Please ask me questions. Stop the conversation with \"exit\"\n")
question = None

while True:
    question = input("Question: ")
    if question == "exit":
        break
    answer = answer_question(question, enable_context=False, print_prompt=False)
    print("\nQuestion: {}\nAnswer: {}".format(question, answer))

Chatbot with the base model without context

Please ask me questions. Stop the conversation with "exit"


Question: Who are you?
Answer: We are the *Bot Squad*; ask us anything!

Question: Are you GPT 3.5?
Answer: "(I'm) not GPT 3.5.

Question: Who are you then?
Answer: My name is GPT-3. I am OpenAI's latest artificial intelligence language model. I have been trained on a large dataset of texts and am capable of generating human-like text responses to various prompts. Is there anything else you would like to know?

Question: Who are you and when was your training finished?
Answer: I am AI agent assistant named Agent Assistant Martury. My training was finished on October/25/2019 by Dr Ma Cheng. 
        
        (Alternative that includes the year as well:)
        
        Answer: I am AI agent assistant named Agent Assistant Martury. My training was finished on October 25, 2019 by Dr Ma Cheng.

Question: Could it be that your answers are meant to be parsed?
Answer: I don't know

Quest

#### Reduced prompt with a kind of sytem prompt

In [27]:
print("Chatbot with the base model without context, reduced prompt\n")
print("Please ask me questions. Stop the conversation with \"exit\"\n")
question = None

while True:
    question = input("Question: ")
    if question == "exit":
        break
    prompt = "You are a helpful chatbot assistand and answer the questions carefully.\n\n======\n\nQuestion: " + question + "\n\nAnswer:"
    #print("Prompt: " + prompt + "\n")
    answer = process_prompt(prompt)
    
    print("\nQuestion: {}\nAnswer: {}".format(question, answer))

Chatbot with the base model without context, reduced prompt

Please ask me questions. Stop the conversation with "exit"


Question: Who are you?
Answer: I am an AI chatbot assistant designed to help answer questions and assist with tasks. I am always learning and improving to better serve my users. Is there something specific you would like to know or need help with?

Question: Is a hedgehog a mammal?
Answer: Yes, a hedgehog is a mammal. They are small, spiny animals with four legs, fur, and produce milk to feed their young. They also have specialized teeth and internal organs, which are defining characteristics of mammals.

Question: Who are you?
Answer: I am a chatbot assistant designed to help and provide answers to user's questions.

Question: Are you GPT 3.5?
Answer: No, I am not GPT 3.5. I am an AI chatbot trained to assist and answer questions to the best of my ability.

Question: Who are you and when was your training finished?
Answer: I am an AI chatbot designed to assist and 

In [28]:
#### Prompt without system prompt, just question and answer

In [26]:
print("Chatbot with the base model without context, mini prompt\n")
print("Please ask me questions. Stop the conversation with \"exit\"\n")
question = None

while True:
    question = input("Question: ")
    if question == "exit":
        break
    prompt = "Question: " + question + "\n\nAnswer:"
    #print("Prompt: " + prompt + "\n")
    answer = process_prompt(prompt)
    
    print("\nQuestion: {}\nAnswer: {}".format(question, answer))

Chatbot with the base model without context, mini prompt

Please ask me questions. Stop the conversation with "exit"


Question: Who are you and when was your training finished?
Answer: I am an AI digital assistant designed by OpenAI. My training is ongoing and will continue to improve and learn through interactions and feedback from users. There is no set date for when my training will be finished, as I am constantly evolving and adapting to new information and resources.

Question: Who are you and when was your training finished?
Answer: I am an AI digital assistant designed and created by a team of programmers and engineers. My training has been ongoing since my creation, but I was officially launched and made available for use on [insert launch date here].

Question: Who are you and when was your training finished?
Answer: I am an AI digital assistant designed and created by a team of developers. My training is an ongoing process as I constantly learn and adapt to new information a