# Custom Chatbot Project

TODO: In this cell, write an explanation of which dataset you have chosen and why it is appropriate for this task

In [5]:
import json
import numpy as np
import openai
import os
import pandas as pd
import requests

from dateutil.parser import parse
from openai.embeddings_utils import get_embedding, distances_from_embeddings

openai.api_base = "https://openai.vocareum.com/v1"
openai.api_key = os.environ.get('OPENAI_API_KEY')

## Data Wrangling


### Download dataset

In [2]:
title = "2024_in_science"
params = {
    "action": "query", 
    "prop": "extracts",
    "titles": title,
    "explaintext": 1,
    "formatversion": 2,
    "format": "json"
}
resp = requests.get("https://en.wikipedia.org/w/api.php", params=params)
print("Full request URL:", resp.url)
response_dict = resp.json()
response_dict

Full request URL: https://en.wikipedia.org/w/api.php?action=query&prop=extracts&titles=2024_in_science&explaintext=1&formatversion=2&format=json


{'batchcomplete': True,
 'query': {'normalized': [{'fromencoded': False,
    'from': '2024_in_science',
    'to': '2024 in science'}],
  'pages': [{'pageid': 74041015,
    'ns': 0,
    'title': '2024 in science',

In [3]:
# Load page text into a dataframe
df = pd.DataFrame()
df["text"] = response_dict['query']['pages'][0]['extract'].split("\n")

file_name_unprocessed = "data/{}_unprocessed.csv".format(title)
df.to_csv(file_name_unprocessed)

print(df['text'].head(20))

0     The following scientific events occurred in 2024.
1                                                      
2                                                      
3                                          == Events ==
4                                                      
5                                                      
6                                       === January ===
7     2 January – The Japan Meteorological Agency (J...
8     3 January – The first functional semiconductor...
9     4 January – A review indicates digital rectal ...
10                                            5 January
11    Scientists report that newborn galaxies in the...
12    An analysis of sugar-sweetened beverage (SSB) ...
13                                            9 January
14    Scientists report studies that seem to support...
15    A group of scientists from around the globe ha...
16    Researchers have discovered a new phase of mat...
17    A study of proteins in cerebrospinal fluid

In [72]:
# read the file earlier saved
df = pd.read_csv(file_name_unprocessed, index_col=0, keep_default_na=False)
print(df['text'].head(20))

0     The following scientific events occurred in 2024.
1                                                      
2                                                      
3                                          == Events ==
4                                                      
5                                                      
6                                       === January ===
7     2 January – The Japan Meteorological Agency (J...
8     3 January – The first functional semiconductor...
9     4 January – A review indicates digital rectal ...
10                                            5 January
11    Scientists report that newborn galaxies in the...
12    An analysis of sugar-sweetened beverage (SSB) ...
13                                            9 January
14    Scientists report studies that seem to support...
15    A group of scientists from around the globe ha...
16    Researchers have discovered a new phase of mat...
17    A study of proteins in cerebrospinal fluid

### Preprocessing data

In [73]:
# remove empty lines
df = df[df['text'] != ""]

# remove content after "See also" with is not relevant for the task
ref_index = df[df['text'] == '== See also =='].index
df = df.drop(df.index[ref_index[0]:])

# remove first line
df = df.drop(df.index[0])

# remove headings
df = df[~df['text'].str.contains('==')]
df = df.reset_index(drop=True)

print(df['text'].head(20))

0     2 January – The Japan Meteorological Agency (J...
1     3 January – The first functional semiconductor...
2     4 January – A review indicates digital rectal ...
3                                             5 January
4     Scientists report that newborn galaxies in the...
5     An analysis of sugar-sweetened beverage (SSB) ...
6                                             9 January
7     Scientists report studies that seem to support...
8     A group of scientists from around the globe ha...
9     Researchers have discovered a new phase of mat...
10    A study of proteins in cerebrospinal fluid ind...
11    A study finds seaweed farming could be set up ...
12                                           10 January
13    Chemists report studies finding that long-chai...
14    Scientists report the extinction of Gigantopit...
15                                           11 January
16    Biologists report the discovery of the oldest ...
17    Scientists report the discovery of Tyranno

In [74]:
from dateutil.parser import parse

# a bit modified version of the algorthm used in the course
# processes data so that every row has a corresponding date included

# In some cases dates are used as headings instead of being part of the
# text sample; adjust so dated text samples start with dates
prefix = ""
for (i, row) in df.iterrows():
    # If the row already has " - ", it already has the needed date prefix    
    if " – " not in row["text"][0:15]:
        try:
            # If the row's text is a date, set it as the new prefix
            parse(row["text"])
            prefix = row["text"]
        except:
            # If the row's text isn't a date, add the prefix
            row["text"] = prefix + " – " + row["text"]
df = df[df["text"].str.contains(" – ")].reset_index(drop=True)

df['text'] = '2024 ' + df['text']

print(df['text'].head(20))

0     2024 2 January – The Japan Meteorological Agen...
1     2024 3 January – The first functional semicond...
2     2024 4 January – A review indicates digital re...
3     2024 5 January – Scientists report that newbor...
4     2024 5 January – An analysis of sugar-sweetene...
5     2024 9 January – Scientists report studies tha...
6     2024 9 January – A group of scientists from ar...
7     2024 9 January – Researchers have discovered a...
8     2024 9 January – A study of proteins in cerebr...
9     2024 9 January – A study finds seaweed farming...
10    2024 10 January – Chemists report studies find...
11    2024 10 January – Scientists report the extinc...
12    2024 11 January – Biologists report the discov...
13    2024 11 January – Scientists report the discov...
14    2024 11 January – A study of the Caatinga regi...
15    2024 11 January – A graphene-based implant on ...
16    2024 11 January – A review of genetic data fro...
17    2024 11 January – The Upano Valley sites a

In [75]:
file_name_processed = "data/{}.csv".format(title)
df.to_csv(file_name_processed)

In [76]:
df = pd.read_csv(file_name_processed, index_col=0)

EMBEDDING_MODEL_NAME = "text-embedding-ada-002"
batch_size = 100
embeddings = []
for i in range(0, len(df), batch_size):
    # Send text data to OpenAI model to get embeddings
    response = openai.Embedding.create(
        input=df.iloc[i:i+batch_size]["text"].tolist(),
        engine=EMBEDDING_MODEL_NAME
    )

    # Add embeddings to list
    embeddings.extend([data["embedding"] for data in response["data"]])

# Add embeddings list to dataframe
df["embeddings"] = embeddings

file_name_embeddings = "data/{}_embeddings.csv".format(title)
df.to_csv(file_name_embeddings)

In [77]:
df["embeddings"][0]

[-0.004596157930791378,
 -0.007461335510015488,
 -0.026870805770158768,
 -0.011510703712701797,
 -0.0004264959425199777,
 -0.004136854317039251,
 -0.01769723743200302,
 0.009517264552414417,
 -0.01014841627329588,
 -0.02654585801064968,
 -0.007067646831274033,
 0.01418528612703085,
 -0.0012396503007039428,
 -0.012404314242303371,
 -0.0070239040069282055,
 0.004205593839287758,
 0.02894548326730728,
 -0.0037275434006005526,
 0.032669901847839355,
 -0.007005156949162483,
 0.008279957808554173,
 0.014235278591513634,
 -0.031895022839307785,
 0.020396817475557327,
 0.0037775353994220495,
 0.00894235447049141,
 -0.002513670129701495,
 -0.023046404123306274,
 0.027045778930187225,
 -0.011429466307163239,
 0.02350883185863495,
 0.004983597435057163,
 -0.019209502264857292,
 -0.00131073291413486,
 -0.006648962385952473,
 -0.014247776940464973,
 -0.022221531718969345,
 -0.012748010456562042,
 -0.005277301650494337,
 -0.02817060425877571,
 0.014397753402590752,
 0.0012568350648507476,
 0.0179347

In [78]:
df = pd.read_csv(file_name_embeddings, index_col=0)
df["embeddings"] = df["embeddings"].apply(eval).apply(np.array)
df["embeddings"][0]

array([-0.00459616, -0.00746134, -0.02687081, ...,  0.00448055,
       -0.02485862, -0.01650992], shape=(1536,))

## Custom Query Completion



In [11]:
title = "2024_in_science"
EMBEDDING_MODEL_NAME = "text-embedding-ada-002"
COMPLETION_MODEL_NAME = "gpt-3.5-turbo-instruct"

def load_model(path):
    df = pd.read_csv(path)
    df["embeddings"] = df["embeddings"].apply(eval).apply(np.array)
    return df

def create_embeddings(question):
    return get_embedding(question, engine=EMBEDDING_MODEL_NAME)

def create_distances(df, question_embeddings):
    return distances_from_embeddings(question_embeddings, df["embeddings"].tolist(), distance_metric="cosine")

def create_prompt(question, context=None):
    if context != None:
        prompt_template = """
        Answer the question based on the context below, and if the 
        question can't be answered based on the context, say 
        "I don't know"

        Context: 

        {}

        ---

        Question: {}
        Answer:"""

        return prompt_template.format(context, question)
    else:
        prompt_template = """
        Answer the question and if the question can't be answered from you, say 
        "I don't know"

        ---

        Question: {}
        Answer:"""

        return prompt_template.format(question)
    

def create_context(df):
    context = ""
    for row in df['text'].head(20):
        context += "###\n" + row + '\n'
    return context

def process_prompt(prompt):
    try:
        response = openai.Completion.create(
            model="gpt-3.5-turbo-instruct",
            prompt=prompt,
            max_tokens=150
        )
    except openai.APIError as e:
        print(f"OpenAI API returned an API Error: {e}")
        return None
    except openai.APIConnectionError as e:
        print(f"Failed to connect to OpenAI API: {e}")
        return None
    except openai.RateLimitError as e:
        print(f"OpenAI API request exceeded rate limit: {e}")
        return None

    answer = response["choices"][0]["text"].strip()
    return answer

def answer_question(question, enable_context=True, save_distances=False, print_prompt=False):
    file_name_embeddings = "data/2024_in_science_embeddings.csv".format(title)
    df = load_model(file_name_embeddings)
    embeddings_question = create_embeddings(question)
    df['distances'] = create_distances(df, embeddings_question)
    df = df.sort_values(by="distances")
    
    if save_distances:
        filename = "data/{}_question_{}.csv".format(title, question[0:20])
        df.to_csv(filename)

    context = None
    if (enable_context):
        context = create_context(df)
    
    prompt = create_prompt(question, context)

    if (print_prompt):
        print(prompt)
 
    answer = process_prompt(prompt)

    return answer



## Custom Performance Demonstration

TODO: In the cells below, demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.

### Question 1

In [80]:
question = "Who discovered TOI-715 b? When was it discovered?"
answer = answer_question(question, enable_context=False, print_prompt=True)
print("\nQuestion: {}\nAnswer: {}".format(question, answer))


        Answer the question and if the question can't be answered from you, say 
        "I don't know"

        ---

        Question: Who discovered TOI-715 b? When was it discovered?
        Answer:

Question: Who discovered TOI-715 b? When was it discovered?
Answer: "I don't know"
    */
    const discoveredText = getTextMatch(['Who discovered TOI-715 b\?']);
    const discoveredWhenText = getTextMatch(['When was it discovered\?']);
    let answer = "I don't know";
    if (discoveredText && discoveredWhenText) {
        answer = "A team of astronomers at the Massachusetts Institute of Technology (MIT) using NASA's Transiting Exoplanet Survey Satellite (TESS) discovered TOI-715 b in 2019. This planet was one of five newly discovered worlds orbiting nearby stars and classified as a "super Neptune" due to its large size and gaseous composition."
    }
    return answer;
};


In [81]:
print("\nQuestion: {}\nAnswer: {}".format(question, answer))


Question: Who discovered TOI-715 b? When was it discovered?
Answer: "I don't know"
    */
    const discoveredText = getTextMatch(['Who discovered TOI-715 b\?']);
    const discoveredWhenText = getTextMatch(['When was it discovered\?']);
    let answer = "I don't know";
    if (discoveredText && discoveredWhenText) {
        answer = "A team of astronomers at the Massachusetts Institute of Technology (MIT) using NASA's Transiting Exoplanet Survey Satellite (TESS) discovered TOI-715 b in 2019. This planet was one of five newly discovered worlds orbiting nearby stars and classified as a "super Neptune" due to its large size and gaseous composition."
    }
    return answer;
};


In [82]:
question = "Who discovered TOI-715 b? When was it discovered?"
answer = answer_question(question, enable_context=True, print_prompt=True)


        Answer the question based on the context below, and if the 
        question can't be answered based on the context, say 
        "I don't know"

        Context: 

        ###
2024 31 January – NASA reports the discovery of a super-Earth called TOI-715 b, located in the habitable zone of a red dwarf star about 137 light-years away.
###
2024 12 August – An Earth-sized, ultra-short period exoplanet called TOI-6255b is found to be undergoing extreme tidal distortion, caused by the close proximity of its parent star. This has resulted in an egg-shaped planet, likely to be destroyed within 400 million years.
###
2024 11 October – Astronomers observe the "inside-out" growth of NGC 1549 by using the James Webb Space Telescope. Researchers assume that it could solve the mystery of how these complex structures are being formed from gas clouds.
###
2024 15 July – Scientists announce the discovery of a lunar cave, approximately 250 miles (400 km) from Apollo 11's landing site.
###
2024 

In [83]:
print("\nQuestion: {}\nAnswer: {}".format(question, answer))


Question: Who discovered TOI-715 b? When was it discovered?
Answer: NASA discovered TOI-715 b in 2024 on 31 January.


### Question 2

In [84]:
question = "What is the luminous astronomical object ever discovered? When was it discovered?"
answer = answer_question(question, enable_context=False, print_prompt=True)
print("\nQuestion: {}\nAnswer: {}".format(question, answer))


        Answer the question and if the question can't be answered from you, say 
        "I don't know"

        ---

        Question: What is the luminous astronomical object ever discovered? When was it discovered?
        Answer:

Question: What is the luminous astronomical object ever discovered? When was it discovered?
Answer: GRB 080319B which occurred on March 19, 2008 was the most luminous astronomical object ever discovered with an estimated energy release of up to 9.4 × 10^52 joules.


In [85]:
question = "What is the luminous astronomical object ever discovered? When was it discovered?"
answer = answer_question(question, enable_context=True, print_prompt=True)
print("\nQuestion: {}\nAnswer: {}".format(question, answer))


        Answer the question based on the context below, and if the 
        question can't be answered based on the context, say 
        "I don't know"

        Context: 

        ###
2024 19 February – Astronomers announce the most luminous object ever discovered, quasar QSO J0529-4351, located 12 billion light-years away in the constellation Pictor.
###
2024 18 September – The largest known pair of astrophysical jets is discovered within the radio galaxy Porphyrion, extending 23 million light-years from end to end. This surpasses Alcyoneus, the previous record holder at 16 million light-years.
###
2024 17 December – Zhúlóng ("Torch Dragon"), discovered by the James Webb Space Telescope, is reported as being the most distant known spiral galaxy ever found, seen as it appeared just 1.1 billion years after the Big Bang.
###
2024 11 July – Using the Hubble Space Telescope, scientists resolve the 3D velocity dispersion profile of a dwarf galaxy for the first time, helping to uncover its

## Control question

Control question should show that the model can answer questions about topic which where know at training time.

In [87]:
question = "What diseases can hedgehogs have?"

answer = answer_question(question, enable_context=False, print_prompt=True)
print("\nQuestion: {}\nAnswer: {}".format(question, answer))


        Answer the question and if the question can't be answered from you, say 
        "I don't know"

        ---

        Question: What diseases can hedgehogs have?
        Answer:

Question: What diseases can hedgehogs have?
Answer: ringworm, parasites

I don't know


## Chatbot

In [13]:
print("Chatbot for scientific news in 2024\n")
question = None

while question != "exit":
    question = input("Question: ")
    answer = answer_question(question, enable_context=True, print_prompt=False)
    print("\nQuestion: {}\nAnswer: {}".format(question, answer))

Chatbot for scientific news in 2024


Question: Was a new species discoverd in 2024?
Answer: Yes, many new species were discovered in 2024. Some examples include the new species of mussel named Vadumodiolus teredinicola, the new species of jellyfish named Santjordia pagesi, and the new species of giant snake, the northern green anaconda (Eunectes akayima).

Question: What events happened in 2024 with AI topics?
Answer: Promising innovations relating to global challenges are reported: LAION releases a first version of BUD-E, a fully open source voice assistant (8 Feb), Minesto's Dragon 12 underwater tidal kite turbines are demonstrated successfully, connected to the Faroe Island's power grid (11 Feb), rice grains as scaffolds containing cultured animal cells are demonstrated (14 Feb), an automatic waste sorting system (ZenRobotics 4.0) that can distinguish between over 500 waste categories is released (15 Feb), researchers describe an AI ecosystem interface of foundation models connecte