# Custom Chatbot Project

TODO: In this cell, write an explanation of which dataset you have chosen and why it is appropriate for this task

In [1]:
import json
import numpy as np
import openai
import os
import pandas as pd
import requests

from dateutil.parser import parse
from openai.embeddings_utils import get_embedding, distances_from_embeddings

openai.api_base = "https://openai.vocareum.com/v1"
openai.api_key = os.environ.get('OPENAI_API_KEY')

## Data Wrangling

TODO: In the cells below, load your chosen dataset into a `pandas` dataframe with a column named `"text"`. This column should contain all of your text data, separated into at least 20 rows.

In [49]:
title = "2024_in_science"
params = {
    "action": "query", 
    "prop": "extracts",
    "titles": title,
    "explaintext": 1,
    "formatversion": 2,
    "format": "json"
}
resp = requests.get("https://en.wikipedia.org/w/api.php", params=params)
print("Full request URL:", resp.url)
response_dict = resp.json()
response_dict

Full request URL: https://en.wikipedia.org/w/api.php?action=query&prop=extracts&titles=2024_in_science&explaintext=1&formatversion=2&format=json


{'batchcomplete': True,
 'query': {'normalized': [{'fromencoded': False,
    'from': '2024_in_science',
    'to': '2024 in science'}],
  'pages': [{'pageid': 74041015,
    'ns': 0,
    'title': '2024 in science',

In [50]:
import re
# Load page text into a dataframe
df = pd.DataFrame()
df["text"] = response_dict['query']['pages'][0]['extract'].split("\n")

In [51]:
print(df['text'].head(20))

0     The following scientific events occurred in 2024.
1                                                      
2                                                      
3                                          == Events ==
4                                                      
5                                                      
6                                       === January ===
7     2 January – The Japan Meteorological Agency (J...
8     3 January – The first functional semiconductor...
9     4 January – A review indicates digital rectal ...
10                                            5 January
11    Scientists report that newborn galaxies in the...
12    An analysis of sugar-sweetened beverage (SSB) ...
13                                            9 January
14    Scientists report studies that seem to support...
15    A group of scientists from around the globe ha...
16    Researchers have discovered a new phase of mat...
17    A study of proteins in cerebrospinal fluid

In [52]:
# remove empty lines
df = df[df['text'] != ""]
df = df.reset_index(drop=True)

In [53]:
# remove content after "See also" with is not relevant for the task
ref_index = df[df['text'] == '== See also =='].index
dfn = df.drop(df.index[ref_index[0]:])

# remove first line
dfn = df.drop(df.index[0])

# remove headings
dfn = dfn[~dfn['text'].str.contains('==')]
dfn = dfn.reset_index(drop=True)

for row in dfn['text']:
    print(row)

2 January – The Japan Meteorological Agency (JMA) publishes its JRA-55 dataset, confirming 2023 as the warmest year on record globally, at 1.43 °C (2.57 °F) above the 1850–1900 baseline. This is 0.14 °C (0.25 °F) above the previous record set in 2016.
3 January – The first functional semiconductor made from graphene is created.
4 January – A review indicates digital rectal examination (DRE) is an outdated routine medical practice, with a lower cancer detection rate compared to prostate-specific antigen (PSA) testing and no benefit from combining DRE and PSA.
5 January
Scientists report that newborn galaxies in the very early universe were "banana"-shaped, much to the surprise of researchers.
An analysis of sugar-sweetened beverage (SSB) taxes concludes scaling them could yield substantial public health benefits.
9 January
Scientists report studies that seem to support the hypothesis that life may have begun in a shallow lake rather than otherwise – perhaps somewhat like a "warm little 

In [38]:
file_name = "data/{}_lines.csv".format(title)
dfn.to_csv(file_name)

In [64]:
df = pd.read_csv(file_name, index_col=0)

In [65]:
from dateutil.parser import parse

# In some cases dates are used as headings instead of being part of the
# text sample; adjust so dated text samples start with dates
prefix = ""
for (i, row) in df.iterrows():
    # If the row already has " - ", it already has the needed date prefix
    if " – " not in row["text"][0:15]:
        try:
            # If the row's text is a date, set it as the new prefix
            parse(row["text"])
            prefix = row["text"]
        except:
            # If the row's text isn't a date, add the prefix
            row["text"] = prefix + " – " + row["text"]
df = df[df["text"].str.contains(" – ")].reset_index(drop=True)

In [66]:
for row in df['text']:
    print(row)

2 January – The Japan Meteorological Agency (JMA) publishes its JRA-55 dataset, confirming 2023 as the warmest year on record globally, at 1.43 °C (2.57 °F) above the 1850–1900 baseline. This is 0.14 °C (0.25 °F) above the previous record set in 2016.
3 January – The first functional semiconductor made from graphene is created.
4 January – A review indicates digital rectal examination (DRE) is an outdated routine medical practice, with a lower cancer detection rate compared to prostate-specific antigen (PSA) testing and no benefit from combining DRE and PSA.
5 January – Scientists report that newborn galaxies in the very early universe were "banana"-shaped, much to the surprise of researchers.
5 January – An analysis of sugar-sweetened beverage (SSB) taxes concludes scaling them could yield substantial public health benefits.
9 January – Scientists report studies that seem to support the hypothesis that life may have begun in a shallow lake rather than otherwise – perhaps somewhat like

In [68]:
file_name = "data/{}.csv".format(title)
dfn.to_csv(file_name)

In [None]:
df = pd.read_csv(file_name, index_col=0)

EMBEDDING_MODEL_NAME = "text-embedding-ada-002"
batch_size = 100
embeddings = []
for i in range(0, len(df), batch_size):
    # Send text data to OpenAI model to get embeddings
    response = openai.Embedding.create(
        input=df.iloc[i:i+batch_size]["text"].tolist(),
        engine=EMBEDDING_MODEL_NAME
    )

    # Add embeddings to list
    embeddings.extend([data["embedding"] for data in response["data"]])

# Add embeddings list to dataframe
df["embeddings"] = embeddings

file_name = "data/{}_embeddings.csv".format(title)
df.to_csv(file_name)

In [70]:
df = pd.read_csv(file_name, index_col=0)
df["embeddings"] = df["embeddings"].apply(eval).apply(np.array)
df["embeddings"][0]

array([-0.00309624, -0.0093812 , -0.02867035, ...,  0.00607019,
       -0.01890036, -0.01413451], shape=(1536,))

## Custom Query Completion

TODO: In the cells below, compose a custom query using your chosen dataset and retrieve results from an OpenAI `Completion` model. You may copy and paste any useful code from the course materials.

In [71]:

def load_model(path):
    df = pd.read_csv(path)
    df["embeddings"] = df["embeddings"].apply(eval).apply(np.array)
    return df

def create_embeddings(question):
    return get_embedding(question, engine=EMBEDDING_MODEL_NAME)

def create_distances(question_embeddings):
    return distances_from_embeddings(question_embeddings, df["embeddings"].tolist(), distance_metric="cosine")

def create_prompt(question, context=""):
    if context != "":
        prompt_template = """
        Answer the question based on the context below, and if the 
        question can't be answered based on the context, say 
        "I don't know"

        Context: 

        {}

        ---

        Question: {}
        Answer:"""

        return prompt_template.format(context, question)
    else:
        prompt_template = """
        Answer the question and if the question can't be answered from you, say 
        "I don't know"

        ---

        Question: {}
        Answer:"""

        return prompt_template.format(question)
    

def create_context():
    context = ""
    for row in df['text'].head(20):
        context += "###\n" + row + '\n'
    return context

## Custom Performance Demonstration

TODO: In the cells below, demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.

### Question 1

In [72]:
question = "Who discovered TOI-715 b? When was it discovered?"

df = load_model(file_name)
embeddings = create_embeddings(question)
#print(embeddings)
distances = create_distances(embeddings)
df['distances'] = distances

filename_q1 = "data/{}_question1.csv".format(title)
df.to_csv(filename_q1)
with open("data/question_1_embeddings.json", 'w') as f:
    json.dump(embeddings, f)

In [74]:
df = pd.read_csv(filename_q1)
df=df.sort_values(by="distances")

In [75]:
prompt_simple = create_prompt(question)
print(prompt_simple)


        Answer the question and if the question can't be answered from you, say 
        "I don't know"

        ---

        Question: Who discovered TOI-715 b? When was it discovered?
        Answer:


In [76]:
COMPLETION_MODEL_NAME = "gpt-3.5-turbo-instruct"
response = openai.Completion.create(
    model="gpt-3.5-turbo-instruct",
    prompt=prompt_simple,
    max_tokens=150
)
answer = response["choices"][0]["text"].strip()
print(answer)

The exoplanet TOI-715 b was discovered by NASA's TESS mission in 2020. However, the published paper on its discovery was released in 2021. It is unclear if one specific person can be credited with the discovery of this exoplanet.


In [77]:
context = create_context()
prompt_context = create_prompt(question, context)

In [78]:
print(prompt_context)


        Answer the question based on the context below, and if the 
        question can't be answered based on the context, say 
        "I don't know"

        Context: 

        ###
31 January – NASA reports the discovery of a super-Earth called TOI-715 b, located in the habitable zone of a red dwarf star about 137 light-years away.
###
An Earth-sized, ultra-short period exoplanet called TOI-6255b is found to be undergoing extreme tidal distortion, caused by the close proximity of its parent star. This has resulted in an egg-shaped planet, likely to be destroyed within 400 million years.
###
11 October – Astronomers observe the "inside-out" growth of NGC 1549 by using the James Webb Space Telescope. Researchers assume that it could solve the mystery of how these complex structures are being formed from gas clouds.
###
10 October – Scientists use a high-level machine learning model "SHBoost", to process data and estimate precise stellar properties for 217 million stars observed by t

In [79]:
COMPLETION_MODEL_NAME = "gpt-3.5-turbo-instruct"
response = openai.Completion.create(
    model="gpt-3.5-turbo-instruct",
    prompt=prompt_context,
    max_tokens=150
)
answer = response["choices"][0]["text"].strip()
print(answer)

NASA discovered TOI-715 b on January 31st.


### Question 2

In [80]:
question = "What is the luminous astronomical object ever discovered?"

df = load_model(file_name)
embeddings = create_embeddings(question)
#print(embeddings)
distances = create_distances(embeddings)
df['distances'] = distances

filename_q2 = "data/{}_question1.csv".format(title)
df.to_csv(filename_q2)
with open("data/question_2_embeddings.json", 'w') as f:
    json.dump(embeddings, f)

In [81]:
df = pd.read_csv(filename_q2)
df=df.sort_values(by="distances")

In [82]:
prompt_simple = create_prompt(question)
print(prompt_simple)


        Answer the question and if the question can't be answered from you, say 
        "I don't know"

        ---

        Question: What is the luminous astronomical object ever discovered?
        Answer:


In [83]:
context = create_context()
prompt_context = create_prompt(question, context)

In [84]:
prompt_simple = create_prompt(question)
COMPLETION_MODEL_NAME = "gpt-3.5-turbo-instruct"
response = openai.Completion.create(
    model="gpt-3.5-turbo-instruct",
    prompt=prompt_simple,
    max_tokens=150
)
answer = response["choices"][0]["text"].strip()
print(answer)

I don't know


In [85]:
context = create_context()
prompt_context = create_prompt(question, context)
print(prompt_context)


        Answer the question based on the context below, and if the 
        question can't be answered based on the context, say 
        "I don't know"

        Context: 

        ###
Astronomers announce the most luminous object ever discovered, quasar QSO J0529-4351, located 12 billion light-years away in the constellation Pictor.
###
The largest known pair of astrophysical jets is discovered within the radio galaxy Porphyrion, extending 23 million light-years from end to end. This surpasses Alcyoneus, the previous record holder at 16 million light-years.
###
17 December – Zhúlóng ("Torch Dragon"), discovered by the James Webb Space Telescope, is reported as being the most distant known spiral galaxy ever found, seen as it appeared just 1.1 billion years after the Big Bang.
###
30 May – NASA reports that the James Webb Space Telescope has discovered JADES-GS-z14-0, the most distant known galaxy, which existed only 290 million years after the Big Bang. Its redshift of 14.32 exceeds 

In [86]:
COMPLETION_MODEL_NAME = "gpt-3.5-turbo-instruct"
response = openai.Completion.create(
    model="gpt-3.5-turbo-instruct",
    prompt=prompt_context,
    max_tokens=150
)
answer = response["choices"][0]["text"].strip()
print(answer)

Quasar QSO J0529-4351.
