# Custom Chatbot Project

TODO: In this cell, write an explanation of which dataset you have chosen and why it is appropriate for this task

In [186]:
import json
import numpy as np
import openai
import os
import pandas as pd
import requests

from dateutil.parser import parse
from openai.embeddings_utils import get_embedding, distances_from_embeddings

openai.api_base = "https://openai.vocareum.com/v1"
openai.api_key = os.environ.get('OPENAI_API_KEY')

## Data Wrangling

TODO: In the cells below, load your chosen dataset into a `pandas` dataframe with a column named `"text"`. This column should contain all of your text data, separated into at least 20 rows.

In [83]:
params = {
    "action": "query", 
    "prop": "extracts",
    "titles": "Hedgehog",
    "explaintext": 1,
    "formatversion": 2,
    "format": "json"
}
resp = requests.get("https://en.wikipedia.org/w/api.php", params=params)
print("Full request URL:", resp.url)
response_dict = resp.json()
response_dict

Full request URL: https://en.wikipedia.org/w/api.php?action=query&prop=extracts&titles=Hedgehog&explaintext=1&formatversion=2&format=json


{'batchcomplete': True,
 'query': {'pages': [{'pageid': 46590,
    'ns': 0,
    'title': 'Hedgehog',
    'extract': 'A  hedgehog  is a spiny mammal of the subfamily Erinaceinae, in the eulipotyphlan family Erinaceidae. There are 17 species of hedgehog in five genera found throughout parts of Europe, Asia, and Africa, and in New Zealand by introduction. There are no hedgehogs native to Australia and no living species native to the Americas. However, the extinct genus Amphechinus was once present in North America.\nHedgehogs share distant ancestry with shrews (family Soricidae), with gymnures possibly being the intermediate link, and they have changed little over the last 15 million years. Like many of the first mammals, they have adapted to a nocturnal way of life. Their spiny protection resembles that of porcupines, which are rodents, and echidnas, a type of monotreme.\n\n\n== Etymology ==\nThe name hedgehog came into use around the year 1450, derived from the Middle English heyghoge, 

In [94]:
# Load page text into a dataframe
df = pd.DataFrame()
df["text"] = response_dict['query']['pages'][0]['extract'].split("\n")

In [95]:
df

Unnamed: 0,text
0,A hedgehog is a spiny mammal of the subfamil...
1,Hedgehogs share distant ancestry with shrews (...
2,
3,
4,== Etymology ==
...,...
114,Hedgehog reference at the University of Michig...
115,Hedgehog facts
116,Hedgehog care and advice
117,Natural History of European Hedgehogs (Wildlif...


In [None]:
# remove empty lines
df = df[df['text'] != ""]
df = df.reset_index(drop=True)
df['text'][7]

"Hedgehogs roll into a tight spiny ball when threatened, tucking in the furry face, feet, and belly. The hedgehog's back contains two large muscles that direct the quills. Some light-weight desert hedgehog species with fewer spines are more likely to flee or attack, ramming an intruder with the spines, rolling up only as a last resort."

In [147]:
# remove content after "See also" with is not relevant for the task
ref_index = df[df['text'] == '== References =='].index
dfn = df.drop(df.index[ref_index[0]:])

# remove headings
dfn = dfn[~dfn['text'].str.contains('==')]
dfn = dfn.reset_index(drop=True)

for row in dfn['text']:
    print(row)

A  hedgehog  is a spiny mammal of the subfamily Erinaceinae, in the eulipotyphlan family Erinaceidae. There are 17 species of hedgehog in five genera found throughout parts of Europe, Asia, and Africa, and in New Zealand by introduction. There are no hedgehogs native to Australia and no living species native to the Americas. However, the extinct genus Amphechinus was once present in North America.
Hedgehogs share distant ancestry with shrews (family Soricidae), with gymnures possibly being the intermediate link, and they have changed little over the last 15 million years. Like many of the first mammals, they have adapted to a nocturnal way of life. Their spiny protection resembles that of porcupines, which are rodents, and echidnas, a type of monotreme.
The name hedgehog came into use around the year 1450, derived from the Middle English heyghoge, from heyg, hegge 'hedge', because it frequents hedgerows, and hoge, hogge 'hog', from its piglike snout. Other names that are used are urchi

In [None]:
dfn.to_csv("data/hedgehog.csv")

In [234]:
df = pd.read_csv("data/hedgehog.csv", index_col=0)

EMBEDDING_MODEL_NAME = "text-embedding-ada-002"
batch_size = 100
embeddings = []
for i in range(0, len(df), batch_size):
    # Send text data to OpenAI model to get embeddings
    response = openai.Embedding.create(
        input=df.iloc[i:i+batch_size]["text"].tolist(),
        engine=EMBEDDING_MODEL_NAME
    )

    # Add embeddings to list
    embeddings.extend([data["embedding"] for data in response["data"]])

# Add embeddings list to dataframe
df["embeddings"] = embeddings

df.to_csv("data/hedgehog_embedding.csv")

In [155]:
df.to_csv("data/hedgehog_embedding.csv")

In [235]:
df = pd.read_csv("data/hedgehog_embedding.csv", index_col=0)
df["embeddings"] = df["embeddings"].apply(eval).apply(np.array)
df["embeddings"][0]

array([ 0.00173882,  0.00553422, -0.02343979, ..., -0.02628833,
        0.02005475, -0.02103833], shape=(1536,))

## Custom Query Completion

TODO: In the cells below, compose a custom query using your chosen dataset and retrieve results from an OpenAI `Completion` model. You may copy and paste any useful code from the course materials.

In [236]:
question = "What diseases can a hedgehog have?"
question_embeddings = get_embedding(question, engine=EMBEDDING_MODEL_NAME)
question_embeddings

[0.017819583415985107,
 0.0036284802481532097,
 0.017070643603801727,
 -0.031016403809189796,
 -0.02195166051387787,
 0.033185746520757675,
 -0.02166757918894291,
 -0.009574797935783863,
 -0.01497877947986126,
 0.0024663335643708706,
 0.014513920992612839,
 0.005261941812932491,
 0.019846882671117783,
 0.011279279366135597,
 0.012008849531412125,
 0.03024164028465748,
 0.046769946813583374,
 0.022390693426132202,
 0.02993173338472843,
 -0.01412653923034668,
 -0.019265810027718544,
 0.014668873511254787,
 -0.006349840667098761,
 -0.009529602713882923,
 0.011259910650551319,
 0.007256960496306419,
 0.014462269842624664,
 -0.0005451597971841693,
 -0.01402323693037033,
 -0.0005165096372365952,
 0.025567227974534035,
 -0.011873265728354454,
 -0.02400478534400463,
 -0.004406473133713007,
 -0.0033185745123773813,
 0.01544363796710968,
 0.0205183457583189,
 -0.005968914367258549,
 0.020944466814398766,
 -0.008948530070483685,
 0.01780666969716549,
 0.006020565517246723,
 -0.004884244408458471,

In [237]:
distances = distances_from_embeddings(question_embeddings, df["embeddings"].tolist(), distance_metric="cosine")
distances

[np.float64(0.1685098447386707),
 np.float64(0.17453466683207564),
 np.float64(0.19175629331039912),
 np.float64(0.15753684369614707),
 np.float64(0.17865630509636976),
 np.float64(0.19205309106839696),
 np.float64(0.16177368751096155),
 np.float64(0.17216104983897795),
 np.float64(0.16029512822168246),
 np.float64(0.1863022733466445),
 np.float64(0.1656123245890232),
 np.float64(0.18130117562255565),
 np.float64(0.1932685578125578),
 np.float64(0.190154905375698),
 np.float64(0.22386607510464895),
 np.float64(0.16815643690643822),
 np.float64(0.18390116914158505),
 np.float64(0.17538397999768263),
 np.float64(0.17831211313175255),
 np.float64(0.1596217284286099),
 np.float64(0.19352944502866065),
 np.float64(0.1791722374733541),
 np.float64(0.10390895501024455),
 np.float64(0.1194462406921214),
 np.float64(0.14560525810979008),
 np.float64(0.1316294913438535),
 np.float64(0.14364816092566168),
 np.float64(0.15972122800962196),
 np.float64(0.24508189827840943),
 np.float64(0.1724163594

In [240]:
df['distances'] = distances
df.to_csv("data/hedgehog_distances.csv")

In [241]:
df = pd.read_csv("data/hedgehog_distances.csv")
df=df.sort_values(by="distances")
df.head(20)

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,text,embeddings,distances
22,49,49,"Small-toothed forest hedgehog, Mesechinus miodon",[ 0.00677569 0.01798113 -0.00633337 ... -0.01...,0.103909
23,41,41,"European hedgehog, Erinaceus europaeus",[ 0.01139074 -0.01136534 -0.00212386 ... -0.02...,0.119446
25,0,0,A hedgehog is a spiny mammal of the subfamil...,[ 0.00173882 0.00553422 -0.02343979 ... -0.02...,0.131629
54,38,38,Genus Erinaceus,[ 0.00287111 -0.00555636 -0.00345968 ... -0.02...,0.14321
26,30,30,Hedgehogs are a food source in many cultures. ...,[ 0.01497972 0.01070624 -0.01369059 ... -0.00...,0.143648
24,15,15,The various species have many predators: while...,[ 0.01613636 0.00929608 -0.0104597 ... -0.00...,0.145605
44,9,9,The sense of smell has been little studied in ...,[ 0.01496064 -0.00112909 0.00321122 ... 0.01...,0.15032
39,21,21,"In 2022, it was reported that the hedgehog pop...",[-0.00144557 0.00037003 -0.01000671 ... 0.01...,0.151933
45,13,13,Hedgehogs have a relatively long lifespan for ...,[ 0.01745288 -0.00931666 0.00035124 ... -0.01...,0.153456
47,5,5,"Hedgehogs are primarily nocturnal, with some s...",[ 0.0232083 0.01066772 0.00103529 ... -0.00...,0.154808


In [242]:
for row in df['text'].head(20):
    print(row)

Small-toothed forest hedgehog, Mesechinus miodon
European hedgehog, Erinaceus europaeus
A  hedgehog  is a spiny mammal of the subfamily Erinaceinae, in the eulipotyphlan family Erinaceidae. There are 17 species of hedgehog in five genera found throughout parts of Europe, Asia, and Africa, and in New Zealand by introduction. There are no hedgehogs native to Australia and no living species native to the Americas. However, the extinct genus Amphechinus was once present in North America.
Genus Erinaceus
Hedgehogs are a food source in many cultures. They were eaten in Ancient Egypt and some recipes of the Late Middle Ages call for hedgehog meat. They are traded throughout Eurasia and Africa for traditional medicine and witchcraft. In the Middle East and especially among Bedouins, hedgehog meat is considered medicine against  rheumatism and arthritis. Hedgehogs are also said to cure a variety of disorders from tuberculosis to impotence. In Morocco, inhaling the smoke of the burnt skin or bri

In [171]:
for row in df['text'].head(20):
    print(row)

Hedgehogs suffer many diseases common to mammals, including cancer, fatty liver disease, and cardiovascular disease.
Cancer is very common in hedgehogs. The most common is squamous cell carcinoma, which spreads quickly from bone to the organs, unlike in humans. Surgery to remove the bone tumors is impractical.
Hedgehogs are also highly susceptible to pneumonia, with difficulty breathing and nasal discharge, caused by the bacterium Bordetella bronchiseptica.
Indian hedgehog, Paraechinus micropus 
Hedgehogs uncommonly transmit a fungal ringworm or dermatophytosis skin infection to human handlers and other hedgehogs, caused by Trichophyton erinacei, a distinct mating group among the Arthroderma benhamiae fungi.
Fatty liver and heart disease are believed to be caused by an unhealthy diet and obesity. Hedgehogs will eagerly eat foods high in fat and sugar, despite a metabolism adapted for low-fat, protein-rich insects.
Long-eared hedgehog, Hemiechinus auritus
Amur hedgehog, Erinaceus amuren

In [178]:
context = ""
for row in df['text'].head(20):
    context += "###\n" + row + '\n'
print(context)

###
Hedgehogs suffer many diseases common to mammals, including cancer, fatty liver disease, and cardiovascular disease.
###
Cancer is very common in hedgehogs. The most common is squamous cell carcinoma, which spreads quickly from bone to the organs, unlike in humans. Surgery to remove the bone tumors is impractical.
###
Hedgehogs are also highly susceptible to pneumonia, with difficulty breathing and nasal discharge, caused by the bacterium Bordetella bronchiseptica.
###
Indian hedgehog, Paraechinus micropus 
###
Hedgehogs uncommonly transmit a fungal ringworm or dermatophytosis skin infection to human handlers and other hedgehogs, caused by Trichophyton erinacei, a distinct mating group among the Arthroderma benhamiae fungi.
###
Fatty liver and heart disease are believed to be caused by an unhealthy diet and obesity. Hedgehogs will eagerly eat foods high in fat and sugar, despite a metabolism adapted for low-fat, protein-rich insects.
###
Long-eared hedgehog, Hemiechinus auritus
###

In [243]:

def load_model(path):
    df = pd.read_csv(path)
    df["embeddings"] = df["embeddings"].apply(eval).apply(np.array)
    return df

def create_embeddings(question):
    return get_embedding(question, engine=EMBEDDING_MODEL_NAME)

def create_distances(question_embeddings):
    return distances_from_embeddings(question_embeddings, df["embeddings"].tolist(), distance_metric="cosine")

def create_prompt(question, context=""):
    prompt_template = """
    Answer the question based on the context below, and if the 
    question can't be answered based on the context, say 
    "I don't know"

    Context: 

    {}

    ---

    Question: {}
    Answer:"""

    return prompt_template.format(question, context)

def create_context():
    context = ""
    for row in df['text'].head(20):
        context += "###\n" + row + '\n'
    return context

## Custom Performance Demonstration

TODO: In the cells below, demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.

### Question 1

In [244]:
question = "What diseases can a hedgehog have?"

df = load_model("data/hedgehog_embedding.csv")
embeddings = create_embeddings(question)
#print(embeddings)
distances = create_distances(embeddings)
df['distances'] = distances
df.to_csv("data/hedgehog_distances_question1.csv")
with open("data/question_1_embeddings.json", 'w') as f:
    json.dump(embeddings, f)

In [245]:
df = pd.read_csv("data/hedgehog_distances_question1.csv")
df=df.sort_values(by="distances")

In [246]:
prompt_simple = create_prompt(question)

In [247]:
COMPLETION_MODEL_NAME = "gpt-3.5-turbo-instruct"
response = openai.Completion.create(
    model="gpt-3.5-turbo-instruct",
    prompt=prompt_simple,
    max_tokens=150
)
answer = response["choices"][0]["text"].strip()
print(answer)

I


In [248]:
context = create_context()
prompt_context = create_prompt(question, context)

In [252]:
print(prompt_context)


    Answer the question based on the context below, and if the 
    question can't be answered based on the context, say 
    "I don't know"

    Context: 

    What diseases can a hedgehog have?

    ---

    Question: ###
Hedgehogs suffer many diseases common to mammals, including cancer, fatty liver disease, and cardiovascular disease.
###
Cancer is very common in hedgehogs. The most common is squamous cell carcinoma, which spreads quickly from bone to the organs, unlike in humans. Surgery to remove the bone tumors is impractical.
###
Hedgehogs are also highly susceptible to pneumonia, with difficulty breathing and nasal discharge, caused by the bacterium Bordetella bronchiseptica.
###
Indian hedgehog, Paraechinus micropus 
###
Hedgehogs uncommonly transmit a fungal ringworm or dermatophytosis skin infection to human handlers and other hedgehogs, caused by Trichophyton erinacei, a distinct mating group among the Arthroderma benhamiae fungi.
###
Fatty liver and heart disease are bel

In [250]:
COMPLETION_MODEL_NAME = "gpt-3.5-turbo-instruct"
response = openai.Completion.create(
    model="gpt-3.5-turbo-instruct",
    prompt=prompt_context,
    max_tokens=150
)
answer = response["choices"][0]["text"].strip()
print(answer)

Cancer, pneumonia, fatty liver disease, cardiovascular disease, fungal ringworm or dermatophytosis skin infections, balloon syndrome, natural immunity against some snake venom and α-neurotoxin


### Question 2

In [223]:
question = "What do hedgehogs eat?"

df = load_model("data/hedgehog_embedding.csv")
embeddings = create_embeddings(question)
#print(embeddings)
distances = create_distances(embeddings)
df['distances'] = distances
df.to_csv("data/hedgehog_distances_question2.csv")
with open("data/question_2_embeddings.json", 'w') as f:
    json.dump(embeddings, f)

In [224]:
df = pd.read_csv("data/hedgehog_distances_question2.csv")
df=df.sort_values(by="distances")

In [228]:
print(df['text'].head(10))
df['embeddings']


10    Although traditionally classified in the aband...
15    The various species have many predators: while...
6     Hedgehogs are fairly vocal, with a variety of ...
5     Hedgehogs are primarily nocturnal, with some s...
24    Fatty liver and heart disease are believed to ...
54               Indian hedgehog, Paraechinus micropus 
30    Hedgehogs are a food source in many cultures. ...
22    Hedgehogs suffer many diseases common to mamma...
19    In areas where hedgehogs have been introduced,...
1     Hedgehogs share distant ancestry with shrews (...
Name: text, dtype: object


10    [ 0.02069382 -0.00985236 -0.03234452 ... -0.01...
15    [ 0.01609841  0.00932216 -0.01046654 ... -0.00...
6     [ 0.0040018  -0.00092159 -0.01209461 ... -0.01...
5     [ 0.02331606  0.01058268  0.00102079 ... -0.00...
24    [ 0.00949567 -0.00740482 -0.00498587 ... -0.00...
54    [-0.00159441 -0.00211778 -0.02009217 ... -0.03...
30    [ 0.01490215  0.0108221  -0.01376773 ... -0.00...
22    [ 0.01009876  0.00247485 -0.01219203 ... -0.00...
19    [ 0.00336399 -0.01525777  0.01016545 ... -0.00...
1     [ 0.02259638 -0.00435686 -0.01331549 ... -0.00...
48    [-0.00717033  0.00896451 -0.00677587 ... -0.03...
39    [ 0.00911664 -0.00654172 -0.02039699 ... -0.02...
0     [ 0.00166384  0.00559191 -0.02345343 ... -0.02...
44    [ 0.00678325  0.00045774  0.00011635 ... -0.02...
49    [ 0.00673264  0.01802801 -0.00633603 ... -0.01...
3     [ 0.00518654  0.0124542  -0.00962518 ... -0.01...
55    [ 0.0070671  -0.0065805   0.0057412  ... -0.02...
32    [ 0.00246984  0.00266743 -0.00789054 ... -

In [218]:
context = create_context()
prompt_context = create_prompt(question, context)

In [219]:
prompt_simple = create_prompt(question)
COMPLETION_MODEL_NAME = "gpt-3.5-turbo-instruct"
response = openai.Completion.create(
    model="gpt-3.5-turbo-instruct",
    prompt=prompt_simple,
    max_tokens=150
)
answer = response["choices"][0]["text"].strip()
print(answer)

Hedgehogs typically live for three to eight years in the wild, but captive ones can


In [220]:
prompt_simple

'\n    Answer the question based on the context below, and if the \n    question can\'t be answered based on the context, say \n    "I don\'t know"\n\n    Context: \n\n    How is the lifespan of a hedgehog?\n\n    ---\n\n    Question: \n    Answer:'

In [215]:
context

'###\nHedgehogs have a relatively long lifespan for their size. In captivity, lack of predators and controlled diet contribute to a lifespan of eight to ten years depending on size. In the wild, larger species live four to seven years (some recorded up to 16 years), and smaller species live two to four years (four to seven in captivity). This compares to a mouse at two years and a large rat at three to five years.\n###\nHedgehog gestation lasts 35–58 days, depending on species. The average litter is three to four newborns for larger species and five to six for smaller ones. As with many animals, it is not unusual for an adult male hedgehog to kill newborn males.\n###\nLong-eared hedgehog, Hemiechinus auritus\n###\nIndian hedgehog, Paraechinus micropus \n###\nIn 2022, it was reported that the hedgehog population in rural Britain was declining rapidly, down by 30–75% since 2000.\n###\nCancer is very common in hedgehogs. The most common is squamous cell carcinoma, which spreads quickly fr

In [221]:
context = create_context()
prompt_context = create_prompt(question, context)
prompt_context

'\n    Answer the question based on the context below, and if the \n    question can\'t be answered based on the context, say \n    "I don\'t know"\n\n    Context: \n\n    How is the lifespan of a hedgehog?\n\n    ---\n\n    Question: ###\nHedgehogs have a relatively long lifespan for their size. In captivity, lack of predators and controlled diet contribute to a lifespan of eight to ten years depending on size. In the wild, larger species live four to seven years (some recorded up to 16 years), and smaller species live two to four years (four to seven in captivity). This compares to a mouse at two years and a large rat at three to five years.\n###\nHedgehog gestation lasts 35–58 days, depending on species. The average litter is three to four newborns for larger species and five to six for smaller ones. As with many animals, it is not unusual for an adult male hedgehog to kill newborn males.\n###\nLong-eared hedgehog, Hemiechinus auritus\n###\nIndian hedgehog, Paraechinus micropus \n##

In [222]:
COMPLETION_MODEL_NAME = "gpt-3.5-turbo-instruct"
response = openai.Completion.create(
    model="gpt-3.5-turbo-instruct",
    prompt=prompt_context,
    max_tokens=150
)
answer = response["choices"][0]["text"].strip()
print(answer)

The lifespan of a hedgehog varies depending on its habitat, with hedgehogs living four to ten years in captivity and two to seven years in the wild.
