# Codealong Notebook

Use this notebook as your "scratch pad" as you go through the course contents. Feel free to copy any example code and tweak it to get a better understanding of how it works!

Use the **+** button or `Insert` menu to add additional code cells as needed.

## Step 0

### Preparing data

In [1]:
import requests

In [2]:
params = {
    "action": "query", 
    "prop": "extracts",
    "exlimit": 1,
    "titles": "2022",
    "explaintext": 1,
    "formatversion": 2,
    "format": "json"
}

# Always include a User-Agent when calling Wikimedia APIs
headers = {
    "User-Agent": "MyApp/1.0 (https://example.com; cshreyastech@gmail.com)"
}

In [3]:
resp = requests.get("https://en.wikipedia.org/w/api.php", params=params, headers=headers)
# Check status and parse
if resp.status_code == 200:
    response_dict = resp.json()
    # print(response_dict)
else:
    print("Error:", resp.status_code, resp.text)
    
# response_dict["query"]["pages"][0]["extract"].split("\n")

## Step 1

### Loading the Data with `pandas`

In [4]:
import pandas as pd

In [5]:
df = pd.DataFrame()
df["text"] = response_dict["query"]["pages"][0]["extract"].split("\n")
df

Unnamed: 0,text
0,2022 (MMXXII) was a common year starting on Sa...
1,The year began with another wave in the COVID-...
2,2022 was also dominated by wars and armed conf...
3,
4,
...,...
287,== References ==
288,
289,
290,== External links ==


In [6]:
df = df[df["text"].str.len() > 0]
df

Unnamed: 0,text
0,2022 (MMXXII) was a common year starting on Sa...
1,The year began with another wave in the COVID-...
2,2022 was also dominated by wars and armed conf...
5,== Conflicts ==
7,The ongoing Russian invasion of Ukraine escala...
...,...
281,== Births and deaths ==
284,== Nobel Prizes ==
287,== References ==
290,== External links ==


In [7]:
df = df[~df["text"].str.startswith("==")]
df

Unnamed: 0,text
0,2022 (MMXXII) was a common year starting on Sa...
1,The year began with another wave in the COVID-...
2,2022 was also dominated by wars and armed conf...
7,The ongoing Russian invasion of Ukraine escala...
14,January 1
...,...
272,December 24 – 2022 Fijian general election: Th...
273,December 29 – Brazilian football legend Pelé d...
274,December 31 – Former Pope Benedict XVI dies at...
278,The world population was estimated to have rea...


In [8]:
from dateutil.parser import parse

# Clean up text to remove empty lines and headings
df = df[(df["text"].str.len() > 0) & (~df["text"].str.startswith("=="))]

# In some cases dates are used as headings instead of being part of the
# text sample; adjust so dated text samples start with dates
prefix = ""
for (i, row) in df.iterrows():
    # If the row already has " - ", it already has the needed date prefix
    if " – " not in row["text"]:
        try:
            # If the row's text is a date, set it as the new prefix
            parse(row["text"])
            prefix = row["text"]
        except:
            # If the row's text isn't a date, add the prefix
            row["text"] = prefix + " – " + row["text"]
df = df[df["text"].str.contains(" – ")].reset_index(drop=True)
df.tail()

Unnamed: 0,text
202,December 24 – 2022 Fijian general election: Th...
203,December 29 – Brazilian football legend Pelé d...
204,December 31 – Former Pope Benedict XVI dies at...
205,December 7 – The world population was estimate...
206,December 7 – Media related to 2022 at Wikimed...


In [9]:
df.reset_index(inplace=True, drop=True)
df

Unnamed: 0,text
0,– 2022 (MMXXII) was a common year starting on...
1,– The year began with another wave in the COV...
2,– 2022 was also dominated by wars and armed c...
3,– The ongoing Russian invasion of Ukraine esc...
4,January 1 – France takes over the Presidency ...
...,...
202,December 24 – 2022 Fijian general election: Th...
203,December 29 – Brazilian football legend Pelé d...
204,December 31 – Former Pope Benedict XVI dies at...
205,December 7 – The world population was estimate...


### Creating an Embeddings Index with `openai.Embedding`

In [10]:
# df = pd.read_csv("text.csv", index_col=0)


In [11]:
import openai
openai.api_key = "YOUR API KEY"


In [12]:
EMBEDDING_MODEL_NAME = "text-embedding-ada-002"

# response = openai.Embedding.create(
#     input=df["text"].tolist(),
#     model=EMBEDDING_MODEL_NAME
# )

# type(response)

openai.openai_object.OpenAIObject

In [18]:
# response.keys()

dict_keys(['object', 'data', 'model', 'usage'])

In [19]:
# type(response["data"])

list

In [20]:
# response["data"][0]

<OpenAIObject embedding at 0x7b6bf3e4f9f0> JSON: {
  "embedding": [
    8.00940779299708e-06,
    -0.01797325722873211,
    -0.017194541171193123,
    -0.012296169996261597,
    -0.009608345106244087,
    -0.0012190979905426502,
    -0.008320952765643597,
    0.013213044963777065,
    -0.0226204302161932,
    -0.0020708180963993073,
    0.030118707567453384,
    0.02435370162129402,
    -0.018990611657500267,
    -0.01680518314242363,
    -0.0021006478928029537,
    0.004593793768435717,
    0.01847565360367298,
    -0.004571814090013504,
    0.021201159805059433,
    0.009633464738726616,
    0.002265497110784054,
    0.009784184396266937,
    -0.002078668214380741,
    -0.014016879722476006,
    0.009966302663087845,
    0.01750854030251503,
    0.004477614536881447,
    -0.005036531016230583,
    0.03577067703008652,
    -0.0229721087962389,
    -0.0001466960384277627,
    -0.010839218273758888,
    -0.015046793967485428,
    -0.016943342983722687,
    0.003554459661245346,
    -0.0

In [22]:
# len(response["data"][0]["embedding"])

1536

In [26]:
# embeddings = [data["embedding"] for data in response["data"]]

In [28]:
# df["embeddings"] = embeddings
# df

Unnamed: 0,text,embeddings
0,– 2022 (MMXXII) was a common year starting on...,"[8.00940779299708e-06, -0.01797325722873211, -..."
1,– The year began with another wave in the COV...,"[-0.004340707324445248, -0.01989074982702732, ..."
2,– 2022 was also dominated by wars and armed c...,"[-0.009628579951822758, -0.015318196266889572,..."
3,– The ongoing Russian invasion of Ukraine esc...,"[-0.014748847112059593, -0.007639929186552763,..."
4,January 1 – France takes over the Presidency ...,"[0.03009609319269657, -0.010577891953289509, -..."
...,...,...
202,December 24 – 2022 Fijian general election: Th...,"[-0.011724342592060566, -0.009384616278111935,..."
203,December 29 – Brazilian football legend Pelé d...,"[-0.007581493351608515, 0.0041535887867212296,..."
204,December 31 – Former Pope Benedict XVI dies at...,"[0.023394910618662834, 0.0076256790198385715, ..."
205,December 7 – The world population was estimate...,"[-0.0040730503387749195, -0.014438522979617119..."


In [29]:
# df.to_csv("embeddings.csv")

## Step 2

### Finding Relevant Data with Cosine Similarity

In [72]:
import numpy as np
import pandas as pd

df = pd.read_csv("embeddings.csv", index_col=0)
df["embeddings"] = df["embeddings"].apply(eval).apply(np.array)
type(df)

pandas.core.frame.DataFrame

In [31]:
question = "When did Russia invade Ukraine?"

In [34]:
# import openai
# from openai.embeddings_utils import get_embedding

# question_embeedings = get_embedding(question, engine=EMBEDDING_MODEL_NAME)
# # question_embeedings

[0.0016044961521402001,
 -0.019282648339867592,
 0.0034815892577171326,
 -0.013990121893584728,
 -0.02527659200131893,
 0.001979914726689458,
 -0.013735060580074787,
 -0.024689950048923492,
 -0.013352468609809875,
 -0.02120836079120636,
 0.022394398227334023,
 0.02462618611752987,
 -0.009048305451869965,
 -0.011873112060129642,
 -0.006309583317488432,
 -0.010489403270184994,
 0.010610557161271572,
 -0.003959829453378916,
 0.03336204215884209,
 -0.01868325285613537,
 -0.01435996126383543,
 -0.01624741591513157,
 0.0033349287696182728,
 0.0013040017802268267,
 -0.014831825159490108,
 0.006733622867614031,
 0.013735060580074787,
 -0.029102513566613197,
 0.015431219711899757,
 -0.014449232257902622,
 -0.01113981008529663,
 -0.022815249860286713,
 -0.020494189113378525,
 -0.016158144921064377,
 -0.035121966153383255,
 -0.03239280730485916,
 0.009150330908596516,
 -0.009367132559418678,
 0.015048626810312271,
 -0.004061853978782892,
 0.009507416747510433,
 0.017892561852931976,
 -0.004600671

In [37]:
# from openai.embeddings_utils import distances_from_embeddings

In [40]:
# distances = distances_from_embeddings(question_embeedings, df["embeddings"].tolist(), distance_metric="cosine")
# # distances

[0.2909545237960648,
 0.28059195106729784,
 0.18533158488669488,
 0.13048907343930882,
 0.2522891675333879,
 0.28964597462419406,
 0.2615814398449131,
 0.23043418428852702,
 0.21796327881307032,
 0.1776878085935384,
 0.2620326331892232,
 0.2627021441424129,
 0.28682575261373167,
 0.2751959421232052,
 0.25369833329875313,
 0.2644069297036914,
 0.23330397907636313,
 0.25343083817068657,
 0.2624541373837993,
 0.23041193302820318,
 0.2715976744892449,
 0.25559902018340985,
 0.24657768658038992,
 0.2002834245709776,
 0.26136517179444296,
 0.25015834575067564,
 0.2531973104210793,
 0.28307464353024536,
 0.12135054611393714,
 0.24553969837004996,
 0.14681021652618242,
 0.13769078384092426,
 0.13845269668472526,
 0.17965674271228205,
 0.1809236299612318,
 0.2714574547942369,
 0.1260494153249253,
 0.1563324094356685,
 0.19416467138070903,
 0.1654855759364826,
 0.12616036209793935,
 0.10917599419597646,
 0.1350110297508319,
 0.1547895684352053,
 0.14725880112096135,
 0.24187123756338424,
 0.2129

In [80]:
df["distances"] = distances
df

Unnamed: 0,text,embeddings,distances
0,– 2022 (MMXXII) was a common year starting on...,"[8.00940779299708e-06, -0.01797325722873211, -...",0.290955
1,– The year began with another wave in the COV...,"[-0.004340707324445248, -0.01989074982702732, ...",0.280592
2,– 2022 was also dominated by wars and armed c...,"[-0.009628579951822758, -0.015318196266889572,...",0.185332
3,– The ongoing Russian invasion of Ukraine esc...,"[-0.014748847112059593, -0.007639929186552763,...",0.130489
4,January 1 – France takes over the Presidency ...,"[0.03009609319269657, -0.010577891953289509, -...",0.252289
...,...,...,...
202,December 24 – 2022 Fijian general election: Th...,"[-0.011724342592060566, -0.009384616278111935,...",0.248067
203,December 29 – Brazilian football legend Pelé d...,"[-0.007581493351608515, 0.0041535887867212296,...",0.287878
204,December 31 – Former Pope Benedict XVI dies at...,"[0.023394910618662834, 0.0076256790198385715, ...",0.293132
205,December 7 – The world population was estimate...,"[-0.0040730503387749195, -0.014438522979617119...",0.263978


In [67]:
# df.to_csv("distances.csv")

In [86]:
df_sorted = df.sort_values(by="distances", ascending=True).reset_index(drop=True)
df_sorted.to_csv("distances_sorted.csv")

## Step 3

### Tokenizing with `tiktoken`

In [44]:
import tiktoken

In [48]:
tokenizer = tiktoken.get_encoding("cl100k_base")

In [49]:
tokenizer

<Encoding 'cl100k_base'>

In [50]:
tokenizer.encode("This is a question")

[2028, 374, 264, 3488]

In [51]:
tokenizer.encode(question)

[4599, 1550, 8524, 75272, 19278, 30]

In [52]:
len(tokenizer.encode(question))

6

### Composing a Custom Text Prompt

In [53]:
prompt_template = """
Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 

{}

---

Question: {}
Answer:"""

In [54]:
question = "When did Russia invade Ukraine?"

In [55]:
print(prompt_template.format("context", question))


Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 

context

---

Question: When did Russia invade Ukraine?
Answer:


In [56]:
max_token_count = 1000

In [58]:
current_token_count = len(tokenizer.encode(prompt_template)) + len(tokenizer.encode(question))
current_token_count

47

In [59]:
context = []

In [89]:
df = pd.read_csv("distances_sorted.csv", index_col=0)
df

Unnamed: 0,text,embeddings,distances
0,March 2 – Russian invasion of Ukraine: Russia ...,[ 0.00081407 -0.01842358 0.0125922 ... 0.01...,0.109176
1,April 3 – Russian invasion of Ukraine: As Russ...,[-0.01213661 -0.0124027 0.00538684 ... 0.01...,0.111381
2,November 11 – Russian invasion of Ukraine: Ukr...,[-0.0122911 -0.01395855 0.0166113 ... 0.02...,0.115622
3,September 21 – Russian invasion of Ukraine: Fo...,[-0.02549138 -0.02205592 0.01317991 ... 0.00...,0.116824
4,May 16 – Russian invasion of Ukraine: The Sieg...,[-0.01833553 -0.00646986 0.01102178 ... 0.00...,0.119223
...,...,...,...
202,December 31 – Former Pope Benedict XVI dies at...,[ 0.02339491 0.00762568 -0.0130245 ... 0.00...,0.293132
203,"November 11 – The cryptocurrency exchange FTX,...",[ 2.81478046e-03 -2.54106168e-02 7.51158441e-...,0.294008
204,November 20 – 2022 Nepalese general election: ...,[-0.00426018 -0.00068145 -0.0016189 ... -0.00...,0.295567
205,April 4 – The Intergovernmental Panel on Clima...,[-0.01045529 -0.01644811 0.0026733 ... 0.01...,0.298559


In [90]:
for text in df["text"].values:
    text_token_count = len(tokenizer.encode(text))
    current_token_count += text_token_count
    
    if current_token_count <= max_token_count:
        context.append(text)
    else:
        break

In [91]:
context

['March 2 – Russian invasion of Ukraine: Russia captures its first large city, the Black Sea port of Kherson, as shelling intensifies across many parts of Ukraine, including civilian areas.',
 "April 3 – Russian invasion of Ukraine: As Russia's forces retreat from areas near Kyiv, it is accused by Ukraine of war crimes, amid mounting evidence of indiscriminate civilian killings, including the Bucha massacre.",
 'November 11 – Russian invasion of Ukraine: Ukrainian forces recapture Kherson, the only regional capital to be taken by Russia since the start of the war.',
 'September 21 – Russian invasion of Ukraine: Following a major counteroffensive by Ukraine in the east of the country, Putin announces a partial mobilisation of Russia and threatens nuclear retaliation, saying "this is not a bluff".',
 'May 16 – Russian invasion of Ukraine: The Siege of Mariupol ends in a Russian victory as Ukrainian troops are evacuated from Mariupol.',
 'October 8 – Russian invasion of Ukraine: An explos

In [92]:
print(prompt_template.format(context, question))


Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 

['March 2 – Russian invasion of Ukraine: Russia captures its first large city, the Black Sea port of Kherson, as shelling intensifies across many parts of Ukraine, including civilian areas.', "April 3 – Russian invasion of Ukraine: As Russia's forces retreat from areas near Kyiv, it is accused by Ukraine of war crimes, amid mounting evidence of indiscriminate civilian killings, including the Bucha massacre.", 'November 11 – Russian invasion of Ukraine: Ukrainian forces recapture Kherson, the only regional capital to be taken by Russia since the start of the war.', 'September 21 – Russian invasion of Ukraine: Following a major counteroffensive by Ukraine in the east of the country, Putin announces a partial mobilisation of Russia and threatens nuclear retaliation, saying "this is not a bluff".', 'May 16 – Russian invasion of Ukraine: The Siege of Ma

In [95]:
print(prompt_template.format("\n\n###\n\n".join(context), question))


Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 

March 2 – Russian invasion of Ukraine: Russia captures its first large city, the Black Sea port of Kherson, as shelling intensifies across many parts of Ukraine, including civilian areas.

###

April 3 – Russian invasion of Ukraine: As Russia's forces retreat from areas near Kyiv, it is accused by Ukraine of war crimes, amid mounting evidence of indiscriminate civilian killings, including the Bucha massacre.

###

November 11 – Russian invasion of Ukraine: Ukrainian forces recapture Kherson, the only regional capital to be taken by Russia since the start of the war.

###

September 21 – Russian invasion of Ukraine: Following a major counteroffensive by Ukraine in the east of the country, Putin announces a partial mobilisation of Russia and threatens nuclear retaliation, saying "this is not a bluff".

###

May 16 – Russian invasion of Ukraine: The S

## Step 4

### Getting a Custom Q&A Response with `openai.Completion`

In [107]:
openai.Completion.create( 
    model="gpt-3.5-turbo-instruct",
    prompt=prompt_template.format("\n\n###\n\n".join(context), question)
)["choices"][0]["text"]


' February 21 – February 24.'