# Codealong Notebook

Use this notebook as your "scratch pad" as you go through the course contents. Feel free to copy any example code and tweak it to get a better understanding of how it works!

Use the **+** button or `Insert` menu to add additional code cells as needed.

In [1]:
import requests

In [2]:
params = {
    "action": "query", 
    "prop": "extracts",
    "exlimit": 1,
    "titles": "2022",
    "explaintext": 1,
    "formatversion": 2,
    "format": "json"
}

response = requests.get("https://en.wikipedia.org/w/api.php", params=params)
response_dict = response.json()["query"]["pages"][0]["extract"].split("\n")
response_dict

['2022 (MMXXII) was a common year starting on Saturday of the Gregorian calendar, the 2022nd year of the Common Era (CE) and Anno Domini (AD) designations, the 22nd  year of the 3rd millennium and the 21st century, and the  3rd   year of the 2020s decade.  ',
 'The year saw the removal of nearly all COVID-19 restrictions and the reopening of international borders in most countries, while the global rollout of COVID-19 vaccines continued. The global economic recovery from the pandemic continued, though many countries experienced an ongoing inflation surge; in response, many central banks raised their interest rates to landmark levels. The world population reached eight billion people in 2022, though the year also witnessed numerous natural disasters, including two devastating Atlantic hurricanes (Fiona and Ian), and the most powerful volcano eruption of the century so far. The later part of the year also saw the first public release of ChatGPT by OpenAI starting an arms race in artifici

## Step 1

### Loading the Data with `pandas`

In [3]:
import pandas as pd

In [4]:
df = pd.DataFrame()
df["text"] = response.json()["query"]["pages"][0]["extract"].split("\n")
df

Unnamed: 0,text
0,2022 (MMXXII) was a common year starting on Sa...
1,The year saw the removal of nearly all COVID-1...
2,2022 was also dominated by wars and armed conf...
3,
4,
...,...
253,
254,== Nobel Prizes ==
255,
256,


In [5]:
df=df[df["text"].str.len() > 0]
df

Unnamed: 0,text
0,2022 (MMXXII) was a common year starting on Sa...
1,The year saw the removal of nearly all COVID-1...
2,2022 was also dominated by wars and armed conf...
5,== Events ==
8,=== January ===
...,...
247,== Demographics ==
248,The world population was estimated to have rea...
251,== Deaths ==
254,== Nobel Prizes ==


In [6]:
df=df[~df["text"].str.startswith("==")]
df

Unnamed: 0,text
0,2022 (MMXXII) was a common year starting on Sa...
1,The year saw the removal of nearly all COVID-1...
2,2022 was also dominated by wars and armed conf...
9,January 1 – The Regional Comprehensive Econom...
10,January 2 – Abdalla Hamdok resigns as Prime Mi...
...,...
241,December 21–December 26 – A major winter storm...
242,December 24 – 2022 Fijian general election: Th...
243,December 29 – Brazilian football legend Pelé d...
244,December 31 – Former Pope Benedict XVI dies at...


In [7]:
df.tail(15)

Unnamed: 0,text
228,November 21 – A 5.6 earthquake strikes near Ci...
229,"November 30 – OpenAI releases ChatGPT, an arti..."
233,December 2 – The G7 and Australia join the EU ...
234,December 5 – The National Ignition Facility ac...
235,December 7
236,The Congress of Peru removes President Pedro C...
237,After substantial protests against China's Zer...
238,December 17 – Leo Varadkar succeeds Micheál Ma...
239,December 18 – Argentina wins the 2022 FIFA Wor...
240,December 19 – At the UN Biodiversity Conferenc...


In [8]:
# In some cases dates are used as headings instead of being part of the
# text sample; adjust so dated text samples start with dates

from dateutil.parser import parse
prefix = ""
for (i, row) in df.iterrows():
    # If the row already has " - ", it already has the needed date prefix
    if " – " not in row["text"]:
        try:
            # If the row's text is a date, set it as the new prefix
            parse(row["text"])
            prefix = row["text"]
        except:
            # If the row's text isn't a date, add the prefix
            row["text"] = prefix + " – " + row["text"]
            
df = df[df["text"].str.contains(" – ")].reset_index(drop=True)
df

Unnamed: 0,text
0,– 2022 (MMXXII) was a common year starting on...
1,– The year saw the removal of nearly all COVI...
2,– 2022 was also dominated by wars and armed c...
3,January 1 – The Regional Comprehensive Econom...
4,January 2 – Abdalla Hamdok resigns as Prime Mi...
...,...
179,December 21–December 26 – A major winter storm...
180,December 24 – 2022 Fijian general election: Th...
181,December 29 – Brazilian football legend Pelé d...
182,December 31 – Former Pope Benedict XVI dies at...


In [9]:
df.tail(15)

Unnamed: 0,text
169,November 20 – 2022 Nepalese general election: ...
170,November 21 – A 5.6 earthquake strikes near Ci...
171,"November 30 – OpenAI releases ChatGPT, an arti..."
172,December 2 – The G7 and Australia join the EU ...
173,December 5 – The National Ignition Facility ac...
174,December 7 – The Congress of Peru removes Pres...
175,December 7 – After substantial protests agains...
176,December 17 – Leo Varadkar succeeds Micheál Ma...
177,December 18 – Argentina wins the 2022 FIFA Wor...
178,December 19 – At the UN Biodiversity Conferenc...


### Creating an Embeddings Index with `openai.Embedding`

In [10]:
import openai
openai.api_key="xxx"

In [11]:
EMBEDDING_MODEL_NAME = "text-embedding-ada-002"
response = openai.Embedding.create(
    input=df["text"].tolist(),
    model=EMBEDDING_MODEL_NAME
)

In [12]:
type(response)

openai.openai_object.OpenAIObject

In [13]:
response.keys()

dict_keys(['object', 'data', 'model', 'usage'])

In [14]:
type(response["data"])

list

In [15]:
response["data"][0]

<OpenAIObject embedding at 0x7e54f9fc3e00> JSON: {
  "embedding": [
    3.178619226673618e-05,
    -0.017844218760728836,
    -0.017203785479068756,
    -0.012218455784022808,
    -0.009562542662024498,
    -0.001151366508565843,
    -0.008400972932577133,
    0.013285843655467033,
    -0.02247793972492218,
    -0.00208925548940897,
    0.0300877895206213,
    0.024411795660853386,
    -0.01889904960989952,
    -0.0169526357203722,
    -0.002043734537437558,
    0.004580351058393717,
    0.018534881994128227,
    -0.004589769057929516,
    0.021234747022390366,
    0.009656723588705063,
    0.002326278481632471,
    0.009870201349258423,
    -0.0021834366489201784,
    -0.013926276937127113,
    0.009939268231391907,
    0.01751772314310074,
    0.004445357713848352,
    -0.0049978881143033504,
    0.035788897424936295,
    -0.022967683151364326,
    -0.0001852232526289299,
    -0.010912474244832993,
    -0.014955991879105568,
    -0.016965193673968315,
    0.0034125028178095818,
    -

In [16]:
response["data"][0]["embedding"]

[3.178619226673618e-05,
 -0.017844218760728836,
 -0.017203785479068756,
 -0.012218455784022808,
 -0.009562542662024498,
 -0.001151366508565843,
 -0.008400972932577133,
 0.013285843655467033,
 -0.02247793972492218,
 -0.00208925548940897,
 0.0300877895206213,
 0.024411795660853386,
 -0.01889904960989952,
 -0.0169526357203722,
 -0.002043734537437558,
 0.004580351058393717,
 0.018534881994128227,
 -0.004589769057929516,
 0.021234747022390366,
 0.009656723588705063,
 0.002326278481632471,
 0.009870201349258423,
 -0.0021834366489201784,
 -0.013926276937127113,
 0.009939268231391907,
 0.01751772314310074,
 0.004445357713848352,
 -0.0049978881143033504,
 0.035788897424936295,
 -0.022967683151364326,
 -0.0001852232526289299,
 -0.010912474244832993,
 -0.014955991879105568,
 -0.016965193673968315,
 0.0034125028178095818,
 -0.028932498767971992,
 -0.013813259080052376,
 -0.0017878751968964934,
 0.004815804306417704,
 -0.012287521734833717,
 0.01210543792694807,
 0.021134287118911743,
 0.0128400521

In [17]:
len(response["data"][0]["embedding"])

1536

In [18]:
embeddings = [data["embedding"] for data in response['data']]
embeddings

[[3.178619226673618e-05,
  -0.017844218760728836,
  -0.017203785479068756,
  -0.012218455784022808,
  -0.009562542662024498,
  -0.001151366508565843,
  -0.008400972932577133,
  0.013285843655467033,
  -0.02247793972492218,
  -0.00208925548940897,
  0.0300877895206213,
  0.024411795660853386,
  -0.01889904960989952,
  -0.0169526357203722,
  -0.002043734537437558,
  0.004580351058393717,
  0.018534881994128227,
  -0.004589769057929516,
  0.021234747022390366,
  0.009656723588705063,
  0.002326278481632471,
  0.009870201349258423,
  -0.0021834366489201784,
  -0.013926276937127113,
  0.009939268231391907,
  0.01751772314310074,
  0.004445357713848352,
  -0.0049978881143033504,
  0.035788897424936295,
  -0.022967683151364326,
  -0.0001852232526289299,
  -0.010912474244832993,
  -0.014955991879105568,
  -0.016965193673968315,
  0.0034125028178095818,
  -0.028932498767971992,
  -0.013813259080052376,
  -0.0017878751968964934,
  0.004815804306417704,
  -0.012287521734833717,
  0.01210543792694

In [19]:
df["embeddings"] = embeddings
df

Unnamed: 0,text,embeddings
0,– 2022 (MMXXII) was a common year starting on...,"[3.178619226673618e-05, -0.017844218760728836,..."
1,– The year saw the removal of nearly all COVI...,"[-0.009815667755901814, -0.021536769345402718,..."
2,– 2022 was also dominated by wars and armed c...,"[-0.009612335823476315, -0.015313445590436459,..."
3,January 1 – The Regional Comprehensive Econom...,"[-0.0005963737494312227, -0.024176515638828278..."
4,January 2 – Abdalla Hamdok resigns as Prime Mi...,"[-0.015023498795926571, 0.0012224462116137147,..."
...,...,...
179,December 21–December 26 – A major winter storm...,"[-0.024856366217136383, -0.023884402588009834,..."
180,December 24 – 2022 Fijian general election: Th...,"[-0.011570309288799763, -0.009243391454219818,..."
181,December 29 – Brazilian football legend Pelé d...,"[-0.007621751632541418, 0.004138795658946037, ..."
182,December 31 – Former Pope Benedict XVI dies at...,"[0.023392874747514725, 0.007689744234085083, -..."


In [20]:
df.to_csv("embeddings.csv")

## Step 2

### Finding Relevant Data with Cosine Similarity

In [21]:
import numpy as np
import pandas as pd
df = pd.read_csv("embeddings.csv", index_col=0)
df["embeddings.csv"] = df["embeddings"].apply(eval).apply(np.array)
df

Unnamed: 0,text,embeddings,embeddings.csv
0,– 2022 (MMXXII) was a common year starting on...,"[3.178619226673618e-05, -0.017844218760728836,...","[3.178619226673618e-05, -0.017844218760728836,..."
1,– The year saw the removal of nearly all COVI...,"[-0.009815667755901814, -0.021536769345402718,...","[-0.009815667755901814, -0.021536769345402718,..."
2,– 2022 was also dominated by wars and armed c...,"[-0.009612335823476315, -0.015313445590436459,...","[-0.009612335823476315, -0.015313445590436459,..."
3,January 1 – The Regional Comprehensive Econom...,"[-0.0005963737494312227, -0.024176515638828278...","[-0.0005963737494312227, -0.024176515638828278..."
4,January 2 – Abdalla Hamdok resigns as Prime Mi...,"[-0.015023498795926571, 0.0012224462116137147,...","[-0.015023498795926571, 0.0012224462116137147,..."
...,...,...,...
179,December 21–December 26 – A major winter storm...,"[-0.024856366217136383, -0.023884402588009834,...","[-0.024856366217136383, -0.023884402588009834,..."
180,December 24 – 2022 Fijian general election: Th...,"[-0.011570309288799763, -0.009243391454219818,...","[-0.011570309288799763, -0.009243391454219818,..."
181,December 29 – Brazilian football legend Pelé d...,"[-0.007621751632541418, 0.004138795658946037, ...","[-0.007621751632541418, 0.004138795658946037, ..."
182,December 31 – Former Pope Benedict XVI dies at...,"[0.023392874747514725, 0.007689744234085083, -...","[0.023392874747514725, 0.007689744234085083, -..."


In [22]:
question = "when did Russia invade Ukraine?"

In [23]:
import openai
openai.api_key = "xxx"

In [24]:
from openai.embeddings_utils import get_embedding

In [25]:
EMBEDDING_MODEL_NAME = "text-embedding-ada-002"
question_embeddings = get_embedding(question, engine = EMBEDDING_MODEL_NAME)
question_embeddings

[-0.0016771359369158745,
 -0.0193809662014246,
 0.00020256849529687315,
 -0.0164154302328825,
 -0.021769141778349876,
 0.0048583634197711945,
 -0.01670411042869091,
 -0.027424657717347145,
 -0.01859365589916706,
 -0.024721557274460793,
 0.01888233609497547,
 0.026112474501132965,
 -0.011527542024850845,
 -0.010602451860904694,
 -0.010576208122074604,
 -0.008443907834589481,
 0.006636373698711395,
 -0.0005006804130971432,
 0.031072530895471573,
 -0.01893482357263565,
 -0.01064181700348854,
 -0.013856669887900352,
 0.0028556417673826218,
 -0.006987383123487234,
 -0.012314852327108383,
 0.0057145641185343266,
 0.015234462916851044,
 -0.027897045016288757,
 0.010549964383244514,
 -0.012806922197341919,
 -0.013686085119843483,
 -0.023409374058246613,
 -0.015142610296607018,
 -0.016900938004255295,
 -0.03267339617013931,
 -0.03681990131735802,
 0.009854506701231003,
 -0.008601370267570019,
 0.014998270198702812,
 0.0014835887122899294,
 0.007879668846726418,
 0.013764816336333752,
 -0.003331

In [26]:
from openai.embeddings_utils import distances_from_embeddings

In [29]:
distances = distances_from_embeddings(question_embeddings, df["embeddings"].tolist(), distance_metric = "cosine")
distances

ValueError: Input vector should be 1-D.

## Step 3

### Tokenizing with `tiktoken`

### Composing a Custom Text Prompt

## Step 4

### Getting a Custom Q&A Response with `openai.Completion`