# Codealong Notebook

Use this notebook as your "scratch pad" as you go through the course contents. Feel free to copy any example code and tweak it to get a better understanding of how it works!

Use the **+** button or `Insert` menu to add additional code cells as needed.

In [65]:
# Preparation code

import os
if 'A306709' in os.environ['USERNAME']:
    print("Running on Christophs computer: update proxy settings.")
    os.environ["http_proxy"] = "http://sia-lb.telekom.de:8080"
    os.environ["https_proxy"] = "http://sia-lb.telekom.de:8080"
else:
    print("Running on any computer but not Christophs: don't update any proxy settings.")

import openai
openai.api_base = "https://openai.vocareum.com/v1"
openai.api_key =  os.getenv("OPENAI_API_KEY")

ukraine_question = "When did Russia invade Ukraine?"
ukraine_prompt = """
Question: " + ukraine_querstion + "
Answer:
"""

# würde einfach so loslabern:
# openai.Completion.create(model="gpt-3.5-turbo-instruct")

# avoid executing often; comment it out to run it!
#ukraine_answer = openai.Completion.create(
#    model="gpt-3.5-turbo-instruct",
#    prompt=ukraine_prompt
#)
#ukraine_answer


Running on Christophs computer: update proxy settings.


## Step 1

### Loading the Data with `pandas`

In [66]:
import requests

# Projektidee: https://en.wikipedia.org/wiki/Finite_sphere_packing

# Get the Wikipedia page for "2022" since OpenAI's models stop in 2021
params = {
    "action": "query",
    "prop": "extracts",
    "exlimit": 1,
    "titles": "2022",
    "explaintext": 1,
    "formatversion": 2,
    "format": "json"
}
resp = requests.get("https://en.wikipedia.org/w/api.php", params=params)
response_dict = resp.json()
response_dict["query"]["pages"][0]["extract"].split("\n")


['2022 (MMXXII) was a common year starting on Saturday of the Gregorian calendar, the 2022nd year of the Common Era (CE) and Anno Domini (AD) designations, the 22nd  year of the 3rd millennium and the 21st century, and the  3rd   year of the 2020s decade.  ',
 'The year began with another wave in the COVID-19 pandemic, with Omicron spreading rapidly and becoming the dominant variant of the SARS-CoV-2 virus worldwide. Tracking a decrease in cases and deaths, 2022 saw the removal of most COVID-19 restrictions and the reopening of international borders in the vast majority of countries, while the global rollout of COVID-19 vaccines continued. The global economic recovery from the pandemic continued, though many countries experienced an ongoing inflation surge; in response, many central banks raised their interest rates to landmark levels. The world population reached eight billion people in 2022. The year also witnessed numerous natural disasters, including two devastating Atlantic hurric

In [67]:
import pandas as pd

# Load page text into a dataframe
df = pd.DataFrame()
df["text"] = response_dict["query"]["pages"][0]["extract"].split("\n")

df = df[df["text"].str.len() > 0]
df = df[~df["text"].str.startswith("==")]

# copied from 4.10:
from dateutil.parser import parse

# Clean up text to remove empty lines and headings
df = df[(df["text"].str.len() > 0) & (~df["text"].str.startswith("=="))]

# In some cases dates are used as headings instead of being part of the
# text sample; adjust so dated text samples start with dates
prefix = ""
for (i, row) in df.iterrows():
    # If the row already has " - ", it already has the needed date prefix
    if " – " not in row["text"]:
        try:
            # If the row's text is a date, set it as the new prefix
            parse(row["text"])
            prefix = row["text"]
        except:
            # If the row's text isn't a date, add the prefix
            row["text"] = prefix + " – " + row["text"]
df = df[df["text"].str.contains(" – ")].reset_index(drop=True)


df.reset_index(inplace=True, drop=True)


In [68]:
df
df.head()
df.tail()



Unnamed: 0,text
193,December 24 – 2022 Fijian general election: Th...
194,December 29 – Brazilian football legend Pelé d...
195,December 31 – Former Pope Benedict XVI dies at...
196,December 7 – The world population was estimate...
197,December 7 – Media related to 2022 at Wikimed...


### Creating an Embeddings Index with `openai.Embedding`

In [69]:
EMBEDDING_MODEL_NAME = "text-embedding-ada-002"
batch_size = 100
embeddings = []
for i in range(0, len(df), batch_size):
    # Send text data to OpenAI model to get embeddings
    response = openai.Embedding.create(
        input=df.iloc[i:i+batch_size]["text"].tolist(),
        engine=EMBEDDING_MODEL_NAME
    )

    # Add embeddings to list
    embeddings.extend([data["embedding"] for data in response["data"]])

# Add embeddings list to dataframe
df["embeddings"] = embeddings

df.to_csv("embeddings.csv")

In [70]:
df


Unnamed: 0,text,embeddings
0,– 2022 (MMXXII) was a common year starting on...,"[-2.4516581106581725e-05, -0.01798415742814541..."
1,– The year began with another wave in the COV...,"[-0.004297760780900717, -0.01981227844953537, ..."
2,– 2022 was also dominated by wars and armed c...,"[-0.009628579951822758, -0.015318196266889572,..."
3,– The ongoing Russian invasion of Ukraine esc...,"[-0.014748847112059593, -0.007639929186552763,..."
4,January 1 – The Regional Comprehensive Econom...,"[-0.0005859995144419372, -0.024158434942364693..."
...,...,...
193,December 24 – 2022 Fijian general election: Th...,"[-0.011724342592060566, -0.009384616278111935,..."
194,December 29 – Brazilian football legend Pelé d...,"[-0.007581493351608515, 0.0041535887867212296,..."
195,December 31 – Former Pope Benedict XVI dies at...,"[0.023601267486810684, 0.007735467050224543, -..."
196,December 7 – The world population was estimate...,"[-0.004104986786842346, -0.014414746314287186,..."


## Step 2

### Finding Relevant Data with Cosine Similarity

In [71]:
print(EMBEDDING_MODEL_NAME)
print(openai.api_key)
print(openai.api_base)

from openai.embeddings_utils import get_embedding, distances_from_embeddings

question_embeddings = get_embedding(ukraine_question, engine=EMBEDDING_MODEL_NAME)
print(question_embeddings)


text-embedding-ada-002
voc-1450252964126677391061467d00e2abc1555.10371114
https://openai.vocareum.com/v1
[0.0016044961521402001, -0.019282648339867592, 0.0034815892577171326, -0.013990121893584728, -0.02527659200131893, 0.001979914726689458, -0.013735060580074787, -0.024689950048923492, -0.013352468609809875, -0.02120836079120636, 0.022394398227334023, 0.02462618611752987, -0.009048305451869965, -0.011873112060129642, -0.006309583317488432, -0.010489403270184994, 0.010610557161271572, -0.003959829453378916, 0.03336204215884209, -0.01868325285613537, -0.01435996126383543, -0.01624741591513157, 0.0033349287696182728, 0.0013040017802268267, -0.014831825159490108, 0.006733622867614031, 0.013735060580074787, -0.029102513566613197, 0.015431219711899757, -0.014449232257902622, -0.01113981008529663, -0.022815249860286713, -0.020494189113378525, -0.016158144921064377, -0.035121966153383255, -0.03239280730485916, 0.009150330908596516, -0.009367132559418678, 0.015048626810312271, -0.0040618539787

In [72]:
distances = distances_from_embeddings(question_embeddings, df["embeddings"].tolist(), distance_metric="cosine")
df["distances"] = distances
df
df.to_csv("distances.csv")


In [60]:
# play around, but that's not efficient:
#current_shortest = df.iloc[0]["distances"]
#current_shortest_index = 0
##print(current_shortest)
#for index, distance in enumerate(df["distances"].values):
#    #print(f'dist: {distance}')
#    if distance < current_shortest:
#        current_shortest_index = index
#        current_shortest = distance
#print(f'dist {current_shortest} at index {current_shortest_index}')
#print(df["text"][current_shortest_index])


dist 0.10922891508535504 at index 38
March 2 – Russian invasion of Ukraine: Russia captures its first large city, the Black Sea port of Kherson, as shelling intensifies across many parts of Ukraine, including civilian areas.


In [74]:
df.sort_values(by="distances", inplace=True)
df.to_csv("distances_sorted.csv")
df.head()


Unnamed: 0,text,embeddings,distances
38,March 2 – Russian invasion of Ukraine: Russia ...,"[0.000781890528742224, -0.018386563286185265, ...",0.109229
61,April 3 – Russian invasion of Ukraine: As Russ...,"[-0.012136607430875301, -0.012402704916894436,...",0.111381
174,November 11 – Russian invasion of Ukraine: Ukr...,"[-0.01207408681511879, -0.014001138508319855, ...",0.115526
145,September 21 – Russian invasion of Ukraine: Fo...,"[-0.02543523721396923, -0.022038765251636505, ...",0.116915
87,May 16 – Russian invasion of Ukraine: The Sieg...,"[-0.018302420154213905, -0.00646719615906477, ...",0.119179


## Step 3

### Tokenizing with `tiktoken`

### Composing a Custom Text Prompt

## Step 4

### Getting a Custom Q&A Response with `openai.Completion`