# Step 2 - Finding Relevant Data

Add your API key to the cell below then run it.

In [1]:
import openai
openai.api_base = "https://openai.vocareum.com/v1"
openai.api_key = ""

The code below loads in the embeddings you previously created. Run it as-is.

In [2]:
import numpy as np
import pandas as pd

df = pd.read_csv("embeddings.csv", index_col=0)
df["embeddings"] = df["embeddings"].apply(eval).apply(np.array)
df

Unnamed: 0,text,embeddings
0,"On 6 February 2023, at 04:17 TRT (01:17 UTC), ...","[-0.007865791209042072, -0.01488738413900137, ..."
1,The Mw 7.8 earthquake is the largest in Turkey...,"[0.00019888307724613696, -0.022314351052045822..."
2,There was widespread damage in an area of abou...,"[-0.003678650129586458, -0.020112549886107445,..."
3,Central southern Turkey and northwestern Syria...,"[-0.005976187530905008, -0.011475914157927036,..."
4,The EAF has produced large or damaging earthqu...,"[0.0002380282385274768, -0.02387528494000435, ..."
...,...,...
96,The International Seismological Centre has a b...,"[-0.004583664704114199, -0.009662682190537453,..."
97,The International Seismological Centre has a b...,"[-0.004807258490473032, -0.01684679090976715, ..."
98,"Erdik, M., Tümsa, M. B. D., Pınar, A., Altunel...","[-0.006729048676788807, -0.04049292206764221, ..."
99,"""Kahramanmaraş Supersite science page"". Group ...","[0.0036264623049646616, -0.007848413661122322,..."


## TODO 1: Create Embeddings for the User's Question

In the previous exercise, you were given the code to create embeddings for everything in the dataset. Now your task is to create embeddings for just one string: the user's question. Assign the result to the variable `question_embeddings`. This variable should contain a list of 1,536 floating point numbers, and the provided code will print the first 100 once `question_embeddings` has been created correctly.

If you're getting stuck, you can click to reveal the solution then copy and paste this into the cell below.

---

<details>
    <summary style="cursor: pointer"><strong>Solution (click to show/hide)</strong></summary>

```python
EMBEDDING_MODEL_NAME = "text-embedding-ada-002"
USER_QUESTION = """What were the estimated damages of the 2023 \
Turkey-Syria earthquake?"""

# Generate the embedding response
response = openai.Embedding.create(
    input=USER_QUESTION,
    engine=EMBEDDING_MODEL_NAME
)

# Extract the embeddings from the response
question_embeddings = response["data"][0]["embedding"]

print(question_embeddings[:100])
```

</details>

In [3]:
EMBEDDING_MODEL_NAME = "text-embedding-ada-002"
USER_QUESTION = """What were the estimated damages of the 2023 \
Turkey-Syria earthquake?"""

# Generate the embedding response
response = openai.Embedding.create(
    input = USER_QUESTION,
    model = EMBEDDING_MODEL_NAME)

# Extract the embeddings from the response
question_embeddings = response['data'][0]['embedding']

print(question_embeddings[:100])

[0.00553178321570158, -0.024947911500930786, 0.002177121117711067, -0.012254414148628712, -0.02144855074584484, 0.002601235406473279, -0.03387593477964401, -0.013079358264803886, 0.0024498847778886557, -0.014982052147388458, 0.0162061620503664, 0.044413935393095016, -0.010231969878077507, -0.013179149478673935, 0.01508849672973156, -0.005395401734858751, 0.012400775216519833, -0.013039441779255867, 0.008202873170375824, -0.005724714137613773, -0.007444456685334444, 0.011476038955152035, 0.012387469410896301, -0.009120956063270569, 0.015194941312074661, 0.03209299221634865, 0.008049859665334225, -0.0023667251225560904, -0.0008465657592751086, -0.012141316197812557, 0.005505172535777092, -0.006552984472364187, -0.023524217307567596, 0.015261468477547169, -0.03246554732322693, -0.0075309425592422485, 0.006865664850920439, -0.004883137531578541, 0.023231495171785355, -0.0078436229377985, 0.010112219490110874, 0.028553714975714684, 0.012959607876837254, -0.007697261869907379, -0.01757663488

## TODO 2: Find Cosine Distances

Create a new list called `distances`, which represents the cosine distances between `question_embeddings` and each value in the `'embeddings'` column of `df`.

If you're getting stuck, you can click to reveal the solution then copy and paste this into the cell below.

---

<details>
    <summary style="cursor: pointer"><strong>Solution (click to show/hide)</strong></summary>

```python
from openai.embeddings_utils import distances_from_embeddings

# Create a list containing the distances from question_embeddings
distances = distances_from_embeddings(
    question_embeddings,
    df["embeddings"],
    distance_metric="cosine"
)

print(distances[:100])
```

</details>

In [4]:
from openai.embeddings_utils import distances_from_embeddings

# Create a list containing the distances from question_embeddings
distances = distance_from_embedding(
    question_embeddings,
    df['embeddings'].tolist(),
    distance_metric = 'cosine')

print(distances[:100])

NameError: name 'distance_from_embedding' is not defined

## Sorting by Distance

The code below uses the `distances` list to update `df` then sorts `df` to find the most related rows. Shorter distance means more similarity, so we'll use an ascending sorting order. Run the cell below as-is.

In [None]:
df["distances"] = distances
df.sort_values(by="distances", ascending=True, inplace=True)
df.head(5)

Great, now we have the dataset sorted from most relevant to least relevant! Let's save this as a CSV so we can load it in the next step and compose a custom prompt.

Run the cell below as-is.

In [None]:
df.to_csv("distances.csv")