# Step 1: Preparing a Dataset with Embeddings

Add your API key to the cell below then run it.

In [1]:
import openai
openai.api_key = "YouR API KEY"

## Loading the Data

We are using the `requests` library ([documentation here](https://requests.readthedocs.io/en/latest/user/quickstart/)) to get the text of a page from Wikipedia using the `extracts` API feature ([documentation here](https://www.mediawiki.org/w/api.php?action=help&modules=query%2Bextracts)). You can ignore the details of the `params` being sent — the important takeaway is that **`response_dict` is a Python dictionary containing the the response to our query**.

Run the cell below as-is.

In [2]:
import requests

# Get the Wikipedia page for the 2023 Turkey–Syria earthquake
params = {
    "action": "query", 
    "prop": "extracts",
    "exlimit": 1,
    "titles": "2023_Turkey–Syria_earthquake",
    "explaintext": 1,
    "formatversion": 2,
    "format": "json"
}
resp = requests.get("https://en.wikipedia.org/w/api.php", params=params)
response_dict = resp.json()

In [3]:
response_dict

{'batchcomplete': True,
 'query': {'normalized': [{'fromencoded': False,
    'from': '2023_Turkey–Syria_earthquake',
    'to': '2023 Turkey–Syria earthquake'}],
  'pages': [{'pageid': 72964820,
    'ns': 0,
    'title': '2023 Turkey–Syria earthquake',
    'extract': '<!-- \nNewPP limit report\nParsed by mw‐api‐ext.eqiad.main‐5cb9485489‐nxg52\nCached time: 20240401153439\nCache expiry: 2592000\nReduced expiry: false\nComplications: [is‐preview]\nCPU time usage: 0.061 seconds\nReal time usage: 0.084 seconds\nPreprocessor visited node count: 213/1000000\nPost‐expand include size: 23695/2097152 bytes\nTemplate argument size: 9311/2097152 bytes\nHighest expansion depth: 12/100\nExpensive parser function count: 1/500\nUnstrip recursion depth: 0/20\nUnstrip post‐expand size: 1653/5000000 bytes\nLua time usage: 0.029/10.000 seconds\nLua memory usage: 854703/52428800 bytes\nNumber of Wikibase entities loaded: 0/400\n--><!--\nTransclusion expansion time report (%,ms,calls,template)\n100.00%   71

### TODO: Parse `response_dict` to get a list of text data samples

Look at the nested data structure of `response_dict` and find the key-value pair with the key of `"extract"`. The associated value will be a string containing a long block of text. Split this text into a list of strings using the `"\n"` separator and assign to the variable `text_data`.

If you're getting stuck, you can click to reveal the solution then copy and paste this into the cell below.

---

<details>
    <summary style="cursor: pointer"><strong>Solution (click to show/hide)</strong></summary>

```python
text_data = response_dict["query"]["pages"][0]["extract"].split("\n")
```

</details>

In [4]:
text_data = response_dict["query"]["pages"][0]["extract"].split("\n")

In [5]:
text_data

['<!-- ',
 'NewPP limit report',
 'Parsed by mw‐api‐ext.eqiad.main‐5cb9485489‐nxg52',
 'Cached time: 20240401153439',
 'Cache expiry: 2592000',
 'Reduced expiry: false',
 'Complications: [is‐preview]',
 'CPU time usage: 0.061 seconds',
 'Real time usage: 0.084 seconds',
 'Preprocessor visited node count: 213/1000000',
 'Post‐expand include size: 23695/2097152 bytes',
 'Template argument size: 9311/2097152 bytes',
 'Highest expansion depth: 12/100',
 'Expensive parser function count: 1/500',
 'Unstrip recursion depth: 0/20',
 'Unstrip post‐expand size: 1653/5000000 bytes',
 'Lua time usage: 0.029/10.000 seconds',
 'Lua memory usage: 854703/52428800 bytes',
 'Number of Wikibase entities loaded: 0/400',
 '--><!--',
 'Transclusion expansion time report (%,ms,calls,template)',
 '100.00%   71.567      1 Template:Redirect_category_shell',
 '100.00%   71.567      1 -total',
 ' 96.27%   68.900      1 Template:Mbox',
 ' 18.04%   12.909      3 Template:Redirect_template',
 ' 10.55%    7.553      

### Adding the Text Data to a DataFrame

Run the cell below as-is.

In [6]:
import pandas as pd

# Load page text into a dataframe
df = pd.DataFrame()
df["text"] = text_data

# Clean up dataframe to remove empty lines and headings
df = df[(
    (df["text"].str.len() > 0) & (~df["text"].str.startswith("=="))
)].reset_index(drop=True)
df.head()

Unnamed: 0,text
0,<!--
1,NewPP limit report
2,Parsed by mw‐api‐ext.eqiad.main‐5cb9485489‐nxg52
3,Cached time: 20240401153439
4,Cache expiry: 2592000


## Creating the Embeddings Index

Here is the text from the first row of our dataset. Run the cell below as-is.

In [7]:
df["text"][0]

'<!-- '

This code creates embeddings for that text sample. Run the cell below as-is.

In [8]:
EMBEDDING_MODEL_NAME = "text-embedding-ada-002"
response = openai.Embedding.create(
    input=[df["text"][0]],
    engine=EMBEDDING_MODEL_NAME
)

# Extract and print the first 20 numbers in the embedding
response_list = response["data"]
first_item = response_list[0]
first_item_embedding = first_item["embedding"]
print(first_item_embedding[:20])

RateLimitError: You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: https://platform.openai.com/docs/guides/error-codes/api-errors.

### Creating a list of embeddings

This code sends all of the data from `df["text"].tolist()` to the `openai.Embedding.create` function, then extracts the resulting embeddings and creates a list of embeddings called `embeddings`.

Run the cell below as-is.

In [None]:
# Send text data to the model
response = openai.Embedding.create(
    input=df["text"].tolist(),
    engine=EMBEDDING_MODEL_NAME
)

# Extract embeddings
embeddings = [data["embedding"] for data in response["data"]]

### Adding Embeddings to DataFrame and Saving as CSV

Run the cell below as-is.

In [None]:
# Add embeddings list to dataframe
df["embeddings"] = embeddings
df.to_csv("embeddings.csv")

## Conclusion

You have now created and saved an embeddings index!