In [1]:
import tqdm
import numpy as np

# API Setup

In [2]:
from dotenv import load_dotenv
load_dotenv(dotenv_path="../.env")

True

# Dataset

For this example we use the [EleutherAI/wikitext_document_level](https://huggingface.co/datasets/EleutherAI/wikitext_document_level) dataset. This consists of 29,444 documents containing the raw text from their respective Wikipedia pages. For this demo, we use only 100 documents to ensure the code can be run quickly on a local machine.

In [3]:
from datasets import load_dataset
data = load_dataset("EleutherAI/wikitext_document_level", "wikitext-103-raw-v1", trust_remote_code=True)

Repo card metadata block was not found. Setting CardData to empty.


In [4]:
sample_size = 100
data = data["train"][:sample_size]

In [5]:
data_index = 38
print(data["page"][data_index][:1000])
print(f"{'---'*20}\nTotal Length: {len(data["page"][data_index])} characters.")

 = Michael Jordan = 
 
 Michael Jeffrey Jordan ( born February 17 , 1963 ) , also known by his initials , MJ , is an American retired professional basketball player . He is also a businessman , and principal owner and chairman of the Charlotte Hornets . Jordan played 15 seasons in the National Basketball Association ( NBA ) for the Chicago Bulls and Washington Wizards . His biography on the NBA website states : " By acclamation , Michael Jordan is the greatest basketball player of all time . " Jordan was one of the most effectively marketed athletes of his generation and was considered instrumental in popularizing the NBA around the world in the 1980s and 1990s . 
 Jordan played three seasons for coach Dean Smith at the University of North Carolina . He was a member of the Tar Heels ' national championship team in 1982 . Jordan joined the NBA 's Chicago Bulls in 1984 as the third overall draft pick . He quickly emerged as a league star , entertaining crowds with his prolific scoring . 

# Preprocessing

Now that we have our dataset, we can load it into LlamaIndex and do any preprocessing we may want to do. We've intentionally chosen this dataset as it isn't completely clean and thus we can illustrate some preprocessing practices.

**Data Extraction:** Since we are using an existing dataset, we do not have to worry about extracting data from other sources. 

**PII:** Wikipedia does not contain PII.

## Data Cleaning

In the case of this dataset, data is extracted at a particular point in time (check the dataset page for specifics), so there's no chances of duplicate, old or conflicting information. However, the dataset does still have some artifacts such as the occurrence of the `=` symbol denoting headings and certain punctuation being enclosed by `@` symbols. We use the former later in the next part to build metadata but remove the latter in this section. We do this via some straightforward regex as shown below.

In [6]:
# Example sentence of this happening
print(data["page"][data_index][1350:1450])

with titles in 1992 and 1993 , securing a " three @-@ peat " . Although Jordan abruptly retired from


In [7]:
# Sentence after the fix
import re
pattern = re.compile(r" @(.)@ ")
print(re.sub(pattern, r"\1", data["page"][data_index])[1350:1450])

with titles in 1992 and 1993 , securing a " three-peat " . Although Jordan abruptly retired from bas


In [8]:
# Run this across the entire dataset
for i, page in enumerate(data["page"]):
    data["page"][i] = re.sub(pattern, r"\1", page)

## Data Enrichment

No metadata is offered as part of the dataset but we can use the structure of the documents to extract information from them such as the title, headings, and subheadings. First, we take a look at the data:

In [9]:
print(data['page'][data_index])

 = Michael Jordan = 
 
 Michael Jeffrey Jordan ( born February 17 , 1963 ) , also known by his initials , MJ , is an American retired professional basketball player . He is also a businessman , and principal owner and chairman of the Charlotte Hornets . Jordan played 15 seasons in the National Basketball Association ( NBA ) for the Chicago Bulls and Washington Wizards . His biography on the NBA website states : " By acclamation , Michael Jordan is the greatest basketball player of all time . " Jordan was one of the most effectively marketed athletes of his generation and was considered instrumental in popularizing the NBA around the world in the 1980s and 1990s . 
 Jordan played three seasons for coach Dean Smith at the University of North Carolina . He was a member of the Tar Heels ' national championship team in 1982 . Jordan joined the NBA 's Chicago Bulls in 1984 as the third overall draft pick . He quickly emerged as a league star , entertaining crowds with his prolific scoring . 

We can see that `= =` indicates a title, `== ==` indicates a heading and `=== ===` indicates a sub-heading. Let's build a function that uses regex to extract the titles and store them as metadata.

In [10]:
def extract_metadata(data):
    title_pattern = re.compile(r"\s=\s([^=]{1,50})\s=\s")
    title = [item for item in re.findall(title_pattern, data)]
    # The regex above isn't perfect so we take the first match as the title 
    if len(title) > 0:
        title = title[0]
    else:
        title = "Unknown Title"
    return {"title": title}

In [11]:
extract_metadata(data["page"][data_index])

{'title': 'Michael Jordan'}

The results here aren’t perfect but are shown as an example of how things can be extracted from the structure of the document programmatically. There may be some edge cases that are not handled but we do not solve them here in favor of utilizing simpler, more understandable code.

As an example of other metadata we could add, we could use the [Wikipedia API](https://www.mediawiki.org/wiki/API:Main_page) to find the relevant URL for each article. We do not implement this here in order to keep things simple.

## Data Loading

Now, we load our dataset into a list of LlamaIndex documents alongside the extracted metadata.

In [12]:
from llama_index.core import Document

documents = []
for i in tqdm.tqdm(range(len(data["page"]))):
    documents.append(
        Document(
            text=data["page"][i],
            metadata=extract_metadata(data["page"][i]),
        )
    )

100%|██████████| 100/100 [00:00<00:00, 2414.35it/s]


In [13]:
documents[data_index]

Document(id_='2d35b458-78e0-4436-a743-b2cdcb41df3d', embedding=None, metadata={'title': 'Michael Jordan'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text=' = Michael Jordan = \n \n Michael Jeffrey Jordan ( born February 17 , 1963 ) , also known by his initials , MJ , is an American retired professional basketball player . He is also a businessman , and principal owner and chairman of the Charlotte Hornets . Jordan played 15 seasons in the National Basketball Association ( NBA ) for the Chicago Bulls and Washington Wizards . His biography on the NBA website states : " By acclamation , Michael Jordan is the greatest basketball player of all time . " Jordan was one of the most effectively marketed athletes of his generation and was considered instrumental in popularizing the NBA around the world in the 1980s and 1990s . \n Jordan played three seasons for coach Dean Smith at the University of North Carolina . He was a member of the Tar Heels \' natio