# Creating a wikipedia database

speed is typically 2000 pages per second uncompressed
250 pages per second on the compressed bz2 file


In [3]:
%load_ext autoreload
%autoreload 2

## Download the wikipedia dump

This downloads the 28Gb Wikipedia dump from the web.


Instead of running the following cell which downloads the wikipedia dump from a web url, you can run the following command in the terminal, which downloads it via a torrent. The torrent download is a bit more manual but much faster (the torrent takes 15mins at ~20Mb/s, the web download takes 1-2h at 4Mb/s and is flimsy, subject to cuts)

Torrents are found [here](https://meta.wikimedia.org/wiki/Data_dump_torrents)

```bash
# MacOS: brew install aria2
# Linux: sudo apt install aria2
aria2c \
    -d wikipedia_data \
    --seed-time=0
    https://academictorrents.com/download/cd872797612d95384de3a0ab7e6a1f156bf91495.torrent
```


In [4]:
# You can safely run this cell even if you have already downloaded the wikipedia dump
# it will simply skip the download if the file already exists

from wiki_dump_extractor import WikiXmlDumpExtractor, download_file
from pathlib import Path

dump_date = "20250501"
http_wiki_dump_dir = f"https://dumps.wikimedia.org/enwiki/{dump_date}/"
wiki_dump_name = f"enwiki-{dump_date}-pages-articles-multistream.xml.bz2"
dump_url = http_wiki_dump_dir + wiki_dump_name
wikipedia_data_dir = Path("wikipedia_data")
local_dump_path = wikipedia_data_dir / wiki_dump_name
download_file(dump_url, local_dump_path, replace=False)

wikipedia_data/enwiki-20250501-pages-articles-multistream.xml.bz2 already exists, skipping download.


## Extract the dump to an Avro file

This creates a more practical version of the archive, which is 28Gb, but faster to iterate through (10-20x) and easier to fetch by page title.

This will go over 24 million compressed pages (2400 batches of 10000 pages). This operation takes ~40mins on a good processor, up to ~2-3h on an older one.


In [5]:
avro_dump_path = wikipedia_data_dir / "wiki_dump.avro"
page_index_db = wikipedia_data_dir / "wiki_dump_index_db"
redirects_db = wikipedia_data_dir / "wiki_dump_redirects_db"


if not avro_dump_path.exists():
    extractor = WikiXmlDumpExtractor(file_path=local_dump_path)
    ignored_fields = ["timestamp", "page_id", "revision_id", "redirect_title"]
    extractor.extract_pages_to_avro(
        output_file=avro_dump_path,
        redirects_db_path=redirects_db,
        batch_size=10_000,
        ignored_fields=ignored_fields,
    )

## Index the pages

This creates an index so that pages can be fetched by title in the future. This takes ~4mins on a good processor, up to ~3x on an older one.


In [6]:
from wiki_dump_extractor import WikiAvroDumpExtractor

if not page_index_db.exists():
    dump = WikiAvroDumpExtractor(avro_dump_path)
    dump.index_pages(index_dir=page_index_db)

## Extract the titles of disambiguation pages

This finds the titles of pages which are disambiguation pages. For instance "Marie Louise" is a valid page title but the page doesn't refer to a particular person, but rather to a category of people named "Marie Louise".

This processes 12 million pages and takes 12mins on a good processor, probably 3 times that on an older one.


In [7]:
from wiki_dump_extractor import WikiAvroDumpExtractor

disamsbiguation_page_titles_path = (
    wikipedia_data_dir / "disambiguation_page_titles.json"
)
if not disamsbiguation_page_titles_path.exists():
    dump = WikiAvroDumpExtractor(avro_dump_path)
    dump.extract_disambiguation_page_titles(disamsbiguation_page_titles_path)

## Extract links from the pages

This extracts the links from the pages. Every tisme there is a link like `[[Bombay | Mumbai]]` it records an entry associating the shown text (Bombay) to the wikipedia page (Mumbai) which helps create a "dictionary of synonyms". Some of these are specific to the page in which they are. For instance you might find `[[Marie-Therese | Infanta Maria Theresa of Portugal]]` in one page, and then `[[Maria Theresa | Maria Theresa of Spain ]]` in another.

This processes 12 million pages (1200 batches of 10,000 pages)and takes 10mins on a good processor with 6 workers, probably 3 times that on an older one.


In [11]:
from utils import db_utils
from utils.extraction_utils import find_links_in_pages
from tqdm.auto import tqdm
from pathlib import Path
from wiki_dump_extractor import WikiAvroDumpExtractor

wiki_data_dir = Path("wikipedia_data")
dump = WikiAvroDumpExtractor(
    wiki_data_dir / "wiki_dump.avro", index_dir=wiki_data_dir / "wiki_dump_idx"
)

generated_data_dir = Path("generated_data")
page_links_db = generated_data_dir / "page_links_db"
if not page_links_db.exists():
    processed_batches = dump.process_page_batches_in_parallel(
        process_fn=find_links_in_pages, batch_size=10_000, num_workers=6
    )
    with db_utils.LMDBWriter(page_links_db, map_size=20_000_000_000) as db:
        for batch_result in tqdm(processed_batches):
            db.write_batch(batch_result)


## Extract infobox data

This extracts the infoboxes from the pages.

This processes 12 million pages (1200 batches of 10,000 pages) and takes 7mins on a good processor with 6 workers, probably 3 times that on an older one.


In [16]:
from pathlib import Path
from utils import db_utils, extraction_utils

generated_data_dir = Path("generated_data")
parsed_infoboxes_db = generated_data_dir / "parsed_infoboxes_db"
if not parsed_infoboxes_db.exists():
    counter = 0
    processed_batches = dump.process_page_batches_in_parallel(
        process_fn=extraction_utils.parse_infoboxes,
        batch_size=10_000,
        num_workers=6,
    )
    with db_utils.LMDBWriter(parsed_infoboxes_db, map_size=20_000_000_000) as db:
        for batch_result in tqdm(processed_batches):
            counter += len(batch_result)
            db.write_batch(batch_result)
print(counter)

0it [00:00, ?it/s]

4392949


## Extract pages linked in the "Year" pages

This goes through wikipedia's "year pages" such as [1808 (year)](<https://en.wikipedia.org/wiki/1808_(year)>) and extracts all the pages linked in them, which include notable events, notable places and notable people's birth and death. These are great first candidates for AI extraction of events. The process takes a couple minutes.


In [None]:
import re
from tqdm.auto import tqdm

wiki_data_dir = Path("wikipedia_data")
link_regex = re.compile(r"\[\[([^|\]]+)(?:\|[^\]]*)?]]")
months = "January|February|March|April|May|June|July|August|September|October|November|December"
month_day_regex = re.compile(f"^({months}) \\d+$")
day_month_regex = re.compile(f"^\\d{{1,2}} ({months})$")
year_regex = re.compile(r"^\d{1,4}$")


def extract_links(page_text):
    return [
        link.strip()
        for link in list(set(link_regex.findall(page_text)))
        if not month_day_regex.match(link)
        and not day_month_regex.match(link)
        and not year_regex.match(link)
        and not link.startswith("Category:")
        and not link.startswith("File:")
    ]


def year_to_page_name(year, redirects_db):
    if year < 0:
        page_name = f"{-year} BC"
    else:
        page_name = f"{year} (year)"
    redirect = db_utils.get_redirect(title=page_name, db=redirects_db)
    if redirect:
        return redirect
    return page_name


target = wiki_data_dir / "historical_pages.avro"
if not target.exists():
    with db_utils.LMDBReader(wiki_data_dir / "wiki_dump_redirects_db") as redirects_env:
        page_names = [
            year_to_page_name(year, redirects_env) for year in range(-500, 2000)
        ]
        pages = dump.get_page_batch_by_title(page_names)
        all_page_links = [link for page in pages for link in extract_links(page.text)]
        all_page_links = sorted(set(all_page_links))
        dump.extract_pages_titles_to_new_dump(
            all_page_links,
            target,
            redirects_env=redirects_env,
            ignore_titles_not_found=True,
        )
    historical_dump = WikiAvroDumpExtractor(target)
    historical_dump.index_pages(wiki_data_dir / "historical_pages_idx")
