<a href="https://www.kaggle.com/code/emmermarcell/create-a-wikipedia-corpus?scriptVersionId=160121848" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Create a Wikipedia corpus

The aim of this notebook is to preprocess an existing wikipedia dump and create a wiikpedia corpus can can be used for NLP tasks. Since the dataset is quite large I will make use of the memory mapping between the RAM and the filesystems storage done by the the [Hugging Face Datasets library][1]. Under the hood, it utilizes the Apache Arrow memory format and pyarrow library. Unfortunately the [`wikipedia`][2] dataset is not streamable so I stick to iterating through it.

I used the following article as a starting point for implementing a RAG pipeline:

* [Steven van de Graaf - Pre-processing a Wikipedia dump for NLP model training — a write-up][3]

[1]: https://huggingface.co/learn/nlp-course/chapter5/4?fw=pt
[2]: https://huggingface.co/datasets/wikipedia
[3]: https://towardsdatascience.com/pre-processing-a-wikipedia-dump-for-nlp-model-training-a-write-up-3b9176fdf67

In [1]:
!pip install blingfire

Collecting blingfire
  Downloading blingfire-0.1.8-py3-none-any.whl (42.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.1/42.1 MB[0m [31m36.1 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hInstalling collected packages: blingfire
Successfully installed blingfire-0.1.8


In [2]:
import os
import gzip
import gc    # Garbage collector
from tqdm.auto import tqdm

from blingfire import text_to_sentences
import numpy as np
from datasets import load_dataset

The `all-MiniLM-L6-v2` model can handle a maximum sequence size of 256 words. As Wikipedia articles are often longer, I use the `text_to_sentences` function from`blingfire` to preprocess the data. This function takes a string (representing a Wikipedia article) and breaks up the article by sentences. The sentences are then saved into the `wikipedia_processed_*.txt` files locally line-by-line. After, I can simply utilize the Huggingface datasets library. 

In [3]:
def process_wikipedia_dataset(wiki_dataset, output_dir, articles_per_file=1_000_000):
    file_count = 1
    article_count = 0

    # Ensure output directory exists
    os.makedirs(output_dir, exist_ok=True)

    # Open the first output file for writing in compressed format
    out_f = gzip.open(f'{output_dir}/wikipedia_processed_{file_count}.txt.gz', 'wt', encoding='utf-8')

    # Iterate through each article in the dataset
    for article in tqdm(wiki_dataset, desc='Processing Articles'):
        # Process the article into sentences and write them into file
        sentences = text_to_sentences(article['text'])
        out_f.write(sentences + '\n')

        article_count += 1
        # Check if it's time to switch to a new file
        if article_count >= articles_per_file:
            out_f.close()
            file_count += 1
            article_count = 0
            out_f = gzip.open(f'{output_dir}/wikipedia_processed_{file_count}.txt.gz', 'wt', encoding='utf-8')

    # Close the last file
    out_f.close()

In [4]:
# Load the Wikipedia dataset
wiki_dataset = load_dataset("wikipedia", "20220301.en", split='train')
print(f'Length of the Wikipedia dataset is {len(wiki_dataset)} articles.')

# Specify the working directory to save the processed files
output_dir = '/kaggle/working'
# Process the Wikipedia dataset and return the path to the processed files
process_wikipedia_dataset(wiki_dataset, output_dir)

Downloading builder script:   0%|          | 0.00/11.6k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/7.14k [00:00<?, ?B/s]

Downloading and preparing dataset wikipedia/20220301.en (download: 19.18 GiB, generated: 18.88 GiB, post-processed: Unknown size, total: 38.07 GiB) to /root/.cache/huggingface/datasets/wikipedia/20220301.en/2.0.0/aa542ed919df55cc5d3347f42dd4521d05ca68751f50dbc32bae2a7f1e167559...


Downloading:   0%|          | 0.00/15.3k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/20.3G [00:00<?, ?B/s]

Dataset wikipedia downloaded and prepared to /root/.cache/huggingface/datasets/wikipedia/20220301.en/2.0.0/aa542ed919df55cc5d3347f42dd4521d05ca68751f50dbc32bae2a7f1e167559. Subsequent calls will reuse this data.
Length of the Wikipedia dataset is 6458670 articles.


Processing Articles:   0%|          | 0/6458670 [00:00<?, ?it/s]

### The resulting files can be saved in a Kaggle dataset.