<a href="https://www.kaggle.com/code/emmermarcell/create-a-wikipedia-corpus?scriptVersionId=163531466" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Create a Wikipedia corpus

The aim of this notebook is to preprocess an existing wikipedia dump and create a wiikpedia corpus can can be used for NLP tasks. Since the dataset is quite large I will make use of the memory mapping between the RAM and the filesystems storage done by the the [Hugging Face Datasets library][1]. Under the hood, it utilizes the Apache Arrow memory format and pyarrow library. Unfortunately the [`wikipedia`][2] dataset is not streamable so I stick to iterating through it.

I used the following articles and notebooks as a starting point for implementing a RAG pipeline:

* [Steven van de Graaf - Pre-processing a Wikipedia dump for NLP model training — a write-up][3]

* [Chris Deotte - How To Train Open Book Model - Part 2][4]

[1]: https://huggingface.co/learn/nlp-course/chapter5/4?fw=pt
[2]: https://huggingface.co/datasets/wikipedia
[3]: https://towardsdatascience.com/pre-processing-a-wikipedia-dump-for-nlp-model-training-a-write-up-3b9176fdf67
[4]: https://www.kaggle.com/code/cdeotte/how-to-train-open-book-model-part-2

In [1]:
!pip install blingfire

Collecting blingfire
  Downloading blingfire-0.1.8-py3-none-any.whl (42.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.1/42.1 MB[0m [31m23.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: blingfire
Successfully installed blingfire-0.1.8


In [2]:
import os
import time
import re
import gzip
import gc    # Garbage collector
from tqdm.auto import tqdm
from concurrent.futures import ProcessPoolExecutor, as_completed

import blingfire as bf
import numpy as np
from datasets import load_dataset

In [3]:
# Compile regular expressions only once at the beginning
infobox_pattern = re.compile(r'\{\{Infobox [^}]+\}\}', flags=re.DOTALL)
sidebar_pattern = re.compile(r'\{\{Sidebar [^}]+\}\}', flags=re.DOTALL)
link_pattern = re.compile(r'\[\[([^|\]]+\|)?([^\]]+)\]\]')
references_pattern = re.compile(r'==\s*(References|External links|See also|Notes)\s*==.*', flags=re.DOTALL)
citation_needed_pattern = re.compile(r'\{\{citation needed[^}]*\}\}', flags=re.DOTALL)
cn_pattern = re.compile(r'\{\{cn\}\}', flags=re.DOTALL)
curly_braces_pattern = re.compile(r'\{\{[^}]+\}\}', flags=re.DOTALL)
whitespace_pattern = re.compile(r'\s+')

The `all-MiniLM-L6-v2` model can handle a maximum sequence size of 256 words. As Wikipedia articles are often longer, I use the `text_to_sentences` function from`blingfire` to preprocess the data. This function takes a string (representing a Wikipedia article) and breaks up the article by sentences. The sentences are then saved into the `wikipedia_processed_*.txt` files locally line-by-line. After, I can simply utilize the Huggingface datasets library. 

In [4]:
def preprocess_article(text: str) -> str:
    # Remove infoboxes and sidebars
    text = infobox_pattern.sub('', text)
    text = sidebar_pattern.sub('', text)
    
    # Simplify links - keep the text of the link only
    text = link_pattern.sub(r'\2', text)
    
    # Remove sections that start with == References ==, == External links ==, etc.
    text = references_pattern.sub('', text)
    
    # Optional: Remove citation needed and similar templates
    text = citation_needed_pattern.sub('', text)
    text = cn_pattern.sub('', text)  # Short form of citation needed
    
    # Remove any remaining curly braces content (catch-all for other templates)
    text = curly_braces_pattern.sub('', text)
    
    # Normalize whitespace to a single space
    text = whitespace_pattern.sub(' ', text).strip()
    
    return text

def process_article(article_text: str, min_len: int, max_len: int) -> str:
    # Preprocess the article text with regex
    article_text = preprocess_article(article_text)
    
    # Skip the processing if there is no relavant sentences
    if not article_text:
        return ""
    
    # Initialize an empty list to hold sentences that meet the length criteria.
    proper_sentences = []
    
    # Segment the preprocessed article text into sentences and obtain their offsets (start and end positions).
    _, offsets = bf.text_to_sentences_and_offsets(article_text)
    
    for o in offsets:
        # Check if the length of the current sentence (calculated as end position - start position)
        # falls within the specified minimum and maximum length bounds.
        if not min_len <= o[1] - o[0] <= max_len:
            # If the sentence does not meet the length criteria, skip to the next iteration.
            continue
        
        # Extract the sentence from the article text using the start and end positions from 'o'.
        sentence = article_text[o[0]:o[1]]
        
        # Add the sentence that meets the length criteria to the list of proper sentences.
        proper_sentences.append(sentence)
    
    # Join the proper sentences into a single string, separated by newline characters,
    # and return this string. This results in a string where each sentence is on a new line,
    # assuming it met the length criteria.
    return '\n'.join(proper_sentences)
    
def process_article_wrapper(args):
    return process_article(*args)

def process_wikipedia_dataset(wiki_dataset, output_dir, articles_per_file=1_000_000, batch_size=100):
    os.makedirs(output_dir, exist_ok=True)  # Ensure output directory exists
    file_count = 1
    article_count = 0
    out_f = gzip.open(f'{output_dir}/wikipedia_processed_{file_count}.txt.gz', 'wt', encoding='utf-8')

    with ProcessPoolExecutor() as executor:
        futures = {}
        for article in tqdm(wiki_dataset, desc='Processing Articles'):
            # Submit tasks as you iterate
            future = executor.submit(process_article_wrapper, (article['text'], 32, 2048))
            futures[future] = article['text']

            # Process completed tasks in batches to save memory
            if len(futures) >= batch_size:
                for future in as_completed(futures):
                    sentences = future.result()
                    out_f.write(sentences + '\n')
                    article_count += 1

                    if article_count >= articles_per_file:
                        out_f.close()
                        file_count += 1
                        article_count = 0
                        out_f = gzip.open(f'{output_dir}/wikipedia_processed_{file_count}.txt.gz', 'wt', encoding='utf-8')
                    # Remove the future from the dictionary once processed
                    del futures[future]
                    break  # Break after processing one to check if more tasks should be added

        # Process any remaining tasks
        for future in as_completed(futures):
            sentences = future.result()
            out_f.write(sentences + '\n')
            article_count += 1

            if article_count >= articles_per_file:
                out_f.close()
                file_count += 1
                article_count = 0
                out_f = gzip.open(f'{output_dir}/wikipedia_processed_{file_count}.txt.gz', 'wt', encoding='utf-8')

    # Ensure the last file is closed
    out_f.close()

In [5]:
# Load the Wikipedia dataset
wiki_dataset = load_dataset("wikipedia", "20220301.en", split='train')
print(f'Length of the Wikipedia dataset is {len(wiki_dataset):_} articles.')

# Specify the working directory to save the processed files
output_dir = '/kaggle/working'
# Process the Wikipedia dataset add them to .txt.gz files
process_wikipedia_dataset(wiki_dataset, output_dir)

Downloading builder script:   0%|          | 0.00/11.6k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/7.14k [00:00<?, ?B/s]

Downloading and preparing dataset wikipedia/20220301.en (download: 19.18 GiB, generated: 18.88 GiB, post-processed: Unknown size, total: 38.07 GiB) to /root/.cache/huggingface/datasets/wikipedia/20220301.en/2.0.0/aa542ed919df55cc5d3347f42dd4521d05ca68751f50dbc32bae2a7f1e167559...


Downloading:   0%|          | 0.00/15.3k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/20.3G [00:00<?, ?B/s]

Dataset wikipedia downloaded and prepared to /root/.cache/huggingface/datasets/wikipedia/20220301.en/2.0.0/aa542ed919df55cc5d3347f42dd4521d05ca68751f50dbc32bae2a7f1e167559. Subsequent calls will reuse this data.
Length of the Wikipedia dataset is 6_458_670 articles.


Processing Articles:   0%|          | 0/6458670 [00:00<?, ?it/s]

### The resulting files can be saved in a Kaggle dataset.