Wikipedia English Dataset Search

Description: Provides a text based search for Wikipedia (English only)

PROJECT STATUS

Project is currently on hold due to the current runtime performance of the search engine hitting the limits of my hardware (and thus the scope of the project).
- The sparse vector stage (TF-IDF/BM25) is taking around an hour, even with multiprocessing and multithreading. I'm also reaching the limits of my hardware (server) with those techniques.
- Ideally the whole system should take no more than 10 minutes per query but we are far from that at the moment.
- Considering a larger rewrite in Rust to allow for speedup and/or use of a database (ie sqlite) over parquet files to get around fileIO latency.
TODO
- Copy over work implemented in TF-IDF class over to BM25.
- Rust rewrite?
  - Use polars instead of pandas
- Re-validate stage 1 (sparse vector) performance and accuracy.
- Build out stage 2 (dense vector) and validate it's performance and accuracy.
- Comprehensive setup.py script for python.

Before Proceeding

This project was intended to be possible to replicate and run on a consumer level PC
- Minimum Hardware Requirements:
  - 500GB hard drive (SSD preferred)
  - 8 GB RAM minimum (16+ GB preferred)
  - Nvidia GPU is preferred for creating vector embeddings
    - There is some support for Apple Silicon with MPS too
  - 4 core CPU (8+ cores preferred)
That being said, this project is very demanding in regards to the resources it uses.
- Storage
  - Expect to use around 30GB of storage for downloading the compressed dataset from Wikipedia (see WikipediaEnDownload submodule).
  - Add 128GB to that storage overhead for the extracted/decompressed dataset (also see WikipediaEnDownload submodule).
  - And add another 10GB of storage for the necessary environment libraries/packages.
  - And add another 85GB of storage for the bag-of-words metadata (depending on the file format).
    - JSON: 60.7GB (57 GB doc-to-words + 3.7 GB word-to-docs)
    - msgpack: 25GB (23 GB doc-to-words + 1.9 GB word-to-docs)
  - And finally, add another 5GB of storage for the embedding module.
- Compute (CPU, GPU, RAM)
  - If you just want to use the search functions, I have linked where you can download the necssary data below.
  - Running the preprocessing necessary for the program to work is incredibly taxing depending on your hardware.
    - Generally more is better. More CPU cores, more GPU VRAM, more RAM, etc.
    - I ran a lot of the preprocessing for the sake of speed/time on my own server. The server has more room for expansion of its resources but it was not during this project. The server's hardware config at the time of writing this:
      - 2x Intel Xeon with total of 56 cores
      - 68 GB RAM
      - 3x Nvidia P100 GPU cards at 16GB VRAM each
      - 1 TB HDD storage
    - I leveraged multiprocessing with all available cores for the preprocessing work in this repo's preprocess.py. It took around 2 weeks to complete the preprocessing step with all available cores.

Setup

Use either the dockerfile or a virtual environment to setup the necessary environment to run the files.
Python virtual environment (venv)
- Initialize: python -m venv wiki-search
- Activate:
  - Windows: /wiki-search/Scripts/activate
  - Linux/MacOS: source /wiki-search/bin/activate
- Install dependencies: pip install -r requirements.txt
Docker
- Build: docker build -t wiki-search -f Dockerfile .
- Run: docker run -v {$pwd}:/wiki wiki-search
Download the data
- Wikipedia XML
  - Contains the actual wikipedia articles aggregated into XML files.
  - Huggingface dataset link
- Wikipedia metadata (needed for search)
  - Recommend you download only the .msgpack files as they have a smaller storage footprint than the JSON.
  - Huggingface dataset link

Usage

Preprocess Wikipedia data (preprocess.py)
- python preprocess.py
- Arguments
  - --bow: perform bag-of-words preprocessing for the data.
  - --vector: perform vector preprocessing for the data.
  - --num_proc: the number of processors to use for preprocessing (default is 1).
  - --gpu2cpu_limit: the maximum processors allowed before use of the GPU is forced to the CPU (default is 4).
  - --override_gpu2cpu_limit: whether to ignore the limit set by --gpu2cpu_limit argument (default is false/not specified).
  - Example: python preprocess.py --bow --vector --num_proc=16 --override_gpu2cpu_limit
Search Wikipedia (searchwiki.py)
- python searchwiki.py

Notes

Minimum system storage and memory requirements for each file:
- pages-articles-multistream.xml.bz2
  - 19+ GB disk space compressed
  - 86+ GB disk space decompressed
  - Expect to use at least 64+ GB of RAM for the decompression OR 8+ GB if using --shard when downloading the data.
Preprocessing consists of the following:
- Breaking each document (xml file/article) into a bag of words.
  - The bag of words is stored into a JSON dictionary.
  - For each word in the bag of words, extract the word count. This will make for faster compute by front loading that operation here. The word count is also stored in the bag of words JSON dictionary.
- Chunking each document (xml file/article) into vector embeddings.
  - The vector embeddings are stored into a faiss vector index.
  - For each embedding, a mapping of the vectorDB, vectorDB index, file path, and slice indices are stored in a JSON dictionary.
Nltk
- Downloading all nltk packages (stored in ~/nltk_data) results in 3.3 GB in storage.
- NLTK uses WordNet Lemmatizer by default.
- See the geeksforgeeks article on using NLTK's lemmatizer for the pros and cons of using this specific lemmatizer.
  - Advantages
    - Improves text analysis accuracy: Lemmatization helps in improving the accuracy of text analysis by reducing words to their base or dictionary form. This makes it easier to identify and analyze words that have similar meanings.
    - Reduces data size: Since lemmatization reduces words to their base form, it helps in reducing the data size of the text, which makes it easier to handle large datasets.
    - Better search results: Lemmatization helps in retrieving better search results since it reduces different forms of a word to a common base form, making it easier to match different forms of a word in the text.
  - Disadvantages
    - Time-consuming: Lemmatization can be time-consuming since it involves parsing the text and performing a lookup in a dictionary or a database of word forms.
    - Not suitable for real-time applications: Since lemmatization is time-consuming, it may not be suitable for real-time applications that require quick response times.
    - May lead to ambiguity: Lemmatization may lead to ambiguity, as a single word may have multiple meanings depending on the context in which it is used. In such cases, the lemmatizer may not be able to determine the correct meaning of the word.
Optimizations to consider for further development
- In general
  - Parralellizing workflows (especially in preprocessing) to leverage multiprocessing/threading.
  - The embedding models are neural networks and thus would benefit from leveraging on device GPUs.
- In preprocessing
  - Opting for utilizing a second machine or SQL database to load in article data/text from memory/RAM instead of using file IO to read xml files.
- In search
  - Opting for utilizing a second machine to load in and query indices from memory/RAM instead of using file IO to read the index files.
Multiprocessing
- Getting transformer models running with multiprocessing
  - Cannot re-initialize CUDA in forked subprocess github issue
- Videos going using python's multiprocessing
  - Unlocking your CPU cores in Python (multiprocessing) video
  - Multiprocessing in Python: Pool video
  - Python starmap (itertools): A short and easy intro video
Got ascii art from here
Dev/performance notes:
- Despite being designed to scale but also run from consumer level hardware, the data preprocessing is NOT ideal for consumer hardware.
- Consider also adding C++ code that serves as extensions and accelerates certain areas such as xml parsing, file loading, and basic dictionary operations.
- Bag of words + vector preprocessing
  - Could not run bag of words and vector preprocessing at the same time. Managed to OOM when the data was almost completely done. Current server hardware is not sufficient for this but it's interesting to see those limits.
  - Estimated runtime on 16 cores was around 24 hours per file.
  - Used only 16 cores because it would create 16 instances of the embedding model and my server has 16GB VRAM per card.
- Bag of words preprocessing only
  - Estimated runtime on 32 cores was around 4 hours per file.
  - Task is CPU bound only.
  - Word to document metadata around 50 MB per file.
  - Document to word metadata around 10 MB per file.
- Vector preprocessing only
  - Estimated runtime on 16 cores was around hours per file (gpu to cpu override enabled).
  - Task can be either CPU or GPU bound.
  - Cannot isolate the size of the vector data per file (both in number of vectors or size in Bytes).
    - Memory resources exhausted (OOM) on my server (has 68 GB) when running the above configuration (16 cores). This has introduced the need to dynamically loading the data into the vector DB (lancedb) during preprocessing (instead of waiting for the full file to be processed) without causing any issues with threading/race conditions.
- Secondary preprocessing (bag of words)
  - It seemed prudent to include a secondary preprocessing step for the data. Theidea was to take the outputs from the bag of words preprocessing step and filter out all non-articles, precompute inverse document frequency for all words, and precompute the TF-IDF and BM25 scores for all words.
    - Computing the BM25 required that we define a k1 and b in the config.json before running this step.
      - Chose the following default k1 and b scores based on this article from ElasticSearch
    - The idea was that we would save on the cost of these computations at inference time given the scale of the data (we could go back to passing in k1 and b to class initialization method if we either have a much smaller dataset or we find ourself with sufficient need and compute power to do this at runtime).
  - Now we just have to read the document, word TF-IDF or BM25 and compute only the scores.
    - There are still issues with disk IO being a bottleneck for this part but it does save us memory usage.
  - This secondary preprocessing also includes building the inverted index.

Parsing Documents & Queries

Read in the file and pass it to BeautifulSoup.
Extract only the <title> and <text> tags. Get the text for both tags and concatenate them together.
- Ignore other tags (including <links>) as they don't have relevant information for the article.
For TF-IDF/BM25 search:
- Preprocessing:
  - Lowercase all words.
  - Split and/or replace all punctuation with "" (empty string) or " " (whitespace).
  - Tokenize all words (with nltk word_tokenize OR splitting on " " whitespace).
  - Convert all numbers from digits to their written expanded form (1 -> 'one').
  - Lemmatize and stem all remainging words.
- Convert the list of all words into a set (bag of words).
- Document to words mapping:
  - Each document is going to contain its own list of the set of unique words (bag of words).
  - For each word in the document's bag of words, the word frequency will be kept track of.

document_path: {
    word_1: word_1_freq,
    word_2: word_2_freq,
    . . .
    word_n: word_n_freq,
}

 - Word to documents mapping:
     - Each word is going to maintain a list of the documents that it has occured in.
     -

word: [document_1_path, document_2_path, ... , document_n_path]

 - Store the above metadata into dictionaries and save them to JSON files.

For Vector search with language model:
- Preprocessing
  - Created a preprocessor the recursively splits the text into paragraphs, words, and characters.
    - Paragraph splitter splits on newline characters ("\n\n", "\n").
    - Word splitter splits on white space characters (" ").
    - Character splitter splits on characters ("" or empty strin).
  - When a text segment is too large for the embedding model, it is either passed to the next level splitter or further broken down on the current splitter.
  - Inspiration for the design of the recursive splitter is the Langchain RecursiveTextCharacterSplitter.
    - All links below are for v0.1 of Langchain.
    - Langchain RecursiveCharacterTextSplitter in the documentation
    - Langchain RecursiveCharacterTextSplitter implementation on github
  - Preprocessor captures subtext token sequence, index with respect to original text, and subtext length along with the article's SHA1 hash and the xml file it came from.
- Embed the texts with the language model, maximizing the number of tokens that can fit within the model's context window. The subtext metadata for each embedded chunk will come in handy for retrieval.
  - Generate embeddings with BERT example (geeksforgeeks)
    - Primarily good for understanding how to generate sentence embeddings (what we want). Concepts still apply to using other models, not just BERT.
- Store the embedding to the vector database. Keep track of the index of that embedding as well as the text and article that it came from.
  - Store this metadata (database index, article file, text slice indices) in a JSON dictionary.

[database_path, database_index, article_path, [text_slice_start_index, text_slice_end_index]]

Repeat the same steps from document parsing to parse search queries.

Search Methods

TF-IDF/BM25
- Requires no additional modules but would benefit from having numpy installed for faster computation (minimal package bloat).
- Is a more classical method for search.
- Should at least give you the top article(s) to consider for each query/search.
Vector search with language model
- Requires pytorch, transformers, and faiss libraries (meaning the package bloat will be heavier).
- Is a more abstract and deep learning method of search.
- Should give you the top passages of text to consider for each query/search.
- Uses language models to turn articles and search queries into embeddings and performms k nearest neighbors (knn) to get the embeddings that are most semantically similar.
- Models to be considered:
  - BERT
    - google-bert/bert-base-uncased
    - google/mobilebert-uncased
    - distilbert/distilbert-base-uncased
    - google-bert/bert-large-uncased
  - All MiniLM (sentence-transformers)
    - sentence-transformers/all-MiniLM-L6-v2
    - sentence-transformers/all-MiniLM-L6-v1
    - sentence-transformers/all-MiniLM-L12-v2
    - sentence-transformers/all-MiniLM-L12-v1
    - sentence-transformers/paraphrase-MiniLM-L3-v2
    - sentence-transformers/paraphrase-MiniLM-L6-v2
    - sentence-transformers/paraphrase-MiniLM-L12-v2
    - sentence-transformers/all-roberta-large-v1
    - sentence-transformers/all-distilroberta-v1
    - sentence-transformers/multi-qa-MiniLM-L6-cos-v1
    - sentence-transformers/multi-qa-MiniLM-L6-dot-v1
  - Instructor (hkunlp)
    - hkunlp/instructor-base
    - hkunlp/instructor-large
    - hkunlp/instructor-xl
- UPDATE 06/19/2024:
  - I ran out of space on my server when generating embeddings for just 1 file. Embedding process was about 75% complete and generated 187 GB of data in the lancedb vector DB when it stopped. My server has a 1 TB hard drive but there are other projects and files stored on the machine.
  - Performing a napkin calculation I found that embedding all vectors in the dataset would take around 2.2 TB (this was just the embeddings, not the additional metadata that is stored with the embedding in the vector DB).
    - 1 (BERT) embedding has 768 dims
      - Embeddings use float32 dtype which is 32-bit floating point number which equates to 4 bytes.
      - The total embedding storage overhead is 4 x 768 or 3,072 bytes.
    - 1 file had around 4 million vectors.
      - 1 file also had 150,000 articles.
    - There are just shy of 200 files.
    - Total formula:
      - 3,072 bytes per embeding vector x 4,000,000 vectors per file x 200 files
      - Divide accordingly to get conversion to TB, GB, MB, KB units.
  - Acquiring a 4 TB hard drive is not a challenge but that level of storage defeats the goal of the project to be able to run on a consumer grade computer.
    - As such, I have elected to stop pursuing this component of the project at this time. The creation of the vector database is just not worth the compute/energy and storage. I will only implement the TF-IDF/BM25 and the Rerank search engines.
Rerank search with TF-IDF/BM25 and vector search
- Combination of methods above.

Useful Links for Preprocessing Concepts

Stop words
- pythonspot blog
- geeksforgeeks
Num2words
- pypi
Lemmatization
- geeksforgeeks
Stemming
- geeksforgeeks
- guru99 post (applies also to lemmatization)
Previous search examples
- neuralsearch repo (mine)

Python Module to JS Package

nltk
- functions: stop words, lemmatization, stemming, and word tokenizing.
- natural
  - npm
  - main webpage
- winkjs
  - wink-nlp npm
  - wink-porter2-stemmer npm
  - wink-nlp-utils npm
  - wink-nlp-lemmatizer npm
  - main webpage
  - examples page
  - stemming and lemmatization tutorial
  - wink-nlp main webpage
  - wink-nlp-utils main webpage
num2word
- functions: convert numbers from their numerical representation to a written form.
- to-words
  - npm
beautifulsoup
- functions: parse xml data.
- some alternatives (curtesy of stack overflow)
- libxmljs
  - npm
  - github
- xml-stream
  - npm
  - github
- xmldoc
  - npm
  - github
faiss
- functions: store/retrieve and save/load vector embeddings.
- faiss-node
  - npm
chromadb
- functions: store/retrieve and save/load vector embeddings
- chromadb
  - npm
transformers
- functions: use pretrained language models to create vector embeddings from texts.
- transformers.js
  - npm

Misc Links

Chunk Size Matters blog post
- Relevant to how chunking can affect RAG systems.
- Argument for smaller chunk sizes (128 tokens seemed to be the best performing in the article).
Vector Databases blog post
- Great for exploring vector DB options
- Some features have had time to develop since the time the article was written/published (2023)

References

BeautifulSoup documentation
Faiss documentation
- wiki
NLTK documentation
Transformers
- getting started with embeddings blog post
- using sentencetransformers at huggingface documentation
- loading models direcdtly to gpu stackoverflow
- estimate model memory usage documentation
- tokenizer encode text documentation
- tokenizer decode text documentation
- tokenizer padding text documentation
- Uploading datasets to huggingface hub documentation
  - Using huggingface-cli documentation
Sentence Transformers
- sentence transformers in the huggingface hub blog post
- documentation
- github
- semantic search sbert documentation
Documentation of native python module used:
- argparse
- concurrent.futures
- copy
- cProfiler
- gc
- json
- math
- multiprocessing
- os
- shutil
- sqlite3
- string
- typing
Cython:
- Python documentation on extending python with c/c++
- Cython documentation
- Sentdex youtube video
- Sentdex tutorial blog post
- LinkedIn blog post
- Peter Baumgartner blog post
- Python Bindings: Calling C or C++ From Python (Real Python)
  - Note: Disable JS to avoid the sign-in popup.
  - Associated repo
- Learn Cython pdf
LanceDB
- Quick Start documentation
- Python API Reference documentation
- Filtering documentation
- Full Text Search documentation
- Working with Tables documentation
- ANN Indexes documentation
  - Primarily for the section on disk-based indexes
  - Vector Search documentation
  - Concepts:
- Storage docoumentation
  - Primarily for local disk storage
- FAQs documentation
- Lance documentation
- Restack Framework transaction overview with lancedb
- Benchmarking LanceDB medium article
- Benchmarking LanceDB lancedb blog
- Rust crate
Milvus documentation
Cosine Similarity
- Wikipedia article
Zipf's Law
- Wikipedia article
TF-IDF
- Document Similarity Search Examples in Python (LinkedIn blog post)
- Understanding TF-IDF geeksforgeeks
- What is inverse document frequency blog post
msgpack main page
DrDub Semantic Search repo
Tries
- Trie data structure geeksforgeeks
- Trie data structure | Insert and Search geeksforgeeks

Name		Name	Last commit message	Last commit date
Latest commit History 312 Commits
.github/workflows		.github/workflows
Archived		Archived
Whitepapers		Whitepapers
WikipediaEnDownload @ e2675bb		WikipediaEnDownload @ e2675bb
src		src
.gitignore		.gitignore
.gitmodules		.gitmodules
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Notes.md		Notes.md
README.md		README.md
Vocab to Doc Freq (top-1000).png		Vocab to Doc Freq (top-1000).png
Vocab to Doc Freq (top-500).png		Vocab to Doc Freq (top-500).png
Vocab to Doc Freq (top-5000).png		Vocab to Doc Freq (top-5000).png
config.json		config.json
download_category_tree.py		download_category_tree.py
generate_trie.py		generate_trie.py
link_category_tree.py		link_category_tree.py
precompute_sparse_vectors.py		precompute_sparse_vectors.py
preprocess.py		preprocess.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
search.py		search.py
search_helper.pyx		search_helper.pyx
searchwiki.py		searchwiki.py
semantic_category_inverted_index.py		semantic_category_inverted_index.py
setup.py		setup.py
title.txt		title.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Wikipedia English Dataset Search

PROJECT STATUS

Before Proceeding

Setup

Usage

Notes

Parsing Documents & Queries

Search Methods

Useful Links for Preprocessing Concepts

Python Module to JS Package

Misc Links

References

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Wikipedia English Dataset Search

PROJECT STATUS

Before Proceeding

Setup

Usage

Notes

Parsing Documents & Queries

Search Methods

Useful Links for Preprocessing Concepts

Python Module to JS Package

Misc Links

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages