Skip to content

ashvardanian/JsParseWiki

 
 

Repository files navigation

JsParseWiki

JavaScript app that parses Wikipedia dumps and imports them into MongoDB. This fork supports remote DB authetification and is slightly refactored.. As the original version, it uses worker-nodes to process pages in parallel, and wtf_wikipedia to turn WikiScript into JSON.

npm badge codacy

Install

This version:

$ npm install -g ashvardanian/JsParseWiki

Original version:

$ npm install -g dumpster-dive

Usage

wget -O archieve.xml.bz2 https://some.wiki.dump
# Decompress using the parallel Zip utility with 8 threads.
lbzip2 -d -n 8 -v archieve.xml.bz2
# Run this JS module.
npx JsParseWiki archieve.xml \
	--workers=8 \
	--mongo_url=mongodb://user:password@localhost:27017 \
	--mongo_name_db=wiki \
	--mongo_name_collection=pages
# Remove the dump.
rm archieve.xml

Packages

No packages published

Languages

  • JavaScript 100.0%