JavaScript app that parses Wikipedia dumps and imports them into MongoDB. This fork supports remote DB authetification and is slightly refactored.. As the original version, it uses worker-nodes to process pages in parallel, and wtf_wikipedia to turn WikiScript into JSON.
This version:
$ npm install -g ashvardanian/JsParseWiki
Original version:
$ npm install -g dumpster-dive
wget -O archieve.xml.bz2 https://some.wiki.dump
# Decompress using the parallel Zip utility with 8 threads.
lbzip2 -d -n 8 -v archieve.xml.bz2
# Run this JS module.
npx JsParseWiki archieve.xml \
--workers=8 \
--mongo_url=mongodb://user:password@localhost:27017 \
--mongo_name_db=wiki \
--mongo_name_collection=pages
# Remove the dump.
rm archieve.xml