parsewiki

Parse Wikipedia page dump to manage page entities and revisions.

Roadmap

Get some data (directly reslting from the parsing, not just entities) and go to the professor
Add a command line utility to specify the source of the dump(s)
Add an importer for the configurations (Mongo, spark folder, executors conf...)

For each new task:

Name		Name	Last commit message	Last commit date
Latest commit History 88 Commits
data_utils		data_utils
parsewiki		parsewiki
.gitignore		.gitignore
README.md		README.md
cli_utils.py		cli_utils.py
dumpmanager.py		dumpmanager.py
no-mongo-conf.json		no-mongo-conf.json
online-wikipedia-conf.json		online-wikipedia-conf.json
requirements.txt		requirements.txt
setup_env.sh		setup_env.sh
spark_interface.py		spark_interface.py
wikihelper.py		wikihelper.py