Skip to content

Isolate pages from WIkimedia dumps and process them with Pandoc

Notifications You must be signed in to change notification settings

dylanburati/wikiplain

Repository files navigation

wikiplain

A toolkit for processing Wikimedia XML dumps and Wikitext. Also includes a part-of-speech tagging TCP service.

I use these to take an English Wikipedia snapshot, a collection of Reddit post logs, and the UMBC webbase corpus and estimate the level of name recognition for each article's subject. This helps when curating the default People, Places, and Characters decks in my trivia game.

About

Isolate pages from WIkimedia dumps and process them with Pandoc

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published