A toolkit for processing Wikimedia XML dumps and Wikitext. Also includes a part-of-speech tagging TCP service.
I use these to take an English Wikipedia snapshot, a collection of Reddit post logs, and the UMBC webbase corpus and estimate the level of name recognition for each article's subject. This helps when curating the default People, Places, and Characters decks in my trivia game.