Join GitHub today
Alternative approach using live database #9
In case you are not aware, WMF provides [Tool Labs](https://wikitech.wikimedia.org/wiki/Help:Tool Labs) which is a free hosting service for Wikimedia-related tools, with access to live database replicas and latest dumps. With live database access, I propose another way to discover pages needing citations:
This way, we can avoid the lag caused by the use of dumps (what is shown represents the current state of the article) and there will be no need to pre-process the dumps. Editors also tend to trust tools hosted on Tool Labs more than those on third-party servers.
Tools on Tool Labs need to be licensed under a OSI-approved license.
Zhaofeng Li (User:Zhaofeng Li on Wikipedia)
Tool Labs looks really great, thanks for bringing it up! I'll look at it in more detail in the next few days, but I see no reason not to host CitationHunt there.
As for using the live databases, I'd definitely like that. One thing that's missing from your proposal is categorizing pages. The ~2000 categories that are available on the site for filtering are picked using simple heuristics by the assign_categories.py script, given the set of pages with missing citations and their categories (as you can probably see, this is basically a weighted Set Cover problem).
In order to keep categories consistent, it might make sense to just periodically recompute the pages and categories, except doing this against the live databases instead of dumps, thus reducing lag. Note that the current database contains over 170000 snippets lacking citations, so for a sufficiently recent recomputation, the probability of coming across an outdated snippet shouldn't be large enough that we need to do everything live.
Finally, a cursory look at the database schema suggests that the page contents could be grabbed from the
Thank you very much for your suggestions!
Sorry, my comment wasn't clear at all. My concern was that the way we categorize requires all relevant pages and their categories to be known in advance. So even with the live database, some pre-processing will be necessary to figure out which categories will be featured on the site, whereas if we didn't use categories, we could just select a random page and parse it at each access. This is not really a big deal, and is definitely doable, I just wanted to point it out because it was missing from your proposal.