New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternative approach using live database #9

Closed
zhaofengli opened this Issue Mar 5, 2015 · 4 comments

Comments

Projects
None yet
2 participants
@zhaofengli

zhaofengli commented Mar 5, 2015

In case you are not aware, WMF provides [Tool Labs](https://wikitech.wikimedia.org/wiki/Help:Tool Labs) which is a free hosting service for Wikimedia-related tools, with access to live database replicas and latest dumps. With live database access, I propose another way to discover pages needing citations:

  1. Using SQL, select a random page in article space (namespace 0) which transcludes {{cn}} with the templatelinks table (see database layout and example)
  2. Obtain the page content via API (There doesn't seem to be a way to get this from the database)
  3. Parse the page using existing code, to get the uncited part

This way, we can avoid the lag caused by the use of dumps (what is shown represents the current state of the article) and there will be no need to pre-process the dumps. Editors also tend to trust tools hosted on Tool Labs more than those on third-party servers.

Tools on Tool Labs need to be licensed under a OSI-approved license.

Zhaofeng Li (User:Zhaofeng Li on Wikipedia)

@eggpi

This comment has been minimized.

Show comment
Hide comment
@eggpi

eggpi Mar 5, 2015

Owner

Tool Labs looks really great, thanks for bringing it up! I'll look at it in more detail in the next few days, but I see no reason not to host CitationHunt there.

As for using the live databases, I'd definitely like that. One thing that's missing from your proposal is categorizing pages. The ~2000 categories that are available on the site for filtering are picked using simple heuristics by the assign_categories.py script, given the set of pages with missing citations and their categories (as you can probably see, this is basically a weighted Set Cover problem).

In order to keep categories consistent, it might make sense to just periodically recompute the pages and categories, except doing this against the live databases instead of dumps, thus reducing lag. Note that the current database contains over 170000 snippets lacking citations, so for a sufficiently recent recomputation, the probability of coming across an outdated snippet shouldn't be large enough that we need to do everything live.

Finally, a cursory look at the database schema suggests that the page contents could be grabbed from the text table. The documentation for Quarry says that text is missing, but I'd expect it to be there for other applications. Please correct me if this is not the case.

Thank you very much for your suggestions!

Owner

eggpi commented Mar 5, 2015

Tool Labs looks really great, thanks for bringing it up! I'll look at it in more detail in the next few days, but I see no reason not to host CitationHunt there.

As for using the live databases, I'd definitely like that. One thing that's missing from your proposal is categorizing pages. The ~2000 categories that are available on the site for filtering are picked using simple heuristics by the assign_categories.py script, given the set of pages with missing citations and their categories (as you can probably see, this is basically a weighted Set Cover problem).

In order to keep categories consistent, it might make sense to just periodically recompute the pages and categories, except doing this against the live databases instead of dumps, thus reducing lag. Note that the current database contains over 170000 snippets lacking citations, so for a sufficiently recent recomputation, the probability of coming across an outdated snippet shouldn't be large enough that we need to do everything live.

Finally, a cursory look at the database schema suggests that the page contents could be grabbed from the text table. The documentation for Quarry says that text is missing, but I'd expect it to be there for other applications. Please correct me if this is not the case.

Thank you very much for your suggestions!

@zhaofengli

This comment has been minimized.

Show comment
Hide comment
@zhaofengli

zhaofengli Mar 6, 2015

Unfortunately, the text table is not available on Labs replicas, and you will need to rely on the API to grab page text. As for the category assigning code, I'm not seeing the point here.

zhaofengli commented Mar 6, 2015

Unfortunately, the text table is not available on Labs replicas, and you will need to rely on the API to grab page text. As for the category assigning code, I'm not seeing the point here.

@eggpi

This comment has been minimized.

Show comment
Hide comment
@eggpi

eggpi Mar 6, 2015

Owner

Sorry, my comment wasn't clear at all. My concern was that the way we categorize requires all relevant pages and their categories to be known in advance. So even with the live database, some pre-processing will be necessary to figure out which categories will be featured on the site, whereas if we didn't use categories, we could just select a random page and parse it at each access. This is not really a big deal, and is definitely doable, I just wanted to point it out because it was missing from your proposal.

Owner

eggpi commented Mar 6, 2015

Sorry, my comment wasn't clear at all. My concern was that the way we categorize requires all relevant pages and their categories to be known in advance. So even with the live database, some pre-processing will be necessary to figure out which categories will be featured on the site, whereas if we didn't use categories, we could just select a random page and parse it at each access. This is not really a big deal, and is definitely doable, I just wanted to point it out because it was missing from your proposal.

@eggpi

This comment has been minimized.

Show comment
Hide comment
@eggpi

eggpi Mar 30, 2016

Owner

After 7a0e9df, we're (finally!) using the live databases and API and no longer depend on dumps :)

Owner

eggpi commented Mar 30, 2016

After 7a0e9df, we're (finally!) using the live databases and API and no longer depend on dumps :)

@eggpi eggpi closed this Mar 30, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment