Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

topic annotation for articles could be throttled #683

Open
egonw opened this issue Apr 25, 2019 · 6 comments
Open

topic annotation for articles could be throttled #683

egonw opened this issue Apr 25, 2019 · 6 comments
Assignees
Labels
performance the way Scholia treats the machines using it

Comments

@egonw
Copy link
Collaborator

egonw commented Apr 25, 2019

@Daniel-Mietchen, this is just a heads up... I've been lurking on #wikidata on IRC for some time now. The WDQS servers have been building up lag quite a few times in the past couple of months. Sjoerd has been looking into the issue, and it seems related to the size of the items being edited. Now, it seems that articles are generally large, or at least in terms of the number of statements:

image

If not mistaken, your batches have been shut down (see your talk page).

So, here's one scalability issue for Scholia: mass annotation of articles with topics is fairly expensive, and the old WDQS cluster does not handle the data bandwidth well. One issue is that it needs to pass around the full content of the item. So, the problem scales with the size of the item. Andra further told me there seems to be an issue with the max JSON size of an item.

No action needed, but something we could report on at some point.

I will also run my code on the key types behind all Scholia aspects.

@egonw egonw changed the title topic annotation for articles should be throttled to less than on per second topic annotation for articles should be throttled Apr 25, 2019
@egonw
Copy link
Collaborator Author

egonw commented Apr 25, 2019

Oh, the source code for that plot is now available from this Rmd file: https://github.com/egonw/wikidata-item-size/blob/master/wikidata_item_size.Rmd and as HTML at https://egonw.github.io/wikidata-item-size/wikidata_item_size.html

@wetneb
Copy link

wetneb commented Apr 25, 2019

I know you have already read that many times, but just for the record: this is just one of the many symptoms of the inadequacy of Wikidata to host Wikicite. It's not just about annotating topics: disambiguating authors, adding publication identifiers, adding affiliations… running any of these operations at a significant scale involves editing many items, which happen to be quite large now. At the moment doing this at 60 edits/minute in this domain is already too much for the servers.

Even assuming that the WMF wins the lottery and gets servers that are 10 times more powerful, allowing you to edit at 600 edits/min, this thoughput is still going to be way below what is needed to efficiently maintain a database of articles. In https://dissem.in/ we index more than 100 million papers and much higher edit rates are needed even just to keep the database in sync with the metadata sources. The orders of magnitude just do not match up.

I wish Wikicite acknowledged that fully and realized that the current edits in Wikidata are doing more harm than good (I weigh my words), given that they put a significant strain on Wikidata without any hope to reach a useful state any time soon. It would be great if the roadmap discussion could be taken seriously: please just stop editing in this domain while no solution has emerged from that debate.

@egonw
Copy link
Collaborator Author

egonw commented Apr 25, 2019

Some other Scholia-related types:

image

@Daniel-Mietchen
Copy link
Member

Let me just acknowledge that I've seen this.

@Daniel-Mietchen Daniel-Mietchen added the performance the way Scholia treats the machines using it label Apr 25, 2019
@egonw egonw added this to Backend in Robustifying Scholia Apr 29, 2019
@egonw egonw changed the title topic annotation for articles should be throttled topic annotation for articles could be throttled May 3, 2019
@Daniel-Mietchen
Copy link
Member

@wetneb We have started https://www.wikidata.org/wiki/Wikidata:WikiProject_Limits_of_Wikidata to keep track of discussions related to the limits of Wikidata. Your contributions there would be appreciated.

@wetneb
Copy link

wetneb commented Jul 2, 2019

@Daniel-Mietchen I am not sure what I can contribute there?

IMHO what we would need is an RFC to decide on a clear inclusion criterion for scholarly articles, which would limit the random growth and hopefully make the dataset more useful to consumers by announcing a clear scope.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance the way Scholia treats the machines using it
Projects
Development

No branches or pull requests

3 participants