-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compute PageRank #10
Comments
Hi @brawer, seeing this just now. Hm, I designed danker to run with 8GB of main memory on some 2-4 core CPU. The problem is - until the link graph is established uses between 300 to 400 GB of disk space (later the unzipped graph is like 100gb). That is the only resource-intensive part but I could run this on a raspberry pi with 8gb ram and a connected usb disk. Danker can, in fact, run on all language editions together (see here for the link graph) and I'm also irregularly running the script and offer the scores here in multiple formats. The hdt format is particularly neat as you can run federated queries to the wikidata endpoint with just downloading some 200MB and run it in docker-compose. Update 2023-NOV: I bought a Raspberry Pi 4B, 8GB to save cloud on cost. So far, computation runs smoothly. |
As of May 2024, Wikimedia does not dump the `interwikimap` table into the periodic SQL dumps, so we fetch it from live sites via the public API. We’ll need this table to resolve inter-wiki links. #10
We’ll eventually store other files than just `page_signals`. #10
We’ll eventually store other files than just `page_signals`. #10
Because it has the same format and sort order as the `titles` file, we can trivially merge the two together when resolving links. #10
It would be nice if QRank were to run the PageRank algorithm on the link graph in Wikipedia and sister projects. Something like https://github.com/athalhammer/danker but less resource-hungry, so it can work on all language editions together, be put in production, and run (weekly or at least bi-weekly) in the Wikimedia cloud. The results should be made available for public download, similar to the existing QRank signal.
Personally I’m actually not super convinced that PageRank provides a better ranking signal than the existing QRank — basically, PageRank is a mathematical model that tries to predict hypothetical user behavior (by analyzing the structure of the link graph), whereas QRank measures what real users have done in practice (by analyzing Wikimedia logs). However, the choice of ranking algorithm should be left to users: if someone wants PageRank, they should be able to get it, easily and reliably. And there’s definitely a value in being able to combine multiple ranking signals.
The text was updated successfully, but these errors were encountered: