Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compute PageRank #10

Open
brawer opened this issue Dec 13, 2022 · 1 comment
Open

Compute PageRank #10

brawer opened this issue Dec 13, 2022 · 1 comment

Comments

@brawer
Copy link
Owner

brawer commented Dec 13, 2022

It would be nice if QRank were to run the PageRank algorithm on the link graph in Wikipedia and sister projects. Something like https://github.com/athalhammer/danker but less resource-hungry, so it can work on all language editions together, be put in production, and run (weekly or at least bi-weekly) in the Wikimedia cloud. The results should be made available for public download, similar to the existing QRank signal.

Personally I’m actually not super convinced that PageRank provides a better ranking signal than the existing QRank — basically, PageRank is a mathematical model that tries to predict hypothetical user behavior (by analyzing the structure of the link graph), whereas QRank measures what real users have done in practice (by analyzing Wikimedia logs). However, the choice of ranking algorithm should be left to users: if someone wants PageRank, they should be able to get it, easily and reliably. And there’s definitely a value in being able to combine multiple ranking signals.

@athalhammer
Copy link

athalhammer commented Feb 16, 2023

Hi @brawer, seeing this just now. Hm, I designed danker to run with 8GB of main memory on some 2-4 core CPU. The problem is - until the link graph is established uses between 300 to 400 GB of disk space (later the unzipped graph is like 100gb). That is the only resource-intensive part but I could run this on a raspberry pi with 8gb ram and a connected usb disk.

Danker can, in fact, run on all language editions together (see here for the link graph) and I'm also irregularly running the script and offer the scores here in multiple formats. The hdt format is particularly neat as you can run federated queries to the wikidata endpoint with just downloading some 200MB and run it in docker-compose.

Update 2023-NOV: I bought a Raspberry Pi 4B, 8GB to save cloud on cost. So far, computation runs smoothly.

brawer added a commit that referenced this issue May 21, 2024
As of May 2024, Wikimedia does not dump the `interwikimap` table
into the periodic SQL dumps, so we fetch it from live sites via
the public API. We’ll need this table to resolve inter-wiki links.

#10
brawer added a commit that referenced this issue May 24, 2024
We’ll eventually store other files than just `page_signals`.
#10
brawer added a commit that referenced this issue May 24, 2024
We’ll eventually store other files than just `page_signals`.
#10
brawer added a commit that referenced this issue May 25, 2024
brawer added a commit that referenced this issue May 25, 2024
brawer added a commit that referenced this issue May 25, 2024
brawer added a commit that referenced this issue May 27, 2024
brawer added a commit that referenced this issue May 27, 2024
brawer added a commit that referenced this issue May 27, 2024
brawer added a commit that referenced this issue May 28, 2024
brawer added a commit that referenced this issue May 28, 2024
brawer added a commit that referenced this issue May 31, 2024
Because it has the same format and sort order as the `titles` file,
we can trivially merge the two together when resolving links.

#10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants