Compute PageRank #10

brawer · 2022-12-13T10:28:30Z

It would be nice if QRank were to run the PageRank algorithm on the link graph in Wikipedia and sister projects. Something like https://github.com/athalhammer/danker but less resource-hungry, so it can work on all language editions together, be put in production, and run (weekly or at least bi-weekly) in the Wikimedia cloud. The results should be made available for public download, similar to the existing QRank signal.

Personally I’m actually not super convinced that PageRank provides a better ranking signal than the existing QRank — basically, PageRank is a mathematical model that tries to predict hypothetical user behavior (by analyzing the structure of the link graph), whereas QRank measures what real users have done in practice (by analyzing Wikimedia logs). However, the choice of ranking algorithm should be left to users: if someone wants PageRank, they should be able to get it, easily and reliably. And there’s definitely a value in being able to combine multiple ranking signals.

athalhammer · 2023-02-16T19:48:10Z

Hi @brawer, seeing this just now. Hm, I designed danker to run with 8GB of main memory on some 2-4 core CPU. The problem is - until the link graph is established uses between 300 to 400 GB of disk space (later the unzipped graph is like 100gb). That is the only resource-intensive part but I could run this on a raspberry pi with 8gb ram and a connected usb disk.

Danker can, in fact, run on all language editions together (see here for the link graph) and I'm also irregularly running the script and offer the scores here in multiple formats. The hdt format is particularly neat as you can run federated queries to the wikidata endpoint with just downloading some 200MB and run it in docker-compose.

Update 2023-NOV: I bought a Raspberry Pi 4B, 8GB to save cloud on cost. So far, computation runs smoothly.

As of May 2024, Wikimedia does not dump the `interwikimap` table into the periodic SQL dumps, so we fetch it from live sites via the public API. We’ll need this table to resolve inter-wiki links. #10

#10

We’ll eventually store other files than just `page_signals`. #10

#10

Because it has the same format and sort order as the `titles` file, we can trivially merge the two together when resolving links. #10

#10

brawer added a commit that referenced this issue May 24, 2024

Fetch global interwikimap instead of ~1000 site-specific ones

ec59c68

#10

brawer added a commit that referenced this issue May 24, 2024

Fetch global interwikimap instead of ~1000 site-specific ones

ab8efb7

#10

brawer added a commit that referenced this issue May 24, 2024

Generalize helper method for listing files in S3 storage

e56c840

We’ll eventually store other files than just `page_signals`. #10

brawer added a commit that referenced this issue May 24, 2024

Generalize helper method for building site files

81e15b7

We’ll eventually store other files than just `page_signals`. #10

brawer added a commit that referenced this issue May 25, 2024

Pass InterwikiMap to ReadWikiSites()

a3f7d52

#10

brawer added a commit that referenced this issue May 25, 2024

Add method to resolve interwiki prefixes

a176d08

#10

brawer added a commit that referenced this issue May 25, 2024

Resolve language prefixes in interwiki links

14d60fa

#10

brawer added a commit that referenced this issue May 25, 2024

Add test data files for iwlinks table

f9292b0

#10

brawer added a commit that referenced this issue May 27, 2024

Collect interwiki links

86db785

#10

brawer added a commit that referenced this issue May 27, 2024

Resolve interwiki links in page titles

ef4ca5d

#10

brawer added a commit that referenced this issue May 27, 2024

Build interwiki_links in S3 storage

34f65cd

#10

brawer added a commit that referenced this issue May 28, 2024

Sort interwiki links

9bf2c89

#10

brawer added a commit that referenced this issue May 28, 2024

Emit page titles with namespace prefixes

a30c27e

#10

brawer added a commit that referenced this issue May 31, 2024

Build redirects file

16f3f70

Because it has the same format and sort order as the `titles` file, we can trivially merge the two together when resolving links. #10

brawer added a commit that referenced this issue Jun 4, 2024

Add struct to represent links between Wikidata items

28936c3

#10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compute PageRank #10

Compute PageRank #10

brawer commented Dec 13, 2022 •

edited

athalhammer commented Feb 16, 2023 •

edited

Compute PageRank #10

Compute PageRank #10

Comments

brawer commented Dec 13, 2022 • edited

athalhammer commented Feb 16, 2023 • edited

brawer commented Dec 13, 2022 •

edited

athalhammer commented Feb 16, 2023 •

edited