Skip to content
Compute PageRank on ~3 billion Wikipedia links on off-the-shelf hardware.
Python Shell
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
danker linted danker.py Jan 15, 2020
docs pylint fixes Jan 19, 2020
script
test pylint fixes Jan 19, 2020
.gitignore added pyc files to gitignore Oct 9, 2019
.travis.yml introduced shellcheck Jan 19, 2020
LICENSE Create LICENSE Jul 1, 2017
README.md added reference to Haveliwala, 1999 Jan 18, 2020
README_PR.md updated doc [skip ci] Oct 6, 2019
danker.sh introduced shellcheck Jan 19, 2020
release_instructions.md gave up travis/pypi integration Sep 14, 2019
requirements.txt added requirements.txt Oct 9, 2019
setup.py added python 3.8 Dec 27, 2019

README.md

Build Status Coverage Status Documentation Status PyPI GitHub license

danker

danker is a compilation of Bash and Python3 scripts that enables the computation of PageRank on Wikipedia on normal off-the-shelf hardware (e.g., a quad-core CPU, 8 GB of main memory, and 250 GB hard disk storage). The "--bigmem/-b" option enables to speed up computation given that enough main memory is available (this depends on the Wikipedia language edition, project and your hardware configuration).

  • INPUT Wikipedia language edition, e.g. "en" OR "ALL" (for computing PR on the union of all language editions using the bag-of-links model); optional parameter "--bigmem/-b".
  • PROCESSING danker downloads the required Wikipedia dump files (https://dumps.wikimedia.org/LANGwiki/latest/), resolves links, redirects, Wikidata Q-IDs, produces a link-file and computes PageRank.
  • OUTPUT
    • LANG-DUMPDATE.links - a link file (edge list, tab-separated), the input for PageRank. Every line reads from left to right:
    # Q-ID left --links to--> Q-ID right
    Q30    Q46
    
    • LANG-DUMPDATE.links.rank - a series of Wikidata Q-ID with their respective PageRank (sorted descending)

Requirements

  • python>=3.5
  • csvkit>=1.0.4
  • numpy>=1.16.3 and networkx>=2.3 for running unit tests

Usage

  usage: ./danker.sh [-h] [-p PROJECT] [-i ITERATIONS] [-d DAMPING] [-s START]
                     [-b] [-l]
                     wikilang

  Compute PageRank on Wikipedia.

  positional arguments:
    wikilang              Wikipedia language edition, e.g. "en". "ALL" for
                          computing PageRank over all languages available in a
                          project.

  optional arguments:
    -h, --help            show this help message and exit
    -p PROJECT, --project PROJECT
                          Wiki project, currently supported [wiki, books,
                          source, versity, news]. (default: wiki)
    -i ITERATIONS, --iterations ITERATIONS
                          PageRank number of iterations. (default: 40)
    -d DAMPING, --damping DAMPING
                          PageRank damping factor. (default: 0.85)
    -s START, --start START
                          PageRank starting value. (default: 0.1)
    -b, --bigmem          PageRank big memory flag. (default: False)
    -l, --links           Only extract links (skip PageRank). (default: False)

Examples

  • Compute PageRank on the current dump of English Wikipedia:

    $ ./danker.sh en
    $ ./danker.sh en --bigmem
  • Compute PageRank on the union of all language editions:

    $ ./danker.sh ALL
    $ ./danker.sh ALL --bigmem    # caution, you will need some main memory for that
  • Compute PageRank for each Wikipedia language edition separately:

    $ for i in $(./script/get_languages.sh); do ./danker.sh "$i"; done
    $ for i in $(./script/get_languages.sh); do ./danker.sh "$i" -b; done
  • Compute PageRank on the English version of Wikibooks:

    $ ./danker.sh en --project books
    $ ./danker.sh en --bigmem --project books
  • Compute PageRank on any other graph

Download

Output of ./danker.sh ALL on bi-weekly Wikipedia dumps.

Previous work

Before danker, I performed a number of experiments with DBpedia "page links" datasets most of which are documented at https://web.archive.org/web/20180222182923/https://people.aifb.kit.edu/ath/.

Test

The unit tests assure correctness and compare the results of danker to the PageRank implementation of NetworkX. The tests need the numpy and networkx libraries installed.

Execute the unit tests as follows:

python3 -m unittest test/danker_test.py

In the directory test is a small graph with which you can try out the PageRank core of danker.

$ ./danker/danker.py ./test/graphs/test.links 0.85 40 1
1.2.3.4.5.6.7.8.9.10.11.12.13.14.15.16.17.18.19.20.21.22.23.24.25.26.27.28.29.30.31.32.33.34.35.36.37.38.39.40.
Computation of PageRank on './test/graphs/test.links' with danker took 0.00 seconds.
C	3.1828140590777672
B	3.5642607869667629
A	0.30410528185693986
D	0.3626006631927996
F	0.3626006631927996
E	0.75035528185693967
G	0.15000000000000002
H	0.15000000000000002
I	0.15000000000000002
K	0.15000000000000002
L	0.15000000000000002
$ ./danker/danker.py ./test/graphs/test.links ./test/graphs/test.links.right 0.85 40 1
1.2.3.4.5.6.7.8.9.10.11.12.13.14.15.16.17.18.19.20.21.22.23.24.25.26.27.28.29.30.31.32.33.34.35.36.37.38.39.40.
Computation of PageRank on './test/graphs/test.links' with danker took 0.01 seconds.
A	0.30410528185693986
B	3.5642607869667629
C	3.1828140590777672
D	0.3626006631927996
E	0.75035528185693967
F	0.3626006631927996
G	0.15000000000000002
L	0.15000000000000002
K	0.15000000000000002
I	0.15000000000000002
H	0.15000000000000002

If you normalize the output values (divide each by 11) the values compare well to https://commons.wikimedia.org/wiki/File:PageRank-Beispiel.png or, if you compute percentages (division by the sum), they are similar to https://commons.wikimedia.org/wiki/File:PageRanks-Example.svg (same graph).

License

This software is licensed under GPLv3. (see https://www.gnu.org/licenses/).

FAQ

  1. The source code of danker is licensed under GPL v3. What about the output?

    The output of danker has no license. It can be used without attribution. However, if you use the PageRank scores it would be nice if you would provide reference to this page or, if you use the scores in an academic work, cite the following paper:

    @InCollection{Thalhammer2016,
        Title                    = {{PageRank on Wikipedia: Towards General Importance Scores for Entities}},
        Author                   = {Andreas Thalhammer and Achim Rettinger},
        Booktitle                = {The Semantic Web: ESWC 2016 Satellite Events, Heraklion, Crete, Greece, May 29 -- June 2, 2016, Revised Selected Papers},
        Publisher                = {Springer International Publishing},
        Year                     = {2016},
        Address                  = {Cham},
        Month                    = oct,
        Pages                    = {227--240},
        Doi                      = {10.1007/978-3-319-47602-5_41},
        ISBN                     = {978-3-319-47602-5},
        Url                      = {https://dx.doi.org/10.1007/978-3-319-47602-5_41}
    }
    
  2. The output format is a tab-separated values (TSV) file with Wikidata Qids and the respective rank. Can I have format xyz?

    We consider TSV as sufficient. Any other format and/or mapping can easily be produced with a simple script.

  3. Why is this not programmed with Apache Hadoop/NetworkX/etc.?

    We believe that ranking computations should be transparent. In the best case, everyone who wants to verify the computed rankings should be enabled to do so. Therefore, we also support computation on off-the-shelf hardware. We do this taking into account that we don't need this to finish in one or two days (also the Wikipedia dumps are not that frequent). In general, the provided code can be extended and also be ported to other platforms (under consideration of the license terms).

  4. Why does it take so long (up to one week) to compute PageRank with the ALL option?

    This goes in line with the previous point: we want to provide software that everyone with a standard laptop and some time can use. Of course it is possible to speed the computation up at the cost of required memory/computation power but we strongly believe that "this is for everyone".

  5. Can I use danker to compute PageRank on other graphs than Wikipedia?

    Sure, you can use the file ./danker/danker.py for computing PageRank on your graph. If you pass a "right sorted" file with the optional parameter right_sorted automatically the slower method with low memory footprint will be used. The memory-intensive method will be used otherwise. You can sort tab-separated files with the Unix command sort --key=2 -o output-file input-file. Type ./danker/danker.py -h for options. In addition, you can use danker as a library in your Python 3.x code (cf.: https://danker.rtfd.org)

  6. Why do the scores not form a nice probability distribution?

    This has multiple reasons. First, we do not compute the normalized version of PageRank. Instead of (1 - damping)/N (N is the total number of nodes) we only use (1 - damping). This doesn't change the ranking as it is just multiplying by a constant factor. Second, according to the theory, given the non-normalized version, all scores should add up to N. This would only be true if there would be no dangling nodes (pages with no outlinks) - these serve as energy sinks. One way to mitigate this would be to create links from dangling nodes to all pages (including itself). However, this also would only introduce a constant factor and therefore also has no effect on the final ranking. More information on the topic can be found in Monica Bianchini, Marco Gori, and Franco Scarselli. 2005. Inside PageRank. ACM Trans. Internet Technol. 5, 1 (February 2005), 92-128. DOI: https://doi.org/10.1145/1052934.1052938

  7. Sorted edge lists are not a common graph representation format (compared to adjacency list or adjacency matrix). Why is it useful in this particular case?

    This is a good question and there are multiple aspects to it. We know that the graph would not easily fit in some 8GB of memory (as we have ~3bn edges). The good news is: We don't have to fit it. Random access to get all out/in links of a specific node is not needed for computing PageRank as we access every node anyway (some of these ideas are presented in Taher H. Haveliwala. 1999. Efficient Computation of PageRank. http://ilpubs.stanford.edu:8090/386/).

    With sorted edge lists we gain two main advantages:

    1. We can walk through the graph node by node just by reading the lines of a file consecutively.
    2. We can transform quickly from the best way accessing out-links to the best way of accessing in-links by sorting by the second column ("best way" refers to this specific case).

    Trade offs:

    1. We use much more diskspace than actually needed as we repeat nodes (compared to adjacency lists). Still, computation usually needs < 100GB of space and disk space is cheaper than memory.
    2. Isolated nodes can not be represented with edge lists. However their PageRank would be (1 - damping).
    3. Computation by iterating over files is much slower than storing the graph in memory. If you have a graph that can fit into memory you can use the --bigmem/-b option and speed up computation time.
  8. What is the ALL option and what is the bag-of-links model?

    Naturally, the PageRank algorithm works with lists rather than sets. It also does not make any assumptions on the uniqueness of a link on a given web site. Therefore, if a link to page B apears multiple times on page A (say twice) it basically means that page B gets double the share of A's PageRank score than any other page C that is only linked once on page A. The following example shows that scenario:

    A -> B
    A -> C
    A -> B
    

    B gets 2/3 * PR(A) while C gets 1/3 * PR(A) (the damping factor is ignored here for brevity). This basic principle holds for every web site and can also be leveraged for Wikipedia. However, in the particular case of Wikipedia there is a "one link only" policy and multiple occurrences of a link may be arbitrary or due to quality issues of the article. However, with the ALL option we extract and align all page links from all Wikipedia language editions via the Wikidata identifier of each article. In this mode, we leverage the different language editions as link voting entities - the above example with Wikipedia language editions:

    A -> B   # en
    A -> C   # en
    A -> B   # de
    

    For example, the link Q159 -> Q46 appears in 254 different language editions. As stated above, PageRank automatically handles multiple link occurrences and adjusts the scores accordingly. The whole methodology is similar to the bag-of-words model of the information retrieval domain where the word count is tracked and the resulting frequencies are fed into a probabilistic model (such as topic modeling). Due to this relation we call this method informally "bag-of-links model".

You can’t perform that action at this time.