Module for analyzing contributions to a topic on Wikipedia.
git clone
cd TopicContribs
python3 install


> python3 -m topics.cmdline
    cmdline --dumps=<path_to_dumps> --out=<path_to_output_dir>
            [--apm=<article_project_path>] [--pl=<project_list_path>]
            [--verbose] [<cohort_file> ... ]
    cmdline (-h | --help)
    --dumps=<path_to_dumps>      Directory containing the metadata dumps
    --out=<path_to_output_dir>   Directory in which to put output files
    --apm=<article_project_path> Path to a csv of page_id project_name pairs.
    --pl=<project_list_path>     Path to a csv with all project_name's that you
                                    would like to be included in the count.
    --threads=<num_threads>      Number of threads to be used. All available
                                    will be used if not specified.
    <cohort_file>                File containing usernames of interest.
    -v, --verbose                Generate verbose output.

Input files


These must be full history dumps.

  • For minimal size and maximal parallelization use <wiki>-<date>-stub-meta-history<number>.xml.gz
  • If you want to use a single file <wiki>-<date>-stub-meta-history.xml.gz
  • If you already have the full text history dumps downloaded and you feel like using them <wiki>-<date>-pages-meta-history<number>.xml-<page_range>.bz2 will work.

You can use mwdumps to download the latest set of dumps:

  • python3 -m mwdumps.cmdline --wiki=enwiki -v /path/to/save/dumps


This file provides a map between articles are the projects they are included in. We expect it to be a .csv following the format


Generating this file

This file can be produced by running sql/page_project_map.sql on wmflabs and replacing <user_database> with your user database.


This is a file listing all of the project names we are interested in. The names must match those in the project_name column of the article_project_path file in order for the corresponding pages to be counted.


A file or set of files listing the usernames of the users we are interested in tracking. If multiple are used then each will be summed separately and output to a separate output file.

Output files

We will output one timeseries file for each cohort_file and one extra general file for all activity.


You can use topicutils.tsvToCsv -i <input.tsv> -o <output.csv> to convert a .tsv generated by the wmflabs databases to a .csv.