WikiConv is a multilingual corpus encompassing the history of conversations on Wikipedia Talk Pages—including the deletion, modification and restoration of comments.
The dataset and reconstruction process for the corpus has been published in the paper WikiConv: A Corpus of the Complete Conversational History of a Large Online Collaborative Community, presented at EMNLP 2018.
The work has also been presented at the June 2018 Wikipedia reasearch showcase (the first of this pair of talks describes our work using an earlier version of this dataset to predict conversations going awry).
WikiConv is a multilingual corpus, it currently includes all conversations extracted from the 2018-07-01 Wikipedia dumps of: English, Chinese, German, Greek, Russian.
Downloading the dataset
The data is available for download from
- Google Cloud: https://console.cloud.google.com/storage/browser/wikidetox-wikiconv-public-dataset
- Figshare: https://figshare.com/projects/WikiConv_A_Corpus_of_the_Complete_Conversational_History_of_a_Large_Online_Collaborative_Community/57110
If you believe there is any information in this dataset that should be removed, please file a Github issue in this repository or email
In the WikiConv paper we use a technical definition of a conversation in terms of a section created in a Wikipedia Talk Page. Using that definition, the scale of the dataset is as follows:
|Language||Talk Pages||Revisions||Users||Conversational Actions||Conversations||Conversations with > 1 participant|
However, note that this may not necessarily match an intuitive definition of what a conversation is; in particular, such discussions may only contain the comments of a single editor (there may not be a reply from anyone someone else). A better estimate for the scale of human conversations can be found with a query of the form:
SELECT COUNT(*) FROM (SELECT conversation_id, COUNT(DISTINCT user_text) as cnt FROM `wikiconv_v2.en_20180701_conv` WHERE type = 'ADDITION' or type='CREATION' GROUP BY conversation_id) WHERE cnt > 1)
For the English language dataset, this results in 8.7M rows (conversations with comments from at least two distinct users).
Format of the reconstruction actions
Due to Wikipedia's format and editing style, conversations are just snapshots of a history of edits (or revisions) that take place on a given Talk Page. We parse these revisions, compute diffs, and categorize edits into 5 kinds of 'actions':
CREATION: An edit that creates a new section in wiki markup.
ADDITION: An edit that adds a new comment to the thread of a conversation.
MODIFICATION: An edit that modifies an existing comment on a Talk Page.
DELETION: An edit that removes a comment from a Talk Page.
RESTORATION: An edit (for example, a revert) that restores a previously removed comment.
Each row of the table corresponds to an action as defined in the section above. The schema for the table is then:
id: An id for the action that the row represents.
conversation_id: The id of the conversation in which the action took place. This is usually the id of the first action of the conversation.
page_title: The name of the Talk Page where the action occurred.
indentation: In wiki markup, the level of indentation represents the reply-depth in the conversation tree.
replyTo_id: The id of the action that this action is a reply to.
content: The text of the comment or section underlying the action.
cleaned_content: The text of the comment or section underlying the action without MediaWiki markup.
user_text: The name of the user that made the edit from which the action was extracted.
rev_id: The Wikipedia revision id of the edit from which the action was extracted.
type: The type of action that the row represents. This will be one of the types enumerated in the previous section.
user_id: The Wikipedia id of the user making the edit from which the action was extracted.
page_id: The Wikipedia id of the page on which the action took place.
timestamp: The timestamp of the edit from which the action was extracted.
parent_id: For modification, removal, and restoration actions, this provides the id of the action that was modified, removed, or restored respectively.
ancestor_id: For modification, removal, and restoration actions, this provides the id of first creation of the action that was modified, removed, or restored respectively.
toxicity: Score assigned by Perspective API given the content using TOXICITY attribute (only available for English corpus).
severe_toxicity: Score given by Perspective API given the content using SEVERE_TOXICITY attribute (only available for English corpus).
The thresholds used in the paper are 0.64 and 0.92 for toxicity and severe_toxicity respectively.
Visualization of the English Dataset
You can play with our visualization of the English Dataset. This data represents conversations whose comments have been scored by the Perspective API so you can now browse the comments by toxicity level. You can click on the comment to see the whole conversation in which it occurs; click on the link to be directed to the revision when this comment was posted. You can also search comments by page or user.
If you find a comment that contains personal information, please contact us at yiqing(at)cs(dot)cornell(dot)edu and we will notify the Wikimedia Foundation, as well as remove it from this dataset.
This system is still under development, any suggestions are welcome!
The Conversation Reconstruction Process
The code in this repository contains a python package that has reconstruction tools to extract the conversation structure from Wikipedia talk pages.
Please note that this package (currently) uses Python 2.7.
This reconstruction tool aims to show Wikipedia conversations with their full history; namely also including not just new posts, but also modifications, deletions and reverts. For example, rather than showing a snapshot of a conversation, such as:
The WikiConv dataset includes all the actions led to its final state:
Setup the environment
In the current directory:
- Follow the steps in section 1 to set up your cloud project. Note that do not proceed to install the newest google cloud dataflow, which may be in-compatible with some of the packages listed in requirements.txt.
- Use your service account to set up boto:
gsutil config -e
- Setup your python environment:
- Set up a virtualenv environment
- Do . /path/to/directory/bin/activate
- pip install -r requirements.txt
Run the pipeline
- Copy the template configuration
and then edit the configuration for your cloud resources.
rsync --ignore-existing ./template.config ./config/wikiconv.config