# Converting Wikipedia Articles for Deletion (*AfD*) dataset into ConvoKit format

In this notebook we are going to convert Wikipedia Articles for Deletion [dataset](https://github.com/emayfield/AFD_Decision_Corpus) by Elijah Mayfield and Alan W. Black into ConvoKit format.

Here is an example of Wikipedia Article for Deletion page: https://en.wikipedia.org/wiki/Wikipedia:Articles_for_deletion/Andrew_Nellis

In [1]:
import pandas as pd
from convokit import Corpus, Speaker, Utterance
import re
# import glob, os, csv
import json
import numpy as np
from tqdm import tqdm

## Load the data

Instructions on how to download `afd_2019_full_policies.json` as well as `pandas_afd` directory can be found [here](https://github.com/emayfield/AFD_Decision_Corpus).

In [2]:
with open("afd_2019_full_policies.json", 'r') as f:
    afd = json.load(f)

In [3]:
afd.keys()

dict_keys(['Users', 'Discussions', 'Outcomes', 'Contributions', 'Citations'])

We are also going to use `users_df.csv` file, as it provides more information on Wikipedia users than `afd_2019_full_policies.json` does.

In [4]:
users_df = pd.read_csv("pandas_afd/users_df.csv")

## Create Speaker Objects

In [5]:
users_df = users_df.drop(columns=['Unnamed: 0'])
users_df.head(5)

Unnamed: 0,user_id,name,editcount,signup,gender
0,200000001,Mangojuice,19969.0,2005-01-27T20:54:10Z,unknown
1,200000002,Vic sinclair,51.0,2005-07-16T16:30:17Z,unknown
2,200000003,69.196.150.118,,,
3,200000004,TruthbringerToronto,6606.0,2006-05-07T20:34:14Z,unknown
4,200000005,Desertsky85451,3247.0,2006-07-17T16:54:56Z,unknown


Certain users are repeated in the csv file. In cases of duplicates, we will only include the last occurence of the user

In [6]:
pd.concat(g for _, g in users_df.groupby("user_id") if len(g) > 1)

Unnamed: 0,user_id,name,editcount,signup,gender
581,200000582,El C,103227.0,2004-08-09T10:55:09Z,unknown
582,200000582,El C,103229.0,2004-08-09T10:55:09Z,unknown
3575,200003575,Czar,95811.0,2005-06-11T17:16:04Z,unknown
3576,200003575,Czar,95812.0,2005-06-11T17:16:04Z,unknown
4163,200004162,K.e.coffman,91090.0,2014-09-22T03:23:48Z,unknown
4164,200004162,K.e.coffman,91100.0,2014-09-22T03:23:48Z,unknown
6984,200006982,David Fuchs,36826.0,2005-10-15T19:21:06Z,male
6985,200006982,David Fuchs,36827.0,2005-10-15T19:21:06Z,male
21280,200021277,Discospinster,277715.0,2004-06-27T18:41:07Z,unknown
21281,200021277,Discospinster,277716.0,2004-06-27T18:41:07Z,unknown


Modify the dataframe to only contain meta information for speakers and create the speaker objects.

In [7]:
speaker_meta = users_df.replace({np.nan:None}).drop_duplicates(subset=['user_id'], keep='last').set_index('user_id').T.to_dict()

In [8]:
speaker_objects = {}
for s_id in speaker_meta:
    speaker_objects[str(s_id)] = Speaker(id=str(s_id), meta=speaker_meta[s_id])

In [9]:
len(speaker_objects)

179859

Number of speakers in the full dataset is 179859. 

Here are examples of speaker objects:

In [10]:
speaker_objects['200000582']

Speaker({'obj_type': 'speaker', 'meta': {'name': 'El C', 'editcount': 103229.0, 'signup': '2004-08-09T10:55:09Z', 'gender': 'unknown'}, 'vectors': [], 'owner': None, 'id': '200000582'})

In [11]:
speaker_objects['200000003']

Speaker({'obj_type': 'speaker', 'meta': {'name': '69.196.150.118', 'editcount': None, 'signup': None, 'gender': None}, 'vectors': [], 'owner': None, 'id': '200000003'})

## Create Utterance Objects

Here, we are going to use data from Contributions list of `afd_2019_full_policies.json` dictionary. Mayfield data categorizes contributions into three classes: ***nominations*** for deletion (these tend to happen at the beginning of the disucssion, but not all discussions start with a nomination), ***votes*** by users to delete/keep the article followed by a rationale for the vote, and general ***non-voting comments*** made by users. 

Below are examples of a nomination, a vote, and a non-voting comment, in that respective order:

In [12]:
afd['Contributions'][0], afd['Contributions'][1], afd['Contributions'][10]

({'Parent': -1,
  'Discussion': 100000001,
  'Timestamp': 1158550020.0,
  'User': 200000002,
  'Text': 'Suspected vanity page.  Person clearly not encyclopedic ',
  'ID': 600000001},
 {'Parent': -1,
  'Discussion': 100000001,
  'Timestamp': 1158737220.0,
  'User': 200000003,
  'Label': 'keep',
  'Raw': 'keep',
  'Rationale': "*'''Keep''' - I think we should keep this page as Andrew is a notable figure in Canadian labour politics as is also a well-known figure in internet circles. He has appeared many times on local television news as well as his name appearing in all the local newspapers. --",
  'ID': 400000001},
 {'Parent': -1,
  'Discussion': 100000001,
  'Timestamp': 1158789840.0,
  'User': 200000008,
  'Text': '::This "Information" is completely irrelevant to whether the article merits deletion or not     and appears to be little more than an unfounded attempt to vilify those in support of keeping the page. ',
  'ID': 500000001})

**Observe that** `parent` key in each of the contribution dictionaries has a value of `-1`. At this point Mayfield data does not have any information on the conversation structure, from which we can extract reply-to chains. So, to make sure that ConvoKit checks do not throw errors, we are going to introduce the following structure:
* Every first utterance (nomination, vote, or a non-voting comment) we encounter in the discussion does not have a parent utterance (i.e. reply-to is None)
* Voting comments and nominations (if they are not already first in the discussion) are replies to the first utterance in the discussion
* Non-voting comments are replies to either (i) the previous vote or (ii) the first utterance in the discussion if no vote has been cast yet.

In [13]:
utterance_objects = {}
seen_discussions = {}
previous_vote = '', '' #the last voting comments & discussion it occurred in

# We are also going to get citations information for each contributions from Citations list
citations_dct = {str(d['ID']): d['Citations'] for d in afd['Citations']}

for contribution in tqdm(afd['Contributions']):
         
            c_id = str(contribution['ID'])
            c_meta = {'citations': citations_dct.get(c_id, [])}
            c_speaker = str(contribution['User'])
            c_conversation_id = str(contribution['Discussion'])
            c_timestamp = contribution['Timestamp']

            
            #keep track of the first contribution in the discussion we encounter
            if c_conversation_id not in seen_discussions:
                seen_discussions[c_conversation_id] = c_id

                
            #if the contribution is a vote
            if c_id[0] == '4':
                c_meta.update({'type': 'vote', 
                               'label': contribution['Label'],
                               'raw_label':contribution['Raw']})
                #replace mask the bolded expression with a "VOTE"
                c_text = re.sub("\'\'\'[^\']+\'\'\'", "VOTE", contribution['Rationale'])
                #votes are replies to the first contribution/utterance in the discussion
                c_reply_to = seen_discussions[c_conversation_id]                
                #keep track of the last voting comments & discussion it occurred in
                previous_vote = c_id, c_conversation_id

                
            #if the contribution is a non-voting comment    
            elif c_id[0] == '5':
                c_meta.update({'type':'non-voting comment',
                               'label': None,
                               'raw_label': None})
                c_text = contribution['Text']
                #when a non-voting comment happens before any vote was made, it is a reply to the first contribution in the discussion
                if previous_vote[1] != c_conversation_id: 
                    c_reply_to = seen_discussions[c_conversation_id]
                #when a comment happens after the vote in the discussion, it is a reply to that vote
                else:
                    c_reply_to = previous_vote[0]

                    
            #if contribution is a nomination        
            elif c_id[0] == '6':
                c_meta.update({'type':'nomination',
                               'label': None,
                               'raw_label': None})
                c_text = contribution['Text']
                #c_reply_to = None

                #want to make sure that nominations only happen at the very beginning of a discussion
                if c_id != seen_discussions[c_conversation_id]:
                    print("Something wrong")

            else:
                print(c_id[0])

                

            #The first comment is not a reply to any other contribution
            if c_id == seen_discussions[c_conversation_id]:
                    c_reply_to = None

            utterance_objects[c_id] = Utterance(id = c_id, 
                                                speaker = speaker_objects[c_speaker], 
                                                conversation_id = c_conversation_id, 
                                                reply_to = c_reply_to, 
                                                timestamp = c_timestamp, 
                                                text = c_text,
                                                meta = c_meta
                                               )       

100%|██████████| 3295340/3295340 [01:28<00:00, 37079.86it/s]


Number of discussions (i.e. ConvoKit conversations) in this data

In [14]:
len(seen_discussions)

383918

Number of contributions (i.e. ConvoKit utterances) in this data

In [15]:
len(utterance_objects)

3295340

However, note that some of these contributions are empty strings after parsing/cleaning steps completed by authors of the original dataset.

In [16]:
empty_string_contributions = []
for contribution in tqdm(afd['Contributions']):
    c_id = str(contribution['ID'])
    if (c_id[0] == '4' and len(contribution['Rationale'].split()) != 0) or \
       (c_id[0] != '4' and len(contribution['Text'].split()) != 0):
        a = 1
    else:
        empty_string_contributions.append(contribution)

100%|██████████| 3295340/3295340 [00:28<00:00, 114007.23it/s]


In [17]:
len(empty_string_contributions)

80290

Here is how examples of a nomination, a vote, and a non-voting comment from above as utterance objects

In [18]:
utterance_objects['600000001']

Utterance({'obj_type': 'utterance', 'meta': {'citations': [], 'type': 'nomination', 'label': None, 'raw_label': None}, 'vectors': [], 'speaker': Speaker({'obj_type': 'speaker', 'meta': {'name': 'Vic sinclair', 'editcount': 51.0, 'signup': '2005-07-16T16:30:17Z', 'gender': 'unknown'}, 'vectors': [], 'owner': None, 'id': '200000002'}), 'conversation_id': '100000001', 'reply_to': None, 'timestamp': 1158550020.0, 'text': 'Suspected vanity page.  Person clearly not encyclopedic ', 'owner': None, 'id': '600000001'})

In [19]:
utterance_objects['400000002']

Utterance({'obj_type': 'utterance', 'meta': {'citations': ['signatures'], 'type': 'vote', 'label': 'keep', 'raw_label': 'keep'}, 'vectors': [], 'speaker': Speaker({'obj_type': 'speaker', 'meta': {'name': 'TruthbringerToronto', 'editcount': 6606.0, 'signup': '2006-05-07T20:34:14Z', 'gender': 'unknown'}, 'vectors': [], 'owner': None, 'id': '200000004'}), 'conversation_id': '100000001', 'reply_to': '600000001', 'timestamp': 1158558120.0, 'text': '*VOTE. Notable Ottawa activist who has appeared on radio and television. See references. --01:42, 18 September 2006 (UTC) <small>—The preceding [[Wikipedia:Sign your posts on talk pages|unsigned]] comment was added by ', 'owner': None, 'id': '400000002'})

In [20]:
utterance_objects['500000001']

Utterance({'obj_type': 'utterance', 'meta': {'citations': [], 'type': 'non-voting comment', 'label': None, 'raw_label': None}, 'vectors': [], 'speaker': Speaker({'obj_type': 'speaker', 'meta': {'name': 'Kroppie', 'editcount': 18.0, 'signup': '2006-02-14T14:30:43Z', 'gender': 'unknown'}, 'vectors': [], 'owner': None, 'id': '200000008'}), 'conversation_id': '100000001', 'reply_to': '400000009', 'timestamp': 1158789840.0, 'text': '::This "Information" is completely irrelevant to whether the article merits deletion or not     and appears to be little more than an unfounded attempt to vilify those in support of keeping the page. ', 'owner': None, 'id': '500000001'})

## Create Corpus Object

In [21]:
afd_corpus = Corpus(utterances=list(utterance_objects.values()))

In [22]:
afd_corpus.random_utterance()

Utterance({'obj_type': 'utterance', 'meta': {'citations': [], 'type': 'non-voting comment', 'label': None, 'raw_label': None}, 'vectors': [], 'speaker': Speaker({'obj_type': 'speaker', 'meta': {'name': 'Spinningspark', 'editcount': 70773.0, 'signup': '2007-03-03T09:41:30Z', 'gender': 'male'}, 'vectors': [], 'owner': <convokit.model.corpus.Corpus object at 0x7f836804e220>, 'id': '200004545'}), 'conversation_id': '100356850', 'reply_to': '401659650', 'timestamp': 1523457360.0, 'text': '*\'\'\'Comment\'\'\'. Yet again we have an AfD nomination of a food related article with the rationale "Wikipedia is not a recipe book" where the nominator seems to have completely failed to observe that there is actually no recipe in the article.  That just translates to IDONTLIKEIT and should be ignored by the closer as an invalid argument. ', 'owner': <convokit.model.corpus.Corpus object at 0x7f836804e220>, 'id': '500849879'})

Corpus summary information:

In [23]:
afd_corpus.print_summary_stats()

Number of Speakers: 161266
Number of Utterances: 3295340
Number of Conversations: 383918


Add the dataset name.

In [24]:
afd_corpus.meta['name'] = 'Wikipedia Articles for Deletion Dataset'

## Add Metadata for Converastions

In the metadata field for each conversation we are going to include the title of the Wikipedia page suggested for deletion and information about the outcome of the discussion (as was determined by an admin).

In [25]:
afd['Discussions'][0]

{'ID': 100000001, 'Title': 'Andrew Nellis'}

In [26]:
afd['Outcomes'][0]

{'ID': 300000001,
 'Parent': 100000001,
 'Label': 'delete',
 'Raw': 'delete,',
 'User': 200000001,
 'Timestamp': 1159342800.0,
 'Rationale': "The result was '''delete,''' discounting SPA's.  "}

In [27]:
outcomes_dct = {str(d['Parent']): d for d in afd['Outcomes']}
disc_info_dct = {str(d['ID']): d['Title'] for d in afd['Discussions']}


for conversation in tqdm(afd_corpus.iter_conversations()):
    
    c_id = conversation.get_id()
    if c_id not in outcomes_dct:
        outcome_id, outcome_label, outcome_label_raw, outcome_user, outcome_timestamp, outcome_rationale = None, None, None, None, None, None
    
    outcome_id = outcomes_dct[c_id]['ID']
    outcome_label = outcomes_dct[c_id]['Label']
    outcome_label_raw = outcomes_dct[c_id]['Raw']
    outcome_user = outcomes_dct[c_id]['User']
    outcome_timestamp = outcomes_dct[c_id]['Timestamp']
    outcome_rationale = outcomes_dct[c_id]['Rationale']
    
    
    conversation.meta.update({'article_title': disc_info_dct[c_id], 
                              'outcome_id': str(outcome_id), 
                              'outcome_label': outcome_label, 
                              'outcome_raw_label': outcome_label_raw, 
                              'outcome_decision_maker_id': str(outcome_user),
                              'outcome_timestamp': outcome_timestamp,
                              'outcome_rationale': outcome_rationale
                             })

383918it [00:07, 52868.44it/s]


In [28]:
afd_corpus.get_conversation('100309419').meta

{'article_title': 'Hibiya High School',
 'outcome_id': '300281410',
 'outcome_label': 'keep speedy',
 'outcome_raw_label': 'speedy keep',
 'outcome_decision_maker_id': '200000595',
 'outcome_timestamp': 1176317760.0,
 'outcome_rationale': "The result was '''speedy keep'''.  Non-admin closure. "}

**Note** that some, but not all, of the outcome decision makers also appear as speakers in this corpus.

User with ID of `'200000595'`, who made the final decision in the example debate above, is also a speaker.

In [29]:
afd_corpus.get_speaker('200000595')

Speaker({'obj_type': 'speaker', 'meta': {'name': 'YechielMan', 'editcount': 13.0, 'signup': '2009-03-16T05:12:51Z', 'gender': 'unknown'}, 'vectors': [], 'owner': <convokit.model.corpus.Corpus object at 0x7f836804e220>, 'id': '200000595'})

However, `76` of the outcome decision makers never appeared as contributors/speakers in debates of this corpus.

In [30]:
speaker_ids = []
for speaker in afd_corpus.iter_speakers():
    speaker_ids.append(speaker.id)

missing_users = set([])
for conversation in afd_corpus.iter_conversations():
    user_id = str(conversation.meta['outcome_decision_maker_id'])
    if user_id not in speaker_ids:
        missing_users.add(user_id)
        
len(missing_users)

76

## Verify

In [31]:
afd_corpus.random_utterance()

Utterance({'obj_type': 'utterance', 'meta': {'citations': [], 'type': 'vote', 'label': 'keep', 'raw_label': 'strong keep:'}, 'vectors': [], 'speaker': Speaker({'obj_type': 'speaker', 'meta': {'name': 'Ju-ju', 'editcount': 2.0, 'signup': '2006-06-23T16:41:53Z', 'gender': 'unknown'}, 'vectors': [], 'owner': <convokit.model.corpus.Corpus object at 0x7f836804e220>, 'id': '200101465'}), 'conversation_id': '100186630', 'reply_to': '600140671', 'timestamp': 1151096100.0, 'text': "VOTE This article documents a major event that impacted South Florida tremendously. It was more than just a minor strike, it was the beginning of a chain reaction that raised the awareness of workers' rights not only in the University of Miami community, but also throughout the nation as the story circulated the national news wires. The article carries enough detail and information to be kept online for the education of others, and should not be deleted.", 'owner': <convokit.model.corpus.Corpus object at 0x7f836804e2

In [32]:
afd_corpus.random_conversation()

Conversation({'obj_type': 'conversation', 'meta': {'article_title': 'Kyle Leopold', 'outcome_id': '300165514', 'outcome_label': 'delete speedy', 'outcome_raw_label': 'speedily deleted', 'outcome_decision_maker_id': '200016801', 'outcome_timestamp': 1231535280.0, 'outcome_rationale': 'The result was    \'\'\'speedily deleted\'\'\' by {{admin|TexasAndroid}} ([[WP:NAC|non-admin closure]]). <font face="Arial"> '}, 'vectors': [], 'tree': None, 'owner': <convokit.model.corpus.Corpus object at 0x7f836804e220>, 'id': '100181561'})

#### Check reply-to chain integrity

In [33]:
broken = []
for convo in tqdm(afd_corpus.iter_conversations()):
    if not convo.check_integrity(verbose=False):
        broken.append(convo.id)

383918it [00:09, 38514.69it/s]


In [34]:
print(len(broken))

0


So, all conversations were verified to have valid reply-to chains.

## Dump the corpus

In [None]:
afd_corpus.dump("wiki-articles-for-deletion-corpus")