# Wikiconv Corpus Deletion Statistics 
This notebook provides a way to compute comment deletion statistics similar to the one presented in the corresponding Wikiconv paper (http://www.cs.cornell.edu/~cristian/index_files/wikiconv-conversation-corpus.pdf), which presents a corpus of the complete history of conversations on Wikipedia.

This notebook will output comment deletion statistics for a small subset of Corpus data and showcase the form of provided utterances. 
For reference, deleted comments can have three forms:
1. Normal Comments - have toxicity < 0.5 and sever_toxicity < 0.5
2. Toxic Comments - have toxicity > 0.5
3. Severe Toxic Comments - have sever_toxicity > 0.5

The deletion rates of the comments should follow this pattern as well for provided deletion time intervals up to 365 days:
1. Normal Comments - deleted at the lowest rate
2. Toxic comments - deleted at a rate greater than Normal Comments but less than Severe Toxic Comments
3. Severe Toxic Comments - deleted at the greatest rate

Finally, two important points to consider are that deletion time intervals greater than 365 days may inflate numbers due to periodic page cleanups and that individual corpora may show different deletion rates than described above due to variance.

In [1]:
#import relevant modules
from datetime import datetime, timedelta
from convokit import Corpus, User, Utterance, Conversation, download

In [2]:
# Load the 2003 wikiconv corpus (feel free to change this to a year of your preference)
wikiconv_corpus = Corpus(filename=download('wikiconv-2003'))

Dataset already exists at /home/jonathan/.convokit/downloads/wikiconv-2003


Some basic facts about this subset of the corpus: 91,787 conversations and 140,265 utterances

In [3]:
len(list(wikiconv_corpus.iter_conversations()))

91787

In [4]:
len(list(wikiconv_corpus.iter_utterances()))

140265

Each utterance has the following structure:
In this case the modification, deletion, and restoration lists are empty, but in cases where actions occur on the original comment, they will be filled with utterances

In [5]:
list(wikiconv_corpus.iter_utterances())[3]

Utterance({'id': '5021479.2081.2077', 'user': User([('name', 'Jay')]), 'root': '5021479.1277.1272', 'reply_to': '5021479.1277.1272', 'timestamp': 1070614595.0, 'text': "You're right about separating the sandwich of war names and MiG names. Each plane should be sorted chronologically and have its own sentence detailing its importance. ", 'meta': {'is_section_header': True, 'indentation': '2', 'toxicity': 0.1219038, 'sever_toxicity': 0.06112729, 'ancestor_id': '5021479.2081.2077', 'rev_id': '5021479', 'parent_id': None, 'original': None, 'modification': [], 'deletion': [], 'restoration': []}})

First, we will write a function that takes as input the deletion list and returns the count of the different types of  deletion instances (normal, toxic, and severe toxic). 

In [6]:
def check_deletion_list_data(list_of_deletion_utterances, original_posting_time, timedelta_value):
    #Count the total number of deleted utterances of each type
    count_normal = 0
    count_toxic= 0
    count_sever_toxic = 0
    
    for deletion_utt in list_of_deletion_utterances:
        toxicity_val = deletion_utt.meta['toxicity']
        sever_toxicity_val = deletion_utt.meta['sever_toxicity']
        timestamp_value = deletion_utt.timestamp
        deletion_datetime_val = datetime.fromtimestamp(timestamp_value)
        
        #delta_value is the time delta between when the deletion utt happened and the original utt's posting 
        if (original_posting_time is None):
            delta_value = 0
        else: 
            delta_value =  deletion_datetime_val - original_posting_time 
        
        #If the delta value is less than the provided time delta, consider its type
        if (delta_value <= timedelta(days = timedelta_value)):
                if (toxicity_val < 0.5 and sever_toxicity_val < 0.5):
                    count_normal +=1
                if (toxicity_val > 0.5):
                    count_toxic +=1
                if (sever_toxicity_val > 0.5):
                    count_sever_toxic +=1 
    
    #Return in  tuple  form the number of each type of affected comment
    return (count_normal, count_toxic, count_sever_toxic)
                    

We will compute how many of each type of comment are deleted in the list of total utterances.

In [7]:
def get_deletion_counts(individual_utterance_list, timedelta_value):
    #Normal Data count
    count_normal_deleted = 0
    count_normal_total = 0
    set_of_normal_comments = set()
    

    #Toxic data count
    count_toxic_deleted = 0
    count_toxic_total = 0
    set_of_toxic_comments = set()

    #Sever Toxic Data count
    count_sever_deleted = 0
    count_sever_total = 0
    set_of_sever_comments = set()
    
    #Check each utterance
    for utterance_value in individual_utterance_list:
        toxicity_val = utterance_value.meta['toxicity']
        sever_toxicity_val = utterance_value.meta['sever_toxicity']
        
        #Find the total number of comments of each type
        if (toxicity_val < 0.5 and sever_toxicity_val < 0.5):
            if (utterance_value.id not in set_of_normal_comments):
                count_normal_total +=1     
                set_of_normal_comments.add(utterance_value.id)  
                
        if (toxicity_val > 0.5):
            if (utterance_value.id not in set_of_toxic_comments):
                count_toxic_total +=1
                set_of_toxic_comments.add(utterance_value.id)
   
        if (sever_toxicity_val > 0.5):
            if (utterance_value.id not in set_of_sever_comments):
                count_sever_total +=1
                set_of_sever_comments.add(utterance_value.id)
                
        #Find the time that the original utterance is posted
        original_utterance = utterance_value.meta['original']
        if (original_utterance is not None):
            original_time = original_utterance.timestamp
            original_date_time = datetime.fromtimestamp(original_time)
        else:
            original_date_time =  datetime.fromtimestamp(utterance_value.timestamp)
        
        
        #Count the number of deleted comments 
        if (len(utterance_value.meta['deletion']) >0):
            deletion_list = utterance_value.meta['deletion']
            ind_normal, ind_toxic, ind_sever = check_deletion_list_data(deletion_list, original_date_time, timedelta_value)
            count_normal_deleted += ind_normal
            count_toxic_deleted += ind_toxic
            count_sever_deleted += ind_sever
    
    return (count_normal_deleted, count_toxic_deleted, count_sever_deleted, 
            count_normal_total, count_toxic_total, count_sever_total)







Finally, we willl define a method to print out the different statistics

In [8]:
def print_statistics(count_normal_deleted, count_toxic_deleted, count_sever_deleted, total_normal, total_toxic, total_sever):
    prop_normal = count_normal_deleted/float(total_normal)
    prop_toxic = count_toxic_deleted/float(total_toxic)
    prop_sever = count_sever_deleted/float(total_sever)

    print ('Proportion of normal comments deleted: ' + str(prop_normal))  
    print ('Proportion of toxic comments deleted: ' + str(prop_toxic))
    print ('Proportion of sever toxic comments deleted: ' + str(prop_sever))  

Now, we can process the corpus and output the deletion statistics in the one day interval.

In [9]:
#Set the default values we will need to compute the corpus statistics
individual_utterance_list = list(wikiconv_corpus.iter_utterances())
len_utterances = len(individual_utterance_list)
timedelta_value = 1

#Find the counts of deleted comments and print statistics with a time delta of One Day
(count_normal_deleted, count_toxic_deleted, count_sever_deleted,
 total_normal, total_toxic, total_sever) = get_deletion_counts(individual_utterance_list, timedelta_value)
print_statistics(count_normal_deleted, count_toxic_deleted, count_sever_deleted,
                 total_normal,  total_toxic, total_sever)



Proportion of normal comments deleted: 0.04678097466457062
Proportion of toxic comments deleted: 0.07529684332464523
Proportion of sever toxic comments deleted: 0.0836092715231788


We can also modify the time delta in days  (which considers comments that are deleted up to that time delta value)

In [10]:
timedelta_value = 7
(count_normal_deleted, count_toxic_deleted, count_sever_deleted, 
 total_normal, total_toxic, total_sever) = get_deletion_counts(individual_utterance_list, timedelta_value)
print_statistics(count_normal_deleted, count_toxic_deleted, count_sever_deleted,
                 total_normal,  total_toxic, total_sever)



Proportion of normal comments deleted: 0.10941395824955215
Proportion of toxic comments deleted: 0.1448016217781639
Proportion of sever toxic comments deleted: 0.14072847682119205


We can change to the 30 day view.

In [11]:
timedelta_value = 30
(count_normal_deleted, count_toxic_deleted, count_sever_deleted, 
 total_normal, total_toxic, total_sever) = get_deletion_counts(individual_utterance_list, timedelta_value)
print_statistics(count_normal_deleted, count_toxic_deleted, count_sever_deleted,
                 total_normal,  total_toxic, total_sever)


Proportion of normal comments deleted: 0.18123788981098965
Proportion of toxic comments deleted: 0.24123950188242108
Proportion of sever toxic comments deleted: 0.23178807947019867


As well as the 365 day time delta view. 

In [12]:
timedelta_value = 365
(count_normal_deleted, count_toxic_deleted, count_sever_deleted,
 total_normal, total_toxic, total_sever) = get_deletion_counts(individual_utterance_list, timedelta_value)
print_statistics(count_normal_deleted, count_toxic_deleted, count_sever_deleted,
                 total_normal,  total_toxic, total_sever)

Proportion of normal comments deleted: 0.2552188059810624
Proportion of toxic comments deleted: 0.32696206197509414
Proportion of sever toxic comments deleted: 0.3129139072847682
