## Converting the paired data from "Winning Arguments: Interaction Dynamics and Persuasion Strategies in Good-faith Online Discussions" into ConvoKit format (the data used in section 4 of their paper).

#### Note: we are only converting the subset data used to measure successful vs. unsuccessful arguments. All data provided by 
--------------------

Winning Arguments: Interaction Dynamics and Persuasion Strategies in Good-faith Online Discussions
Chenhao Tan, Vlad Niculae, Cristian Danescu-Niculescu-Mizil, Lillian Lee. 
In Proceedings of the 25th International World Wide Web Conference (WWW'2016).

The paper, data, and associated materials can be found at:
http://chenhaot.com/pages/changemyview.html

If you use this data, please cite:
@inproceedings{tan+etal:16a, 
    author = {Chenhao Tan and Vlad Niculae and Cristian Danescu-Niculescu-Mizil and Lillian Lee}, 
    title = {Winning Arguments: Interaction Dynamics and Persuasion Strategies in Good-faith Online Discussions}, 
    year = {2016}, 
    booktitle = {Proceedings of WWW} 
}

Note at the blog in the hyperlink above, the data we used is the original data (linked with corresponding README, PDF and Slides). We did *not* use the updated data provided on 11/11/2016

Before starting the data conversion, you need to download the data, linked above, and extract the data from the tar archive.

------------------------------------

In [1]:
import os

In [2]:
#here I set the working directory to where I store the convokit package 
# os.chdir('C:\\Users\\Andrew\\Desktop\\Cornell-Conversational-Analysis-Toolkit')
from convokit import Corpus, User, Utterance, meta_index

In [3]:
import pandas as pd

Load the original pair data:

In [4]:
pairDF=pd.read_json('C:\\Users\\Andrew\\Downloads\\train_pair_data.jsonlist',lines=True)
pairDF.tail()

Unnamed: 0,op_author,op_text,op_title,positive,negative,op_name
3451,helpful_hank,"In opposing injustice, we must strive not to p...",CMV: Drawing images of Mohammed and posting th...,"{'ancestor': 't1_cniw4jr', 'author': 'cold08',...","{'ancestor': 't1_cniu655', 'author': 'learhpa'...",t3_2rsgv3
3452,VIRMD,The rate at which income is taxed (at least in...,CMV: The rate at which one's income is taxed s...,"{'ancestor': 't1_cnirwl5', 'author': 'scottevi...","{'ancestor': 't1_cnjrwds', 'author': 'natha105...",t3_2rs57a
3453,VIRMD,The rate at which income is taxed (at least in...,CMV: The rate at which one's income is taxed s...,"{'ancestor': 't1_cnjiwww', 'author': 'AdmiralC...","{'ancestor': 't1_cnjrwds', 'author': 'natha105...",t3_2rs57a
3454,GetCapeFly,It seems logical to me that school hours shoul...,CMV: School hours should be 9am to 5pm to matc...,"{'ancestor': 't1_cnii75i', 'author': '[deleted...","{'ancestor': 't1_cnijhp3', 'author': 'funchy',...",t3_2rqvf8
3455,luxo42,My argument assumes the Christian theology tau...,"CMV: In heaven, as long as an individual has f...","{'ancestor': 't1_cnj7d44', 'author': 'Field-K'...","{'ancestor': 't1_cnih5d9', 'author': '____Matt...",t3_2rq5g3


In [5]:
len(pairDF)

3456

Note: Each observation has the reply comments in a conversation that changes the OP's (OP: original poster) mind (positive column) and a conversation that does not change the OP's mind (negative column). Unfortunately, this does not include the comments that OP made after their original post: the comments made by the OP in response to the second conversant's arguments. To find the comments made by OP (i.e. the other half of the conversation), we need to retrieve them from the 'all' dataset.

First: collect the unique identifiers for each original post in our dataset

In [6]:
nyms = list(set(pairDF.op_name))
len(nyms)

2509

Collect each post from the full dataset (this has the full comment threads, whereas the pair data above only has the first response):

Note: if you have not run this notebook before, then you will need to uncomment the following three code cells. It will load the full 2GB dataset into your working memory and save only the approx. 2500 observations that match with the posts in the pair_data above.

In [7]:
# #note: this is over 2 GB of data, uncomment the following two lines to read in the data

# data = pd.read_json('/Users/andrewszmurlo/workspace/cmv/all/train_period_data.jsonlist', lines=True)
# len(data)

Keep only the posts that are identified in our original dataset:

In [8]:
# #note: this reduces the 2 GB dataset to a similar size as our original dataset

# data=data[data.name.isin(nyms)]
# len(data)

Saving the posts from the full dataset that are the same as posts in our pair data. 

In [9]:
# #note: I save the data as a pickle file so I don't have to reload the 2 GB dataset in my working memory

# data.to_pickle('/Users/andrewszmurlo/workspace/cmv/pairAll.pkl')

Here, I have already run this notebook, so I can just load this dataset back into working memory.

In [10]:
data = pd.read_pickle('C:\\Users\\Andrew\\Downloads\\pairAll.pkl')

In [11]:
data.tail()

Unnamed: 0,approved_by,archived,author,author_flair_css_class,author_flair_text,banned_by,clicked,comments,created,created_utc,...,stickied,subreddit,subreddit_id,suggested_sort,thumbnail,title,ups,url,user_reports,visited
18349,,0.0,DirtyStanBoozie,,,,False,"[{'subreddit_id': 't5_2w2s8', 'banned_by': Non...",1420759479,1420759479,...,False,changemyview,t5_2w2s8,qa,,CMV: Pugs,35,http://www.reddit.com/r/changemyview/comments/...,[],False
18350,,0.0,helpful_hank,points,1∆,,False,"[{'subreddit_id': 't5_2w2s8', 'banned_by': Non...",1420755416,1420755416,...,False,changemyview,t5_2w2s8,qa,,CMV: Drawing images of Mohammed and posting th...,0,http://www.reddit.com/r/changemyview/comments/...,[],False
18351,,0.0,VIRMD,points,1∆,,False,"[{'subreddit_id': 't5_2w2s8', 'banned_by': Non...",1420750270,1420750270,...,False,changemyview,t5_2w2s8,qa,,CMV: The rate at which one's income is taxed s...,1,http://www.reddit.com/r/changemyview/comments/...,[],False
18357,,0.0,GetCapeFly,,,,False,"[{'subreddit_id': 't5_2w2s8', 'banned_by': Non...",1420728251,1420728251,...,False,changemyview,t5_2w2s8,qa,,CMV: School hours should be 9am to 5pm to matc...,819,http://www.reddit.com/r/changemyview/comments/...,[],False
18362,,0.0,luxo42,,,,False,"[{'subreddit_id': 't5_2w2s8', 'banned_by': Non...",1420705507,1420705507,...,False,changemyview,t5_2w2s8,qa,,"CMV: In heaven, as long as an individual has f...",3,http://www.reddit.com/r/changemyview/comments/...,[],False


In [12]:
len(data)

2509

In [13]:
len(pairDF)

3456

In [14]:
data.columns

Index(['approved_by', 'archived', 'author', 'author_flair_css_class',
       'author_flair_text', 'banned_by', 'clicked', 'comments', 'created',
       'created_utc', 'distinguished', 'domain', 'downs', 'edited', 'gilded',
       'hidden', 'id', 'is_self', 'likes', 'link_flair_css_class',
       'link_flair_text', 'media', 'media_embed', 'mod_reports', 'name',
       'num_comments', 'num_reports', 'over_18', 'permalink', 'report_reasons',
       'saved', 'score', 'secure_media', 'secure_media_embed', 'selftext',
       'selftext_html', 'stickied', 'subreddit', 'subreddit_id',
       'suggested_sort', 'thumbnail', 'title', 'ups', 'url', 'user_reports',
       'visited'],
      dtype='object')

only keep the comments and the identifier for merging with the original dataset:

In [15]:
data=data[['comments','name']]

In [16]:
pairDF.columns

Index(['op_author', 'op_text', 'op_title', 'positive', 'negative', 'op_name'], dtype='object')

This joins the comments in the 'all' data, with the posts we are interested in studying:

In [17]:
pairDF=pairDF.join(data.set_index('name'), on='op_name')

In [18]:
len(pairDF)

3456

In [19]:
pairDF.tail()

Unnamed: 0,op_author,op_text,op_title,positive,negative,op_name,comments
3451,helpful_hank,"In opposing injustice, we must strive not to p...",CMV: Drawing images of Mohammed and posting th...,"{'ancestor': 't1_cniw4jr', 'author': 'cold08',...","{'ancestor': 't1_cniu655', 'author': 'learhpa'...",t3_2rsgv3,"[{'subreddit_id': 't5_2w2s8', 'banned_by': Non..."
3452,VIRMD,The rate at which income is taxed (at least in...,CMV: The rate at which one's income is taxed s...,"{'ancestor': 't1_cnirwl5', 'author': 'scottevi...","{'ancestor': 't1_cnjrwds', 'author': 'natha105...",t3_2rs57a,"[{'subreddit_id': 't5_2w2s8', 'banned_by': Non..."
3453,VIRMD,The rate at which income is taxed (at least in...,CMV: The rate at which one's income is taxed s...,"{'ancestor': 't1_cnjiwww', 'author': 'AdmiralC...","{'ancestor': 't1_cnjrwds', 'author': 'natha105...",t3_2rs57a,"[{'subreddit_id': 't5_2w2s8', 'banned_by': Non..."
3454,GetCapeFly,It seems logical to me that school hours shoul...,CMV: School hours should be 9am to 5pm to matc...,"{'ancestor': 't1_cnii75i', 'author': '[deleted...","{'ancestor': 't1_cnijhp3', 'author': 'funchy',...",t3_2rqvf8,"[{'subreddit_id': 't5_2w2s8', 'banned_by': Non..."
3455,luxo42,My argument assumes the Christian theology tau...,"CMV: In heaven, as long as an individual has f...","{'ancestor': 't1_cnj7d44', 'author': 'Field-K'...","{'ancestor': 't1_cnih5d9', 'author': '____Matt...",t3_2rq5g3,"[{'subreddit_id': 't5_2w2s8', 'banned_by': Non..."


Now that we have all comments made within every CMV post in our dataset, we need to extract only the comments that correspond to a positive argument and negative argument (i.e. the ones recorded as either changing OP's mind or not).

First, collect the identifiers for each comment made by the respondent attempting to change the OP's mind (there is a respondent in both the positive and negative columns).

In [20]:
def collectResponses(responseList):
    iDs=[]
    if len(responseList['comments'])>0:
        for each in responseList['comments']:
            iDs.append(each['id'])
    return iDs
pairDF['negIDs']=pairDF.negative.apply(lambda x: collectResponses(x))
pairDF['posIDs']=pairDF.positive.apply(lambda x: collectResponses(x))

Now collect each of the comment identifiers that signify a response to the challenger by OP

In [21]:
def collectOPcommentIDs(op_auth, allComments, replyIDs):
    opIds =[]
    for comment in allComments:
        if comment['parent_id'].split('_')[1] in replyIDs: 
            if 'author' in comment.keys():
                if comment['author'] == op_auth:
                    opIds.append(comment['id'])

    return opIds

In [22]:
pairDF['opRepliesPos'] = pairDF[['op_author','comments','posIDs']].apply(lambda x: collectOPcommentIDs(x['op_author'],x['comments'],x['posIDs']),axis=1)

In [23]:
pairDF['opRepliesNeg'] = pairDF[['op_author','comments','negIDs']].apply(lambda x: collectOPcommentIDs(x['op_author'],x['comments'],x['negIDs']),axis=1)

Here I collect and properly order each of the comment IDs made in the thread _only_ by either OP or the 2nd conversant studied for both succesful and unsuccesful arguments:

In [24]:
def orderThreadids(comments, replyIDs, opCommentIDs):
    threadIDs=list(replyIDs)
    for comment in comments:
        if comment['id'] in opCommentIDs:
            pID= comment['parent_id'].split('_')[1]
            if pID in replyIDs:
                threadIDs.insert(threadIDs.index(pID)+1,comment['id'])
            
    return threadIDs

In [25]:
pairDF['posOrder']= pairDF[['comments','posIDs','opRepliesPos']].apply(lambda x: orderThreadids(x['comments'],x['posIDs'],x['opRepliesPos']) ,axis = 1)

In [26]:
pairDF['negOrder']= pairDF[['comments','negIDs','opRepliesNeg']].apply(lambda x: orderThreadids(x['comments'],x['negIDs'],x['opRepliesNeg']) ,axis = 1)

This function takes the ordered thread IDs for only the successful and unsuccesful arguments measured in the original paper (although, note: I have also collected the OP replies from the 'all' data, which wasn't included in the smaller pair_data).

Note: I don't convert this section into convokit format, but instead I convert the full comment threads later in this notebook. If you are interested in looking at the successful and unsuccessful arguments in the convokit format, see the 'success' attribute in each utterance's metadata

In [27]:
def collectThread(comments, orderedThreadids):
    threadComments=[]
    for iD in orderedThreadids:
        for comment in comments:
            if iD==comment['id']:
                threadComments.append(comment)
    return threadComments

In [28]:
pairDF['positiveThread'] = pairDF[['comments','posOrder']].apply(lambda x: collectThread(x['comments'],x['posOrder']),axis=1)
pairDF['negativeThread'] = pairDF[['comments','negOrder']].apply(lambda x: collectThread(x['comments'],x['negOrder']),axis=1)

Note above: I have just collected each individual thread (with OP comments). However, when studying this data, we may be interested in looking at the entire conversation. Therefore, instead of only converting the positive threads and negative threads into convokit format, here I simply add an attribute to the comments if they are part of either the positive or negative thread.

Here I add the success attribute and the pair identification (see my readme file for a more detailed explanation of 'success' and 'pair_ids') :

In [29]:
# Create an identification # for the paired unsuccessful/successful arguments,
# Note: the pair # will be the same for successful-unsuccessful matched pairs with the prefix 'p_' for pair 
# if there is no paired argument for the comment (i.e. it was either the original post by OP or an uncategorized comment), 
# then pair_id = None
c=0
pairIDS={}
for i, r in pairDF.iterrows():
    
    c=c+1
    for comment in r.comments:
        
        if comment['id'] in r.posOrder:
            comment['success']=1
            if comment['name'] in pairIDS.keys():
                pairIDS[comment['name']].append('p_'+str(c))
                pairIDS[comment['name']]=list(set(pairIDS[comment['name']]))
            else:
                pairIDS[comment['name']]=['p_'+str(c)]
                pairIDS[comment['name']]=list(set(pairIDS[comment['name']]))
                
                
        elif comment['id'] in r.negOrder:
            comment['success']=0

            if comment['name'] in pairIDS.keys():
                pairIDS[comment['name']].append('p_'+str(c))
                pairIDS[comment['name']]=list(set(pairIDS[comment['name']]))
            else:
                pairIDS[comment['name']]=['p_'+str(c)]
                pairIDS[comment['name']]=list(set(pairIDS[comment['name']]))
                

        
        if comment['name'] not in pairIDS.keys():
            pairIDS[comment['name']]=[]
        if 'success' not in comment.keys():
            comment['success']=None

In [30]:
#make a column for pair_ids collected at the op post level, note: this won't be unique at the observation level in our pairDF dataframe, but I'm just doing this for quick conversion and after converting it into convokit, I add the list in at the conversation-level metadata and it is unique per conversation
threads = list(set(pairDF.op_name))
pids =[]
for thread in threads:
    pid=[]
    for i,r in pairDF[pairDF.op_name==thread].iterrows():
        for comment in r.comments:
            if len(pairIDS[comment['name']])>0:
                for p in pairIDS[comment['name']]:
                    pid.append(p)
    pid=list(set(pid))
    pids.append(pid)
pairDF['pIDs']=pairDF.op_name.apply(lambda x: pids[threads.index(x)])

Now the data is collected in a pandas dataframe with each thread's comments fully accounted for. Convert it into convokit format:

The first step is to create a list of all Redditors, or 'users' in convokit parlance:

In [31]:
users = list(set(pairDF.op_author))

for i,r in pairDF.iterrows():
    for comment in r.comments:
        if 'author' in comment.keys():
            if comment['author'] not in users:
                users.append(comment['author'])
        else: continue

In [32]:
len(users)

29997

Note: I don't have metadata on individual users. There may be more data on individual Redditors in the 'all' datafile (which I used to collect the OP comments in each thread),but since individuals that isn't our focus (in class), I did not seek it out. I briefly considered creating a unique identifier for each user and including the 'username' as metadata, but since each Reddit username is unique, it would be superfluous). I believe other relevant information (such as whether a Redditor is the original poster) is specific to individual conversations and utterances.

2 metadata points of note: 'author_flair_css_class' and 'author_flair_text' both describe flags that appear next to an author in a subeddit. In the changemyview subreddit the moderators use this to illustrate whether the author has changed someone's mind and it can be seen as both an award and evidence of credibility in the subreddit. While I would include this as author metadata, I believe, instead, that it is actually 'conversation' metadata because this flag would be updated overtime if the author changes multiple people's minds over the course of many conversations. Since this data was collected overtime, the flag is likely to change per user across multiple conversations, possibly across utterances.

I will include the user_meta dictionary, just in case, so data can be added to it later.

In [33]:
user_meta={}
for user in users:
    user_meta[user]={}

In [34]:
corpus_users = {k: User(name = k, meta = v) for k,v in user_meta.items()}

In [35]:
print("number of users in the data = {0}".format(len(corpus_users)))

number of users in the data = 29997


Next: create utterances

In [36]:
c=0
count=0
errors=[]
utterance_corpus = {}

for i , r in pairDF.iterrows():
    #this creates an Utterance using the metadata provided in the original file. Note: this is for the original post in each observation within the pandas dataframe
    utterance_corpus[r.op_name]=Utterance(id=r.op_name ,
                                          user=corpus_users[r.op_author],
                                          root=r.op_name ,
                                          reply_to=None,
                                          timestamp=None,
                                          text=r.op_text,
                                          meta= {'pair_ids':[],
                                                 'success':None,
                                                 'approved_by': None,
                                                 'author_flair_css_class': None,
                                                 'author_flair_text': None,
                                                 'banned_by': None,
                                                 'controversiality': None,
                                                 'distinguished': None,
                                                 'downs': None,
                                                 'edited': None,
                                                 'gilded': None,
                                                 'likes': None,
                                                 'mod_reports':None,
                                                 'num_reports': None,
                                                 'replies': [com['id'] for com in r.comments if com['parent_id']==r.op_name],
                                                 'report_reasons': None,
                                                 'saved': None,
                                                 'score': None,
                                                 'score_hidden': None,
                                                 'subreddit': None,
                                                 'subreddit_id': None,
                                                 'ups': None,
                                                 'user_reports': None})
    #note: now for every comment in the original thread, make an utterance
    for comment in r.comments:
        try:
            utterance_corpus[comment['name']]=Utterance(id=comment['name'],
                                                        user=corpus_users[comment['author']],
                                                        root=r.op_name,
                                                        reply_to=comment['parent_id'],
                                                        timestamp=comment['created'],
                                                        text=comment['body'] ,
                                                        meta={
                                                            'pair_ids':pairIDS[comment['name']],
                                                            'success':comment['success'],
                                                            'approved_by': comment['approved_by'],
                                                            'author_flair_css_class': comment['author_flair_css_class'],
                                                            'author_flair_text': comment['author_flair_text'],
                                                            'banned_by': comment['banned_by'],
                                                            'controversiality': comment['controversiality'],
                                                            'distinguished': comment['distinguished'],
                                                            'downs': comment['downs'],
                                                            'edited': comment['edited'],
                                                            'gilded': comment['gilded'],
                                                            'likes': comment['likes'],
                                                            'mod_reports':comment['mod_reports'],
                                                            'num_reports': comment['num_reports'],
                                                            'replies':comment['replies'],
                                                            'report_reasons': comment['report_reasons'],
                                                            'saved': comment['saved'],
                                                            'score': comment['score'],
                                                            'score_hidden': comment['score_hidden'],
                                                            'subreddit': comment['subreddit'],
                                                            'subreddit_id': comment['subreddit_id'],
                                                            'ups': comment['ups'],
                                                            'user_reports': comment['user_reports']
                                                             })

        #this except catches multiple comments that have no text body, see errors examples below
        except:
            c=c+1
            errors.append(comment)

print('there were '+str(c)+' comments that were not collected because they were missing one of the common attributes')

there were 462 comments that were not collected because they were missing one of the common attributes


examples of uncollected comments (note that none of them have a text body):

In [37]:
errors[9]

{'children': ['cmsgxzm'],
 'count': 0,
 'id': 'cmsgxzm',
 'name': 't1_cmsgxzm',
 'parent_id': 't1_cmsgpv3',
 'success': None}

In [38]:
errors[345]

{'children': ['cqybfct'],
 'count': 1,
 'id': 'cqybfct',
 'name': 't1_cqybfct',
 'parent_id': 't1_cqy6fa6',
 'success': None}

In [39]:
len(utterance_corpus)

242360

Note above: the # of individual posts is less than each recorded comment in our dataset. This stands scrutiny when reviewing the dataset for two reasons:
    1. each positive and negative thread correspond to the same original post.
    2. original posts were re-used to compare different successful/non-successful arguments.

##### Creating a corpus from a list of utterances:

In [40]:
utterance_list = [utterance for k,utterance in utterance_corpus.items()]

In [41]:
change_my_view_corpus = Corpus(utterances=utterance_list, version=1)

In [42]:
print("number of conversations in the dataset = {}".format(len(change_my_view_corpus.get_conversation_ids())))

number of conversations in the dataset = 2509


Note: 2509 is the number of original posts recorded in the dataset

In [43]:
convo_ids = change_my_view_corpus.get_conversation_ids()
for i, convo_idx in enumerate(convo_ids[0:2]):
    print("sample conversation {}:".format(i))
    print(change_my_view_corpus.get_conversation(convo_idx).get_utterance_ids())

sample conversation 0:
['t3_2ro9ux', 't1_cnhplrm', 't1_cnhrvq7', 't1_cnhz66d', 't1_cniauhy', 't1_cnibfev', 't1_cnic0gj', 't1_cnhpsmr', 't1_cnhpvqs', 't1_cnhq7iw', 't1_cnhqrw1', 't1_cnhqzsf', 't1_cni8tcx', 't1_cnhpp4o', 't1_cnhqouu', 't1_cnhrd8u', 't1_cnhrwsq', 't1_cnhs6sc', 't1_cnhtr4t', 't1_cnhuopi', 't1_cnio1bg', 't1_cnhq330', 't1_cnhs7xb', 't1_cnhpnmr', 't1_cnhqhxa', 't1_cnhrkoc', 't1_cnhq7nv', 't1_cnhqcwz', 't1_cnhsyft', 't1_cnhww76', 't1_cnhz5wq', 't1_cni80dr', 't1_cni8e2y']
sample conversation 1:
['t3_2ro0ti', 't1_cnhpddf', 't1_cnhpqan', 't1_cnhuxye', 't1_cni1m79', 't1_cni24ug', 't1_cnhrcu4', 't1_cni06fr', 't1_cnhp0bu', 't1_cnhppsw', 't1_cnhwhma', 't1_cnho6mi', 't1_cnhot32', 't1_cnhp1pb', 't1_cnho7iy', 't1_cnhoqp4', 't1_cnhobzs', 't1_cnhop4t', 't1_cnhp1nq', 't1_cnhpgyd', 't1_cnhp5lp', 't1_cnhplmn', 't1_cni3tyd', 't1_cnhqck4', 't1_cnhpee3', 't1_cnhregg', 't1_cniogf7', 't1_cnhowj2', 't1_cnhxuu1', 't1_cniedbg', 't1_cnixgm0']


##### Add conversation-level metadata:

In [62]:
convos = change_my_view_corpus.iter_conversations()
for convo in convos:
    convo.add_meta('op-userID',pairDF[pairDF.op_name==convo._id].op_author[pairDF[pairDF.op_name==convo._id].index[0]])
    convo.add_meta('op-text-body',pairDF[pairDF.op_name==convo._id].op_text[pairDF[pairDF.op_name==convo._id].index[0]])
    convo.add_meta('op-title',pairDF[pairDF.op_name==convo._id].op_title[pairDF[pairDF.op_name==convo._id].index[0]])
    convo.add_meta('pair_ids',pairDF[pairDF.op_name==convo._id].pIDs[pairDF[pairDF.op_name==convo._id].index[0]])


##### Add corpus title:

In [63]:
change_my_view_corpus.meta['name'] = "Change My View Corpus"

In [64]:
change_my_view_corpus.print_summary_stats()

Number of Users: 29997
Number of Utterances: 242360
Number of Conversations: 2509


In [65]:
change_my_view_corpus.dump('change-my-view-corpus', base_path='C:\\Users\\Andrew\\Desktop\\CMV data')