## Analyzing CMV-mod data
`CMV-mod` is a Change My View subreddit data extracted with the `mod` access

*Submission* is a single page in CMV that starts with the "Change My View" post (OP = original post) and contains a number threads.

## Submissions and threads

In [3]:
from RedditThread import RedditThread
import os
from pandas import DataFrame
import seaborn
import matplotlib.pyplot as plt
%matplotlib inline

# This folder points to a list of scraped JSON files using the Reddit API (praw)
# and a mod access granted by Reddit CMV OPs. This file is *not* part of the
# repository due to its size (0.5 GB compressed, 3.5 GB uncompressed).
# Available upon request.
main_dir = '/home/user-ukp/data2/cmv-full-2017-09-22/'
files = [f for f in os.listdir(main_dir) if os.path.isfile(os.path.join(main_dir, f))]

thread_counts = []
comment_counts = []

for f in files:
    comments = RedditThread.load_comments_from_file(os.path.join(main_dir, f))
    clean_threads = RedditThread.discard_corrupted_threads(RedditThread.reconstruct_threads_from_submission(comments))
    
    # remove outliers (threads longer than 200 comments)
    clean_threads = [thread for thread in clean_threads if 200 >= len(thread.comments) > 0]
    
    thread_counts.append(len(clean_threads))
    comment_counts.extend([len(thread.comments) for thread in clean_threads])

print("Submissions: ", len(thread_counts))
print("Threads: ", len(comment_counts))
print("Comments: ", sum(comment_counts))

Submissions:  31926
Threads:  780040
Comments:  4041394


In [4]:
# stats
df = DataFrame(data={"Threads / Submission": thread_counts})
df.describe()

Unnamed: 0,Threads / Submission
count,31926.0
mean,24.432751
std,38.335248
min,0.0
25%,8.0
50%,14.0
75%,25.0
max,1179.0


In [5]:
df = DataFrame(data={"Comments / Thread": comment_counts})
df.describe()

Unnamed: 0,Comments / Thread
count,780040.0
mean,5.181009
std,3.840263
min,1.0
25%,2.0
50%,4.0
75%,6.0
max,192.0


## Labels of fallacies distribution

In [6]:
from AnnotatedRedditComment import AnnotatedRedditComment
import pandas

fallacy_labels = dict()

for f in files:
    comments = RedditThread.load_comments_from_file(os.path.join(main_dir, f))
    clean_threads = RedditThread.discard_corrupted_threads(RedditThread.reconstruct_threads_from_submission(comments))
    
    # remove outliers (threads longer than 200 comments)
    clean_threads = [thread for thread in clean_threads if 200 >= len(thread.comments) > 0]
    
    for comment in RedditThread.collect_all_comments(clean_threads):
        assert isinstance(comment, AnnotatedRedditComment)
        label = comment.violated_rule
        # update counter
        fallacy_labels[label] = fallacy_labels.get(label, 0) + 1

print(fallacy_labels)

# turn into a nice table
rule_to_str = {0: 'None', 1: 'Direct responses must challenge OP', 2: 'Rude or hostile',
               3: 'Accusing of being unwilling to change view',
               4: 'Not awarded a delta although you have acknowledged a change',
               5: 'Low effort post'}

labels = []
counts = []

for key in fallacy_labels:
    string_label = []
    for rule in rule_to_str:
        if str(rule) in str(key):
            string_label.append(rule_to_str[rule])
    labels.append(' & '.join(string_label))
    counts.append(fallacy_labels[key])
    
pandas.options.display.max_colwidth = 200
DataFrame(data={'labels': labels, 'counts': counts}).sort_values(by='counts', ascending=False)

{0: 2054378, 1: 4709, 2: 4364, 35: 5, 4: 34, 5: 4146, 1235: 3, 135: 2, 45: 2, 12: 49, 13: 21, 15: 487, 3: 708, 235: 2, 23: 75, 25: 110, 123: 1, 125: 55}


Unnamed: 0,counts,labels
0,2054378,
1,4709,Direct responses must challenge OP
2,4364,Rude or hostile
5,4146,Low effort post
12,708,Accusing of being unwilling to change view
11,487,Direct responses must challenge OP & Low effort post
15,110,Rude or hostile & Low effort post
14,75,Rude or hostile & Accusing of being unwilling to change view
17,55,Direct responses must challenge OP & Rude or hostile & Low effort post
9,49,Direct responses must challenge OP & Rude or hostile


So roughly 0.2% (4k out of 2M comments) are ad-hominem arguments