# This notebook is used to load the CMV data, explore its structure, and output a cleaned version.

The line below will download the data into a directory outside the repo and unzip it.

In [120]:
! cd ../.. && mkdir CMV && cd CMV && curl -# -O https://chenhaot.com/data/cmv/cmv_20161111.jsonlist.gz && gunzip cmv_20161111.jsonlist.gz 

Now the JSON file can be loaded into Python

In [8]:
import glob
import json
data = glob.glob('../../CMV/*')[0] # Get string for JSON file with data

In [66]:
f = open(data, 'rb')

In [67]:
for line in f:
    l = json.loads(line.decode('utf-8'))
    print(l)
    break

{'contest_mode': False, 'suggested_sort': 'qa', 'banned_by': None, 'media_embed': {}, 'subreddit': 'changemyview', 'selftext_html': '<!-- SC_OFF --><div class="md"><p>I have to say that I am very disappointed with the current state of political discourse in today&#39;s society. Both in mass media and the Internet, political discussion seems to be ruled by angry extremists who think that the other side is evil and shout down, insult and in some case censor anyone who doesn&#39;t think so. Many times over I&#39;ve dealt with those types of people (both from the left and the right) only to find out that reason and logic rarely, if ever work on them. Their extreme views also focus a lot of outrage (sometimes from other extremists), often derailing the discussion from the original topic.</p>\n\n<p>So I have a question: Do you guys this can be fixed? Or is this just <a href="http://fishbowl.pastiche.org/archives/pictures/greater_internet_fuckwad_theory.jpg">the GIFT in practise?</a></p>\n\n<

The authors have released some [code](https://vene.ro/blog/winning-arguments-attitude-change-reddit-cmv.html) that we can use to better vizualize the textual data:

In [68]:
from IPython.display import Markdown

def show_post(cmv_post):
    md_format = "**{title}** \n\n {selftext}".format(**cmv_post)
    md_format = "\n".join(["> " + line for line in md_format.splitlines()])
    return Markdown(md_format)

In [69]:
show_post(l)

> **CMV: The current state of political discussion makes it pointless to seriously discuss politics on the Internet.** 
> 
>  I have to say that I am very disappointed with the current state of political discourse in today's society. Both in mass media and the Internet, political discussion seems to be ruled by angry extremists who think that the other side is evil and shout down, insult and in some case censor anyone who doesn't think so. Many times over I've dealt with those types of people (both from the left and the right) only to find out that reason and logic rarely, if ever work on them. Their extreme views also focus a lot of outrage (sometimes from other extremists), often derailing the discussion from the original topic.
> 
> So I have a question: Do you guys this can be fixed? Or is this just [the GIFT in practise?](http://fishbowl.pastiche.org/archives/pictures/greater_internet_fuckwad_theory.jpg)
> _____
> 
> > *Hello, users of CMV! This is a footnote from your moderators. We'd just like to remind you of a couple of things. Firstly, please remember to* ***[read through our rules](http://www.reddit.com/r/changemyview/wiki/rules)***. *If you see a comment that has broken one, it is more effective to report it than downvote it. Speaking of which,* ***[downvotes don't change views](http://www.reddit.com/r/changemyview/wiki/guidelines#wiki_upvoting.2Fdownvoting)****! If you are thinking about submitting a CMV yourself, please have a look through our* ***[popular topics wiki](http://www.reddit.com/r/changemyview/wiki/populartopics)*** *first. Any questions or concerns? Feel free to* ***[message us](http://www.reddit.com/message/compose?to=/r/changemyview)***. *Happy CMVing!*

In [70]:
l.keys()

dict_keys(['contest_mode', 'suggested_sort', 'banned_by', 'media_embed', 'subreddit', 'selftext_html', 'selftext', 'likes', 'domain', 'user_reports', 'secure_media', 'saved', 'id', 'gilded', 'secure_media_embed', 'clicked', 'report_reasons', 'author', 'media', 'comments', 'name', 'score', 'approved_by', 'over_18', 'hidden', 'thumbnail', 'subreddit_id', 'edited', 'link_flair_css_class', 'author_flair_css_class', 'downs', 'mod_reports', 'archived', 'removal_reason', 'is_self', 'hide_score', 'spoiler', 'permalink', 'locked', 'stickied', 'created', 'url', 'author_flair_text', 'quarantine', 'title', 'created_utc', 'link_flair_text', 'distinguished', 'num_comments', 'visited', 'num_reports', 'ups'])

In [71]:
l['selftext']

"I have to say that I am very disappointed with the current state of political discourse in today's society. Both in mass media and the Internet, political discussion seems to be ruled by angry extremists who think that the other side is evil and shout down, insult and in some case censor anyone who doesn't think so. Many times over I've dealt with those types of people (both from the left and the right) only to find out that reason and logic rarely, if ever work on them. Their extreme views also focus a lot of outrage (sometimes from other extremists), often derailing the discussion from the original topic.\n\nSo I have a question: Do you guys this can be fixed? Or is this just [the GIFT in practise?](http://fishbowl.pastiche.org/archives/pictures/greater_internet_fuckwad_theory.jpg)\n_____\n\n> *Hello, users of CMV! This is a footnote from your moderators. We'd just like to remind you of a couple of things. Firstly, please remember to* ***[read through our rules](http://www.reddit.co

In [72]:
l['title']

'CMV: The current state of political discussion makes it pointless to seriously discuss politics on the Internet.'

In [73]:
l['name']

't3_5c8xdc'

In [74]:
l['author']

'ralpher313'

In [75]:
l['id']

'5c8xdc'

In [76]:
l['created']

1478826205.0

In [77]:
l['url']

'https://www.reddit.com/r/changemyview/comments/5c8xdc/cmv_the_current_state_of_political_discussion/'

Now to also inspect the comments:

In [94]:
l['comments'][2]

{'approved_by': None,
 'archived': False,
 'author': 'ralpher313',
 'author_flair_css_class': None,
 'author_flair_text': None,
 'banned_by': None,
 'body': "Yes, but it seems like this can't be achieved without heavy moderation. We shouldn't need mods to not fling shit at other people.",
 'body_html': '<div class="md"><p>Yes, but it seems like this can&#39;t be achieved without heavy moderation. We shouldn&#39;t need mods to not fling shit at other people.</p>\n</div>',
 'controversiality': 0,
 'created': 1478826860.0,
 'created_utc': 1478798060.0,
 'distinguished': None,
 'downs': 0,
 'edited': False,
 'gilded': 0,
 'id': 'd9ujt0e',
 'likes': None,
 'link_id': 't3_5c8xdc',
 'mod_reports': [],
 'name': 't1_d9ujt0e',
 'num_reports': None,
 'parent_id': 't1_d9ujmbi',
 'removal_reason': None,
 'replies': {'data': {'after': None,
   'before': None,
   'children': ['d9ujw71'],
   'modhash': ''},
  'kind': 'Listing'},
 'report_reasons': None,
 'saved': False,
 'score': 1,
 'score_hidden': T

In [78]:
l['comments'][0].keys()

dict_keys(['subreddit_id', 'banned_by', 'removal_reason', 'link_id', 'likes', 'replies', 'user_reports', 'saved', 'id', 'gilded', 'archived', 'report_reasons', 'author', 'parent_id', 'score', 'approved_by', 'controversiality', 'body', 'edited', 'author_flair_css_class', 'downs', 'body_html', 'stickied', 'subreddit', 'score_hidden', 'name', 'created', 'author_flair_text', 'created_utc', 'ups', 'mod_reports', 'num_reports', 'distinguished'])

In [103]:
for c in l['comments']:
    try:
        print(c['author'])
        print(c['created'])
        print(c['body'])
        print('\n')
        print(c)
    except:
        pass

Ansuz07
1478826647.0
I would offer this very sub as a counterpoint - throughout this election we have had very civil, well informed discussions about both candidates (albeit more democratic leaning just due to Reddit's natural demographics).  Vitriolic discussion has been shut down by the mod team and only level headed discussion allowed to remain.

The internet allows for echo-chamber discussions to be certain, but its not _impossible_ to find civil discourse on the internet.  It just takes communities like this one who commit to that ideal and this can and does happen.


{'subreddit_id': 't5_2w2s8', 'banned_by': None, 'removal_reason': None, 'link_id': 't3_5c8xdc', 'likes': None, 'replies': {'kind': 'Listing', 'data': {'modhash': '', 'children': ['d9ujt0e', 'd9ujrsj'], 'after': None, 'before': None}}, 'user_reports': [], 'saved': False, 'id': 'd9ujmbi', 'gilded': 0, 'archived': False, 'report_reasons': None, 'author': 'Ansuz07', 'parent_id': 't3_5c8xdc', 'score': 1, 'approved_by': No

In the above post we see the deltabot, which indicates that the original poster awared a delta to a user, Ansuz07. We see that the OP awarded a delta by using the symbol in their comment in reply to the second comment by Ansuz07. We can infer that this comment changed the OPs view.

In [85]:
def has_delta(comment_text):
    if '∆' in comment_text:
        return True
    else:
        return False

In [86]:
print('∆')

∆


In [90]:
for c in l['comments']:
    try:
        if has_delta(c['body']):
            print("DELTA DETECTED")
            print(c['body'])
    except:
        pass

DELTA DETECTED
I didn't said that it can't occur, it just seems to me that way. Or maybe this election has left me VERY worn out with politics and I've just started noticing this more.

∆

Also, kinda off-topic, but don't know where else to ask this:

What should I do when I'm REALLY fucking sick of politics?

DELTA DETECTED
Confirmed: 1 delta awarded to /u/Ansuz07 ([75∆](/r/changemyview/wiki/user/Ansuz07)).

^[Delta System Explained](https://www.reddit.com/r/changemyview/wiki/deltasystem) ^| ^[Deltaboards](https://www.reddit.com/r/changemyview/wiki/deltaboards)
[​](HTTP://DB3PARAMSSTART
{
  "comment": "This is hidden text for DB3 to parse. Please contact the author of DB3 if you see this",
  "issues": {},
  "parentUserName": "Ansuz07"
}
DB3PARAMSEND)


TODO: Either fix the delta issue here, so that the delta is added to the correct comment, i.e. the one preceding the one where the delta is granted / the user mentioned by the delta bot, or do this later.

Now read in each line, extract important information, and store it.

In [114]:
f = open(data, 'rb')
posts_dict = {}
comments_dict = {}
for line in f:
    post = json.loads(line.decode('utf-8'))
    post_info = {
                 'title': post['title'],
                 'text': post['selftext'],
                 'author': post['author'],
                 'num_comments': post['num_comments'],
                 'time': post['created'],
                 'url': post['url'],
                 'name': post['name'],
                 'score': post['score']
    }
    comment_list = []
    for c in post['comments']:
        try:
            comment_list.append({
                'author' : c['author'],
                'time' : c['created'],
                'text': c['body'],
                'parent': c['parent_id'],
                'score': c['score'],
                'delta': has_delta(c['body'])

            })
        except: # Skip comments if they do not have these attributes
            pass 
    posts_dict[post['id']] = post_info
    comments_dict[post['id']]= comment_list
f.close()

In [115]:
pickle.dump(posts_dict, open('post_info.p','wb'))

In [116]:
pickle.dump(comments_dict, open('comment_info.p','wb'))

In [117]:
!ls -lh

total 1974024
-rw-r--r--  1 trd54  staff   909M Sep  2 17:48 comment_info.p
-rw-r--r--  1 trd54  staff    76K Sep  2 17:49 data_exploration.ipynb
-rw-r--r--  1 trd54  staff    55M Sep  2 17:48 post_info.p


The file containing the info about comments is now almost a gigabyte in size. The posts file is much smaller but still not worth putting on Github.