In [1]:
import os
os.chdir('../../..')

In [2]:
os.listdir()

['LICENSE.md',
 'convokit',
 '.DS_Store',
 'requirements.txt',
 'Makefile',
 'website',
 'datasets',
 'tests',
 'README.md',
 'setup.py',
 '.gitignore',
 'ldavis_prepared_8',
 'CONTRIBUTING.md',
 'examples',
 'doc',
 'ldavis_prepared_news_8',
 '.git',
 '.idea']

In [6]:
from convokit import Corpus, LanguageModel, download

In [98]:
progun_corpus = Corpus(download('subreddit-gunpolitics'))

Downloading subreddit-gunpolitics to /Users/calebchiam/.convokit/downloads/subreddit-gunpolitics
Downloading subreddit-gunpolitics from http://zissou.infosci.cornell.edu/convokit/datasets/subreddit-corpus/corpus-zipped/guitarpractice~-~gwent/gunpolitics.corpus.zip (53.1MB)... Done


In [99]:
progun_corpus.print_summary_stats()

Number of Users: 17482
Number of Utterances: 324345
Number of Conversations: 20240


In [127]:
antigun_corpus = Corpus(download('subreddit-GunsAreCool'))

Dataset already exists at /Users/calebchiam/.convokit/downloads/subreddit-GunsAreCool


In [128]:
antigun_corpus.print_summary_stats()

Number of Users: 15559
Number of Utterances: 346866
Number of Conversations: 86943


Let's exclude users that appear in both corpora. This removes users that may be debating the opposite side.

In [129]:
antigun_users = set([user.name for user in antigun_corpus.iter_users()])
progun_users = set([user.name for user in progun_corpus.iter_users()])

In [130]:
len(progun_users.intersection(antigun_users))

2163

In [131]:
skipped_users = antigun_users.intersection(progun_users)

In [132]:
# We want to keep [deleted] users as we don't know their identity
skipped_users.remove('[deleted]')

In [133]:
'[deleted]' in skipped_users

False

In [134]:
'AutoModerator' in antigun_users

True

In [135]:
# Basic cleaning: skip moderator comments
skipped_users.add('AutoModerator')

In [136]:
len(skipped_users)

2163

To avoid biasing the model toward frequent users, we will sample up to 3 utterances per user. This avoids data wastage (since this is a small corpus), while ensuring that no user can have an overly large impact on the model.

We also limit our training corpus to include utterances that are at least 25 characters long, so as to ignore short non-sentence responses, while allowing for most short sentences. Basically, we want to ensure that we are mainly including sentences in the corpus, not just phrases. This also allows us to exclude comments that are deleted (i.e. "[deleted]") or missing (i.e. empty string).

When computing perplexity scores, we will also use the same treatment, i.e. we will only compute scores for utterances >=25 characters long.

In [154]:
def clean_text(txt):
    return txt.replace("\n", " ")

## Progun corpus

In [137]:
progun_utts = [utt for utt in progun_corpus.iter_utterances() if len(utt.text) >= 25 and utt.user.name not in skipped_users]

In [138]:
from collections import defaultdict
progun_user_to_utt = defaultdict(list)
for utt in progun_utts:
    progun_user_to_utt[utt.user.name].append(utt)

In [139]:
# Average utt per user
len(progun_utts) / len(progun_user_to_utt)

12.488815740255854

In [140]:
from random import sample
for user_id, user_utts in progun_user_to_utt.items():
    try:
        progun_user_to_utt[user_id] = sample(user_utts, 3)
    except ValueError:
        continue

In [141]:
sampled_utts = []
for user_id, user_utts in progun_user_to_utt.items():
    sampled_utts.extend(user_utts)

In [142]:
[utt.text for utt in sample(sampled_utts, 10)]

['Except nigras, it *is* the south, after all. ',
 "From one high schooler to another, good job on standing up for your beliefs :) I live in a gun-less  country, so there were no protests or anything. But if there was, I would've been on the same side as you. ",
 'http://i.imgur.com/5ottssc.gif',
 '[Hey, no problem](http://imgur.com/vJSzz8a)',
 "First off, I'm new in this subreddit. My name is Mark, and I'm from Texas. I've had a gun in the house since the day I was born. \n\nAnyways, I got curious about this so I looked it up expecting to see some graphs or data sheets with shootings at establishments with guns banned and guns allowed. (Not) surprisingly I was slapped in the face with 5 pages of anti gun articles and all that not so good stuff. So I'm here to ask this community, do you think antigun or pro gun establishments are more likely to be involved in a mass shooting? \n\nIf you have a link to a page about this that provides statistics on this I would be very greatful. \nThanks

In [155]:
pro_gun_text = ''
for utt in sampled_utts:
    if utt.text.endswith('.'):
        pro_gun_text += clean_text(utt.text) + ' '
    else:
        pro_gun_text += clean_text(utt.text) + '. '

In [175]:
from nltk import sent_tokenize

In [179]:
with open('progun_corpus.txt', 'w') as f:
    for sentence in sent_tokenize(pro_gun_text):
        if len(sentence) > 5:
            f.write(sentence)
            f.write("\n")

## Antigun corpus

In [157]:
antigun_utts = [utt for utt in antigun_corpus.iter_utterances() if len(utt.text) >= 25 and utt.user.name not in skipped_users]

In [158]:
antigun_user_to_utt = defaultdict(list)
for utt in antigun_utts:
    antigun_user_to_utt[utt.user.name].append(utt)

In [159]:
# Average utt per user
len(antigun_utts) / len(antigun_user_to_utt)

11.761225944404847

In [160]:
for user_id, user_utts in antigun_user_to_utt.items():
    try:
        antigun_user_to_utt[user_id] = sample(user_utts, 3)
    except ValueError:
        continue

In [161]:
sampled_utts = []
for user_id, user_utts in antigun_user_to_utt.items():
    sampled_utts.extend(user_utts)

In [162]:
[utt.text for utt in sample(sampled_utts, 10)]

['It was the guns fault, it made him do it. Im saying it right on this subreddit right?',
 'Looking at the picture I said, "That doesn\'t look like Will Smith or Jada Smith."\n\n Looked closer at the name that had a caption "...not to be confused with the actor."',
 'Just like our rape culture, right?',
 "Over 500 casualties in one event isn't a real problem? Yikes.",
 'Ask Trayvon Martin how great justice-by-gun is. Oh ... um ... never mind.',
 'A defeated safe, you say?\n\nTell us more about this safe and how it was defeated. ',
 "Yay a response to my question. I was beginning to wonder. Don't worry, there are like 20 other mods besides the ones you seem to have a problem with and (gasp!) a lot of the mods are liberal. So pretty much everyone stays in check and we too have had the admins ban mods who have been caught misbehaving. That's the funny thing about actual evidence, when you have some people act on it versus what you guys do which is make accusations without credible evidenc

In [164]:
anti_gun_text = ''
for utt in sampled_utts:
    if utt.text.endswith('.'):
        anti_gun_text += clean_text(utt.text) + ' '
    else:
        anti_gun_text += clean_text(utt.text) + '. '

In [180]:
with open('antigun_corpus.txt', 'w') as f:
    for sentence in sent_tokenize(anti_gun_text):
        if len(sentence) > 5:
            f.write(sentence)
            f.write("\n")

In [230]:
lm = LanguageModel(SRILM_path='/Users/calebchiam/Documents/GitHub/cs6742-fork/convokit/SRILM/srilm-1.7.3',
                  working_dir='/Users/calebchiam/Documents/GitHub/cs6742-fork/convokit/SRILM/dump/',
                  lm_output_path='progun.lm',
                  lm_type='laplace',
                  count_output_path='progun_counts.txt',
                  order=2,
                  verbose=False)

In [231]:
lm.train('progun_corpus.txt')




In [232]:
lm.str_perplexity("I love guns.")

213.3894

In [233]:
lm.str_perplexity("I hate guns.")

523.6776

In [234]:
lm.str_perplexity("We have a right to arm ourselves.")

1721.711

In [235]:
lm.str_perplexity("We do not have a right to arm ourselves.")

1525.072

In [225]:
lm2 = LanguageModel(SRILM_path='/Users/calebchiam/Documents/GitHub/cs6742-fork/convokit/SRILM/srilm-1.7.3',
                  working_dir='/Users/calebchiam/Documents/GitHub/cs6742-fork/convokit/SRILM/dump/',
                  lm_output_path='antigun.lm',
                  lm_type='laplace',
                  count_output_path='antigun_counts.txt',
                  order=2,
                  verbose=False)

In [226]:
lm2.train('antigun_corpus.txt')




In [254]:
import json
with open("skipped_users.json", 'w') as f:
    json.dump({"users": list(skipped_users)}, f)