# Reddit Political Analysis
## Insight to behavior of Politically Active Redditors

## Introduction

### Reddit as a Social Network

This report deals heavily with the nature of reddit and how users interact with the platform. As such, it is imperative that first these characteristics are outlined, and fully understood, as they form the presmise of our analysis and later discussion.

Reddit is a social media network centered around forum-based interactions. Lying somewhere in between 4Chan and Facebook, Reddit shifts focus away from the user, focusing instead on content-specific communities, while not going so far as to be completely anonymous. These communitites tend to be cliquey, in that they tend to have specific vernacular in the form of inside jokes and jargon: for instance, HighQualityGifs tends to have meta-discussion on GIF-making. However, users are not restricted to a single community and can participate in all of them. This leads to the the two most important, characteristics of the Reddit: 
    
1. Reddit is essentially comprised of Subreddits
2. Users interact on Subreddits, and are free to do so on any subreddit.
    
### Political Activism on Reddit

Reddit is highly politically active. The platform lends itself to confrontation and conversation between people of varying backgrounds and political leanings, more-so than Facebook, and even Twitter, as people aren't restricted by their own friend circles. Differing political factions pour out of their respective subreddits into common spaces across Reddit, influencing the nature of discussion in the mainstream. More-so, the platform is an important source of news to manny of its users. According to a study by the Pew Research Center, though only less than 1 in 10 American Adults use Reddit, more than 7 in 10 of users rely on it as their primary news source. This means that Reddit is an important place to look to see how people consume news in the Age of the Internet. 

The epicenter of political activism on the platform lies in . There are subreddits for most every type of political affiliation, though some more extreme ideologies have had their subreddits shut down i.e r/nationalsocialism. The most famous of these subreddits is r/The_Donald, with 600,000 subs. The_Donald is notorious across the platform for being highly insular, really only having content that conforms to what the community wants to hear. This is true for most if all of the subreddits on both the left and right. 

### The Problem: 

All this being said, what we wanted to look at was just how insular Reddit is, evaulating whether Reddit could reasonably split into a "Left" and "Right" reddit, each running in their own echo chambers. This is a point of interest because it would let us get a better understanding of:

1. Reddit's efficacy as an informative News Source
2. Organic Organization of Political Activist groups
3. The differing structure of "Leftist" Reddit and "Right" Reddit.

## Data Collection
HALp

In [1]:
import praw, json, sys
from pprint import pprint

reddit = praw.Reddit(client_id=sys.argv[1],
                     client_secret=sys.argv[2],
                     user_agent='Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 \
                     (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36')
sub = sys.argv[3].lower()
print(sub)

users = set()
usernames = set()
user_subs = dict()

with open('users/' + sub + '.json', 'w') as f:
    subreddit = reddit.subreddit(sub)
    for post in subreddit.top('month'): # search 100 top posts of the month
        post.comments.replace_more(limit=None) # all comments on each post
        comments = post.comments.list()
        for comment in comments:
            if comment.author != None:
                users.add(comment.author)
                usernames.add(comment.author.name)
                subs = user_subs.get(comment.author.name, set())
                subs.add(sub)
                user_subs[comment.author.name] = subs

    json.dump(list(usernames), f, separators=(',', ':'))

print(len(usernames))
with open('subs/' + sub + '.json', 'w') as f:
    for user in list(users):
        try:
            for comment in user.comments.top('month'): # search 100 top comments
                subs = user_subs[user.name]
                subs.add(str(comment.subreddit).lower())
                user_subs[user.name] = subs
        except:
            pass
    for user in user_subs:
        user_subs[user] = list(user_subs[user])
    json.dump(user_subs, f, separators=(',', ':'))
print("done")

IndexError: list index out of range

##  Classification

In [68]:
import json
import numpy as np

In [69]:
with open('reddit data/all_subs.json') as subs_json:
    subreddits = json.load(subs_json)

In [70]:
print(len(subreddits))
print(subreddits[:10])

18069
['moviepass', 'greendawn', 'disneyporn', 'lithuaniaspheres', 'meghanmarkle', 'anarchy101', 'linuxfromscratch', 'lojban', 'hvacadvice', 'ratemysinging']


In [71]:
with open('reddit data/left.json') as left_json:
    left = json.load(left_json)
with open('reddit data/right.json') as right_json:
    right = json.load(right_json)

In [72]:
print("Left User Count: %d, Right User Count: %d" % (len(left), len(right)))

Left User Count: 22059, Right User Count: 17999


In [73]:
all_data = []
all_users = []

for user, subs in left.items():
    all_data.append(subs)
    all_users.append(user)
    

for user, subs in right.items():
    all_data.append(subs)
    all_users.append(user)

In [74]:
from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer(sparse_output=True)
converted_data = mlb.fit_transform(all_data)

In [75]:
len_left = len(left)
len_right = len(right)

labels = np.append(np.zeros(len_left), np.ones(len_right))

In [76]:
from sklearn.naive_bayes import BernoulliNB

In [77]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.model_selection import GridSearchCV

In [79]:
alpha_range = np.linspace(1, 40)
param_grid = dict(alpha=alpha_range)
cv = StratifiedShuffleSplit(n_splits=5, test_size=0.2, random_state=42)


def run_grid(param_grid, classifier):
    grid = GridSearchCV(classifier, param_grid=param_grid, cv=cv, verbose=0, n_jobs=-1)
    grid.fit(converted_data, labels)
    return (grid.best_params_, grid.best_score_)
    
best_param_nb, best_score_nb = run_grid(param_grid, BernoulliNB())

print("The best parameters are %s with a score of %0.2f"
      % (best_param_nb, best_score_nb))

NameError: name 'best_param' is not defined

In [81]:
alpha_range = np.linspace(20, 22, 40)
param_grid = dict(alpha=alpha_range)

best_param_nb, best_score_nb = run_grid(param_grid, BernoulliNB())

print("The best parameters are %s with a score of %0.2f"
      % (best_param_nb, best_score_nb))

The best parameters are {'alpha': 20.820512820512821} with a score of 0.89


In [None]:
from sklearn import svm

In [None]:
C_range = np.logspace(1, 3, 4)
gamma_range = np.logspace(-6, -2, 5)
param_grid = dict(gamma=gamma_range, C=C_range)

best_param_svm, best_score_svm = run_grid(param_grid, svm.SVC())

print("The best parameters are %s with a score of %0.2f"
      % (best_param_svm, best_score_svm))

In [None]:
C_range = np.logspace(1.9, 2.1, 4)
gamma_range = np.logspace(-4.1, -3.9, 4)
param_grid = dict(gamma=gamma_range, C=C_range)

best_param_svm, best_score_svm = run_grid(param_grid, svm.SVC())

print("The best parameters are %s with a score of %0.2f"
      % (best_param_svm, best_score_svm))

In [85]:
nb_clf = BernoulliNB(alpha=best_param_nb['alpha'])

nb_clf.fit(converted_data, labels)
nb_clf.score(converted_data, labels)

clf = svm.SVC(C=best['C'], gamma=best['gamma'])
svm_clf.fit(converted_data, labels)
svm_clf.score(converted_data, labels)


0.87300913675170999

In [86]:
from sklearn.externals import joblib
joblib.dump(nb_clf, 'nb.pkl') 
joblib.dump(clf, 'svm.pkl')
joblib.dump(mlb, 'mlb.pkl')

['mlb.pkl']

In [87]:
from sklearn.externals import joblib
nb_clf = joblib.load('nb.pkl')
svm_clf = joblib.load('svm.pkl')
mlb = joblib.load('mlb.pkl')

In [88]:
with open('reddit data/test_subs.json') as test_json:
    test = json.load(test_json)
    
test_data = []
test_users = []

for user, subs in test.items():
    test_data.append(subs)
    test_users.append(user)

In [89]:
print(len(test_data))

30771


In [90]:
filtered_data = list(map(lambda l: list(filter(lambda sub: sub in subreddits, l)), test_data))

converted_test = mlb.transform(filtered_data)

predicted_labels = nb_clf.predict(converted_test)

In [91]:
with open('reddit data/test_preds_nb.json', 'w') as test_preds:
    json.dump(dict(zip(test_users, predicted_labels)), test_preds)

## Results

In [2]:
import json, math
import matplotlib.pyplot as plt
import networkx as nx
from pprint import pprint

In [99]:
with open('reddit data/left.json') as f:
    left = json.load(f)
with open('reddit data/right.json') as f:
    right = json.load(f)
with open('reddit data/left_subs.json') as f:
    left_subs = set(json.load(f))
with open('reddit data/right_subs.json') as f:
    right_subs = set(json.load(f))
with open('reddit data/left_users.json') as f:
    left_users = set(json.load(f))
with open('reddit data/right_users.json') as f:
    right_users = set(json.load(f))
with open('reddit data/left_sources.txt') as f:
    left_sources = set([l.strip().lower() for l in f.readlines() if l.strip() != ''])
with open('reddit data/right_sources.txt') as f:
    right_sources = set([l.strip().lower() for l in f.readlines() if l.strip() != ''])
with open('reddit data/all_subs.json') as f:
    all_subs = json.load(f)

In [100]:
# invert user:subs dict
sub_users = dict()
for s in all_subs:
    sub_users[s] = set()
for u in left:
    for s in left[u]:
        sub_users[s].add(u)
for u in right:
    for s in right[u]:
        sub_users[s].add(u)

In [123]:
# filter out small subs
counts = {s:len(sub_users[s]) for s in sub_users}
large_subs = {s:sub_users[s] for s in sub_users if len(sub_users[s]) >= 1500}
print(len(large_subs))
large_sub_users = set()
for s in large_subs:
    for u in large_subs[s]:
        large_sub_users.add(u)
print(len(large_sub_users))

most_subs = {s:sub_users[s] for s in sub_users if len(sub_users[s]) >= 50}
print(len(most_subs))

59
38795
1483


In [129]:
def score_sub(sub):
    users = sub_users[sub]
    score = 0
    for u in users:
        if u in left_users: score -= 1
        if u in right_users: score += 1
    sub_score = score / len(users)
    sub_score -= -(len(left_users) - len(right_users)) / (len(left_users) + len(right_users))
    return sub_score

scored_subs = {s: score_sub(s) for s in most_subs}
pprint(scored_subs)

{'1200isplenty': -0.22473391842697654,
 '2007scape': 0.005462627135858489,
 '2healthbars': -0.42591968917796463,
 '2meirl4meirl': -0.36668819901863947,
 '3dprinting': -0.2857437360987858,
 '3ds': -0.390172385634051,
 '40klore': -0.0014507002229944138,
 '49ers': -0.1812556575574113,
 '4chan': 0.1573530380947626,
 '4panelcringe': -0.2154573067328236,
 'abandonedporn': -0.34365974962902257,
 'aboringdystopia': -0.6441015073597829,
 'absolutelynotme_irl': -0.4474274497101155,
 'absolutelynotmeirl': -0.25713752794297323,
 'accidentalracism': -0.07750875052312359,
 'accidentalrenaissance': -0.42303720580767645,
 'accidentalwesanderson': -0.5420578146184157,
 'accounting': -0.08342957060088957,
 'actlikeyoubelong': -0.35319241645069194,
 'actuallesbians': -0.6986469619052375,
 'adhd': -0.3915041047623803,
 'adorableporn': -0.41812748138575684,
 'advice': -0.058410275514704874,
 'adviceanimals': -0.36568195269713427,
 'againsthatesubreddits': -0.6076573016393585,
 'ainbow': -0.8068102272113599

 'minipainting': -0.17864696190523743,
 'minneapolis': -0.43198029523857073,
 'minnesota': -0.36416420328454774,
 'minnesotavikings': -0.02109594149707414,
 'misleadingthumbnails': -0.3752094619052374,
 'mkebucks': -0.15864696190523742,
 'mlb': -0.12506205624486005,
 'mlbtheshow': 0.11859441740510743,
 'mls': -0.3594312756307276,
 'mma': -0.05692034320020142,
 'mmorpg': -0.04322527515824945,
 'moderatepolitics': -0.32118217317284303,
 'monarchism': 1.064085957349421,
 'monero': 0.23981457655630106,
 'monsterhunter': -0.36803471700727824,
 'monsterhunterworld': -0.22843419594779057,
 'morbidquestions': 0.02992446666619117,
 'morbidreality': -0.2674588430933562,
 'morrowind': -0.29487337699957705,
 'mostbeautiful': -0.3328574882210269,
 'motorcycles': -0.16055172380999932,
 'mountandblade': -0.017068014536816348,
 'moviedetails': -0.4197975541218195,
 'moviepassclub': -0.4171654804237559,
 'movies': -0.27761262900649775,
 'moviescirclejerk': -0.565313628571904,
 'moviesinthemaking': -0.5

In [128]:
political_sub_edges = dict()
sources_set = set(left_sources).union(set(right_sources))
sorted_subs = sorted(list(large_subs))

for s in sorted(sources_set):
    if s not in large_subs: continue
    for u in large_subs[s]:
        for s2 in sorted_subs:
            if s != s2 and u in large_subs[s2]:
                count = political_sub_edges.get((s, s2), 0)
                political_sub_edges[(s, s2)] = count + 1

In [103]:
MAX_SIZE = 50
MAX_WEIGHT = 30

max_weight = max(political_sub_edges[e] for e in political_sub_edges)
G = nx.Graph()
for e in political_sub_edges:
    G.add_edge(e[0], e[1], weight=political_sub_edges[e] / max_weight * MAX_WEIGHT)
    
def r2bgradient(score):
    r = math.floor(255 * score)
    b = math.floor(255 * (1 - score))
    color = '#' + '%02x' % r + '00' + '%02x' % b
    return color

def calculate_color(s):
    sub_score = score_sub(s) + 1
    sub_score /= 2
    if sub_score > 1: sub_score = 1
    elif sub_score < 0: sub_score = 0
    return r2bgradient(sub_score)
    
colors = {s: calculate_color(s) for s in G.nodes}
left_sources_set = set(left_sources)
right_sources_set = set(right_sources)
    
max_size = max(len(large_subs[s]) for s in G.nodes)
node_sizes = {s:len(large_subs[s]) / max_size * MAX_SIZE for s in G.nodes}

In [104]:
from bokeh.io import show, output_file, output_notebook, reset_output
from bokeh.plotting import figure
from bokeh.models import Plot, Range1d, MultiLine, Circle, HoverTool, BoxZoomTool, WheelZoomTool, PanTool, TapTool
from bokeh.models import GraphRenderer, StaticLayoutProvider, Oval
from bokeh.palettes import Spectral4
from bokeh.models.graphs import from_networkx, NodesAndLinkedEdges, EdgesAndLinkedNodes
from bokeh.layouts import row

In [105]:
X_SPACING = 1000
def x(sub):
    return score_sub(sub) * X_SPACING
    
left = 0
center = 0
right = 0
def y(sub):
    global left, center, right
    if sub in left_sources_set:
        result = left
        if left >= 0: left += 2
        left *= -1
    elif sub in right_sources_set:
        result = right
        if right >= 0: right += 2
        right *= -1
    else:
        result = center
        if center >= 0: center += 1
        center *= -1
    return result

FIGURE_SIZE = 1500
plot = figure(x_range=(-FIGURE_SIZE, FIGURE_SIZE), y_range=(-FIGURE_SIZE, FIGURE_SIZE),
              tools='')

graph = from_networkx(G, nx.spring_layout, scale=2, center=(0,0))

### start of layout code

hover = HoverTool(tooltips=[("sub name", "@name"), ("score", "@score")])
hover.show_arrow = False
plot.add_tools(hover, BoxZoomTool(), PanTool(), WheelZoomTool(), TapTool())
graph.node_renderer.data_source.data['name'] = list(G.nodes)
graph.node_renderer.data_source.data['score'] = [score_sub(s) for s in G.nodes]

graph_layout = {node: (x(node), y(node) * MAX_SIZE) for node in G.nodes}
graph.layout_provider = StaticLayoutProvider(graph_layout=graph_layout)

graph.edge_renderer.data_source.data["line_width"] = [G.get_edge_data(a,b)['weight'] for a, b in G.edges()]
graph.node_renderer.data_source.data['node_color'] = [colors[n] for n in G.nodes]
graph.node_renderer.data_source.data['node_size'] = [node_sizes[n] for n in G.nodes]
graph.node_renderer.glyph = Circle(size='node_size', fill_color={'field': 'node_color'})
graph.edge_renderer.glyph.line_width = {'field': 'line_width'}

graph.edge_renderer.glyph = MultiLine(line_color="#CCCCCC", line_alpha=0.8, line_width={'field': 'line_width'})
graph.edge_renderer.selection_glyph = MultiLine(line_color='#000000', line_width={'field': 'line_width'})
graph.selection_policy = NodesAndLinkedEdges()

plot.renderers.append(graph)
reset_output()
output_notebook()
show(plot)

ERROR:bokeh.core.validation.check:E-1001 (BAD_COLUMN_NAME): Glyph refers to nonexistent column name: line_width [renderer: GlyphRenderer(id='29f970c5-6a02-4aad-9f8d-a1d3c25b1e73', ...)]
ERROR:bokeh.core.validation.check:E-1001 (BAD_COLUMN_NAME): Glyph refers to nonexistent column name: node_color, node_size [renderer: GlyphRenderer(id='7c2111be-5764-47d8-9e55-b83105132c84', ...)]


In [106]:
def make_edges(edges, large_subs, large_sub_users, sorted_subs):
    for i in range(len(sorted_subs)):
        s = sorted_subs[i]
        if s not in large_subs: continue
        for u in large_sub_users:
            if u in large_subs[s]:
                for j in range(i + 1, len(sorted_subs)):
                    s2 = sorted_subs[j]
                    if u in large_subs[s2]:
                        count = edges.get((s, s2), 0)
                        edges[(s, s2)] = count + 1

all_sub_edges = dict()
make_edges(all_sub_edges, large_subs, large_sub_users, sorted_subs = sorted(list(large_subs)))

In [107]:
MAX_SIZE = 50
MAX_WEIGHT = 30

max_weight = max(all_sub_edges[e] for e in all_sub_edges)
all_G = nx.Graph()
for e in all_sub_edges:
    all_G.add_edge(e[0], e[1], weight=all_sub_edges[e] / max_weight * MAX_WEIGHT)
colors = {s: calculate_color(s) for s in all_G.nodes}
max_size = max(len(large_subs[s]) for s in all_G.nodes)
node_sizes = {s:len(large_subs[s]) / max_size * MAX_SIZE for s in all_G.nodes}

In [121]:
def plot_plain_network(G, colors, node_sizes, scores):
    FIGURE_SIZE = 2.1
    plot = figure(x_range=(-FIGURE_SIZE, FIGURE_SIZE), y_range=(-FIGURE_SIZE, FIGURE_SIZE),
                  tools='')

    graph = from_networkx(G, nx.spring_layout, scale=2, center=(0,0))

    ### start of layout code

    hover = HoverTool(tooltips=[("sub name", "@name"), ("score", "@score")])
    hover.show_arrow = False
    plot.add_tools(hover, BoxZoomTool(), PanTool(), WheelZoomTool(), TapTool())
    graph.node_renderer.data_source.data['name'] = list(G.nodes)
    graph.node_renderer.data_source.data['score'] = scores

    graph.edge_renderer.data_source.data["line_width"] = [G.get_edge_data(a,b)['weight'] for a, b in G.edges()]
    graph.node_renderer.data_source.data['node_color'] = [colors[n] for n in G.nodes]
    graph.node_renderer.data_source.data['node_size'] = [node_sizes[n] for n in G.nodes]
    graph.node_renderer.glyph = Circle(size='node_size', fill_color={'field': 'node_color'})
    graph.edge_renderer.glyph.line_width = {'field': 'line_width'}
    
    graph.edge_renderer.glyph = MultiLine(line_color="#CCCCCC", line_alpha=0.8, line_width={'field': 'line_width'})
    graph.edge_renderer.selection_glyph = MultiLine(line_color='#000000', line_width={'field': 'line_width'})
    graph.selection_policy = NodesAndLinkedEdges()

    plot.renderers.append(graph)
    reset_output()
    output_notebook()
    return plot

original_plot = plot_plain_network(G, colors, node_sizes, [score_sub(s) for s in G.nodes])
show(original_plot)

ERROR:bokeh.core.validation.check:E-1001 (BAD_COLUMN_NAME): Glyph refers to nonexistent column name: line_width [renderer: GlyphRenderer(id='a3f980e1-0ded-4977-8178-a92e302acea4', ...)]
ERROR:bokeh.core.validation.check:E-1001 (BAD_COLUMN_NAME): Glyph refers to nonexistent column name: node_color, node_size [renderer: GlyphRenderer(id='0b1d5354-62c2-499c-8e79-8b3265ee51b7', ...)]


In [109]:
with open('reddit data/test_subs.json') as f:
    test = json.load(f)
with open('reddit data/test_users.json') as f:
    test_users = json.load(f)
with open('reddit data/test_preds_nb.json') as f:
    test_preds = json.load(f)

In [110]:
# invert user:subs dict
test_sub_users = dict()
test_subs = set()
for u in test_users:
    test_subs.update(set(test[u]))
for s in test_subs:
    test_sub_users[s] = set()
for u in test:
    for s in test[u]:
        test_sub_users[s].add(u)
print(len(test_sub_users))
test_large_subs = {s:test_sub_users[s] for s in test_sub_users if len(test_sub_users[s]) >= 1500}
print(len(test_large_subs))

16245
45


In [111]:
test_sub_edges = dict()
make_edges(test_sub_edges, test_large_subs, test_users, sorted(list(test_large_subs)))

In [114]:
MAX_SIZE = 50
MAX_WEIGHT = 30

max_weight = max(test_sub_edges[e] for e in test_sub_edges)
test_G = nx.Graph()
for e in test_sub_edges:
    test_G.add_edge(e[0], e[1], weight=test_sub_edges[e] / max_weight * MAX_WEIGHT)
    
def predicted_score(s, users):
    return sum(test_preds[u] for u in users) / len(users)

    
def calculate_predicted_color(s):
    users = test_large_subs[s]
    return r2bgradient(predicted_score(s, users))
    
test_colors = {s: calculate_predicted_color(s) for s in test_G.nodes}
max_size = max(len(test_large_subs[s]) for s in test_G.nodes)
test_node_sizes = {s:len(test_large_subs[s]) / max_size * MAX_SIZE for s in test_G.nodes}

In [122]:
new_plot = plot_plain_network(test_G, test_colors, test_node_sizes,
                              [predicted_score(s, test_sub_users[s]) * 2 - 1 for s in test_G.nodes])
show(new_plot)

ERROR:bokeh.core.validation.check:E-1001 (BAD_COLUMN_NAME): Glyph refers to nonexistent column name: line_width [renderer: GlyphRenderer(id='6feb0ab2-8cf8-4a70-a0b4-f98c7c046766', ...)]
ERROR:bokeh.core.validation.check:E-1001 (BAD_COLUMN_NAME): Glyph refers to nonexistent column name: node_color, node_size [renderer: GlyphRenderer(id='515cf29f-3d06-4cdf-a217-e3beaf71955c', ...)]
