# Reddit Political Analysis
## Insight to behavior of Politically Active Redditors

## Introduction

### Reddit as a Social Network

This report deals heavily with the nature of reddit and how users interact with the platform. As such, it is imperative that first these characteristics are outlined, and fully understood, as they form the presmise of our analysis and later discussion.

Reddit is a social media network centered around forum-based interactions. Lying somewhere in between 4Chan and Facebook, Reddit shifts focus away from the user, focusing instead on content-specific communities, while not going so far as to be completely anonymous. These communitites tend to be cliquey, in that they tend to have specific vernacular in the form of inside jokes and jargon: for instance, HighQualityGifs tends to have meta-discussion on GIF-making. However, users are not restricted to a single community and can participate in all of them. This leads to the the two most important, characteristics of the Reddit: 
    
1. Reddit is essentially comprised of Subreddits
2. Users interact on Subreddits, and are free to do so on any subreddit.
    
### Political Activism on Reddit

Reddit is highly politically active. The platform lends itself to confrontation and conversation between people of varying backgrounds and political leanings, more-so than Facebook, and even Twitter, as people aren't restricted by their own friend circles. Differing political factions pour out of their respective subreddits into common spaces across Reddit, influencing the nature of discussion in the mainstream. More-so, the platform is an important source of news to manny of its users. According to a study by the Pew Research Center, though only less than 1 in 10 American Adults use Reddit, more than 7 in 10 of users rely on it as their primary news source. This means that Reddit is an important place to look to see how people consume news in the Age of the Internet. 

The epicenter of political activism on the platform lies in . There are subreddits for most every type of political affiliation, though some more extreme ideologies have had their subreddits shut down i.e r/nationalsocialism. The most famous of these subreddits is r/The_Donald, with 600,000 subs. The_Donald is notorious across the platform for being highly insular, really only having content that conforms to what the community wants to hear. This is true for most if all of the subreddits on both the left and right. 

### The Problem: 

All this being said, what we wanted to look at was just how insular Reddit is, evaulating whether Reddit could reasonably split into a "Left" and "Right" reddit, each running in their own echo chambers. This is a point of interest because it would let us get a better understanding of:

1. Reddit's efficacy as an informative News Source
2. Organic Organization of Political Activist groups
3. The differing structure of "Leftist" Reddit and "Right" Reddit.

## Data Collection

We decided that the approrpiate way to approach the problem was to garner user data from politicized subreddits and see how the spread themselves across the general subreddits. We gathered this user data using PRAW, the Python Reddit API Wrapper to collect data from the Reddit API.  

To gather training data, we created a list of left-leaning and right-leaning subreddits and gathered lists of contributors and the subreddits they contributed to. The idea was that these users could be classified as left-users and right-users respectively. By looking at the subreddits that they contribute too regularly, we could see the number of contributers for each subreddit on the right and left. Because each redditor’s subscription information is private, 
we used each user’s top 100 comments of the past month to determine which subreddits they recently contributed to. For the training data, we were able to collect: 17999 Redditors in right-leaning subreddits, 22059 Redditors in left-leaning subreddits, for a total of 659,005 total contributions to unique subreddits.

For test data, we collected  30771 Redditors who participated in discussion on the top posts of the month across all of Reddit (which to be sure does not include any posts from the subreddits in the test data). The idea here being that we would be able to classify these users using an classifier trained on the training data, to get a better understanding of how this relates to mainstream reddit. The test data ended up being comprised of 538602 total contributions to unique subreddits.


In [None]:
import praw, json, sys
from pprint import pprint

reddit = praw.Reddit(client_id=sys.argv[1],
                     client_secret=sys.argv[2],
                     user_agent='Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 \
                     (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36')
sub = sys.argv[3].lower()
print(sub)

users = set()
usernames = set()
user_subs = dict()

with open('users/' + sub + '.json', 'w') as f:
    subreddit = reddit.subreddit(sub)
    for post in subreddit.top('month'): # search 100 top posts of the month
        post.comments.replace_more(limit=None) # all comments on each post
        comments = post.comments.list()
        for comment in comments:
            if comment.author != None:
                users.add(comment.author)
                usernames.add(comment.author.name)
                subs = user_subs.get(comment.author.name, set())
                subs.add(sub)
                user_subs[comment.author.name] = subs

    json.dump(list(usernames), f, separators=(',', ':'))

print(len(usernames))
with open('subs/' + sub + '.json', 'w') as f:
    for user in list(users):
        try:
            for comment in user.comments.top('month'): # search 100 top comments
                subs = user_subs[user.name]
                subs.add(str(comment.subreddit).lower())
                user_subs[user.name] = subs
        except:
            pass
    for user in user_subs:
        user_subs[user] = list(user_subs[user])
    json.dump(user_subs, f, separators=(',', ':'))
print("done")

##  Classification

In [28]:
import json
import numpy as np

In [None]:
with open('reddit data/all_subs.json') as subs_json:
    subreddits = json.load(subs_json)

In [None]:
print(len(subreddits))
print(subreddits[:10])

In [None]:
with open('reddit data/left.json') as left_json:
    left = json.load(left_json)
with open('reddit data/right.json') as right_json:
    right = json.load(right_json)

In [None]:
print("Left User Count: %d, Right User Count: %d" % (len(left), len(right)))

In [None]:
all_data = []
all_users = []

for user, subs in left.items():
    all_data.append(subs)
    all_users.append(user)
    

for user, subs in right.items():
    all_data.append(subs)
    all_users.append(user)

In [None]:
from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer(sparse_output=True)
converted_data = mlb.fit_transform(all_data)

In [None]:
len_left = len(left)
len_right = len(right)

labels = np.append(np.zeros(len_left), np.ones(len_right))

In [None]:
from sklearn.naive_bayes import BernoulliNB

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.model_selection import GridSearchCV

In [None]:
alpha_range = np.linspace(1, 40)
param_grid = dict(alpha=alpha_range)
cv = StratifiedShuffleSplit(n_splits=5, test_size=0.2, random_state=42)


def run_grid(param_grid, classifier):
    grid = GridSearchCV(classifier, param_grid=param_grid, cv=cv, verbose=0, n_jobs=-1)
    grid.fit(converted_data, labels)
    return (grid.best_params_, grid.best_score_)
    
best_param_nb, best_score_nb = run_grid(param_grid, BernoulliNB())

print("The best parameters are %s with a score of %0.2f"
      % (best_param_nb, best_score_nb))

In [None]:
alpha_range = np.linspace(20, 22, 40)
param_grid = dict(alpha=alpha_range)

best_param_nb, best_score_nb = run_grid(param_grid, BernoulliNB())

print("The best parameters are %s with a score of %0.2f"
      % (best_param_nb, best_score_nb))

In [None]:
from sklearn import svm

In [None]:
C_range = np.logspace(1, 3, 4)
gamma_range = np.logspace(-6, -2, 5)
param_grid = dict(gamma=gamma_range, C=C_range)

best_param_svm, best_score_svm = run_grid(param_grid, svm.SVC())

print("The best parameters are %s with a score of %0.2f"
      % (best_param_svm, best_score_svm))

In [None]:
C_range = np.logspace(1.9, 2.1, 4)
gamma_range = np.logspace(-4.1, -3.9, 4)
param_grid = dict(gamma=gamma_range, C=C_range)

best_param_svm, best_score_svm = run_grid(param_grid, svm.SVC())

print("The best parameters are %s with a score of %0.2f"
      % (best_param_svm, best_score_svm))

In [None]:
nb_clf = BernoulliNB(alpha=best_param_nb['alpha'])

nb_clf.fit(converted_data, labels)
nb_clf.score(converted_data, labels)

clf = svm.SVC(C=best['C'], gamma=best['gamma'])
svm_clf.fit(converted_data, labels)
svm_clf.score(converted_data, labels)


In [None]:
from sklearn.externals import joblib
joblib.dump(nb_clf, 'nb.pkl') 
joblib.dump(clf, 'svm.pkl')
joblib.dump(mlb, 'mlb.pkl')

In [1]:
from sklearn.externals import joblib
nb_clf = joblib.load('nb.pkl')
svm_clf = joblib.load('svm.pkl')
mlb = joblib.load('mlb.pkl')

In [None]:
with open('reddit data/test_subs.json') as test_json:
    test = json.load(test_json)
    
test_data = []
test_users = []

for user, subs in test.items():
    test_data.append(subs)
    test_users.append(user)

In [None]:
print(len(test_data))

In [None]:
filtered_data = list(map(lambda l: list(filter(lambda sub: sub in subreddits, l)), test_data))

converted_test = mlb.transform(filtered_data)

predicted_labels = nb_clf.predict(converted_test)

In [None]:
with open('reddit data/test_preds_nb.json', 'w') as test_preds:
    json.dump(dict(zip(test_users, predicted_labels)), test_preds)

## Results

Now let's take a look at what the data has to say! We used Bokeh and NetworkX to generate a series of networks graphs to show the relationships between each of the subreddits and how contributers flow between them.

In [15]:
import json, math
import matplotlib.pyplot as plt
import networkx as nx
from pprint import pprint

In [16]:
with open('reddit data/left.json') as f:
    left = json.load(f)
with open('reddit data/right.json') as f:
    right = json.load(f)
with open('reddit data/left_subs.json') as f:
    left_subs = set(json.load(f))
with open('reddit data/right_subs.json') as f:
    right_subs = set(json.load(f))
with open('reddit data/left_users.json') as f:
    left_users = set(json.load(f))
with open('reddit data/right_users.json') as f:
    right_users = set(json.load(f))
with open('reddit data/left_sources.txt') as f:
    left_sources = set([l.strip().lower() for l in f.readlines() if l.strip() != ''])
with open('reddit data/right_sources.txt') as f:
    right_sources = set([l.strip().lower() for l in f.readlines() if l.strip() != ''])
with open('reddit data/all_subs.json') as f:
    all_subs = json.load(f)

In [17]:
# invert user:subs dict
sub_users = dict()
for s in all_subs:
    sub_users[s] = set()
for u in left:
    for s in left[u]:
        sub_users[s].add(u)
for u in right:
    for s in right[u]:
        sub_users[s].add(u)

In [18]:
# filter out small subs
counts = {s:len(sub_users[s]) for s in sub_users}
large_subs = {s:sub_users[s] for s in sub_users if len(sub_users[s]) >= 1500}
print(len(large_subs))
large_sub_users = set()
for s in large_subs:
    for u in large_subs[s]:
        large_sub_users.add(u)
print(len(large_sub_users))

most_subs = {s:sub_users[s] for s in sub_users if len(sub_users[s]) >= 50}
print(len(most_subs))

59
38795
1483


In [19]:
from bokeh.io import show, output_file, output_notebook, reset_output
from bokeh.plotting import figure
from bokeh.models import Plot, Range1d, MultiLine, Circle, HoverTool, BoxZoomTool, WheelZoomTool, PanTool, TapTool
from bokeh.models import GraphRenderer, StaticLayoutProvider, Oval, Span
from bokeh.palettes import Spectral4
from bokeh.models.graphs import from_networkx, NodesAndLinkedEdges, EdgesAndLinkedNodes
from bokeh.layouts import row

In [20]:
def score_sub(sub):
    users = sub_users[sub]
    score = 0
    for u in users:
        if u in left_users: score -= 1
        if u in right_users: score += 1
    sub_score = score / len(users)
    sub_score -= -(len(left_users) - len(right_users)) / (len(left_users) + len(right_users))
    if sub_score > 1: sub_score = 1
    elif sub_score < -1: sub_score = -1
    return sub_score

scored_subs = {s: score_sub(s) for s in most_subs}
sorted_scored_subs = list(reversed(sorted(scored_subs)))
scores = [scored_subs[s] for s in sorted_scored_subs]

X_RANGE = 50
scatter = figure(x_range=(-X_RANGE, X_RANGE), y_range=sorted_scored_subs)
zero = Span(location=0, dimension='height', line_color='black',
            line_width=3)
scatter.add_layout(zero)
scatter.circle(scores, sorted_scored_subs, size=5, fill_color="green", line_color="black", line_width=3)
reset_output()
output_notebook()
show(scatter)

A lot of these classifications come as no surprise, for instance all of the gun subreddits like r/glock and r/guns are very right leaning, however some seem unintuitive like the majority of video-game subreddits tend to be more right-leaning, despite not being content-related. Generally, however, the subreddits in the test data lean left.

In [21]:
political_sub_edges = dict()
sources_set = set(left_sources).union(set(right_sources))
sorted_subs = sorted(list(large_subs))

for s in sorted(sources_set):
    if s not in large_subs: continue
    for u in large_subs[s]:
        for s2 in sorted_subs:
            if s != s2 and u in large_subs[s2]:
                count = political_sub_edges.get((s, s2), 0)
                political_sub_edges[(s, s2)] = count + 1

In [22]:
MAX_SIZE = 50
MAX_WEIGHT = 30

max_weight = max(political_sub_edges[e] for e in political_sub_edges)
G = nx.Graph()
for e in political_sub_edges:
    G.add_edge(e[0], e[1], weight=political_sub_edges[e] / max_weight * MAX_WEIGHT)
    
def r2bgradient(score):
    r = math.floor(255 * score)
    b = math.floor(255 * (1 - score))
    color = '#' + '%02x' % r + '00' + '%02x' % b
    return color

def calculate_color(s):
    sub_score = score_sub(s) + 1
    sub_score /= 2
    if sub_score > 1: sub_score = 1
    elif sub_score < 0: sub_score = 0
    return r2bgradient(sub_score)
    
colors = {s: calculate_color(s) for s in G.nodes}
left_sources_set = set(left_sources)
right_sources_set = set(right_sources)
    
max_size = max(len(large_subs[s]) for s in G.nodes)
node_sizes = {s:len(large_subs[s]) / max_size * MAX_SIZE for s in G.nodes}

In [23]:
X_SPACING = 1000
def x(sub):
    return score_sub(sub) * X_SPACING
    
left = 0
center = 0
right = 0
def y(sub):
    global left, center, right
    if sub in left_sources_set:
        result = left
        if left >= 0: left += 2
        left *= -1
    elif sub in right_sources_set:
        result = right
        if right >= 0: right += 2
        right *= -1
    else:
        result = center
        if center >= 0: center += 1
        center *= -1
    return result

FIGURE_SIZE = 1500
plot = figure(x_range=(-FIGURE_SIZE, FIGURE_SIZE), y_range=(-FIGURE_SIZE, FIGURE_SIZE),
              tools='')

graph = from_networkx(G, nx.spring_layout, scale=2, center=(0,0))


hover = HoverTool(tooltips=[("sub name", "@name"), ("score", "@score")])
hover.show_arrow = False
plot.add_tools(hover, BoxZoomTool(), PanTool(), WheelZoomTool(), TapTool())
graph.node_renderer.data_source.data['name'] = list(G.nodes)
graph.node_renderer.data_source.data['score'] = [score_sub(s) for s in G.nodes]

graph_layout = {node: (x(node), y(node) * MAX_SIZE) for node in G.nodes}
graph.layout_provider = StaticLayoutProvider(graph_layout=graph_layout)

graph.edge_renderer.data_source.data["line_width"] = [G.get_edge_data(a,b)['weight'] for a, b in G.edges()]
graph.node_renderer.data_source.data['node_color'] = [colors[n] for n in G.nodes]
graph.node_renderer.data_source.data['node_size'] = [node_sizes[n] for n in G.nodes]
graph.node_renderer.glyph = Circle(size='node_size', fill_color={'field': 'node_color'})
graph.edge_renderer.glyph.line_width = {'field': 'line_width'}

graph.edge_renderer.glyph = MultiLine(line_color="#CCCCCC", line_alpha=0.8, line_width={'field': 'line_width'})
graph.edge_renderer.selection_glyph = MultiLine(line_color='#000000', line_width={'field': 'line_width'})
graph.selection_policy = NodesAndLinkedEdges()

plot.renderers.append(graph)
reset_output()
output_notebook()
show(plot)

ERROR:bokeh.core.validation.check:E-1001 (BAD_COLUMN_NAME): Glyph refers to nonexistent column name: line_width [renderer: GlyphRenderer(id='52f5aae3-0486-4054-b118-85949c3cfc6d', ...)]
ERROR:bokeh.core.validation.check:E-1001 (BAD_COLUMN_NAME): Glyph refers to nonexistent column name: node_color, node_size [renderer: GlyphRenderer(id='ff229889-ef49-4455-9827-0a1cef546d46', ...)]


Again, it is clear that the majority of subreddits, especially the larger and more popular ones, tend to lean slightly to the left. A few smaller subreddits veer significantly to the right indicating that right-leaning redditors seem to form their own spaces away from the liberal reddit community. 

In [24]:
def make_edges(edges, large_subs, large_sub_users, sorted_subs):
    for i in range(len(sorted_subs)):
        s = sorted_subs[i]
        if s not in large_subs: continue
        for u in large_sub_users:
            if u in large_subs[s]:
                for j in range(i + 1, len(sorted_subs)):
                    s2 = sorted_subs[j]
                    if u in large_subs[s2]:
                        count = edges.get((s, s2), 0)
                        edges[(s, s2)] = count + 1

all_sub_edges = dict()
make_edges(all_sub_edges, large_subs, large_sub_users, sorted_subs = sorted(list(large_subs)))

In [25]:
MAX_SIZE = 50
MAX_WEIGHT = 30

max_weight = max(all_sub_edges[e] for e in all_sub_edges)
all_G = nx.Graph()
for e in all_sub_edges:
    all_G.add_edge(e[0], e[1], weight=all_sub_edges[e] / max_weight * MAX_WEIGHT)
colors = {s: calculate_color(s) for s in all_G.nodes}
max_size = max(len(large_subs[s]) for s in all_G.nodes)
node_sizes = {s:len(large_subs[s]) / max_size * MAX_SIZE for s in all_G.nodes}

In [26]:
def plot_network(G, colors, node_sizes, scores):
    FIGURE_SIZE = 2.1
    plot = figure(x_range=(-FIGURE_SIZE, FIGURE_SIZE), y_range=(-FIGURE_SIZE, FIGURE_SIZE),
                  tools='')

    graph = from_networkx(G, nx.spring_layout, scale=2, center=(0,0))


    hover = HoverTool(tooltips=[("sub name", "@name"), ("score", "@score")])
    hover.show_arrow = False
    plot.add_tools(hover, BoxZoomTool(), PanTool(), WheelZoomTool(), TapTool())
    graph.node_renderer.data_source.data['name'] = list(G.nodes)
    graph.node_renderer.data_source.data['score'] = scores

    graph.edge_renderer.data_source.data["line_width"] = [G.get_edge_data(a,b)['weight'] for a, b in G.edges()]
    graph.node_renderer.data_source.data['node_color'] = [colors[n] for n in G.nodes]
    graph.node_renderer.data_source.data['node_size'] = [node_sizes[n] for n in G.nodes]
    graph.node_renderer.glyph = Circle(size='node_size', fill_color={'field': 'node_color'})
    graph.edge_renderer.glyph.line_width = {'field': 'line_width'}
    
    graph.edge_renderer.glyph = MultiLine(line_color="#CCCCCC", line_alpha=0.8, line_width={'field': 'line_width'})
    graph.edge_renderer.selection_glyph = MultiLine(line_color='#000000', line_width={'field': 'line_width'})
    graph.selection_policy = NodesAndLinkedEdges()

    plot.renderers.append(graph)
    reset_output()
    output_notebook()
    return plot

original_plot = plot_network(all_G, colors, node_sizes, [score_sub(s) for s in all_G.nodes])
show(original_plot)

ERROR:bokeh.core.validation.check:E-1001 (BAD_COLUMN_NAME): Glyph refers to nonexistent column name: line_width [renderer: GlyphRenderer(id='1d74113e-08d4-4eb2-b6e9-59e56d506aba', ...)]
ERROR:bokeh.core.validation.check:E-1001 (BAD_COLUMN_NAME): Glyph refers to nonexistent column name: node_color, node_size [renderer: GlyphRenderer(id='8214c5c8-d90a-40df-aaf7-bdf978a880ac', ...)]


This graph shows clearly the network of subreddits, and how they are intertwined. Clearly, redditors interact fluidly across all of the subreddits, as the graph is highly connected. 

In [29]:
logs = np.rollaxis(nb_clf.feature_log_prob_, 1)
probs = math.e ** logs
prob_lookup = dict(zip(sorted(all_subs), probs))

In [30]:
def prob_score(sub):
    sub_probs = prob_lookup[sub]
    score = sub_probs[1] / sum(sub_probs)
    return score

def prob_color(sub):
    return r2bgradient(prob_score(sub))

prob_colors = {s: prob_color(s) for s in all_G.nodes}
prob_plot = plot_network(all_G, prob_colors, node_sizes, [prob_score(s) * 2 - 1 for s in all_G.nodes])
show(prob_plot)

ERROR:bokeh.core.validation.check:E-1001 (BAD_COLUMN_NAME): Glyph refers to nonexistent column name: line_width [renderer: GlyphRenderer(id='0bc03ff5-a1f0-4f7b-962a-d4e9d831d321', ...)]
ERROR:bokeh.core.validation.check:E-1001 (BAD_COLUMN_NAME): Glyph refers to nonexistent column name: node_color, node_size [renderer: GlyphRenderer(id='fdc3e8fb-d486-489e-9fa9-868b353a6b3b', ...)]


Here we can see how a similar network graph looks for the test data. For a point of reference, look at r/pics, which in the graph of the training data was very small as not many political users interacted with it, is now very large, reflecting it's actual size as one of the largest subreddits. Notice again, the majority or subreddits are left-leaning, with a select few being highly conservative.

In [31]:
with open('reddit data/test_subs.json') as f:
    test = json.load(f)
with open('reddit data/test_users.json') as f:
    test_users = json.load(f)
with open('reddit data/test_preds_nb.json') as f:
    test_preds = json.load(f)

In [32]:
# invert user:subs dict
test_sub_users = dict()
test_subs = set()
for u in test_users:
    test_subs.update(set(test[u]))
for s in test_subs:
    test_sub_users[s] = set()
for u in test:
    for s in test[u]:
        test_sub_users[s].add(u)
print(len(test_sub_users))
test_large_subs = {s:test_sub_users[s] for s in test_sub_users if len(test_sub_users[s]) >= 1500}
print(len(test_large_subs))

16245
45


In [33]:
test_sub_edges = dict()
make_edges(test_sub_edges, test_large_subs, test_users, sorted(list(test_large_subs)))

In [34]:
MAX_SIZE = 50
MAX_WEIGHT = 30

max_weight = max(test_sub_edges[e] for e in test_sub_edges)
test_G = nx.Graph()
for e in test_sub_edges:
    test_G.add_edge(e[0], e[1], weight=test_sub_edges[e] / max_weight * MAX_WEIGHT)
    
def predicted_score(s, users):
    return sum(test_preds[u] for u in users) / len(users)

    
def calculate_predicted_color(s):
    users = test_large_subs[s]
    return r2bgradient(predicted_score(s, users))
    
test_colors = {s: calculate_predicted_color(s) for s in test_G.nodes}
max_size = max(len(test_large_subs[s]) for s in test_G.nodes)
test_node_sizes = {s:len(test_large_subs[s]) / max_size * MAX_SIZE for s in test_G.nodes}

In [35]:
new_plot = plot_network(test_G, test_colors, test_node_sizes,
                              [predicted_score(s, test_sub_users[s]) * 2 - 1 for s in test_G.nodes])
show(new_plot)

ERROR:bokeh.core.validation.check:E-1001 (BAD_COLUMN_NAME): Glyph refers to nonexistent column name: line_width [renderer: GlyphRenderer(id='f9dcb89e-6090-40fd-be4a-a68b00afb8fa', ...)]
ERROR:bokeh.core.validation.check:E-1001 (BAD_COLUMN_NAME): Glyph refers to nonexistent column name: node_color, node_size [renderer: GlyphRenderer(id='08b7f8c6-0d5a-430f-8ef1-fe41570ba1c1', ...)]


This graph is of the same form, but for the main subreddits, using the classifier. It is clear from the visualization that these major subreddits on the platform are largely liberal. This seems to confirm our hypothesis that reddit is highly insular, and that people seem to stay in their reflective studies. 