# Reddit Political Analysis
## Insight to behavior of Politically Active Redditors

## Introduction

### Reddit as a Social Network

This report deals heavily with the nature of reddit and how users interact with the platform. As such, it is imperative that first these characteristics are outlined, and fully understood, as they form the presmise of our analysis and later discussion.

Reddit is a social media network centered around forum-based interactions. Lying somewhere in between 4Chan and Facebook, Reddit shifts focus away from the user, focusing instead on content-specific communities, while not going so far as to be completely anonymous. These communitites tend to be cliquey, in that they tend to have specific vernacular in the form of inside jokes and jargon: for instance, HighQualityGifs tends to have meta-discussion on GIF-making. However, users are not restricted to a single community and can participate in all of them. This leads to the the two most important, characteristics of the Reddit: 
    
1. Reddit is essentially comprised of Subreddits
2. Users interact on Subreddits, and are free to do so on any subreddit.
    
### Political Activism on Reddit

Reddit is highly politically active. The platform lends itself to confrontation and conversation between people of varying backgrounds and political leanings, more-so than Facebook, and even Twitter, as people aren't restricted by their own friend circles. Differing political factions pour out of their respective subreddits into common spaces across Reddit, influencing the nature of discussion in the mainstream. More-so, the platform is an important source of news to manny of its users. According to a study by the Pew Research Center, though only less than 1 in 10 American Adults use Reddit, more than 7 in 10 of users rely on it as their primary news source. This means that Reddit is an important place to look to see how people consume news in the Age of the Internet. 

The epicenter of political activism on the platform lies in . There are subreddits for most every type of political affiliation, though some more extreme ideologies have had their subreddits shut down i.e r/nationalsocialism. The most famous of these subreddits is r/The_Donald, with 600,000 subs. The_Donald is notorious across the platform for being highly insular, really only having content that conforms to what the community wants to hear. This is true for most if all of the subreddits on both the left and right. 

### The Problem: 

All this being said, what we wanted to look at was just how insular Reddit is, evaulating whether Reddit could reasonably split into a "Left" and "Right" reddit, each running in their own echo chambers. This is a point of interest because it would let us get a better understanding of:

1. Reddit's efficacy as an informative News Source
2. Organic Organization of Political Activist groups
3. The differing structure of "Leftist" Reddit and "Right" Reddit.

### Previous Research:

A lot has been said about Reddit as a hotbed for rabid political discussion especially with respect to The_Donald. Even the Trump campaign itself understood it's value, looking at the subreddit for insight into whether his messages hit home. FiveThirtyEight published a wonderful report digging deep into The_Donald, it's users, and its affect and place in the Trump phenomenon: https://fivethirtyeight.com/features/dissecting-trumps-most-rabid-online-following/.
This keys us into how important Reddit can be as a political platform. A specific example of this power at play can be seen through Quartz' report on the origins of the "alt-left" and how conservative redditors created the term artificially: https://qz.com/1083444/analysis-of-500-million-reddit-comments-shows-how-the-alt-right-made-the-alt-left-a-thing/. These articles do a fantastic job of analyzing how conservative redditors behave in their own circles, we would like to extend this and look at whether they come out of their circles and interact with mainstream reddit frequently.

## Data Collection

We decided that the approrpiate way to approach the problem was to garner user data from politicized subreddits and see how the spread themselves across the general subreddits. We gathered this user data using PRAW, the Python Reddit API Wrapper to collect data from the Reddit API.  

To gather training data, we created a list of left-leaning and right-leaning subreddits and gathered lists of contributors and the subreddits they contributed to. The idea was that these users could be classified as left-users and right-users respectively. By looking at the subreddits that they contribute too regularly, we could see the number of contributers for each subreddit on the right and left. Because each redditor’s subscription information is private, 
we used each user’s top 100 comments of the past month to determine which subreddits they recently contributed to. For the training data, we were able to collect: 17999 Redditors in right-leaning subreddits, 22059 Redditors in left-leaning subreddits, for a total of 659,005 total contributions to unique subreddits.

For test data, we collected  30771 Redditors who participated in discussion on the top posts of the month across all of Reddit (which to be sure does not include any posts from the subreddits in the test data). The idea here being that we would be able to classify these users using an classifier trained on the training data, to get a better understanding of how this relates to mainstream reddit. The test data ended up being comprised of 538602 total contributions to unique subreddits.


In [None]:
import praw, json, sys
from pprint import pprint

reddit = praw.Reddit(client_id=sys.argv[1],
                     client_secret=sys.argv[2],
                     user_agent='Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 \
                     (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36')
sub = sys.argv[3].lower()
print(sub)

users = set()
usernames = set()
user_subs = dict()

with open('users/' + sub + '.json', 'w') as f:
    subreddit = reddit.subreddit(sub)
    for post in subreddit.top('month'): # search 100 top posts of the month
        post.comments.replace_more(limit=None) # all comments on each post
        comments = post.comments.list()
        for comment in comments:
            if comment.author != None:
                users.add(comment.author)
                usernames.add(comment.author.name)
                subs = user_subs.get(comment.author.name, set())
                subs.add(sub)
                user_subs[comment.author.name] = subs

    json.dump(list(usernames), f, separators=(',', ':'))

print(len(usernames))
with open('subs/' + sub + '.json', 'w') as f:
    for user in list(users):
        try:
            for comment in user.comments.top('month'): # search 100 top comments
                subs = user_subs[user.name]
                subs.add(str(comment.subreddit).lower())
                user_subs[user.name] = subs
        except:
            pass
    for user in user_subs:
        user_subs[user] = list(user_subs[user])
    json.dump(user_subs, f, separators=(',', ':'))
print("done")

##  Classification

We need to import our data, which is currently stored in `json` files. Thus, we import json, and numpy for later

In [28]:
import json
import numpy as np

In [None]:
with open('reddit data/all_subs.json') as subs_json:
    subreddits = json.load(subs_json)

We can see how many unique subreddits we have collected, as well as a few of the names:

In [None]:
print(len(subreddits))
print(subreddits[:10])

We now load in our data, which consists of dictionaries of users, with a list of the subreddits they contribute to:

In [73]:
with open('reddit data/left.json') as left_json:
    left = json.load(left_json)
with open('reddit data/right.json') as right_json:
    right = json.load(right_json)

We can see how many left and right redditors we have collected:

In [74]:
print("Left User Count: %d, Right User Count: %d" % (len(left), len(right)))

Left User Count: 22059, Right User Count: 17999


We need to merge our data into one large array, so that we can train over it. We also collect the list of users for later.

In [75]:
all_data = []
all_users = []

for user, subs in left.items():
    all_data.append(subs)
    all_users.append(user)
    

for user, subs in right.items():
    all_data.append(subs)
    all_users.append(user)

Now, we need to turn our data into a form that can be trained over. It is currently a list of lists of the names of subreddits. Using Scikit-Learn, we construct a binary vetor for each user, which indicaates which subreddits they participate in.

In [None]:
from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer(sparse_output=True)
converted_data = mlb.fit_transform(all_data)

We also construct our labels vector. Since we appended left, then right users, to `all_data`, we simply construct a vector of zeros. then ones, corresponding to the sizes of the users for each class.

In [None]:
len_left = len(left)
len_right = len(right)

labels = np.append(np.zeros(len_left), np.ones(len_right))

We now need to work on classifying our users. We will test both Naive Bayes and SVM. To train and determing hyperparameters, we will use Scikit-Learn's `gridsearchCV`. We will also use k-folds to implement train and test splits to evaluate the classifiers.

In [None]:
from sklearn.naive_bayes import BernoulliNB

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.model_selection import GridSearchCV

For Naive Bayes, we only need to find the single parameter alpha. We start by testing values between 1 and 40

In [None]:
alpha_range = np.linspace(1, 40)
param_grid = dict(alpha=alpha_range)
cv = StratifiedShuffleSplit(n_splits=5, test_size=0.2, random_state=42)


def run_grid(param_grid, classifier):
    grid = GridSearchCV(classifier, param_grid=param_grid, cv=cv, verbose=0, n_jobs=-1)
    grid.fit(converted_data, labels)
    return (grid.best_params_, grid.best_score_)
    
best_param_nb, best_score_nb = run_grid(param_grid, BernoulliNB())

print("The best parameters are %s with a score of %0.2f"
      % (best_param_nb, best_score_nb))

Since the above found a best value of 21, we will look around that value with more granularity:

In [None]:
alpha_range = np.linspace(20, 22, 40)
param_grid = dict(alpha=alpha_range)

best_param_nb, best_score_nb = run_grid(param_grid, BernoulliNB())

print("The best parameters are %s with a score of %0.2f"
      % (best_param_nb, best_score_nb))

We now have a pretty good value for alpha, which has not improved accuracy much over the previous grid search.


We will also begin searching for values for SVM. We now need to train two parameters, C and gamma, the discount factor for misclassified examples, as well as a value specifying the kernel for SVM.

In [None]:
from sklearn import svm

In [1]:
C_range = np.logspace(1, 3, 4)
gamma_range = np.logspace(-6, -2, 5)
param_grid = dict(gamma=gamma_range, C=C_range)

best_param_svm, best_score_svm = run_grid(param_grid, svm.SVC())

print("The best parameters are %s with a score of %0.2f"
      % (best_param_svm, best_score_svm))

The best parameters are {'C': 100.0, 'gamma': 0.0001} with a score of 0.98


As we already achieved very good accuracy, we will only do one more search for parameters:

In [2]:
C_range = np.logspace(1.9, 2.1, 4)
gamma_range = np.logspace(-4.1, -3.9, 4)
param_grid = dict(gamma=gamma_range, C=C_range)

best_param_svm, best_score_svm = run_grid(param_grid, svm.SVC())

print("The best parameters are %s with a score of %0.2f"
      % (best_param_svm, best_score_svm))

The best parameters are {'C': 125.89254117941675, 'gamma': 0.00012589254117941674} with a score of 0.98


With our best found parameters, we will construct our classifiers, and store them in files, so that they be saved and loaded later:

In [None]:
nb_clf = BernoulliNB(alpha=best_param_nb['alpha'])

nb_clf.fit(converted_data, labels)
nb_clf.score(converted_data, labels)

clf = svm.SVC(C=best['C'], gamma=best['gamma'])
svm_clf.fit(converted_data, labels)
svm_clf.score(converted_data, labels)


In [None]:
from sklearn.externals import joblib
joblib.dump(nb_clf, 'nb.pkl') 
joblib.dump(clf, 'svm.pkl')
joblib.dump(mlb, 'mlb.pkl')

Here, we load our classifiers, so that we can use them to attempt to classify the new test users:

In [1]:
from sklearn.externals import joblib
nb_clf = joblib.load('nb.pkl')
svm_clf = joblib.load('svm.pkl')
mlb = joblib.load('mlb.pkl')

We can load in our test data:

In [None]:
with open('reddit data/test_subs.json') as test_json:
    test = json.load(test_json)
    
test_data = []
test_users = []

for user, subs in test.items():
    test_data.append(subs)
    test_users.append(user)

We have nearly 30000 test users:

In [None]:
print(len(test_data))

In order to predict classes for our users, we must first filter out any subreddits we have not seen before, as our classifier only works for subreddits/features it has seen. Then, we use our precreated binarizer to convert the examples to a form that can be predicted over:

In [None]:
filtered_data = list(map(lambda l: list(filter(lambda sub: sub in subreddits, l)), test_data))

converted_test = mlb.transform(filtered_data)

predicted_labels = nb_clf.predict(converted_test)

Finally, we save our predicted classes back to `json`, to be used later:

In [None]:
with open('reddit data/test_preds_nb.json', 'w') as test_preds:
    json.dump(dict(zip(test_users, predicted_labels)), test_preds)

## Analysis

Now let's take a look at what the data has to say! We used Bokeh and NetworkX to generate a series of networks graphs to show the relationships between each of the subreddits and how contributers flow between them.

### Training Data Analysis

In [15]:
import json, math
import matplotlib.pyplot as plt
import networkx as nx
from pprint import pprint

We start by loading the data we collected earlier.

In [51]:
with open('reddit data/left.json') as f:
    left = json.load(f)
with open('reddit data/right.json') as f:
    right = json.load(f)
with open('reddit data/left_subs.json') as f:
    left_subs = set(json.load(f))
with open('reddit data/right_subs.json') as f:
    right_subs = set(json.load(f))
with open('reddit data/left_users.json') as f:
    left_users = set(json.load(f))
with open('reddit data/right_users.json') as f:
    right_users = set(json.load(f))
with open('reddit data/left_sources.txt') as f:
    left_sources = set([l.strip().lower() for l in f.readlines() if l.strip() != ''])
with open('reddit data/right_sources.txt') as f:
    right_sources = set([l.strip().lower() for l in f.readlines() if l.strip() != ''])
with open('reddit data/all_subs.json') as f:
    all_subs = json.load(f)

This gives us dictionaries mapping users to the subreddits they contribute to, but since we want to analyze relationships between subreddits, we'll invert the dictionaries to map subreddits to users.

In [52]:
sub_users = dict()
for s in all_subs:
    sub_users[s] = set()
for u in left:
    for s in left[u]:
        sub_users[s].add(u)
for u in right:
    for s in right[u]:
        sub_users[s].add(u)

Next we filter out small subreddits. Our initial list had over 18000 unique subreddits, but many of them had few contributors so we excluded them from our analysis. 

In [53]:
# filter out small subs
counts = {s:len(sub_users[s]) for s in sub_users}
large_subs = {s:sub_users[s] for s in sub_users if len(sub_users[s]) >= 1500}
print(len(large_subs))
large_sub_users = set()
for s in large_subs:
    for u in large_subs[s]:
        large_sub_users.add(u)
most_subs = {s:sub_users[s] for s in sub_users if len(sub_users[s]) >= 50}
print(len(most_subs))

59
1483


We created two collections of subreddit data: one with 59 entries for only the largest subreddits (>= 1500 contributors) and one with 1483 entries for subreddits with more than 50 contributors. We use the smaller dataset to generate a network graph showing connections between subreddits and the larger dataset to analyze the political leanings of subreddits in general.

In [54]:
from bokeh.io import show, output_file, output_notebook, reset_output
from bokeh.plotting import figure
from bokeh.models import Plot, Range1d, MultiLine, Circle, HoverTool, BoxZoomTool, WheelZoomTool, PanTool, TapTool
from bokeh.models import GraphRenderer, StaticLayoutProvider, Oval, Span
from bokeh.palettes import Spectral4
from bokeh.models.graphs import from_networkx, NodesAndLinkedEdges, EdgesAndLinkedNodes
from bokeh.layouts import row

First we take a look at the political leanings of the subreddits in the larger set. To do this, we calculate a score for each subreddit by averaging the political leanings of its contributors, with values ranging from -1 for left-leanign subreddits to 1 for right-leaning subs.

In [55]:
def score_sub(sub):
    """
    Calculates a political leaning score for a sub by averaging its
    contributors' political leanings
    """
    users = sub_users[sub]
    score = 0
    for u in users:
        if u in left_users: score -= 1
        if u in right_users: score += 1
    sub_score = score / len(users)
    sub_score -= -(len(left_users) - len(right_users)) / (len(left_users) + len(right_users))
    if sub_score > 1: sub_score = 1
    elif sub_score < -1: sub_score = -1
    return sub_score

scored_subs = {s: score_sub(s) for s in most_subs}
sorted_scored_subs = list(reversed(sorted(scored_subs)))
scores = [scored_subs[s] for s in sorted_scored_subs]

# create a huge scatter plot with all subs and their scores
X_RANGE = 50
scatter = figure(x_range=(-X_RANGE, X_RANGE), y_range=sorted_scored_subs)
zero = Span(location=0, dimension='height', line_color='black',
            line_width=3)
scatter.add_layout(zero)
scatter.circle(scores, sorted_scored_subs, size=5, fill_color="green", line_color="black", line_width=3)
reset_output()
output_notebook()
show(scatter)

Now we'll generate a network graph for the largest subreddits to see the connections between political subreddits and the most popular subreddits.

In [56]:
# create graph edges for political subs and their users
political_sub_edges = dict()
sources_set = set(left_sources).union(set(right_sources))
sorted_subs = sorted(list(large_subs))

for s in sorted(sources_set):
    if s not in large_subs: continue
    for u in large_subs[s]:
        for s2 in sorted_subs:
            if s != s2 and u in large_subs[s2]:
                count = political_sub_edges.get((s, s2), 0)
                political_sub_edges[(s, s2)] = count + 1

In [57]:
MAX_SIZE = 50
MAX_WEIGHT = 30

max_weight = max(political_sub_edges[e] for e in political_sub_edges)
G = nx.Graph()
for e in political_sub_edges:
    G.add_edge(e[0], e[1], weight=political_sub_edges[e] / max_weight * MAX_WEIGHT)
    
def r2bgradient(score):
    """
    Returns a rgb hex color string when given a score
    with -1 as blue and 1 as red
    """
    r = math.floor(255 * score)
    b = math.floor(255 * (1 - score))
    color = '#' + '%02x' % r + '00' + '%02x' % b
    return color

def calculate_color(s):
    """
    Calculates a score and color for a subreddit s
    and returns the color
    """
    sub_score = score_sub(s) + 1
    sub_score /= 2
    if sub_score > 1: sub_score = 1
    elif sub_score < 0: sub_score = 0
    return r2bgradient(sub_score)
    
# calculate colors and sizes for the nodes (subreddits)
colors = {s: calculate_color(s) for s in G.nodes}
left_sources_set = set(left_sources)
right_sources_set = set(right_sources)
    
max_size = max(len(large_subs[s]) for s in G.nodes)
node_sizes = {s:len(large_subs[s]) / max_size * MAX_SIZE for s in G.nodes}

In this graph, each node represents a subreddit and each edge represents one or more redditors who contributes to both subreddits. The radii of the nodes and line width of the edges are proportional to the number of contributors, and the color and x-position of the nodes are proportional to the political leaning of the subreddits, with left-leaning subs in blue on the left, and right-leaning subs in red on the right. A lot of these classifications come as no surprise, for instance all of the gun subreddits like r/glock and r/guns are very right leaning, however some seem unintuitive like the majority of video-game subreddits tend to be more right-leaning, despite not being content-related. Generally, however, the subreddits in the test data lean left.

In [58]:
X_SPACING = 1000
def x(sub):
    """calculates x position for node based on political leaning"""
    return score_sub(sub) * X_SPACING
    
left = 0
center = 0
right = 0
def y(sub):
    """
    Calculates y position for node based on political leaning
    and spreads nodes out in y direction
    """
    global left, center, right
    if sub in left_sources_set:
        result = left
        if left >= 0: left += 2
        left *= -1
    elif sub in right_sources_set:
        result = right
        if right >= 0: right += 2
        right *= -1
    else:
        result = center
        if center >= 0: center += 1
        center *= -1
    return result

FIGURE_SIZE = 1500
plot = figure(x_range=(-FIGURE_SIZE, FIGURE_SIZE), y_range=(-FIGURE_SIZE, FIGURE_SIZE),
              tools='')

graph = from_networkx(G, nx.spring_layout, scale=2, center=(0,0))


hover = HoverTool(tooltips=[("sub name", "@name"), ("score", "@score")])
hover.show_arrow = False
plot.add_tools(hover, BoxZoomTool(), PanTool(), WheelZoomTool(), TapTool())
graph.node_renderer.data_source.data['name'] = list(G.nodes)
graph.node_renderer.data_source.data['score'] = [score_sub(s) for s in G.nodes]

graph_layout = {node: (x(node), y(node) * MAX_SIZE) for node in G.nodes}
graph.layout_provider = StaticLayoutProvider(graph_layout=graph_layout)

graph.edge_renderer.data_source.data["line_width"] = [G.get_edge_data(a,b)['weight'] for a, b in G.edges()]
graph.node_renderer.data_source.data['node_color'] = [colors[n] for n in G.nodes]
graph.node_renderer.data_source.data['node_size'] = [node_sizes[n] for n in G.nodes]
graph.node_renderer.glyph = Circle(size='node_size', fill_color={'field': 'node_color'})
graph.edge_renderer.glyph.line_width = {'field': 'line_width'}

graph.edge_renderer.glyph = MultiLine(line_color="#CCCCCC", line_alpha=0.8, line_width={'field': 'line_width'})
graph.edge_renderer.selection_glyph = MultiLine(line_color='#000000', line_width={'field': 'line_width'})
graph.selection_policy = NodesAndLinkedEdges()

plot.renderers.append(graph)
reset_output()
output_notebook()
show(plot)

ERROR:bokeh.core.validation.check:E-1001 (BAD_COLUMN_NAME): Glyph refers to nonexistent column name: line_width [renderer: GlyphRenderer(id='cf793b88-2a19-4ae0-92f3-25f4bda7273e', ...)]
ERROR:bokeh.core.validation.check:E-1001 (BAD_COLUMN_NAME): Glyph refers to nonexistent column name: node_color, node_size [renderer: GlyphRenderer(id='3950477c-7800-4037-b4c3-34bde52d1e4e', ...)]


You can hover over the nodes to see their names and political leaning scores and use the tools in toolbar on the right to pan and zoom into the graph. Again, it is clear that the majority of subreddits, especially the larger and more popular ones, tend to lean slightly to the left. A few smaller subreddits veer significantly to the right indicating that right-leaning redditors seem to form their own spaces away from the liberal reddit community. 

In the next graph, we plot the largest subreddits and *all* connections between them.

In [59]:
def make_edges(edges, large_subs, large_sub_users, sorted_subs):
    for i in range(len(sorted_subs)):
        s = sorted_subs[i]
        if s not in large_subs: continue
        for u in large_sub_users:
            if u in large_subs[s]:
                for j in range(i + 1, len(sorted_subs)):
                    s2 = sorted_subs[j]
                    if u in large_subs[s2]:
                        count = edges.get((s, s2), 0)
                        edges[(s, s2)] = count + 1

all_sub_edges = dict()
make_edges(all_sub_edges, large_subs, large_sub_users, sorted_subs = sorted(list(large_subs)))

In [60]:
MAX_SIZE = 50
MAX_WEIGHT = 30

max_weight = max(all_sub_edges[e] for e in all_sub_edges)
all_G = nx.Graph()
for e in all_sub_edges:
    all_G.add_edge(e[0], e[1], weight=all_sub_edges[e] / max_weight * MAX_WEIGHT)
colors = {s: calculate_color(s) for s in all_G.nodes}
max_size = max(len(large_subs[s]) for s in all_G.nodes)
node_sizes = {s:len(large_subs[s]) / max_size * MAX_SIZE for s in all_G.nodes}

In [61]:
def plot_network(G, colors, node_sizes, scores):
    FIGURE_SIZE = 2.1
    plot = figure(x_range=(-FIGURE_SIZE, FIGURE_SIZE), y_range=(-FIGURE_SIZE, FIGURE_SIZE),
                  tools='')

    graph = from_networkx(G, nx.spring_layout, scale=2, center=(0,0))


    hover = HoverTool(tooltips=[("sub name", "@name"), ("score", "@score")])
    hover.show_arrow = False
    plot.add_tools(hover, BoxZoomTool(), PanTool(), WheelZoomTool(), TapTool())
    graph.node_renderer.data_source.data['name'] = list(G.nodes)
    graph.node_renderer.data_source.data['score'] = scores

    graph.edge_renderer.data_source.data["line_width"] = [G.get_edge_data(a,b)['weight'] for a, b in G.edges()]
    graph.node_renderer.data_source.data['node_color'] = [colors[n] for n in G.nodes]
    graph.node_renderer.data_source.data['node_size'] = [node_sizes[n] for n in G.nodes]
    graph.node_renderer.glyph = Circle(size='node_size', fill_color={'field': 'node_color'})
    graph.edge_renderer.glyph.line_width = {'field': 'line_width'}
    
    graph.edge_renderer.glyph = MultiLine(line_color="#CCCCCC", line_alpha=0.8, line_width={'field': 'line_width'})
    graph.edge_renderer.selection_glyph = MultiLine(line_color='#000000', line_width={'field': 'line_width'})
    graph.selection_policy = NodesAndLinkedEdges()

    plot.renderers.append(graph)
    reset_output()
    output_notebook()
    return plot

original_plot = plot_network(all_G, colors, node_sizes, [score_sub(s) for s in all_G.nodes])
show(original_plot)

ERROR:bokeh.core.validation.check:E-1001 (BAD_COLUMN_NAME): Glyph refers to nonexistent column name: line_width [renderer: GlyphRenderer(id='c968c3d2-7399-4e18-a22e-6a2add13edb2', ...)]
ERROR:bokeh.core.validation.check:E-1001 (BAD_COLUMN_NAME): Glyph refers to nonexistent column name: node_color, node_size [renderer: GlyphRenderer(id='e72e9862-a541-40e8-a73b-c99632de475c', ...)]


Clearly, redditors interact fluidly across all of the subreddits, as the graph is highly connected. 

To verify that our classifier has truly learned something about the data, we plot the same graph as above, but with the political leaning score calculated using the log probabilities of features given a class. 

In [62]:
logs = np.rollaxis(nb_clf.feature_log_prob_, 1)
probs = math.e ** logs
prob_lookup = dict(zip(sorted(all_subs), probs))

In [63]:
def prob_score(sub):
    sub_probs = prob_lookup[sub]
    score = sub_probs[1] / sum(sub_probs)
    return score

def prob_color(sub):
    return r2bgradient(prob_score(sub))

prob_colors = {s: prob_color(s) for s in all_G.nodes}
prob_plot = plot_network(all_G, prob_colors, node_sizes, [prob_score(s) * 2 - 1 for s in all_G.nodes])
show(prob_plot)

ERROR:bokeh.core.validation.check:E-1001 (BAD_COLUMN_NAME): Glyph refers to nonexistent column name: line_width [renderer: GlyphRenderer(id='ed3183af-3d36-4e19-9d98-13a1e7055441', ...)]
ERROR:bokeh.core.validation.check:E-1001 (BAD_COLUMN_NAME): Glyph refers to nonexistent column name: node_color, node_size [renderer: GlyphRenderer(id='aaecaa76-8fa0-4565-87d8-f2d9d53e08a3', ...)]


The resulting graph is almost identical to the one created by averaging contributors' political leanings, which indicates that our classifier is reasonably accurate. For a point of reference, look at r/pics, which in the graph of the training data was very small as not many political users interacted with it, is now very large, reflecting it's actual size as one of the largest subreddits. Notice again, the majority or subreddits are left-leaning, with a select few being highly conservative.

### Test Data Analysis
Now we'll visualize and classify the test data. Previously we classified the political leanings of the users in the test data, so we'll load and use those predictions to calculate scores for the test subreddits. The steps to create the Bokeh graphs for the test data will be similar to what we did for the training data.

In [64]:
with open('reddit data/test_subs.json') as f:
    test = json.load(f)
with open('reddit data/test_users.json') as f:
    test_users = json.load(f)
with open('reddit data/test_preds.json') as f:
    test_preds_svm = json.load(f)
with open('reddit data/test_preds_nb.json') as f:
    test_preds_nb = json.load(f)

In [79]:
# invert user:subs dict
test_sub_users = dict()
test_subs = set()
for u in test_users:
    test_subs.update(set(test[u]))
for s in test_subs:
    test_sub_users[s] = set()
for u in test:
    for s in test[u]:
        test_sub_users[s].add(u)
print(len(test_sub_users))
test_large_subs = {s:test_sub_users[s] for s in test_sub_users if len(test_sub_users[s]) >= 1500}
test_most_subs = {s:test_sub_users[s] for s in test_sub_users if len(test_sub_users[s]) >= 50}
print(len(test_large_subs))

16245
45


We'll start looking at the test data by using the users classified with naive bayes to calculate political leaning scores for the subreddits collected from r/all.

In [80]:
def predicted_score(preds, s, users):
    return sum(preds[u] for u in users) / len(users)

scored_subs_nb = {s: predicted_score(test_preds_nb, s, test_most_subs[s]) for s in test_most_subs}
sorted_scored_subs_nb = list(reversed(sorted(scored_subs_nb)))
scores_nb = [scored_subs_nb[s] * 2 - 1 for s in sorted_scored_subs_nb]

# create a huge scatter plot with all subs and their scores
X_RANGE = 50
scatter_nb = figure(x_range=(-X_RANGE, X_RANGE), y_range=sorted_scored_subs_nb)
zero = Span(location=0, dimension='height', line_color='black',
            line_width=3)
scatter_nb.add_layout(zero)
scatter_nb.circle(scores_nb, sorted_scored_subs_nb, size=5, fill_color="green", line_color="black", line_width=3)
reset_output()
output_notebook()
show(scatter_nb)

Compared to our results from the training data, the classified data shows that most subreddits are left-leaning, with very few right-leaning subreddits.

Now we use the classified data to create network graphs and learn more. First we use the data from users classified by the SVM.

In [66]:
test_sub_edges = dict()
make_edges(test_sub_edges, test_large_subs, test_users, sorted(list(test_large_subs)))

In [67]:
MAX_SIZE = 50
MAX_WEIGHT = 30

max_weight = max(test_sub_edges[e] for e in test_sub_edges)
test_G = nx.Graph()
for e in test_sub_edges:
    test_G.add_edge(e[0], e[1], weight=test_sub_edges[e] / max_weight * MAX_WEIGHT)
    
def predicted_score(preds, s, users):
    return sum(preds[u] for u in users) / len(users)

    
def calculate_predicted_color(preds, s):
    users = test_large_subs[s]
    return r2bgradient(predicted_score(preds, s, users))
    
max_size = max(len(test_large_subs[s]) for s in test_G.nodes)
test_node_sizes = {s:len(test_large_subs[s]) / max_size * MAX_SIZE for s in test_G.nodes}

In [68]:
test_colors_svm = {s: calculate_predicted_color(test_preds_svm, s) for s in test_G.nodes}
svm_plot = plot_network(test_G, test_colors_svm, test_node_sizes,
                        [predicted_score(test_preds_svm, s, test_sub_users[s]) * 2 - 1 for s in test_G.nodes])
show(svm_plot)

ERROR:bokeh.core.validation.check:E-1001 (BAD_COLUMN_NAME): Glyph refers to nonexistent column name: line_width [renderer: GlyphRenderer(id='c860d166-5b49-41f8-b84c-8068f1f1b334', ...)]
ERROR:bokeh.core.validation.check:E-1001 (BAD_COLUMN_NAME): Glyph refers to nonexistent column name: node_color, node_size [renderer: GlyphRenderer(id='f9ce28c7-fcaf-426c-a419-7b7c4e5e5b9e', ...)]


Ok. That doesn't look right. It doesn't make much sense that most of reddit would be conservative, or that much of the subreddits would be nearly as conservative as 'the_donald'. While there is a large conservative presense on reddit, it does not make much sense that they are large enough to have much of the participation in the largest subreddits (such as pics and askreddit) be overwhelmingly conservative. There must be some reason for this.

In [69]:
right_users_svm = sum(test_preds_svm.values())
left_users_svm = len(test_preds_svm) - right_users_svm

print('Classified left: %d, Classified right: %d' % (left_users_svm, right_users_svm))

Classified left: 4107, Classified right: 26664


It seems as though our SVM classifier tends to classify many users as right leaning, rather than left leaning. That doesn't seem to be accurate, since reddit is an online platform, and should tend to be closer to the middle, or perhaps slightly left-leaning. Let's see if that is consistent with Naive Bayes:

In [70]:
right_users_nb = sum(test_preds_nb.values())
left_users_nb = len(test_preds_nb) - right_users_nb

print('Classified left: %d, Classified right: %d' % (left_users_nb, right_users_nb))

Classified left: 20246, Classified right: 10525


That looks like it makes more sense. It seems as though our SVM classifier does not do well on our test data. Keep in mind that our test users were taken from r/all, which are the most popular posts on reddit, and likely to attract a range of possible contributors. Furthermore, our SVM classfier managed to get a 98% accuracy over our test set, using cross-validation. Perhaps we can learn something by looking at the classifier itself:

In [71]:
print(len(svm_clf.support_))

32409


So there are nearly 4000 support vectors - aka the important examples that the SVM uses to classify based on.

In [76]:
for i in svm_clf.support_[:10]:
    print(all_data[i])

['animesuggest', 'pics', 'livefromnewyork', 'watchitfortheplot', 'unresolvedmysteries', 'tumblrinaction', 'nostalgiafapping', 'documentaries', 'wtfgaragesale', 'progressive', 'familyguy', 'todayilearned', 'television', 'playboybunnies', 'horror', 'whitepeopletwitter', 'youshouldknow', 'bestof', 'politics', 'supergirltv', 'technology', 'gifrecipes', 'askreddit', 'blackpeopletwitter', 'teentitans', 'wtf', 'unexpected', 'intel', 'movies', 'moviepassclub', 'oldschoolcool']
['askanamerican', 'askreddit', 'relationships', 'christianity', 'funny', 'murica', 'the_mueller', 'askmen', 'adviceanimals', 'cyberpunk']
['science', 'worldoftanks', 'meditation', 'worldofwarships', 'games', 'criticalrole', 'joerogan', 'personalfinance', 'zenhabits', 'westworld', 'dota2', 'worldnews', 'doki', 'pathofexile', 'socialism', 'elderscrollsonline', 'frostpunk', 'elderscrollslegends']
['coolguides', 'quityourbullshit', 'bluemidterm2018']
['depression', 'pics', 'fo4', 'bookporn', 'jokes', 'starbound', 'anormalday

Many of these examples contain conflicting subreddits: the_donald and the_muller, liberal and republican, democrats and conservative, etc. So it seems as though SVM is doing its job - finding support vectors - the hard to classify examples that are likely to define the margin.

Let's see if this holds for the test data:

In [77]:
all_test_subs = []
for _, subs in test.items():
    all_test_subs.append(set(subs))
   
all_conflicting = []

left_set = set(left_sources)
right_set = set(right_sources)

for sets in all_test_subs:
    if(len(sets.intersection(left_set)) > 0 and len(sets.intersection(right_set)) > 0):
        all_conflicting.append(list(sets))
print(len(all_conflicting))

145


Ok, that explains a lot. Our SVM classifier is a margin based classifier, which is well defined over our training set. However, in our test set, there are significantly fewer of these margin - defining data points, leading to a failure of SVM on our test set.

For the remainder of our analysis, we will use the Naive Bayes classifier as our sole classifier, as it seems to be a much more reasonable classifier.

In [78]:
test_colors = {s: calculate_predicted_color(test_preds_nb, s) for s in test_G.nodes}
nb_plot = plot_network(test_G, test_colors, test_node_sizes,
                       [predicted_score(test_preds_nb, s, test_sub_users[s]) * 2 - 1 for s in test_G.nodes])
show(nb_plot)


ERROR:bokeh.core.validation.check:E-1001 (BAD_COLUMN_NAME): Glyph refers to nonexistent column name: line_width [renderer: GlyphRenderer(id='077125f3-558c-4587-b3fb-e1e58efa4b9f', ...)]
ERROR:bokeh.core.validation.check:E-1001 (BAD_COLUMN_NAME): Glyph refers to nonexistent column name: node_color, node_size [renderer: GlyphRenderer(id='fbf0db32-b0bd-402c-bba7-7a408c89d12c', ...)]


It is clear from the visualization that these major subreddits on the platform are largely liberal. This seems to confirm our hypothesis that reddit is highly insular, and that people seem to stay in their reflective studies. 

## Conclusions

Overall, there are many conclusions that could be drawn from the conducted analysis. For example, the clear subset of right vs. left leaning subreddits, and the corresponding topics present an interesting insight into the insular nature of subreddits. It seems from the data that contributors from conservative and liberal subreddits seem to only contribute to subreddits that are largely similar to their perspectives. This shows that even in social networks that aren't restrained by friend cirlces, users will carve out communities that are of similar perspectives. Its important to understand this about Reddit because of Reddit's importance as a news source for its users. As would be expected from what's previously been written about The_Donald and alt-right subredditors, its clear from our analysis that  people, both on the left and right, seem to avoid news of the other perspective, giving a lot of power to fake news and conspiracy theories. This detracts from Reddit's efficacy as a news source because the content is unverified and un-balanced. 

Given more time, much more analysis of this (or further) data could be done. First, many improvements could be made to the classifiers, via more, or more representative data, as well as finding more features, such as contribution frequency, or topics of comments, could be included to improve classifier accuracy. Different classifier types could be used, such as Adaboost based classifiers, Perceptron, or unsupervised techniques such as PCA or k-means analysis, to find insular groups on reddit.

Furthermore, more analysis could be done on some of the smaller subreddits, such as many of the gun based subreddits, on the more obscure political movements, or perhaps more investigation could be done into the apparent consistency of many game-related subreddits to leaning right. Simply due to the large number of subreddits, and the complex relationships permit many studies into why certain people contribute to certian subreddits, or why certain subreddits attract a particular group of users.