# Data Science for Social Justice Workshop: Module 5

## Network Analysis

In this notebook, we'll access the Reddit Application Programming Interface (API) to do a small network analysis of "influencers" in our data. An API is effectively a protocol for how to obtain information from someone else's servers. For example, Reddit has all its users, comments, upvotes, downvotes, etc., stored in its own databases. We could obtain some of this information by simply scraping the web - but this is an arduous and time-consuming process. So, Reddit has provided a way to access portions of its database in a streamlined manner, and the API is the "guidebook" for how to do so. We can use this API to get more information on the users in the AITA subreddit.

### Getting an API Set Up

1. **Sign Up.** First, you will need to sign up with Reddit to run some of the code. Go to http://www.reddit.com and **sign up** for an account.

2. **Create an App.** Go to [this page](https://ssl.reddit.com/prefs/apps/) and click on the `are you a developer? create an app` button at the bottom.

3. **Fill Out the Form.** Fill out the form that appears. For the name, you can enter whatever you'd like. Select "script". Enter the redirect uri as shown. Otherwise, you can leave everything else blank. Then, click "create app".

![redditapi](../../img/reddit_api.png)

4. **Note API Credentials.** You should see a new box appear, with some important information. This includes:
    - Client ID: A 14-character string (at least) listed just under “personal use script” for the desired developed application.
    - Client Secret: A 27-character string (at least) listed adjacent to secret for the application.
    - Username: The username of the Reddit account used to register the application.
    - Password: This is not shown here, but you should remember your password to your account.
    
![redditapi2](../../img/reddit_api2.png)

## Importing and Using `praw`

Even though we're set up with the API, we still need to have a way to use Python to interface with the API. Luckily, this is already done for us via the Python Reddit API Wrapper: `praw`. This is a package we can download and use.

In [None]:
!pip install praw

To use `praw`, you need to grab the information that you noted before from your API. Fill in the details below. **Do not share your credentials with anyone. Be especially careful not to share them via a public portal (e.g., GitHub). If you do so, you should consider them comprised, and obtain a new set.**

In [None]:
import praw

reddit = praw.Reddit(client_id='YOUR_CLIENT_ID_HERE',
                     client_secret='YOUR_CLIENT_SECRET_HERE',
                     password='YOUR_REDDIT_PSW_HERE',
                     user_agent='Get Reddit network data, v1.0, by /u/YOUR_USERNAME_HERE',
                     username='YOUR_USERNAME_HERE')

## Network Analysis: Finding Influencers in Reddit Data

When working with Reddit data, we can't determine the most influential users at a glance. Other social media platforms have follower counts which directly quantify the amount of reach a user is likely to have, while Redditors only have karma, i.e., the net total up and down votes since account creation, and a log of their posts and comments in different subreddits. These two statistics can give a rough idea of a user's activity.

It has already been found that a very small percentage of Reddit’s users create the vast majority of the site’s content, so we would not be surprised if only a few users could influence the discourse of entire subreddits. Identifying these users would help us understand how a subreddit's discourse is shaped. 

In [None]:
import os
os.chdir('../../data')

In [None]:
import pandas as pd
df = pd.read_csv('aita_sub_top_sm.csv')

Let's sort by score and just get the top 1000 posts.

In [None]:
df = df.sort_values(by='score', ascending=False)[:1000]
# Sanity check
print(df.shape)

How many *unique* authors do we have in our data?

In [None]:
df.author.nunique()

Let's examine a potential relationship between score and number of comments:

In [None]:
df.plot('score', 'num_comments', kind='scatter', color='black', alpha=0.25, logy=True)

This scatter plot shows that the number of comments don’t necessarily increase with posts that have a higher net score.

Let's only look at the users who posted more than once:

In [None]:
repeating = df[df.duplicated(['author'], keep=False)]
# Get rid of deleted users
repeating = repeating[repeating['author'] != '[deleted]']

In [None]:
# Out of all posts, this is the amount of people who posted more than once 
repeating.author.nunique()

Next, we need to decide which of these users we consider to be "influencers". Let's first have a look at where and how often these popular authors are posting. We'll define a function that can get us the other posts by the influencers we have found in our data.

In [None]:
def get_user_posts(author, n_submissions):
    """Gets the posts by a Reddit user."""
    try:
        # Create a "redditor" object
        redditor = reddit.redditor(author)
        user_posts_list = []
        # Iterate over the top N submissions for the redditor
        for submission in redditor.submissions.top(limit=n_submissions):
            # Obtain information about each submission
            info_list = [submission.id,
                         submission.score,
                         str(submission.author),
                         submission.num_comments,
                         str(submission.subreddit)]
            user_posts_list.append(info_list)
    # Dealing with errors in case redditors have been banned, deleted their accounts, etc.
    except:
        pass

    # Sort submissions in decreasing order of score
    sorted_submissions = sorted(user_posts_list, key=lambda x: x[1], reverse=True)
    # Place submissions in a dataframe.
    user_posts_df = pd.DataFrame(sorted_submissions,
                                 columns=['id', 'score', 'author', 'n_comments', 'subreddit'])
    return user_posts_df

In [None]:
# Make an empty dataframe
authors_df =  pd.DataFrame().fillna(0)
# Loops through every "influencer" user and gets 20 top posts per user
for author in authors:
    user_posts_df = get_user_posts(author, 20)
    authors_df = pd.concat([authors_df, user_posts_df]) 

In [None]:
authors_df = authors_df.reset_index()

In [None]:
print(authors_df.shape)
authors_df.head(10)

Next, let's find out where else these influencers posted. We'll compile a list of authors that appeared more than once on other subreddits. In order to form a network graph, we need data about the particular subreddits where our influencers appeared. For the sake of simplicity, we visualized those subreddits with at least 2 or more posts made by the influencers. The Y-axis is the number of submissions and the X-axis are the respective subreddits.

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
# Count of each subreddit
counts = authors_df['subreddit'].value_counts()

In [None]:
# Only plot the subreddits that appear more than twice
ax = counts[counts > 2].plot(
    kind='bar',
    title='Distribution of other subreddits where influencers post',
    figsize=(10, 5),
    rot=15) 
ax.set_xlabel('Subreddits')
ax.set_ylabel('Number of Posts')

## Conducting a Network Analysis

Finally, we're going to use a package called `networkx` to visualize the subreddits that people post in. The graphical representation will allow us to better assess how each Redditor posts, and where they are influential.

First, let's install `networkx`.

In [None]:
!pip install networkx

Now, we extract the author and subreddit in our data frame. This is what we're going to use to create our graph.

In [None]:
# Create a dataframe for network graph purposes 
network_df = authors_df[['author', 'subreddit']]
network_df.head()

In [None]:
# Make list of unique subreddits to use in network graph
subreddits = list(network_df.subreddit.unique()) 
# Make list of unique authors to use in network graph 
authors = list(network_df.author.unique())

Let's create the graph. There's a lot of moving parts here. In effect, what we're doing is visualizing the *other* subreddits that the top posters on AITA are posting in.

In [None]:
import networkx as nx

plt.figure(figsize=(18, 18))

# Create the graph from the dataframe
g = nx.from_pandas_edgelist(network_df, source='author', target='subreddit') 

# Create a layout for nodes 
layout = nx.spring_layout(g, iterations=50, scale=2)

# Draw the parts we want, edges thin and grey
# Influencers appear small and grey
# Subreddits appear in blue and sized according to their respective number of connections.
# People who have more connections are highlighted in color 

# Go through every subbreddit, ask the graph how many connections it has. 
# Multiply that by 80 to get the circle size
sub_size = [g.degree(sub) * 80 for sub in subreddits]
nx.draw_networkx_nodes(g, 
                       layout, 
                       nodelist=subreddits, 
                       node_size=sub_size, # a LIST of sizes, based on g.degree
                       node_color='lightblue')

# Draw all the entities 
nx.draw_networkx_nodes(g, layout, nodelist=authors_df['author'], node_color='#cccccc', node_size=100)

# Draw highly connected influencers 
influencers = [person for person in authors_df['author'] if g.degree(person) > 1]
nx.draw_networkx_nodes(g, layout, nodelist=influencers, node_color='orange', node_size=100)
# Draw edges
nx.draw_networkx_edges(g, layout, width=1, edge_color="#cccccc")

# Labels for subreddits and authors
node_labels = dict(zip(subreddits, subreddits))
auth_labels = dict(zip(authors, authors))

nx.draw_networkx_labels(g, layout, labels=node_labels)
nx.draw_networkx_labels(g, layout, labels=auth_labels)

# No axis needed
plt.axis('off')
plt.title("Network Graph of Related Subreddits")
plt.show()

In this graph, influencer nodes appear small and grey. The influencers who have more connections than just r/amitheasshole are highlighted in yellow. The subreddits appear in blue and sized according to their respective number of connections. All the redditors listed post in AITA, but they also post in other communities: what do those communities tell you about that subredditors and their interests?