# Exploring Reddit with the pushshift API
This notebook give you examples of how to use the pushshift API for querying Reddit data.

* Pushshift doc:  https://github.com/pushshift/api
* FAQ about Pushshift: https://www.reddit.com/r/pushshift/comments/bcxguf/new_to_pushshift_read_this_faq/

In [None]:
import requests
import pandas as pd

We define a convenient function to get data from Pushshift:

In [None]:
def get_pushshift_data(data_type, params):
    """
    Gets data from the pushshift api.
 
    data_type can be 'comment' or 'submission'
    The rest of the args are interpreted as payload.
 
    Read more: https://github.com/pushshift/api
    
    This function is inspired from:
    https://www.jcchouinard.com/how-to-use-reddit-api-with-python/
    """
 
    base_url = f"https://api.pushshift.io/reddit/search/{data_type}/"
    request = requests.get(base_url, params=params)
    print('Query:')
    print(request.url)
    try: 
        data = request.json().get("data")
    except:
        print('--- Request failed ---')
        data = []
    return data


This function accepts the parameters of the pushshift API detailed in the doc at https://github.com/pushshift/api. An example is given below.

## Example of request to the API
Let us collect the comments written in the last 2 day in the subreddit `askscience`. The number of results returned is limited to 100, the upper limit of the API.

In [None]:
# parameters for the pushshift API
data_type = "comment"    # accept "comment" or "submission", search in comments or submissions
params = {
    "subreddit" : "askscience", # limit to one or a list of subreddit(s)
    "after" : "7d", # Select the timeframe. Epoch value or Integer + "s,m,h,d" (i.e. "second", "minute", "hour", "day")
    "size" : 100, # Number of results to return (limited to max 100 in the API)
    "author" : "![deleted]" # limit to a list of authors or ignore authors with a "!" mark in front
}
# Note: the option "aggs" (aggregate) has been de-activated in the API

data = get_pushshift_data(data_type, params)
if data: # control if something is returned
    df = pd.DataFrame.from_records(data)
    print('Some of the data returned:')
    df[['author', 'subreddit', 'score', 'created_utc', 'body']].head()
else:
    print('The returned data is empty. Change the parameters.')

## Authors of comments
Let us collect the authors of comments in a subreddit during the last days. The next function helps bypassing the limit of results by sending queries multiple times, avoiding collecting duplicate authors.

In [None]:
# Get the list of unique authors of comments in the API results
# bypass the limit of 100 results by sending multiple queries
def get_unique_authors(n_results, params):
    results_per_request = 100 # default nb of results per query
    n_queries = n_results // results_per_request + 1
    author_list = []
    author_neg_list = ["![deleted]"]
    for query in range(n_queries):
        params["author"] = author_neg_list
        data = get_pushshift_data(data_type="comment", params=params)
        df = pd.DataFrame.from_records(data)
        if df.empty:
            return author_list
        authors = list(df['author'].unique())
        # add ! mark
        authors_neg = ["!"+ a for a in authors]
        author_list += authors
        author_neg_list += authors_neg
    return author_list

Let us make a list of authors commenting on the subreddit "askscience".

In [None]:
# Ask for the authors of comments in the last days, colect at least "n_results"
subreddit = "askscience"
data_type = "comment"
params = {
    "subreddit" : subreddit,
    "after" : "2d"
}
n_results = 500
author_list = get_unique_authors(n_results, params)
print("Number of authors:",len(author_list))

From the list of authors obtained, let us collect where else the commented posts (other subreddits).

In [None]:
# Collect the subreddits where the authors wrote comments and the number of comments
from collections import Counter
data_type = "comment"
params = {
    "size" : 100
}
subreddits_count = Counter()
for author in author_list:
    params["author"] = author
    print(params["author"])
    data = get_pushshift_data(data_type=data_type, params=params)
    if data: # in case the resquest failed and data is empty
        df = pd.DataFrame.from_records(data)
        subreddits_count += Counter(dict(df['subreddit'].value_counts()))

## Network of subreddits (ego-graph)
Let us build the ego-graph of the subreddit. Other subreddits will be connected to the main one if the users commented in the other subreddits as well.

In [None]:
# module for networks
import networkx as nx

In [None]:
threshold = 0.05
G = nx.Graph()
G.add_node(subreddit)
self_refs = subreddits_count[subreddit]
for sub,value in subreddits_count.items():
    post_ratio = value/self_refs
    if post_ratio >= threshold:
        G.add_edge(subreddit,sub, weight=post_ratio)
print("Total number of edges in the graph:",G.number_of_edges())

Here is an alternative way of generating the graph using pandas dataframes instead of a for loop (it might scale better on bigger graphs).

In [None]:
threshold = 0.05
subreddits_count_df = pd.DataFrame.from_dict(subreddits_count, orient='index', columns=['total'])
subreddits_ratio_df = subreddits_count_df/subreddits_count_df.loc[subreddit]
subreddits_ratio_df.rename(columns={'total': 'weight'}, inplace=True)
filtered_sr_df = subreddits_ratio_df[subreddits_ratio_df['weight'] >= threshold].copy() # filter weights < threshold
filtered_sr_df['source'] = subreddit
filtered_sr_df['target'] = filtered_sr_df.index
Gdf = nx.from_pandas_edgelist(filtered_sr_df, source='source', target='target', edge_attr=True)
print("Total number of edges in the graph:",Gdf.number_of_edges())

In [None]:
# Write the graph to a file
path = 'egograph.gexf'
nx.write_gexf(G,path)

## Network of subreddit neighbors
This second collection makes a distinction between the related subreddits. For each author, all the subreddits where he/she commented will be connected together. The weight of each connection will be proportional to the number of users commenting in both subreddits joined by the connection. The ego-graph becomes an approximate neighbor network for the central subreddit.

In [None]:
data_type = "comment"
params = {
    "size" : 100
}
count_list = []
for author in author_list:
    params["author"] = author
    print(params["author"])
    data = get_pushshift_data(data_type=data_type, params=params)
    if data:
        df = pd.DataFrame.from_records(data)
        count_list.append(Counter(dict(df['subreddit'].value_counts())))

In [None]:
import itertools
threshold = 0.05
G = nx.Graph()

for author_sub_count in count_list:
    sub_list = author_sub_count.most_common(10)
    # Compute all the combinations of subreddit pairs
    sub_combinations = list(itertools.combinations(sub_list, 2))
    for sub_pair in sub_combinations:
        node1 = sub_pair[0][0]
        node2 = sub_pair[1][0]
        if G.has_edge(node1, node2):
            G[node1][node2]['weight'] +=1
        else:
            G.add_edge(node1, node2, weight=1)
print("Total number of edges {}, and nodes {}".format(G.number_of_edges(),G.number_of_nodes()))

In [None]:
# Sparsify the graph
to_remove = [edge for edge in G.edges.data() if edge[2]['weight'] < 2]
G.remove_edges_from(to_remove)

In [None]:
# Remove isolated nodes
G.remove_nodes_from(list(nx.isolates(G)))
print("Total number of edges {}, and nodes {}".format(G.number_of_edges(),G.number_of_nodes()))

In [None]:
# Write the graph to a file
path = 'graph.gexf'
nx.write_gexf(G,path)

An example of the graph visualization you can obtain using Gephi:
![Reddit neighbors](figures/redditneighbors.png "Reddit neighbors")