# Data extraction for Reddit

Build a strategy to download and store all reddit posts and comments (including upvotes and downvotes) for a given subreddit (eg reddit.com/r/sanfrancisco).

Write down an executable script in any language to run your strategy.

Storage of your choice among Redis, MongoDB, or Mysql. Up to you choose which one you think fits best and/or you are more familiar with.

Tutorial from: https://towardsdatascience.com/how-to-use-the-reddit-api-in-python-5e05ddfd1e5c

In [None]:
with open('api_key.txt', 'r') as key_file:
    CLIENT_ID, SECRET_KEY = key_file.read().strip('\n').split('\n')

In [None]:
import requests

In [None]:
auth = requests.auth.HTTPBasicAuth(CLIENT_ID, SECRET_KEY)

In [None]:
with open('secret.txt', 'r') as pw_file:
    user, pw = pw_file.read().strip('\n').split('\n')

In [None]:
user_data = {
    'grant_type': 'password',
    'username': user,
    'password': pw
}

In [None]:
headers = {'User-Agent': 'HwAPI/0.0.1'}

 ### Important: This access Token will expire after 2 hours (or 1?), a new one has to be requested (permanent?) https://github.com/reddit-archive/reddit/wiki/OAuth2

In [None]:
res = requests.post('https://www.reddit.com/api/v1/access_token', 
                     auth=auth, data=user_data, headers=headers)

In [None]:
# res.json()

In [None]:
TOKEN = res.json()['access_token']

In [None]:
headers['Authorization'] = f'bearer {TOKEN}'

### Important: The API has a limit of requests per minute, monitor the usage. https://github.com/reddit-archive/reddit/wiki/API

In [None]:
res = requests.get('https://oauth.reddit.com/api/v1/me', headers=headers)

In [None]:
# res.json()

In [None]:
sf_hot = requests.get('https://oauth.reddit.com/r/sanfrancisco/hot', headers=headers)

In [None]:
sf_hot.json()['data']

In [None]:
import pandas as pd
from datetime import datetime
from time import sleep

In [None]:
data = pd.DataFrame()  # initialize dataframe
params = {'limit': 5}

In [None]:
# we use this function to convert responses to dataframes
def df_from_response(res):
    # initialize temp dataframe for batch of data in response
    df = pd.DataFrame()

    # loop through each post pulled from res and append to df
    for post in res.json()['data']['children']:
        df = df.append({
            'subreddit': post['data']['subreddit'],
            'title': post['data']['title'],
            'selftext': post['data']['selftext'],
            'upvote_ratio': post['data']['upvote_ratio'],
            'ups': post['data']['ups'],
            'downs': post['data']['downs'],
            'score': post['data']['score'],
            'link_flair_css_class': post['data']['link_flair_css_class'],
            'created_utc': datetime.fromtimestamp(post['data']['created_utc']).strftime('%Y-%m-%dT%H:%M:%SZ'),
            'id': post['data']['id'],
            'kind': post['kind']
        }, ignore_index=True)

    return df

In [None]:
# loop through 10 times (returning 1K posts)
for i in range(3):
    # make request
    res = requests.get("https://oauth.reddit.com/r/sanfrancisco/new",
                       headers=headers,
                       params=params)

    # get dataframe from response
    new_df = df_from_response(res)
    # take the final row (oldest entry)
    row = new_df.iloc[len(new_df)-1]
    # create fullname
    fullname = row['kind'] + '_' + row['id']
    # add/update fullname in params
    params['after'] = fullname
    
    # append new_df to data
    data = data.append(new_df, ignore_index=True)
    
    sleep(1)

In [None]:
data

In [None]:
post_id = "ovq541"

In [None]:
res = requests.get(f"https://oauth.reddit.com/r/sanfrancisco/comments/{post_id}",
                   headers=headers,
                   params=params)

In [None]:
res.json()

In [None]:
single_comment = res.json()[1]['data']['children'][0]
single_comment

In [None]:
comment_keys = single_comment['data'].keys()
comment_keys

API Call to obtain the "more" comments after a call to sanfrancisco/comments/post_id. The number of comments returned is limited in number.

Check the response of https://www.reddit.com/dev/api#GET_api_morechildren

In [None]:
more_comments = ['h7bz4gz', 'h7brflr', 'h7cxasf']

In [None]:
full_post_id = 't3_ovq541'

In [None]:
children_correct = ','.join(more_comments)
children_correct

In [None]:
more_params = {'api_type': 'json',
              'children': children_correct,
              'limit_children': False,
              'link_id': full_post_id,
              'sort': 'new'}

In [None]:
response = requests.get(f"https://oauth.reddit.com/api/morechildren",
                        headers=headers,
                        params=more_params)

In [None]:
response.json()

In [None]:
response.json()['json']['data']['things']