# Spiketrap Homework

Build a strategy to download and store all reddit posts and comments (including upvotes and downvotes) for a given subreddit (eg reddit.com/r/sanfrancisco).

Write down an executable script in any language to run your strategy.

Storage of your choice among Redis, MongoDB, or Mysql. Up to you choose which one you think fits best and/or you are more familiar with.

Tutorial from: https://towardsdatascience.com/how-to-use-the-reddit-api-in-python-5e05ddfd1e5c

In [49]:
with open('api_key.txt', 'r') as key_file:
    CLIENT_ID, SECRET_KEY = key_file.read().strip('\n').split('\n')

In [50]:
import requests

In [51]:
auth = requests.auth.HTTPBasicAuth(CLIENT_ID, SECRET_KEY)

In [52]:
with open('secret.txt', 'r') as pw_file:
    user, pw = pw_file.read().strip('\n').split('\n')

In [55]:
user_data = {
    'grant_type': 'password',
    'username': user,
    'password': pw
}

In [56]:
headers = {'User-Agent': 'HwAPI/0.0.1'}

 ### Important: This access Token will expire after 2 hours (or 1?), a new one has to be requested (permanent?) https://github.com/reddit-archive/reddit/wiki/OAuth2

In [57]:
res = requests.post('https://www.reddit.com/api/v1/access_token', 
                     auth=auth, data=user_data, headers=headers)

In [None]:
# res.json()

In [59]:
TOKEN = res.json()['access_token']

In [60]:
headers['Authorization'] = f'bearer {TOKEN}'

### Important: The API has a limit of requests per minute, monitor the usage. https://github.com/reddit-archive/reddit/wiki/API

In [61]:
res = requests.get('https://oauth.reddit.com/api/v1/me', headers=headers)

In [None]:
# res.json()

In [15]:
sf_hot = requests.get('https://oauth.reddit.com/r/sanfrancisco/hot', headers=headers)

In [16]:
sf_hot.json()['data']

{'after': 't3_ovkd2u',
 'dist': 26,
 'modhash': None,
 'geo_filter': None,
 'children': [{'kind': 't3',
   'data': {'approved_at_utc': None,
    'subreddit': 'sanfrancisco',
    'selftext': 'Post about upcoming events, new things you’ve spotted around the city, or just little mundane sanfranciscoisms that strike your fancy. You can even do a little self-promotion here, if you abide by the rules in the sidebar. \n\n----\n\n* [Archive of previous daily discussions](https://www.reddit.com/r/sanfrancisco/search/?q=author%3Aautomoderator&amp;sort=new&amp;restrict_sr=on)\n* [Official San Francisco COVID-19 Data Tracker.](https://data.sfgov.org/stories/s/fjki-2fab) Complete with data &amp; easy to read charts &amp; graphs.\n* [Additional Covid info](https://covidactnow.org/us/ca/county/san_francisco_county?s=61890)',
    'author_fullname': 't2_6l4z3',
    'saved': False,
    'mod_reason_title': None,
    'gilded': 0,
    'clicked': False,
    'title': 'DAILY BULLSHIT — Sunday August 1, 2021',

In [17]:
import pandas as pd
from datetime import datetime
from time import sleep

In [18]:
data = pd.DataFrame()  # initialize dataframe
params = {'limit': 5}

In [19]:
# we use this function to convert responses to dataframes
def df_from_response(res):
    # initialize temp dataframe for batch of data in response
    df = pd.DataFrame()

    # loop through each post pulled from res and append to df
    for post in res.json()['data']['children']:
        df = df.append({
            'subreddit': post['data']['subreddit'],
            'title': post['data']['title'],
            'selftext': post['data']['selftext'],
            'upvote_ratio': post['data']['upvote_ratio'],
            'ups': post['data']['ups'],
            'downs': post['data']['downs'],
            'score': post['data']['score'],
            'link_flair_css_class': post['data']['link_flair_css_class'],
            'created_utc': datetime.fromtimestamp(post['data']['created_utc']).strftime('%Y-%m-%dT%H:%M:%SZ'),
            'id': post['data']['id'],
            'kind': post['kind']
        }, ignore_index=True)

    return df

In [20]:
# loop through 10 times (returning 1K posts)
for i in range(3):
    # make request
    res = requests.get("https://oauth.reddit.com/r/sanfrancisco/new",
                       headers=headers,
                       params=params)

    # get dataframe from response
    new_df = df_from_response(res)
    # take the final row (oldest entry)
    row = new_df.iloc[len(new_df)-1]
    # create fullname
    fullname = row['kind'] + '_' + row['id']
    # add/update fullname in params
    params['after'] = fullname
    
    # append new_df to data
    data = data.append(new_df, ignore_index=True)
    
    sleep(1)

In [21]:
data

Unnamed: 0,created_utc,downs,id,kind,link_flair_css_class,score,selftext,subreddit,title,ups,upvote_ratio
0,2021-08-02T09:00:47Z,0.0,ow9dhm,t3,,1.0,Question for home owners - has anyone appealed...,sanfrancisco,San Francisco Property Tax Appeal,1.0,0.67
1,2021-08-02T06:45:44Z,0.0,ow7ng7,t3,pic,139.0,,sanfrancisco,Feels like home..,139.0,0.97
2,2021-08-02T06:37:05Z,0.0,ow7jac,t3,,76.0,Today was such a beautiful day in the city! Wi...,sanfrancisco,Gorgeous Day,76.0,0.89
3,2021-08-02T06:10:10Z,0.0,ow75ji,t3,pic,41.0,,sanfrancisco,Things get pretty spectacular around here when...,41.0,0.91
4,2021-08-02T05:27:30Z,0.0,ow6jf0,t3,,281.0,They are trying to do their job. I want to get...,sanfrancisco,PSA: Please stop harassing the muni operators ...,281.0,0.9
5,2021-08-02T05:27:03Z,0.0,ow6j7i,t3,pic,62.0,,sanfrancisco,One of the bay bridges,62.0,0.92
6,2021-08-02T04:54:02Z,0.0,ow61kp,t3,pic,10.0,,sanfrancisco,"Under the bridge (golden gate, OC)",10.0,0.77
7,2021-08-02T04:26:24Z,0.0,ow5n4a,t3,,0.0,"Hi all,\n\nSo me and my friend are moving from...",sanfrancisco,Looking for a 2 bedroom apartment in SF with $...,0.0,0.29
8,2021-08-02T04:25:21Z,0.0,ow5mkh,t3,,2.0,only big hero 6 fans will get this.,sanfrancisco,would you like it if san francisco was remodel...,2.0,0.52
9,2021-08-02T03:55:06Z,0.0,ow55t8,t3,,162.0,,sanfrancisco,Richmond at night,162.0,0.95


In [22]:
post_id = "ovq541"

In [23]:
res = requests.get(f"https://oauth.reddit.com/r/sanfrancisco/comments/{post_id}",
                   headers=headers,
                   params=params)

In [24]:
res.json()

[{'kind': 'Listing',
  'data': {'after': None,
   'dist': 1,
   'modhash': None,
   'geo_filter': '',
   'children': [{'kind': 't3',
     'data': {'approved_at_utc': None,
      'subreddit': 'sanfrancisco',
      'selftext': 'Post about upcoming events, new things you’ve spotted around the city, or just little mundane sanfranciscoisms that strike your fancy. You can even do a little self-promotion here, if you abide by the rules in the sidebar. \n\n----\n\n* [Archive of previous daily discussions](https://www.reddit.com/r/sanfrancisco/search/?q=author%3Aautomoderator&amp;sort=new&amp;restrict_sr=on)\n* [Official San Francisco COVID-19 Data Tracker.](https://data.sfgov.org/stories/s/fjki-2fab) Complete with data &amp; easy to read charts &amp; graphs.\n* [Additional Covid info](https://covidactnow.org/us/ca/county/san_francisco_county?s=61890)',
      'user_reports': [],
      'saved': False,
      'mod_reason_title': None,
      'gilded': 0,
      'clicked': False,
      'title': 'DAIL

In [25]:
single_comment = res.json()[1]['data']['children'][0]
single_comment

{'kind': 't1',
 'data': {'total_awards_received': 0,
  'approved_at_utc': None,
  'author_is_blocked': False,
  'comment_type': None,
  'awarders': [],
  'mod_reason_by': None,
  'banned_by': None,
  'ups': 11,
  'author_flair_type': 'richtext',
  'removal_reason': None,
  'link_id': 't3_ovq541',
  'author_flair_template_id': None,
  'likes': None,
  'replies': {'kind': 'Listing',
   'data': {'after': None,
    'dist': None,
    'modhash': None,
    'geo_filter': '',
    'children': [{'kind': 't1',
      'data': {'total_awards_received': 0,
       'approved_at_utc': None,
       'author_is_blocked': False,
       'comment_type': None,
       'awarders': [],
       'mod_reason_by': None,
       'banned_by': None,
       'ups': 16,
       'author_flair_type': 'richtext',
       'removal_reason': None,
       'link_id': 't3_ovq541',
       'author_flair_template_id': 'ebc9c89a-3ac7-11e3-832e-12313d18f999',
       'likes': None,
       'replies': '',
       'author_fullname': 't2_g6wrh',
 

In [26]:
comment_keys = single_comment['data'].keys()
comment_keys

dict_keys(['total_awards_received', 'approved_at_utc', 'author_is_blocked', 'comment_type', 'awarders', 'mod_reason_by', 'banned_by', 'ups', 'author_flair_type', 'removal_reason', 'link_id', 'author_flair_template_id', 'likes', 'replies', 'author_fullname', 'saved', 'id', 'banned_at_utc', 'mod_reason_title', 'gilded', 'archived', 'collapsed_reason_code', 'no_follow', 'author', 'can_mod_post', 'send_replies', 'parent_id', 'score', 'approved_by', 'report_reasons', 'author_premium', 'all_awardings', 'subreddit_id', 'body', 'edited', 'user_reports', 'author_flair_css_class', 'downs', 'is_submitter', 'collapsed', 'author_flair_richtext', 'author_patreon_flair', 'body_html', 'gildings', 'collapsed_reason', 'associated_award', 'stickied', 'subreddit_type', 'can_gild', 'top_awarded_type', 'author_flair_text_color', 'score_hidden', 'permalink', 'num_reports', 'locked', 'name', 'created', 'subreddit', 'author_flair_text', 'treatment_tags', 'created_utc', 'subreddit_name_prefixed', 'controversial

API Call to obtain the "more" comments after a call to sanfrancisco/comments/post_id. The number of comments returned is limited in number.

Check the response of https://www.reddit.com/dev/api#GET_api_morechildren

In [39]:
more_comments = ['h7bz4gz', 'h7brflr', 'h7cxasf']

In [71]:
full_post_id = 't3_ovq541'

In [72]:
children_correct = ','.join(more_comments)
children_correct

'h7bz4gz,h7brflr,h7cxasf'

In [73]:
more_params = {'api_type': 'json',
              'children': children_correct,
              'limit_children': False,
              'link_id': full_post_id,
              'sort': 'new'}

In [74]:
response = requests.get(f"https://oauth.reddit.com/api/morechildren",
                        headers=headers,
                        params=more_params)

In [75]:
response.json()

{'json': {'errors': [],
  'data': {'things': [{'kind': 't1',
     'data': {'total_awards_received': 0,
      'approved_at_utc': None,
      'author_is_blocked': False,
      'comment_type': None,
      'edited': False,
      'mod_reason_by': None,
      'banned_by': None,
      'author_flair_type': 'text',
      'removal_reason': None,
      'link_id': 't3_ovq541',
      'author_flair_template_id': None,
      'likes': None,
      'replies': '',
      'author_fullname': 't2_wc0we',
      'saved': False,
      'id': 'h7cxasf',
      'banned_at_utc': None,
      'mod_reason_title': None,
      'gilded': 0,
      'archived': False,
      'collapsed_reason_code': None,
      'no_follow': True,
      'author': 'sf_slugger',
      'can_mod_post': False,
      'send_replies': True,
      'parent_id': 't3_ovq541',
      'score': 3,
      'approved_by': None,
      'author_premium': False,
      'mod_note': None,
      'all_awardings': [],
      'subreddit_id': 't5_2qh3u',
      'body': 'fuck y