# Reddit API Data Collection
###### By: Nick Gayliard

In [1]:
import requests
import time
import pandas as pd
import numpy as np
import re
import json
import pdb

### GET requests

In [2]:
url = 'https://www.reddit.com/r/nba.json'

req = requests.get(url)

In [3]:
req

<Response [429]>

https://httpstatuses.com/429

### Requests with parameters / queries

The reddit API gave us a 429 (too many requests) error without a 'User-agent' header assigned. That value can be anything in the case of the reddit API. This can differ from API to API, or be completely unneeded. Many APIs will require a private key, given to you by the company. Be sure to PROTECT your API keys, especially ones attached to bank accounts / credit cards (e.g. Amazon Web Services and Google API keys)

In [4]:
req = requests.get(url, headers = {'User-agent' : 'Nick'})

In [5]:
req.status_code

200

In [89]:
req.content

b'{"kind": "Listing", "data": {"modhash": "", "dist": 27, "children": [{"kind": "t3", "data": {"approved_at_utc": null, "subreddit": "nba", "selftext": "# Today\'s Games:\\n\\n|Tip-off|Away||Home||GDT|PGT|\\n|:--|:--|:-:|:--|--:|:-:|:-:|\\n\\n# Yesterday\'s Games:\\n\\n|Tip-off|Away||Home||GDT|PGT|\\n|:--|:--|:-:|:--|--:|:-:|:-:|\\n\\n# Top Highlights:\\n\\n0. [Danny Green on why Marc Gasol didn\'t have a speech at the Raptors parade: \\"He\'s like \'I\'m drunk bro, that\'s my song!\' The real Memphis came out in Marc when he started drinking. I think he bit me at one point too. That\'s when I told Matt, \'You can\'t give Marc the mic. He might say something crazy.\\"](https://streamable.com/73m1y) | [(Comments)](https://reddit.com/r/nba/comments/c5crq0)\\n\\n0. [Donovan Mitchell\'s luggage gets mixed up with a tourist\'s](https://streamable.com/xt8p8) | [(Comments)](https://reddit.com/r/nba/comments/c5fw47)\\n\\n0. [Michael Jordan hits a Triple Clutch Layup](https://gfycat.com/leafydi

#### Sample URL with a query

In [70]:
req2 = requests.get(url, headers = {'User-agent' : 'Jonnel'}, params = {'before' : 't3_c5rayb'})

In [71]:
req2.url

'https://www.reddit.com/r/nba.json?before=t3_c5rayb'

##### Everything after the '?' symbol in the URL is a query for specific information from the API. You need to check the API documentation to see what variables you can use to grab what information.

In [72]:
req2.url

'https://www.reddit.com/r/nba.json?before=t3_c5rayb'

In [73]:
req2.text

'{"kind": "Listing", "data": {"modhash": "", "dist": 3, "children": [{"kind": "t3", "data": {"approved_at_utc": null, "subreddit": "nba", "selftext": "# Today\'s Games:\\n\\n|Tip-off|Away||Home||GDT|PGT|\\n|:--|:--|:-:|:--|--:|:-:|:-:|\\n\\n# Yesterday\'s Games:\\n\\n|Tip-off|Away||Home||GDT|PGT|\\n|:--|:--|:-:|:--|--:|:-:|:-:|\\n\\n# Top Highlights:\\n\\n0. [Michael Jordan hits a Triple Clutch Layup](https://gfycat.com/leafydimwittedarcticfox) | [(Comments)](https://reddit.com/r/nba/comments/c5pw9l)\\n\\n0. [Donovan Mitchell\'s luggage gets mixed up with a tourist\'s](https://streamable.com/xt8p8) | [(Comments)](https://reddit.com/r/nba/comments/c5fw47)\\n\\n0. [Andre Iguodala asked: Who\'s tougher to guard, Kawhi Leonard or LeBron James? \\"Kobe Bryant\\"](https://streamable.com/yocnw) | [(Comments)](https://reddit.com/r/nba/comments/c5oy6t)\\n\\n0. [Don Nelson says he got fired by the Knicks for trying to trade Ewing in 1996 for Shaq: \\"I said, \\u2018You need to trade Patrick Ewin

### Another reason to not use pd.read_json()

In [74]:
req2.text

'{"kind": "Listing", "data": {"modhash": "", "dist": 3, "children": [{"kind": "t3", "data": {"approved_at_utc": null, "subreddit": "nba", "selftext": "# Today\'s Games:\\n\\n|Tip-off|Away||Home||GDT|PGT|\\n|:--|:--|:-:|:--|--:|:-:|:-:|\\n\\n# Yesterday\'s Games:\\n\\n|Tip-off|Away||Home||GDT|PGT|\\n|:--|:--|:-:|:--|--:|:-:|:-:|\\n\\n# Top Highlights:\\n\\n0. [Michael Jordan hits a Triple Clutch Layup](https://gfycat.com/leafydimwittedarcticfox) | [(Comments)](https://reddit.com/r/nba/comments/c5pw9l)\\n\\n0. [Donovan Mitchell\'s luggage gets mixed up with a tourist\'s](https://streamable.com/xt8p8) | [(Comments)](https://reddit.com/r/nba/comments/c5fw47)\\n\\n0. [Andre Iguodala asked: Who\'s tougher to guard, Kawhi Leonard or LeBron James? \\"Kobe Bryant\\"](https://streamable.com/yocnw) | [(Comments)](https://reddit.com/r/nba/comments/c5oy6t)\\n\\n0. [Don Nelson says he got fired by the Knicks for trying to trade Ewing in 1996 for Shaq: \\"I said, \\u2018You need to trade Patrick Ewin

In [75]:
df = pd.read_json(req.text)

In [76]:
df

Unnamed: 0,kind,data
after,Listing,t3_c5p5zn
before,Listing,
children,Listing,"[{'kind': 't3', 'data': {'approved_at_utc': No..."
dist,Listing,27
modhash,Listing,


In [67]:
json.loads(req.content).keys()

dict_keys(['kind', 'data'])

### Let's check out our request content

In [77]:
# Lots of crazy bytecode 

req2.content

b'{"kind": "Listing", "data": {"modhash": "", "dist": 3, "children": [{"kind": "t3", "data": {"approved_at_utc": null, "subreddit": "nba", "selftext": "# Today\'s Games:\\n\\n|Tip-off|Away||Home||GDT|PGT|\\n|:--|:--|:-:|:--|--:|:-:|:-:|\\n\\n# Yesterday\'s Games:\\n\\n|Tip-off|Away||Home||GDT|PGT|\\n|:--|:--|:-:|:--|--:|:-:|:-:|\\n\\n# Top Highlights:\\n\\n0. [Michael Jordan hits a Triple Clutch Layup](https://gfycat.com/leafydimwittedarcticfox) | [(Comments)](https://reddit.com/r/nba/comments/c5pw9l)\\n\\n0. [Donovan Mitchell\'s luggage gets mixed up with a tourist\'s](https://streamable.com/xt8p8) | [(Comments)](https://reddit.com/r/nba/comments/c5fw47)\\n\\n0. [Andre Iguodala asked: Who\'s tougher to guard, Kawhi Leonard or LeBron James? \\"Kobe Bryant\\"](https://streamable.com/yocnw) | [(Comments)](https://reddit.com/r/nba/comments/c5oy6t)\\n\\n0. [Don Nelson says he got fired by the Knicks for trying to trade Ewing in 1996 for Shaq: \\"I said, \\u2018You need to trade Patrick Ewi

#### Convert it to json and navigate through the json to the data we want

In [78]:
page_pull = req2.json()

In [79]:
page_pull

{'kind': 'Listing',
 'data': {'modhash': '',
  'dist': 3,
  'children': [{'kind': 't3',
    'data': {'approved_at_utc': None,
     'subreddit': 'nba',
     'selftext': '# Today\'s Games:\n\n|Tip-off|Away||Home||GDT|PGT|\n|:--|:--|:-:|:--|--:|:-:|:-:|\n\n# Yesterday\'s Games:\n\n|Tip-off|Away||Home||GDT|PGT|\n|:--|:--|:-:|:--|--:|:-:|:-:|\n\n# Top Highlights:\n\n0. [Michael Jordan hits a Triple Clutch Layup](https://gfycat.com/leafydimwittedarcticfox) | [(Comments)](https://reddit.com/r/nba/comments/c5pw9l)\n\n0. [Donovan Mitchell\'s luggage gets mixed up with a tourist\'s](https://streamable.com/xt8p8) | [(Comments)](https://reddit.com/r/nba/comments/c5fw47)\n\n0. [Andre Iguodala asked: Who\'s tougher to guard, Kawhi Leonard or LeBron James? "Kobe Bryant"](https://streamable.com/yocnw) | [(Comments)](https://reddit.com/r/nba/comments/c5oy6t)\n\n0. [Don Nelson says he got fired by the Knicks for trying to trade Ewing in 1996 for Shaq: "I said, ‘You need to trade Patrick Ewing... [Shaq] 

In [80]:
page_pull.keys()

dict_keys(['kind', 'data'])

In [81]:
page_pull['data']

{'modhash': '',
 'dist': 3,
 'children': [{'kind': 't3',
   'data': {'approved_at_utc': None,
    'subreddit': 'nba',
    'selftext': '# Today\'s Games:\n\n|Tip-off|Away||Home||GDT|PGT|\n|:--|:--|:-:|:--|--:|:-:|:-:|\n\n# Yesterday\'s Games:\n\n|Tip-off|Away||Home||GDT|PGT|\n|:--|:--|:-:|:--|--:|:-:|:-:|\n\n# Top Highlights:\n\n0. [Michael Jordan hits a Triple Clutch Layup](https://gfycat.com/leafydimwittedarcticfox) | [(Comments)](https://reddit.com/r/nba/comments/c5pw9l)\n\n0. [Donovan Mitchell\'s luggage gets mixed up with a tourist\'s](https://streamable.com/xt8p8) | [(Comments)](https://reddit.com/r/nba/comments/c5fw47)\n\n0. [Andre Iguodala asked: Who\'s tougher to guard, Kawhi Leonard or LeBron James? "Kobe Bryant"](https://streamable.com/yocnw) | [(Comments)](https://reddit.com/r/nba/comments/c5oy6t)\n\n0. [Don Nelson says he got fired by the Knicks for trying to trade Ewing in 1996 for Shaq: "I said, ‘You need to trade Patrick Ewing... [Shaq] would love to come to NY.](https:/

In [82]:
page_pull['data'].keys()

dict_keys(['modhash', 'dist', 'children', 'after', 'before'])

In [83]:
page_pull['data']['children']

[{'kind': 't3',
  'data': {'approved_at_utc': None,
   'subreddit': 'nba',
   'selftext': '# Today\'s Games:\n\n|Tip-off|Away||Home||GDT|PGT|\n|:--|:--|:-:|:--|--:|:-:|:-:|\n\n# Yesterday\'s Games:\n\n|Tip-off|Away||Home||GDT|PGT|\n|:--|:--|:-:|:--|--:|:-:|:-:|\n\n# Top Highlights:\n\n0. [Michael Jordan hits a Triple Clutch Layup](https://gfycat.com/leafydimwittedarcticfox) | [(Comments)](https://reddit.com/r/nba/comments/c5pw9l)\n\n0. [Donovan Mitchell\'s luggage gets mixed up with a tourist\'s](https://streamable.com/xt8p8) | [(Comments)](https://reddit.com/r/nba/comments/c5fw47)\n\n0. [Andre Iguodala asked: Who\'s tougher to guard, Kawhi Leonard or LeBron James? "Kobe Bryant"](https://streamable.com/yocnw) | [(Comments)](https://reddit.com/r/nba/comments/c5oy6t)\n\n0. [Don Nelson says he got fired by the Knicks for trying to trade Ewing in 1996 for Shaq: "I said, ‘You need to trade Patrick Ewing... [Shaq] would love to come to NY.](https://streamable.com/dt0b6) | [(Comments)](https:

In [84]:
page_pull['data']['children'][0]

{'kind': 't3',
 'data': {'approved_at_utc': None,
  'subreddit': 'nba',
  'selftext': '# Today\'s Games:\n\n|Tip-off|Away||Home||GDT|PGT|\n|:--|:--|:-:|:--|--:|:-:|:-:|\n\n# Yesterday\'s Games:\n\n|Tip-off|Away||Home||GDT|PGT|\n|:--|:--|:-:|:--|--:|:-:|:-:|\n\n# Top Highlights:\n\n0. [Michael Jordan hits a Triple Clutch Layup](https://gfycat.com/leafydimwittedarcticfox) | [(Comments)](https://reddit.com/r/nba/comments/c5pw9l)\n\n0. [Donovan Mitchell\'s luggage gets mixed up with a tourist\'s](https://streamable.com/xt8p8) | [(Comments)](https://reddit.com/r/nba/comments/c5fw47)\n\n0. [Andre Iguodala asked: Who\'s tougher to guard, Kawhi Leonard or LeBron James? "Kobe Bryant"](https://streamable.com/yocnw) | [(Comments)](https://reddit.com/r/nba/comments/c5oy6t)\n\n0. [Don Nelson says he got fired by the Knicks for trying to trade Ewing in 1996 for Shaq: "I said, ‘You need to trade Patrick Ewing... [Shaq] would love to come to NY.](https://streamable.com/dt0b6) | [(Comments)](https://re

In [85]:
len(page_pull['data']['children'])

3

name, subreddit, selftext, title, num_comments, url, score

In [50]:
# When you are indexing deeply into json, it can help to make variable names for certain levels of indexing
# that you plan on reusing, to improve readability and make sure you don't make indexing errors as often

post_list = page_pull['data']['children']

In [51]:
post_list[1].keys()

dict_keys(['kind', 'data'])

In [52]:
for post in post_list:
    print(post['data']['name'])

t3_c5qx6b
t3_c5oy5m
t3_c5pw9l
t3_c5rayb
t3_c5shdu
t3_c5qc2k
t3_c5oy6t
t3_c5olzb
t3_c5q8u4
t3_c5ta9z
t3_c5qwrb
t3_c5q4pb
t3_c5rax5
t3_c5pfjd
t3_c5p68t
t3_c5ra92
t3_c5qlr9
t3_c5iu4g
t3_c5oadk
t3_c5ufj8
t3_c5rnzj
t3_c5rnbu
t3_c5pnwq
t3_c5pcj6
t3_c5phw1
t3_c5te75
t3_c5p5zn


In [53]:
post_list[0]['data']['title']

'Game Threads Index + Daily Discussion (June 26, 2019)'

### Scrape and build a dictionary to make a dataframe

In [54]:
# Sloppy way! Too much indexing in loop

post_dict = {}

for count, post in enumerate(post_list):
    post_dict[post_list[count]['data']['name']] = [post_list[count]['data']['title'], post_list[count]['data']['num_comments']]

In [55]:
# CLEAN WAY - using an indexer variable!!

post_dict = {}

for count, post in enumerate(post_list):
    post_indexer = post_list[count]['data']
    post_dict[post_indexer['name']] = [post_indexer['title'], post_indexer['num_comments']]

In [56]:
df = pd.DataFrame(post_dict).T
df.columns = ['title', 'num_comments']
df

Unnamed: 0,title,num_comments
t3_c5qx6b,Game Threads Index + Daily Discussion (June 26...,20
t3_c5oy5m,[Serious Discussion] Season Review: Portland T...,114
t3_c5pw9l,Michael Jordan hits a Triple Clutch Layup,1076
t3_c5rayb,[Wojnarowski] Golden State Warriors star Kevin...,922
t3_c5shdu,PSA: Carmelo Anthony is at the same age as Vin...,181
t3_c5qc2k,DeMar DeRozan delivers a powerful message on m...,87
t3_c5oy6t,"Andre Iguodala asked: Who's tougher to guard, ...",919
t3_c5olzb,The amount of confusion surrounding casual fan...,226
t3_c5q8u4,Kobe Bryant makes two defenders collide into e...,160
t3_c5ta9z,[Enes Kanter] I kind of feel like Zion is over...,410


## Put it in a function!

In [59]:
# function to scrape reddit page (takes a reddit .json url)
# returns posts 

headers = {'User-agent' : 'Jonnel'}

def scraper_bike(url):
    posts = []
    after = {}

    for page in range(40):
        params = {'after' : after}
        url = url
        pagepull = requests.get(url = url, params = params, headers = headers)
        page_dict = pagepull.json()
        posts.extend(page_dict['data']['children'])
        after = page_dict['data']['after']
        # sleep is a best practice (probably not necessary for such a small scrape)

        
    return posts

In [58]:
nba_post_list = scraper_bike('https://www.reddit.com/r/nba.json')

In [30]:
len(nba_post_list)

982

In [31]:
# function to convert posts to DataFrame - won't allow duplicate posts since unique id 'name' is set as index
# Extract: name (as index) and subreddit, selftext, title (as columns)

def posts_to_df(post_list):
    post_dict = {}
    
    for i, post in enumerate(post_list):
        ind = post_list[i]['data']
        post_dict[ind['name']] = [ind['subreddit'], ind['title'], ind['selftext']]

    df_name = pd.DataFrame(post_dict)
    df_name = df_name.T
    df_name.columns = ['subreddit', 'title', 'selftext'] #'selftext'
    
    return df_name

In [32]:
posts_to_df(nba_post_list)

Unnamed: 0,subreddit,title,selftext
t3_c5qx6b,nba,Game Threads Index + Daily Discussion (June 26...,# Today's Games:\n\n|Tip-off|Away||Home||GDT|P...
t3_c5oy5m,nba,[Serious Discussion] Season Review: Portland T...,**PORTLAND TRAIL BLAZERS** [](/POR)\n\nHEAD CO...
t3_c5pw9l,nba,Michael Jordan hits a Triple Clutch Layup,
t3_c5rayb,nba,[Wojnarowski] Golden State Warriors star Kevin...,
t3_c5shdu,nba,PSA: Carmelo Anthony is at the same age as Vin...,
t3_c5qc2k,nba,DeMar DeRozan delivers a powerful message on m...,
t3_c5oy6t,nba,"Andre Iguodala asked: Who's tougher to guard, ...",
t3_c5ta9z,nba,[Enes Kanter] I kind of feel like Zion is over...,
t3_c5q8u4,nba,Kobe Bryant makes two defenders collide into e...,
t3_c5olzb,nba,The amount of confusion surrounding casual fan...,Literally every instagram post where the winne...


## Couple extra functions for simplicity in running

In [86]:
# takes scraper function and url - outputs dataframe

def scrape_to_df(scrape_func, url):
    
    return posts_to_df(scrape_func(url))

### Function to scrape and save to csv. HIGHLY recommended when gathering data online that you want to ensure you maintain a copy of locally (and remotely if you want to be secure)

In [87]:
# NOTE: YOU NEED A CSV ALREADY MADE TO SAVE TO IN THIS CASE. 
# YOU COULD ADD CODE TO CREATE A NEW CSV IF NONE EXISTS

# scrape, import csv, concat, drop duplicate, and output to csv

# takes in scraper function, url, csv filename to import, csv filename to output

# Outputs - Concatenated DataFrame as csv

def scrape_add(scrape_func, url, import_file, export_file):
    
    scrape_df = posts_to_df(scrape_func(url))
    
    imported_df = pd.read_csv(import_file, index_col = 'Unnamed: 0')
    
    concat_df = pd.concat([imported_df, scrape_df])
    
    concat_df = concat_df[~concat_df.index.duplicated(keep='first')]
    
    concat_df.to_csv(export_file)