# Project 3 Json Extraction

The 2 subreddits chosen are TalesFromYourServer and TalesFromTheFrontDesk.

TalesFromYourServer posts are mainly personal accounts of serving staff experiences with customers.

TalesFromTheFrontDesk posts are mainly personal accounts of front desk staff experiences with customers.

In [1]:
# urls for the subreddits:
# https://www.reddit.com/r/TalesFromYourServer
# https://www.reddit.com/r/TalesFromTheFrontDesk



In [2]:
url = 'https://www.reddit.com/r/TalesFromYourServer.json'

In [3]:
import requests
import pandas as pd
import time
import random

In [4]:
res = requests.get(url)

In [5]:
res.status_code

429

Reddit knows that you are using a Chrome browser on a Mac is trying to access the address https://www.reddit.com/r/TalesFromRetail.json However, Python has its own default user agent. Since there are so many scripts out there that are already 'hitting' reddit's API, reddit is basically shutting down all Python scripts from accessing its API.

We will change our request a little bit to make it not use the default user agent. 

In [6]:
res = requests.get(url, headers={'User-agent': 'Spider Time!'})

In [7]:
res.status_code

200

In [8]:
reddit_dict = res.json()

In [9]:
print(len(reddit_dict))

2


In [10]:
reddit_dict

{'kind': 'Listing',
 'data': {'modhash': '',
  'dist': 26,
  'children': [{'kind': 't3',
    'data': {'approved_at_utc': None,
     'subreddit': 'TalesFromYourServer',
     'selftext': "Whether you're a new server, busser, host, or bartender with a specific/general question or a veteran who has some good advice, feel free to jump in!\n\nThis thread will be pinned until it gets archived.",
     'author_fullname': 't2_8er8z',
     'saved': False,
     'mod_reason_title': None,
     'gilded': 0,
     'clicked': False,
     'title': 'Dedicated thread for new server advice',
     'link_flair_richtext': [],
     'subreddit_name_prefixed': 'r/TalesFromYourServer',
     'hidden': False,
     'pwls': 6,
     'link_flair_css_class': None,
     'downs': 0,
     'thumbnail_height': None,
     'hide_score': False,
     'name': 't3_dvvyw7',
     'quarantine': False,
     'link_flair_text_color': 'dark',
     'author_flair_background_color': None,
     'subreddit_type': 'public',
     'ups': 122,
   

In [11]:
reddit_dict.keys()

dict_keys(['kind', 'data'])

In [12]:
reddit_dict['data']

{'modhash': '',
 'dist': 26,
 'children': [{'kind': 't3',
   'data': {'approved_at_utc': None,
    'subreddit': 'TalesFromYourServer',
    'selftext': "Whether you're a new server, busser, host, or bartender with a specific/general question or a veteran who has some good advice, feel free to jump in!\n\nThis thread will be pinned until it gets archived.",
    'author_fullname': 't2_8er8z',
    'saved': False,
    'mod_reason_title': None,
    'gilded': 0,
    'clicked': False,
    'title': 'Dedicated thread for new server advice',
    'link_flair_richtext': [],
    'subreddit_name_prefixed': 'r/TalesFromYourServer',
    'hidden': False,
    'pwls': 6,
    'link_flair_css_class': None,
    'downs': 0,
    'thumbnail_height': None,
    'hide_score': False,
    'name': 't3_dvvyw7',
    'quarantine': False,
    'link_flair_text_color': 'dark',
    'author_flair_background_color': None,
    'subreddit_type': 'public',
    'ups': 122,
    'total_awards_received': 0,
    'media_embed': {},
  

In [13]:
len(reddit_dict['data'])

5

In [14]:
reddit_dict['data'].keys()

dict_keys(['modhash', 'dist', 'children', 'after', 'before'])

In [15]:
reddit_dict['data']['children']

[{'kind': 't3',
  'data': {'approved_at_utc': None,
   'subreddit': 'TalesFromYourServer',
   'selftext': "Whether you're a new server, busser, host, or bartender with a specific/general question or a veteran who has some good advice, feel free to jump in!\n\nThis thread will be pinned until it gets archived.",
   'author_fullname': 't2_8er8z',
   'saved': False,
   'mod_reason_title': None,
   'gilded': 0,
   'clicked': False,
   'title': 'Dedicated thread for new server advice',
   'link_flair_richtext': [],
   'subreddit_name_prefixed': 'r/TalesFromYourServer',
   'hidden': False,
   'pwls': 6,
   'link_flair_css_class': None,
   'downs': 0,
   'thumbnail_height': None,
   'hide_score': False,
   'name': 't3_dvvyw7',
   'quarantine': False,
   'link_flair_text_color': 'dark',
   'author_flair_background_color': None,
   'subreddit_type': 'public',
   'ups': 122,
   'total_awards_received': 0,
   'media_embed': {},
   'thumbnail_width': None,
   'author_flair_template_id': '71f86610-

In [16]:
len(reddit_dict['data']['children'])

26

In [17]:
reddit_dict['data']['children'][25]

{'kind': 't3',
 'data': {'approved_at_utc': None,
  'subreddit': 'TalesFromYourServer',
  'selftext': 'Our managers have decided to dock EVERYONE’s hours because it’s an extremely slow season, and because of the virus.\n\nWe have now dropped to 1-2 shifts a week, and people who were more full time than many of us even, lost their hours.\n\nI have gotten a weekday job that pays more. I have kept myself on the weekends, but I’m not going to take a forced vacation just because the company is being frugal with how they want to pay their employees.\n\nThis is a major company with a TON of money. There are many of them but they can’t even afford parking passes for us.\n\nI’m not going to take my only day on the weekend off. Don’t ask me to give up my income because you can’t afford to pay me.',
  'author_fullname': 't2_4fjzzgpr',
  'saved': False,
  'mod_reason_title': None,
  'gilded': 0,
  'clicked': False,
  'title': 'They’re requesting us to take time off because it’s the only way to kee

In [18]:
reddit_dict['data']['children'][25]['data']['subreddit']

'TalesFromYourServer'

The cell directly above gives you the class label, aka your target.

In [19]:
reddit_dict['data']['children'][0]['data']['title']

'Dedicated thread for new server advice'

That's mapping to the first post.

In [20]:
reddit_dict['data']['children'][25]['data']['title']

'They’re requesting us to take time off because it’s the only way to keep their employees.'

We want to get all these posts into a Pandas DataFrame and thereafter we can save it to a CSV.

In [21]:
posts = [p['data'] for p in reddit_dict['data']['children']]

In [22]:
pd.DataFrame(posts)

Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,...,author_flair_text_color,permalink,parent_whitelist_status,stickied,url,subreddit_subscribers,created_utc,num_crossposts,media,is_video
0,,TalesFromYourServer,"Whether you're a new server, busser, host, or ...",t2_8er8z,False,,0,False,Dedicated thread for new server advice,[],...,dark,/r/TalesFromYourServer/comments/dvvyw7/dedicat...,all_ads,True,https://www.reddit.com/r/TalesFromYourServer/c...,331114,1573670000.0,0,,False
1,,TalesFromYourServer,"I've been serving for a few years, work at a s...",t2_jkbi9,False,,0,False,Finally stood up to a table last weekend,[],...,,/r/TalesFromYourServer/comments/fhe03c/finally...,all_ads,False,https://www.reddit.com/r/TalesFromYourServer/c...,331114,1584008000.0,0,,False
2,,TalesFromYourServer,Before I was a mom I was a server and bartende...,t2_4k7ggsgu,False,,0,False,Counterpoint to the Creepy Old Men Post,[],...,,/r/TalesFromYourServer/comments/fh1zs9/counter...,all_ads,False,https://www.reddit.com/r/TalesFromYourServer/c...,331114,1583952000.0,0,,False
3,,TalesFromYourServer,"So this lady called in an order, it was a calz...",t2_15a6jg,False,,0,False,She was mad we didn't make the food she didn't...,[],...,,/r/TalesFromYourServer/comments/fh8wuf/she_was...,all_ads,False,https://www.reddit.com/r/TalesFromYourServer/c...,331114,1583979000.0,0,,False
4,,TalesFromYourServer,Thank you so much for actually parenting. Your...,t2_23eum4ut,False,,0,False,To my first table of the day,[],...,dark,/r/TalesFromYourServer/comments/fhb019/to_my_f...,all_ads,False,https://www.reddit.com/r/TalesFromYourServer/c...,331114,1583989000.0,0,,False
5,,TalesFromYourServer,This only works if he’s only making slightly v...,t2_5xyn8f,False,,1,False,!!YOUNG FEMALE SERVERS!! My tried and true met...,[],...,,/r/TalesFromYourServer/comments/fgs7hk/young_f...,all_ads,False,https://www.reddit.com/r/TalesFromYourServer/c...,331114,1583905000.0,1,,False
6,,TalesFromYourServer,"Alright there’s a TL;DR at the bottom, i’m on ...",t2_4xxvrl9s,False,,0,False,"No, I don’t want to see your crusty facebook m...",[],...,,/r/TalesFromYourServer/comments/fhbxta/no_i_do...,all_ads,False,https://www.reddit.com/r/TalesFromYourServer/c...,331114,1583994000.0,0,,False
7,,TalesFromYourServer,I work part-time at a retirement home in the U...,t2_265j4glq,False,,0,False,The Coronavirus Literally Took My Job Away,[],...,,/r/TalesFromYourServer/comments/fh8i9h/the_cor...,all_ads,False,https://www.reddit.com/r/TalesFromYourServer/c...,331114,1583977000.0,0,,False
8,,TalesFromYourServer,"I was serving a 12 top of a bunch of bikers, m...",t2_zm30o,False,,0,False,"For the love of God, stop telling us to smile",[],...,,/r/TalesFromYourServer/comments/fh3f5l/for_the...,all_ads,False,https://www.reddit.com/r/TalesFromYourServer/c...,331114,1583957000.0,0,,False
9,,TalesFromYourServer,I work at a tex-mex style restaurant in an Aus...,t2_opi9x80,False,,0,False,I didn't come to a Mexican restaurant to get F...,[],...,,/r/TalesFromYourServer/comments/fh5ia8/i_didnt...,all_ads,False,https://www.reddit.com/r/TalesFromYourServer/c...,331114,1583965000.0,0,,False


In [23]:
pd.DataFrame(posts).to_csv('posts.csv')

In [24]:
reddit_dict['data']['after']

't3_fhbdys'

In [25]:
reddit_dict['data']['children'][25]['data']['name']

't3_fhbdys'

This is the name of the last post.

In [26]:
pd.DataFrame(posts)['name']

0     t3_dvvyw7
1     t3_fhe03c
2     t3_fh1zs9
3     t3_fh8wuf
4     t3_fhb019
5     t3_fgs7hk
6     t3_fhbxta
7     t3_fh8i9h
8     t3_fh3f5l
9     t3_fh5ia8
10    t3_fhh4mf
11    t3_fhftu7
12    t3_fhjr0v
13    t3_fh8tzz
14    t3_fhik58
15    t3_fh91m0
16    t3_fhfkh2
17    t3_fhiggc
18    t3_fh648d
19    t3_fh6j5d
20    t3_fh506f
21    t3_fh0cb7
22    t3_fhdfak
23    t3_fhcsyl
24    t3_fh5s0e
25    t3_fhbdys
Name: name, dtype: object

In [27]:
reddit_dict['data']['after']

't3_fhbdys'

This is the new URL that gives you the next 25 posts.

In [28]:
url + '?after=' + reddit_dict['data']['after']

'https://www.reddit.com/r/TalesFromYourServer.json?after=t3_fhbdys'

## Looping through the posts, 25 posts at a time

In [29]:
posts = []
after = None

for a in range(42):
    if after == None:
        current_url = url
    else:
        current_url = url + '?after=' + after
    print(current_url)
    res = requests.get(current_url, headers={'User-agent': 'Pony Inc 1.0'})
    
    if res.status_code != 200:
        print('Status error', res.status_code)
        break
    
    current_dict = res.json()
    current_posts = [p['data'] for p in current_dict['data']['children']]
    posts.extend(current_posts)
    after = current_dict['data']['after']
    
    pd.DataFrame(posts).to_csv('talesfromyourserver.csv', index = False)
    
    # generate a random sleep duration to look more 'natural'
    sleep_duration = random.randint(2,60)
    print(sleep_duration)
    time.sleep(sleep_duration)

https://www.reddit.com/r/TalesFromYourServer.json
44
https://www.reddit.com/r/TalesFromYourServer.json?after=t3_fhbdys
33
https://www.reddit.com/r/TalesFromYourServer.json?after=t3_fgpng3
39
https://www.reddit.com/r/TalesFromYourServer.json?after=t3_ffq2fu
20
https://www.reddit.com/r/TalesFromYourServer.json?after=t3_ff6u06
20
https://www.reddit.com/r/TalesFromYourServer.json?after=t3_fejk4d
21
https://www.reddit.com/r/TalesFromYourServer.json?after=t3_fdying
13
https://www.reddit.com/r/TalesFromYourServer.json?after=t3_fcto99
21
https://www.reddit.com/r/TalesFromYourServer.json?after=t3_fcmdsx
27
https://www.reddit.com/r/TalesFromYourServer.json?after=t3_fc8e2w
5
https://www.reddit.com/r/TalesFromYourServer.json?after=t3_fblsg8
25
https://www.reddit.com/r/TalesFromYourServer.json?after=t3_fapy1c
47
https://www.reddit.com/r/TalesFromYourServer.json?after=t3_f9ua9j
11
https://www.reddit.com/r/TalesFromYourServer.json?after=t3_f8wm2b
46
https://www.reddit.com/r/TalesFromYourServer.json?a

In [30]:
len(posts)

1036

In [31]:
pd.DataFrame(posts).to_csv('../datasets/talesfromyourserver.csv', index = False)

In [32]:
posts

[{'approved_at_utc': None,
  'subreddit': 'TalesFromYourServer',
  'selftext': "Whether you're a new server, busser, host, or bartender with a specific/general question or a veteran who has some good advice, feel free to jump in!\n\nThis thread will be pinned until it gets archived.",
  'author_fullname': 't2_8er8z',
  'saved': False,
  'mod_reason_title': None,
  'gilded': 0,
  'clicked': False,
  'title': 'Dedicated thread for new server advice',
  'link_flair_richtext': [],
  'subreddit_name_prefixed': 'r/TalesFromYourServer',
  'hidden': False,
  'pwls': 6,
  'link_flair_css_class': None,
  'downs': 0,
  'thumbnail_height': None,
  'hide_score': False,
  'name': 't3_dvvyw7',
  'quarantine': False,
  'link_flair_text_color': 'dark',
  'author_flair_background_color': None,
  'subreddit_type': 'public',
  'ups': 119,
  'total_awards_received': 0,
  'media_embed': {},
  'thumbnail_width': None,
  'author_flair_template_id': '71f86610-60b2-11e2-8c3b-12313b088941',
  'is_original_conten

In [33]:
pd.DataFrame(posts)

Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,...,subreddit_subscribers,created_utc,num_crossposts,media,is_video,post_hint,preview,author_cakeday,crosspost_parent_list,crosspost_parent
0,,TalesFromYourServer,"Whether you're a new server, busser, host, or ...",t2_8er8z,False,,0,False,Dedicated thread for new server advice,[],...,331114,1.573670e+09,0,,False,,,,,
1,,TalesFromYourServer,"I've been serving for a few years, work at a s...",t2_jkbi9,False,,0,False,Finally stood up to a table last weekend,[],...,331114,1.584008e+09,0,,False,,,,,
2,,TalesFromYourServer,Before I was a mom I was a server and bartende...,t2_4k7ggsgu,False,,0,False,Counterpoint to the Creepy Old Men Post,[],...,331114,1.583952e+09,0,,False,,,,,
3,,TalesFromYourServer,"So this lady called in an order, it was a calz...",t2_15a6jg,False,,0,False,She was mad we didn't make the food she didn't...,[],...,331114,1.583979e+09,0,,False,,,,,
4,,TalesFromYourServer,Thank you so much for actually parenting. Your...,t2_23eum4ut,False,,0,False,To my first table of the day,[],...,331114,1.583989e+09,0,,False,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1031,,TalesFromYourServer,TW for what seems to be alcoholism\n\n\n\nDefi...,t2_5k6y3pbq,False,,0,False,Had a table get drunk during their child's bir...,[],...,331115,1.583738e+09,0,,False,,,,,
1032,,TalesFromYourServer,"As in cocaine, in a little baggie between the ...",t2_40ucgq3,False,,0,False,Found coke in my tip,[],...,331115,1.583739e+09,0,,False,,,,,
1033,,TalesFromYourServer,I don’t understand why people like to order se...,t2_4fwko748,False,,0,False,Ordering little by little.,[],...,331115,1.583780e+09,0,,False,,,,,
1034,,TalesFromYourServer,I work at a hibachi restaurant and people like...,t2_pm3eq,False,,0,False,Babies afraid of fire,[],...,331115,1.583718e+09,0,,False,,,,,


In [34]:
url = 'https://www.reddit.com/r/TalesFromTheFrontDesk.json'



posts = []
after = None

for a in range(42):
    if after == None:
        current_url = url
    else:
        current_url = url + '?after=' + after
    print(current_url)
    res = requests.get(current_url, headers={'User-agent': 'Pony Inc 1.0'})
    
    if res.status_code != 200:
        print('Status error', res.status_code)
        break
    
    current_dict = res.json()
    current_posts = [p['data'] for p in current_dict['data']['children']]
    posts.extend(current_posts)
    after = current_dict['data']['after']
    
    pd.DataFrame(posts).to_csv('talesfromthefrontdesk.csv', index = False)
    
    # generate a random sleep duration to look more 'natural'
    sleep_duration = random.randint(2,60)
    print(sleep_duration)
    time.sleep(sleep_duration)

https://www.reddit.com/r/TalesFromTheFrontDesk.json
12
https://www.reddit.com/r/TalesFromTheFrontDesk.json?after=t3_fgbz9m
34
https://www.reddit.com/r/TalesFromTheFrontDesk.json?after=t3_ff7yk1
14
https://www.reddit.com/r/TalesFromTheFrontDesk.json?after=t3_fdyumz
55
https://www.reddit.com/r/TalesFromTheFrontDesk.json?after=t3_fcqm8j
39
https://www.reddit.com/r/TalesFromTheFrontDesk.json?after=t3_fbqj21
60
https://www.reddit.com/r/TalesFromTheFrontDesk.json?after=t3_faftfp
22
https://www.reddit.com/r/TalesFromTheFrontDesk.json?after=t3_f9ptt9
50
https://www.reddit.com/r/TalesFromTheFrontDesk.json?after=t3_f895mf
18
https://www.reddit.com/r/TalesFromTheFrontDesk.json?after=t3_f7omgu
30
https://www.reddit.com/r/TalesFromTheFrontDesk.json?after=t3_f71hy9
18
https://www.reddit.com/r/TalesFromTheFrontDesk.json?after=t3_f5n6xs
59
https://www.reddit.com/r/TalesFromTheFrontDesk.json?after=t3_f4jqgr
54
https://www.reddit.com/r/TalesFromTheFrontDesk.json?after=t3_f311qm
36
https://www.reddit.com

In [35]:
len(posts)

1031

In [36]:
pd.DataFrame(posts).to_csv('../datasets/talesfromthefrontdesk.csv', index = False)