# Project 3: Web APIs & Classification

## 1. Problem Statement

For project 3, your goal is two-fold:

Using Reddit's API, you'll collect posts from two subreddits of your choosing.
You'll then use NLP to train a classifier on which subreddit a given post came from. This is a binary classification problem.

## 2. Executive Summary

### Content

## 3. Data Collection

### 3.1 Import libraries:

In [37]:
### import libraries
import pandas as pd
import requests
import random
import time

#To visualize the whole grid
pd.options.display.max_columns = 999

### 3.2 Explore which items to scrap from reddits.com

The two subreddits to scare are:
1. nosleep
2. the true is here

To use 'requests' to the URL to scrap. 

In [6]:
### url for the first reddit sub-post:
url = 'https://www.reddit.com/r/nosleep.json'

Because Reddit has throttled python's default user agent, I'll need to set a custom user-agent to get the requests to work. 

In [7]:
### custom user-agent
headers = {'User-agent': 'Pony Inc 1.0'}
res = requests.get(url, headers = headers)

In [8]:
### check the status, it returns 200, means it is okay
res.status_code

200

In [9]:
#Use res.json() to convert the response into a dictionary format and set this to a variable
nosleep_dict = res.json()

#### Initial exploration of the data

In [13]:
# 1st layer of dict: It has two keys, 'kind' & 'data'
sorted(nosleep_dict.keys())

['data', 'kind']

In [12]:
# 2nd layer of dict, for key 'data', it has 5 keys.
# 'children' & 'after' are the two keys that I would like to scrap
sorted(nosleep_dict['data'].keys())

['after', 'before', 'children', 'dist', 'modhash']

In [35]:
# 3rd layer of dict, it has another two keys, 'kind' & 'data'
# again, the 'data' has is the info that I will need
sorted(nosleep_dict['data']['children'][0].keys())

['data', 'kind']

In [40]:
# convert the 3rd layer of dict, with key 'data' for better view
# selftext (only has value started from 3rd row) is the post text that I would like to compile for modelling

df = pd.DataFrame(p['data'] for p in nosleep_dict['data']['children'])
df.head(3)

Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,subreddit_name_prefixed,hidden,pwls,link_flair_css_class,downs,hide_score,name,quarantine,link_flair_text_color,author_flair_background_color,subreddit_type,ups,total_awards_received,media_embed,author_flair_template_id,is_original_content,user_reports,secure_media,is_reddit_media_domain,is_meta,category,secure_media_embed,link_flair_text,can_mod_post,score,approved_by,author_premium,thumbnail,edited,author_flair_css_class,author_flair_richtext,gildings,content_categories,is_self,mod_note,created,link_flair_type,wls,removed_by_category,banned_by,author_flair_type,domain,allow_live_comments,selftext_html,likes,suggested_sort,banned_at_utc,view_count,archived,no_follow,is_crosspostable,pinned,over_18,all_awardings,awarders,media_only,can_gild,spoiler,locked,author_flair_text,visited,removed_by,num_reports,distinguished,subreddit_id,mod_reason_by,removal_reason,link_flair_background_color,id,is_robot_indexable,report_reasons,author,discussion_type,num_comments,send_replies,whitelist_status,contest_mode,mod_reports,author_patreon_flair,author_flair_text_color,permalink,parent_whitelist_status,stickied,url,subreddit_subscribers,created_utc,num_crossposts,media,is_video,link_flair_template_id,author_cakeday
0,,nosleep,,t2_c446v4f,False,,0,False,February 2020 contest nominations,[],r/nosleep,False,6,,0,False,t3_fdub8s,False,dark,,public,49,1,{},,False,[],,False,False,,{},,False,49,,False,,False,,[],{},[writing],False,,1583439000.0,text,6,,,text,redd.it,False,,,,,,False,False,False,False,False,"[{'count': 1, 'is_enabled': True, 'subreddit_i...",[],False,False,False,True,,False,,,moderator,t5_2rm4d,,,,fdub8s,True,,TheCusterWolf,,0,True,all_ads,False,[],False,,/r/nosleep/comments/fdub8s/february_2020_conte...,all_ads,True,https://redd.it/fduax3,13822519,1583410000.0,0,,False,,
1,,nosleep,,t2_m297o,False,,0,False,January 2020 Winners!,[],r/nosleep,False,6,,0,False,t3_fecu80,False,dark,,public,30,0,{},,False,[],,False,False,,{},,False,30,,False,,False,,[],{},[writing],False,,1583527000.0,text,6,,,text,redd.it,False,,,,,,False,False,False,False,False,[],[],False,False,False,True,,False,,,moderator,t5_2rm4d,,,,fecu80,True,,poppy_moonray,,0,True,all_ads,False,[],False,,/r/nosleep/comments/fecu80/january_2020_winners/,all_ads,True,https://redd.it/fectho,13822519,1583498000.0,0,,False,,
2,,nosleep,A half-dozen police cars crowded the gravel dr...,t2_6fer5,False,,1,False,There's Been a String of Suicides in my Town. ...,[],r/nosleep,False,6,,0,False,t3_fee8yr,False,dark,,public,1967,6,{},,False,[],,False,False,,{},,False,1967,,True,,1.58351e+09,,[],"{'gid_1': 2, 'gid_2': 1, 'gid_3': 1}",[writing],True,,1583534000.0,text,6,,,text,self.nosleep,True,"&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",,,,,False,False,False,False,False,"[{'count': 1, 'is_enabled': True, 'subreddit_i...",[],False,False,False,False,,False,,,,t5_2rm4d,,,,fee8yr,True,,Worchester_St,,49,True,all_ads,False,[],False,,/r/nosleep/comments/fee8yr/theres_been_a_strin...,all_ads,False,https://www.reddit.com/r/nosleep/comments/fee8...,13822519,1583505000.0,0,,False,,


In [41]:
# By default, Reddit will give you the top 25 posts. (note the first 2 rows have no post)
df.shape

(27, 101)

By default, Reddit will give you the top 25 posts .
To get the next 25 posts, will need the name ID of the last post data, 
which is the key 'after' that I mentioned in previous few cells.


In [12]:
# This is the name of the last post.
nosleep_dict['data']['after']

't3_fe0yvl'

## Data Collection



In [3]:
##### Function to scrap post from Reddits.com ######

posts = []     # empty list to store the post after scraping
after = None   # initiate for the first time scraping, it will replace with the last post ID for subsequent scrap
headers = {'User-agent': 'Apple 10.1'}       # customer user-agent

for i in range(100):
    if after == None:
        params = {}
    else:
        params = {'after':after}       # the name of last ID post, this is needed for subsequent scraping
        print('id', params)
        
    #url = 'https://www.reddit.com/r/nosleep.json'
    url = 'https://www.reddit.com/r/Thetruthishere.json'
    res = requests.get(url, params=params, headers=headers)
    
    if res.status_code == 200:          # check if it is okay, status_code = 200 means okay
        current_dict = res.json()       # if okay, use res.json() to convert the response
                                        # into a dictionary format and set this to a variable (current_dict)
        
        # store the values from key 'data' from its 'parent key':['data']['children']
        current_post = [p['data'] for p in current_dict['data']['children']]
        posts.extend(current_post)  #extend save the current_post (same row), instead of as list in the posts[]
        
        after = current_dict['data']['after']    #save the last post id, so that can cont. for next 25 post scrap
        
    else:
        print('status error!', res.status_code)  #break if having error in getting teh request from reddict
        break
    
    
    # generate a random sleep duration to look more 'natural', instead of fix timer
    sleep_duration = random.randint(2,10)
    print(sleep_duration)
    time.sleep(sleep_duration)

# Export the post to csv
#pd.DataFrame(posts).to_csv('../datasets/nosleep2.csv', index = False)
pd.DataFrame(posts).to_csv('../datasets/thetrueishere.csv', index = False)

6
id {'after': 't3_fda074'}
4
id {'after': 't3_fbs4b0'}
4
id {'after': 't3_fb6osl'}
9
id {'after': 't3_f9mzvx'}
9
id {'after': 't3_f8kvvz'}
5
id {'after': 't3_f6yi8x'}
5
id {'after': 't3_f5kagc'}
10
id {'after': 't3_f426vw'}
9
id {'after': 't3_f23hw6'}
3
id {'after': 't3_f0nr7b'}
9
id {'after': 't3_eyzoto'}
2
id {'after': 't3_ewragu'}
4
id {'after': 't3_eu9njd'}
2
id {'after': 't3_et4biq'}
10
id {'after': 't3_ermpa1'}
8
id {'after': 't3_eqqna6'}
5
id {'after': 't3_ep3cid'}
7
id {'after': 't3_enshqd'}
6
id {'after': 't3_emv3ko'}
5
id {'after': 't3_eltott'}
9
id {'after': 't3_ekw1ra'}
3
id {'after': 't3_ejyxpx'}
9
id {'after': 't3_eibjnd'}
7
id {'after': 't3_eggvqm'}
9
id {'after': 't3_efbh79'}
6
id {'after': 't3_ee6fkz'}
2
id {'after': 't3_ec1ckf'}
8
id {'after': 't3_eahoqm'}
2
id {'after': 't3_e8m4kj'}
9
id {'after': 't3_e7rs6v'}
2
id {'after': 't3_e65blo'}
6
id {'after': 't3_e46563'}
6
id {'after': 't3_e1wny6'}
6
id {'after': 't3_e08lis'}
8
id {'after': 't3_dytj1o'}
3
id {'after': 't3

### 3.3 Collecting post from two subreddits

Below is the loop function to collect more in reddits.com

However, Reddit limit the number of requests per second you're allowed to make. Thus, will need to add timer to delay the loop for each requests, using `time.sleep()`.


In [145]:
##### Function to scrap post from Reddits.com ######
def collect_post(url, after):
    posts = []      # empty list to store the post after scraping
    headers = {'User-agent': 'coscos 8.1'}       # customer user-agent
    
    for i in range (50):
        if after == None:
            params = {}          # the first 25 posts
        else: 
            params = {'after' : after}    # name of last ID post, this is for the next 25 post scraping
            #print('id', params)
        
        res = requests.get(url, params = params, headers = headers)
        
        if res.status_code == 200:           # check if it is okay, status_code = 200 means okay
            current_dict = res.json()       # if okay, use res.json() to convert the response         
                                       # into a dictionary format and set this to a variable (current_dict)
                
            # store the values from key 'data' from its 'parent key':['data']['children']
            current_post = [p['data'] for p in current_dict['data']['children']]
            posts.extend(current_post)  #extend save the current_post (same row), instead of as list in the posts[]
            
            after = current_dict['data']['after']   # ID for next 25 post
            print('last ID:', after)
        else:
            print('status error!', res.status_code)
            break
        
        # generate a random sleep duration to look more 'natural', instead of fix timer
        sleep_duration = random.randint(2,10)
        #print(sleep_duration)
        time.sleep(sleep_duration)
    
    # check the number of post collected
    print('length of collected posts:', len(posts))
    # check the unique ID
    print('uniqueID:', len(set([p['name'] for p in posts])))
    
    return posts


#### If want to scrape post from First reddits,
- change `to_scrape` to **True** in below cell
- change to **False** after scrape completed

In [46]:
to_scrape = True

**Uncomment** below line to initial empty post for the FIRST scrape,
**set to comment** by adding back the `#` after the first scrape, that is, before re-run the scrape post in the next cell. Else, it will start as empty post, instead of continuing to append the post collected

In [None]:
#posts_1 = []

In [139]:
#### scrape post from 1st subreddits
if to_scrape:
    after = None    # first 25 post
    url_1 = 'https://www.reddit.com/r/nosleep.json'    # url for 1st subreddit to scrape
    scrape_1 = collect_post(url_1, after)           # call collect_post function to loop and scape from reddits
    
    posts_1.extend(scrape_1)
    df_1 = pd.DataFrame(posts_1)
    df_1.drop_duplicates(subset = 'title')         # drop duplicated 'title'
    df_1.to_csv('../datasets/nosleep.csv', index = False)    #export file to csv

last ID: t3_fdumzk
last ID: t3_fe8ymf
last ID: t3_fdwthg
last ID: t3_fdool9
last ID: t3_fdmtt6
last ID: t3_fdpfgt
last ID: t3_fdho05
last ID: t3_fdkgtq
last ID: t3_fcuakp
last ID: t3_fcsoax
last ID: t3_fcu9jo
last ID: t3_fd4ztq
last ID: t3_fcwpmw
last ID: t3_fd1s4n
last ID: t3_fcix6k
last ID: t3_fcpnvt
last ID: t3_fcjm0k
last ID: t3_fck34j
last ID: t3_fc79l2
last ID: t3_fck6je
last ID: t3_fc839k
last ID: t3_fcayae
last ID: t3_fcbiy6
last ID: t3_fcc8w0
last ID: t3_f86drm
last ID: t3_f8im2c
last ID: t3_f7wgnl
last ID: t3_f7idhs
last ID: t3_f7gtro
last ID: t3_f6t8kk
last ID: t3_f6jpfk
last ID: t3_f6na6u
last ID: None
last ID: t3_fdumzk
last ID: t3_fe8ymf
last ID: t3_fdwthg
last ID: t3_fdool9
last ID: t3_fdmtt6
last ID: t3_fdpfgt
last ID: t3_fdho05
last ID: t3_fdkgtq
last ID: t3_fcuakp
last ID: t3_fcsoax
last ID: t3_fcu9jo
last ID: t3_fd4ztq
last ID: t3_fcwpmw
last ID: t3_fd1s4n
last ID: t3_fcix6k
last ID: t3_fcpnvt
last ID: t3_fcjm0k
last ID: t3_fck34j
last ID: t3_fc79l2
last ID: t3_fck6j

KeyboardInterrupt: 

#### If want to scrape post from the 2nd subreddits
- change `to_scrape` to **True** in below cell
- change to **False** after scrape completed

In [137]:
to_scrape = False

100

**Uncomment** below line to initial empty post for the 1st scrape, that is using 'after' as **None** in the cell after next

In [None]:
#posts_2 = []

In [146]:
#### scrape post
if to_scrape:
    after = None    # first 25 post
    url_2 = 'https://www.reddit.com/r/Thetruthishere.json'    # url for 2nd subreddit to scrape
    scrape_2 = collect_post(url_2, after)           # call collect_post function to loop and scape from reddits
    
    posts_1.extend(scrape_2)
    df_1 = pd.DataFrame(posts_1)
    df_1.drop_duplicates(subset = 'title', inplace = True)         # drop duplicated 'title'
    pd.DataFrame(scrape_2).to_csv('../datasets/thetrueishere.csv', index = False)

last ID: t3_fd41ch
last ID: t3_fbgyng
last ID: t3_fatdoi
last ID: t3_f9h0s0
last ID: t3_f8kayp
last ID: t3_f72jpx
last ID: t3_f5tp9h
last ID: t3_f4szq2
last ID: t3_f28t9p
last ID: t3_f0nv9e
last ID: t3_eyozp7
last ID: t3_exfobo
last ID: t3_etzp15
last ID: t3_eszs65
last ID: t3_erhg1x
last ID: t3_eqyemz
last ID: t3_epocto
last ID: t3_eoa6u9
last ID: t3_emw8f6
last ID: t3_elwhwi
last ID: t3_ekzh9b
last ID: t3_ejuesf
last ID: t3_eixn4q
last ID: t3_egndh5
last ID: t3_eevhsg
last ID: t3_ee73if
last ID: t3_ec69q9
last ID: t3_eaxe0y
last ID: t3_e92xvk
last ID: t3_e7kiy0
last ID: t3_e649c5
last ID: t3_e3pwi8
last ID: t3_e28q4o
last ID: t3_e0i82f
last ID: t3_dzan50
last ID: t3_dxho2j
last ID: t3_dw5mki
last ID: t3_dttui5
last ID: t3_ds3v6c
last ID: None
last ID: t3_fd41ch
last ID: t3_fbgyng
last ID: t3_fatdoi
last ID: t3_f9h0s0
last ID: t3_f8kayp
last ID: t3_f72jpx
last ID: t3_f5tp9h
last ID: t3_f4szq2
last ID: t3_f28t9p
last ID: t3_f0nv9e
length of collected posts: 1248
uniqueID: 998


In [147]:
### Check the number of post collected
len(posts_1)

1348

In [151]:
### Check the unique ID (name) of the post collected
len(set([p['name'] for p in posts_1]))

1098

In [149]:
df_1.shape

(1348, 103)