# Project 3: Web APIs & Classification

## 1. Problem Statement

For project 3, your goal is two-fold:

Using Reddit's API, you'll collect posts from two subreddits of your choosing.
You'll then use NLP to train a classifier on which subreddit a given post came from. This is a binary classification problem.

## 2. Executive Summary

### Content

## 3. Data Collection

### 3.1 Import libraries:

In [41]:
### import libraries
import pandas as pd
import requests
import random
import time

#To visualize the whole grid
pd.options.display.max_columns = 999

### 3.2 Explore which items to scrap from reddits.com

The two subreddits to scare are:
1. nosleep
2. the true is here

Using the `requests` library to gather the data, i.e. post from reddits.com

In [42]:
### url for the first reddit sub-post:
url = 'https://www.reddit.com/r/nosleep.json'

Because Reddit has throttled python's default user agent, I'll need to set a custom user-agent to get the requests to work. 

In [43]:
### custom user-agent
headers = {'User-agent': 'Pony Inc 1.0'}
res = requests.get(url, headers = headers)

In [44]:
### check the status, it returns 200, means it is okay
res.status_code

200

In [45]:
#Use res.json() to convert the response into a dictionary format and set this to a variable
nosleep_dict = res.json()

#### Initial exploration of the data

In [46]:
# 1st layer of dict: It has two keys, 'kind' & 'data'
sorted(nosleep_dict.keys())

['data', 'kind']

In [47]:
# 2nd layer of dict, for key 'data', it has 5 keys.
# 'children' & 'after' are the two keys that I would like to scrap
sorted(nosleep_dict['data'].keys())

['after', 'before', 'children', 'dist', 'modhash']

In [48]:
# 3rd layer of dict, it has another two keys, 'kind' & 'data'
# again, the 'data' has is the info that I will need
sorted(nosleep_dict['data']['children'][0].keys())

['data', 'kind']

In [49]:
# convert the 3rd layer of dict, with key 'data' for better view
# selftext (only has value started from 3rd row) is the post text that I would like to compile for modelling

df = pd.DataFrame(p['data'] for p in nosleep_dict['data']['children'])
df.head(3)

Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,subreddit_name_prefixed,hidden,pwls,link_flair_css_class,downs,hide_score,name,quarantine,link_flair_text_color,author_flair_background_color,subreddit_type,ups,total_awards_received,media_embed,author_flair_template_id,is_original_content,user_reports,secure_media,is_reddit_media_domain,is_meta,category,secure_media_embed,link_flair_text,can_mod_post,score,approved_by,author_premium,thumbnail,edited,author_flair_css_class,author_flair_richtext,gildings,content_categories,is_self,mod_note,created,link_flair_type,wls,removed_by_category,banned_by,author_flair_type,domain,allow_live_comments,selftext_html,likes,suggested_sort,banned_at_utc,view_count,archived,no_follow,is_crosspostable,pinned,over_18,all_awardings,awarders,media_only,can_gild,spoiler,locked,author_flair_text,visited,removed_by,num_reports,distinguished,subreddit_id,mod_reason_by,removal_reason,link_flair_background_color,id,is_robot_indexable,report_reasons,author,discussion_type,num_comments,send_replies,whitelist_status,contest_mode,mod_reports,author_patreon_flair,author_flair_text_color,permalink,parent_whitelist_status,stickied,url,subreddit_subscribers,created_utc,num_crossposts,media,is_video,link_flair_template_id
0,,nosleep,,t2_c446v4f,False,,0,False,February 2020 contest nominations,[],r/nosleep,False,6,,0,False,t3_fdub8s,False,dark,,public,56,1,{},,False,[],,False,False,,{},,False,56,,False,,False,,[],{},[writing],False,,1583439000.0,text,6,,,text,redd.it,False,,,,,,False,False,False,False,False,"[{'count': 1, 'is_enabled': True, 'subreddit_i...",[],False,False,False,True,,False,,,moderator,t5_2rm4d,,,,fdub8s,True,,TheCusterWolf,,0,True,all_ads,False,[],False,,/r/nosleep/comments/fdub8s/february_2020_conte...,all_ads,True,https://redd.it/fduax3,13825083,1583410000.0,0,,False,
1,,nosleep,,t2_m297o,False,,0,False,January 2020 Winners!,[],r/nosleep,False,6,,0,False,t3_fecu80,False,dark,,public,54,0,{},,False,[],,False,False,,{},,False,54,,False,,False,,[],{},[writing],False,,1583527000.0,text,6,,,text,redd.it,True,,,,,,False,False,False,False,False,[],[],False,False,False,True,,False,,,moderator,t5_2rm4d,,,,fecu80,True,,poppy_moonray,,0,True,all_ads,False,[],False,,/r/nosleep/comments/fecu80/january_2020_winners/,all_ads,True,https://redd.it/fectho,13825083,1583498000.0,0,,False,
2,,nosleep,"**1.**\n\nIt was a shock when our family cat, ...",t2_4t6heh2e,False,,1,False,"ALL EIGHTEEN LIVES OF OMEN, THE CAT",[],r/nosleep,False,6,,0,False,t3_fetxdk,False,dark,,public,2351,4,{},,False,[],,False,False,,{},,False,2351,,True,,1.58359e+09,,[],"{'gid_1': 2, 'gid_2': 1}",[writing],True,,1583609000.0,text,6,,,text,self.nosleep,True,"&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",,,,,False,False,False,False,False,"[{'count': 1, 'is_enabled': True, 'subreddit_i...",[],False,False,False,False,,False,,,,t5_2rm4d,,,,fetxdk,True,,Max-Voynich,,78,True,all_ads,False,[],False,,/r/nosleep/comments/fetxdk/all_eighteen_lives_...,all_ads,False,https://www.reddit.com/r/nosleep/comments/fetx...,13825083,1583580000.0,3,,False,


In [50]:
# By default, Reddit will give you the top 25 posts. (note the first 2 rows have no post)
df.shape

(27, 100)

By default, Reddit will give you the top 25 posts .
To get the next 25 posts, will need the name ID of the last post data, 
which is the key 'after' that I mentioned in previous few cells.


In [51]:
# This is the name of the last post.
nosleep_dict['data']['after']

't3_fey7gw'

### 3.3 Collecting post from two subreddits

Below is the loop function to collect more in reddits.com

However, Reddit limit the number of requests per second you're allowed to make. Thus, will need to add timer to delay the loop for each requests, using `time.sleep()`.


In [52]:
##### Function to scrap post from Reddits.com ######
def collect_post(url, after):
    posts = []      # empty list to store the post after scraping
    headers = {'User-agent': 'tamtam 8.1'}       # customer user-agent
    
    for i in range (40):
        if after == None:
            params = {}          # the first 25 posts
        else: 
            params = {'after' : after}    # name of last ID post, this is for the next 25 post scraping
            #print('id', params)
        
        res = requests.get(url, params = params, headers = headers)
        
        if res.status_code == 200:           # check if it is okay, status_code = 200 means okay
            current_dict = res.json()       # if okay, use res.json() to convert the response         
                                       # into a dictionary format and set this to a variable (current_dict)
                
            # store the values from key 'data' from its 'parent key':['data']['children']
            current_post = [p['data'] for p in current_dict['data']['children']]
            posts.extend(current_post)  #extend save the current_post (same row), instead of as list in the posts[]
            
            after = current_dict['data']['after']   # ID for next 25 post
            print('last ID:', after)
        else:
            print('status error!', res.status_code)
            break
        
        # generate a random sleep duration to look more 'natural', instead of fix timer
        sleep_duration = random.randint(2,10)
        #print(sleep_duration)
        time.sleep(sleep_duration)
    
    # check the number of post collected
    print('length of collected posts:', len(posts))
    # check the unique ID
    print('uniqueID:', len(set([p['name'] for p in posts])))
    
    return posts


#### If want to scrape post from First reddits,
- change `to_scrape` to **True** in below cell
- change to **False** after scrape completed

In [53]:
to_scrape = False

**Uncomment** below line to initial empty post for the FIRST scrape,
**set to comment** by adding back the `#` after the first scrape, that is, before re-run the scrape post in the next cell. Else, it will start as empty post, instead of continuing to append the post collected

In [54]:
#posts_1 = []

In [55]:
#### scrape post from 1st subreddits
if to_scrape:
    after = None    # set to 'None' for the first loop of scraping, after that use the last row of
                    # 'last ID:' e.g: 't3_f0nv9e' printed out from the loop function
        
    url_1 = 'https://www.reddit.com/r/nosleep.json'    # url for 1st subreddit to scrape
    scrape_1 = collect_post(url_1, after)              # call collect_post function to loop and scape from reddits
    
    posts_1.extend(scrape_1)
    df_1 = pd.DataFrame(posts_1)
    df_1.drop_duplicates(subset = 'title', inplace = True)  # drop duplicated 'title'
    df_1.to_csv('../datasets/nosleep1.csv', index = False)   #export compiled df_1 to csv file

#### If want to scrape post from the 2nd subreddits
- change `to_scrape` to **True** in below cell
- change to **False** after scrape completed

In [56]:
to_scrape = False

Similarly, **Uncomment** below line to initial empty post for the 2nd subreddits scrape.
**set to comment** by adding back the `#` after the first scrape, that is, before re-run the scrape post in the next cell. Else, it will start as empty post, instead of continuing to append the post collected

In [57]:
#posts_2 = []

In [58]:
#### scrape post
if to_scrape:
    after = 't3_dsmmvk'    # set to 'None' for the first loop of scraping, after that use the last row of
                    # 'last ID:' e.g: 't3_f0nv9e' printed out from the loop function
        
    url_2 = 'https://www.reddit.com/r/Thetruthishere.json'    # url for 2nd subreddit to scrape
    scrape_2 = collect_post(url_2, after)           # call collect_post function to loop and scape from reddits
    
    posts_2.extend(scrape_2)
    df_2 = pd.DataFrame(posts_2)
    df_2.drop_duplicates(subset = 'title', inplace = True)         # drop duplicated 'title'
    pd.DataFrame(scrape_2).to_csv('../datasets/thetrueishere1.csv', index = False)

#### Check the number of post collected & ensure unique post is saved

In [59]:
### Check the number of post collected
len(posts_2)

1996

In [60]:
### Check the unique ID (name) of the post collected
len(set([p['title'] for p in posts_1]))

793

In [61]:
### check the length of data save to csv
## the total row number saved in the csv file is the same as the unique ID post. 
# This is to ensure the data collected is only wit unique post.
df_1.shape

(793, 101)

In [62]:
df_2.shape

(994, 102)