# Project 3: Web APIs & Classification
### Notebook 1: Data collection

## 1. Problem Statement

Product review is widely available in this digital world. People tends to do research on product reviews prior to purchase. However, there is fake review which is written by professional to boost the sale of a particular product. Thus, how would the consumer able to differentiate true user's genuine review from fake review?

We can use Natural Language Processing (NLP) to train a classifier to best predict or differentiate the two.

To do this, content from two subreddits were collected to evaluate different classifier model (Logistic Regression and Naive Bayes) on this binary classification problem.
Subreddits selected are:
1. [nosleep](https://www.reddit.com/r/nosleep/)
2. [Thetruthishere](https://www.reddit.com/r/Thetruthishere/)

Reason for selecting the two subreddits are they are about horror 'story'. Content in `nosleep` is of made up horror stories , whereas `Thetruthishere` is true horror personal experience.

Accuracy score is used to evaluate the success of the model as on how effective the model is able to differentiate post from the two subbredits.

Once the model is able to classify and distinguish between these two subreddits, then it would be able to adapt similar approach to detect made up (fake) review of a product for consumer.

## 2. Executive Summary

### Content
- [Data Collection](#3.-Data-Collection)
- Data Cleaning and EDA
- Preprocessing
- Modeling
- Model Evaluation
- Conclusion and Recommendations


This project was divided into 3 notebooks:
notebook 1: Data Collection
notebook 2: Data Cleaning, EDA, Preprocessing
notebook 3: Modeling, Evaluation, Conclusion and Recommendations

**Data Collection** is done by webscraping using the `requests` library. By default, Reddit give 25 posts per request. To get enough data, I'll need to use a `for loop` to continuously scrap the data by including `time.sleep()` function at the end of each loop to allow for a break in between requests.

**Data Cleaning and EDA**
Cleaning involved removing duplicate post, removing post with null content. This is left with 834 posts from `nosleep` and 937 posts from `Thetruthishere`. It was observed that `nosleep` has much longer post content comnpared to `Thetruthishere`. Asides, CountVectorizer is used to find out the most frequent words appeared in the two subreddits.

**Preprocessing**
Pre-processing includes removing html, removing non-letters, lemmatizing, and removing stopwords. `Subreddit` column that indicate which subreddit the post originated from is converted into binary number by assigning `1` to `nosleep` and `0` to `Thetruthishere`. The data was divided into train set (75%) and test set (25%).

**Modeling**
Baseline score, which is the null model by predicting the majority class is defined.
Baseline accuracy = *53%*

|Target Variable|Normalized Counts|
|---|---|
|1|0.529678|
|0|0.470322|

*where 1 equals `nosleep`, 0 equals `Thetruthishere`*

Two Vectorizer extraction techniques are used to transform the post's content (string of words) into numeric X matrix that is able to use for modeling are:
- CountVectorizer
- TfidfVectorizer

For each Vectorizer, two classification models are built:
- Multinomial Naive Bayes
- Logistic Regression

Model optimization was done by using GridSearchCV to identify the optimal hyperparameters and were built into the classificaiton models.

**Model Evaluation**
Accuracy score was used to evaluate how well the classification model perform. This is because there is no greater detriment to false positive (actual post is `Thetruthishere` but predict it came from `nosleep`'s subreddit).
Generally, all models perform well with accuracy in the range of 92%-94%. They all outperform the baseline. New post from each subreddit were pulled to check how well the model generalize to unseen data. *Logistic Regression model using TfidfVectorizer* performs the best as it is able to predict equally well by having consistent accuracy score around **93%**

## 3. Data Collection

### 3.1 Import libraries:

In [1]:
### import libraries
import pandas as pd
import requests
import random
import time

#To visualize the whole grid
pd.options.display.max_columns = 999

### 3.2 Explore which items to scrap from reddits.com

The two selected subreddits to perform webscraping are:
1. nosleep
2. the true is here

Using the `requests` library to gather the data, i.e. post from reddits.com

In [2]:
### url for the first reddit sub-post:
url = 'https://www.reddit.com/r/nosleep.json'

Because Reddit has throttled python's default user agent, I'll need to set a custom user-agent to get the requests to work. 

In [3]:
### custom user-agent
headers = {'User-agent': 'Pony Inc 1.0'}
res = requests.get(url, headers = headers)

In [4]:
### check the status, it returns 200, means it is okay
res.status_code

200

In [5]:
#Use res.json() to convert the response into a dictionary format and set this to a variable
nosleep_dict = res.json()

#### Initial exploration of the data

In [6]:
# 1st layer of dict: It has two keys, 'kind' & 'data'
sorted(nosleep_dict.keys())

['data', 'kind']

In [7]:
# 2nd layer of dict, for key 'data', it has 5 keys.
# 'children' & 'after' are the two keys that I would like to scrap
sorted(nosleep_dict['data'].keys())

['after', 'before', 'children', 'dist', 'modhash']

In [8]:
# 3rd layer of dict, it has another two keys, 'kind' & 'data'
# again, the 'data' has is the info that I will need
sorted(nosleep_dict['data']['children'][0].keys())

['data', 'kind']

In [9]:
# convert the 3rd layer of dict, with key 'data' for better view
# selftext (only has value started from 3rd row) is the post text that I would like to compile for modelling

df = pd.DataFrame(p['data'] for p in nosleep_dict['data']['children'])
df.head(3)

Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,subreddit_name_prefixed,hidden,pwls,link_flair_css_class,downs,hide_score,name,quarantine,link_flair_text_color,author_flair_background_color,subreddit_type,ups,total_awards_received,media_embed,author_flair_template_id,is_original_content,user_reports,secure_media,is_reddit_media_domain,is_meta,category,secure_media_embed,link_flair_text,can_mod_post,score,approved_by,author_premium,thumbnail,edited,author_flair_css_class,author_flair_richtext,gildings,content_categories,is_self,mod_note,created,link_flair_type,wls,removed_by_category,banned_by,author_flair_type,domain,allow_live_comments,selftext_html,likes,suggested_sort,banned_at_utc,view_count,archived,no_follow,is_crosspostable,pinned,over_18,all_awardings,awarders,media_only,can_gild,spoiler,locked,author_flair_text,visited,removed_by,num_reports,distinguished,subreddit_id,mod_reason_by,removal_reason,link_flair_background_color,id,is_robot_indexable,report_reasons,author,discussion_type,num_comments,send_replies,whitelist_status,contest_mode,mod_reports,author_patreon_flair,author_flair_text_color,permalink,parent_whitelist_status,stickied,url,subreddit_subscribers,created_utc,num_crossposts,media,is_video,author_cakeday,link_flair_template_id
0,,nosleep,,t2_c446v4f,False,,0,False,February 2020 contest nominations,[],r/nosleep,False,6,,0,False,t3_fdub8s,False,dark,,public,86,1,{},,False,[],,False,False,,{},,False,86,,False,,False,,[],{},[writing],False,,1583439000.0,text,6,,,text,redd.it,False,,,,,,False,False,False,False,False,"[{'giver_coin_reward': None, 'subreddit_id': N...",[],False,False,False,True,,False,,,moderator,t5_2rm4d,,,,fdub8s,True,,TheCusterWolf,,0,True,all_ads,False,[],False,,/r/nosleep/comments/fdub8s/february_2020_conte...,all_ads,True,https://redd.it/fduax3,13840304,1583410000.0,0,,False,,
1,,nosleep,,t2_m297o,False,,0,False,January 2020 Winners!,[],r/nosleep,False,6,,0,False,t3_fecu80,False,dark,,public,99,0,{},,False,[],,False,False,,{},,False,99,,False,,False,,[],{},[writing],False,,1583527000.0,text,6,,,text,redd.it,True,,,,,,False,False,False,False,False,[],[],False,False,False,True,,False,,,moderator,t5_2rm4d,,,,fecu80,True,,poppy_moonray,,0,True,all_ads,False,[],False,,/r/nosleep/comments/fecu80/january_2020_winners/,all_ads,True,https://redd.it/fectho,13840304,1583498000.0,0,,False,True,
2,,nosleep,"42 years, 6 months and 3 days ago, on the 5th ...",t2_20gz4yg3,False,,0,False,42 years ago we sent Voyager 1 into space to l...,[],r/nosleep,False,6,,0,False,t3_fgwmy3,False,dark,,public,4933,1,{},,False,[],,False,False,,{},,False,4933,,True,,False,,[],{'gid_1': 1},[writing],True,,1583960000.0,text,6,,,text,self.nosleep,True,"&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",,,,,False,False,False,False,False,"[{'giver_coin_reward': None, 'subreddit_id': N...",[],False,False,False,False,,False,,,,t5_2rm4d,,,,fgwmy3,True,,RichardSaxon,,202,True,all_ads,False,[],False,,/r/nosleep/comments/fgwmy3/42_years_ago_we_sen...,all_ads,False,https://www.reddit.com/r/nosleep/comments/fgwm...,13840304,1583931000.0,4,,False,,


In [10]:
# Check the total rows collected per webscrap on redditcs.com 
df.shape

(27, 101)

By default, Reddit will give you the top 25 posts .
To get the next 25 posts, will need the name ID of the last post data, 
which is the key 'after' that I mentioned in previous few cells.


In [11]:
# This is the name of the last post.
nosleep_dict['data']['after']

't3_fh9ag7'

### 3.3 Collecting post from two subreddits

Below is the loop function to collect more in reddits.com

However, Reddit limit the number of requests per second you're allowed to make. Thus, will need to add timer to delay the loop for each requests, using `time.sleep()`.


In [12]:
##### Function to scrap post from Reddits.com ######
def collect_post(url, after):
    posts = []      # empty list to store the post after scraping
    headers = {'User-agent': 'Appletree 8.1'}       # customer user-agent
    
    for i in range (4):     # number of iteration to scrap. Increase the number depending how much post to scrap
        if after == None:      
            params = {}          # the first 25 posts
        else: 
            params = {'after' : after}    # name of last ID post, this is for the next 25 post scraping
            #print('id', params)
        
        res = requests.get(url, params = params, headers = headers)
        
        if res.status_code == 200:      # check if it is okay, status_code = 200 means okay   
            current_dict = res.json()       # if okay, use res.json() to convert the response         
                                       # into a dictionary format and set this to a variable (current_dict)
                    
        # store the values from key 'data' from its 'parent key':['data']['children']
            current_post = [p['data'] for p in current_dict['data']['children']]
            posts.extend(current_post)  #extend save the current_post (same row), instead of as list in the posts[]
            
            after = current_dict['data']['after']   # ID for next 25 post
            print('last ID:', after)
        else:
            print('status error!', res.status_code)
            break
        
        # generate a random sleep duration to look more 'natural', instead of fix timer
        sleep_duration = random.randint(2,10)
        #print(sleep_duration)
        time.sleep(sleep_duration)
    
    # check the number of post collected
    print('length of collected posts:', len(posts))
    # check the unique ID
    print('uniqueID:', len(set([p['name'] for p in posts])))
    
    return posts


#### If want to scrape post from First reddits,
- change `to_scrape` to **True** in below cell
- change to **False** after scrape completed

In [13]:
to_scrape = False

**Uncomment** below line to initiate empty post for the FIRST scrap,
**set to comment** by adding back the `#` after the first scrap, that is, before re-run the scrape post in the next cell. Else, it will start as empty post, instead of continuing to append the post collected.

In [14]:
#posts_1 = []

In [15]:
#### scrape post from 1st subreddits
if to_scrape:
    after = None    # set to 'None' for the first loop of scraping, after that use the last row of
                    # 'last ID:' e.g: 't3_fgi720' printed out from the loop function
        
    url_1 = 'https://www.reddit.com/r/nosleep.json'    # url for 1st subreddit to scrape
    scrape_1 = collect_post(url_1, after)              # call collect_post function to loop and scape from reddits
    
    posts_1.extend(scrape_1)
    df_1 = pd.DataFrame(posts_1)
    df_1.drop_duplicates(subset = 'title', inplace = True)  # drop duplicated 'title'
    df_1.to_csv('../datasets/nosleep1.csv', index = False)   #export compiled df_1 to csv file

last ID: t3_fh9ag7
last ID: t3_fglt67
last ID: t3_fgt5sp
last ID: t3_fgi720
length of collected posts: 102
uniqueID: 102


#### If want to scrape post from the 2nd subreddits
- change `to_scrape` to **True** in below cell
- change to **False** after scrape completed

In [16]:
to_scrape = False

Similarly, **Uncomment** below line to initial empty post for the 2nd subreddits scrape.
**set to comment** by adding back the `#` after the first scrape, that is, before re-run the scrape post in the next cell. Else, it will start as empty post, instead of continuing to append the post collected

In [17]:
#posts_2 = []

In [18]:
#### scrape post
if to_scrape:
    after = None    # set to 'None' for the first loop of scraping, after that use the last row of
                    # 'last ID:' e.g: 't3_fbdupg' printed out from the loop function
        
    url_2 = 'https://www.reddit.com/r/Thetruthishere.json'    # url for 2nd subreddit to scrape
    scrape_2 = collect_post(url_2, after)           # call collect_post function to loop and scape from reddits
    
    posts_2.extend(scrape_2)
    df_2 = pd.DataFrame(posts_2)
    df_2.drop_duplicates(subset = 'title', inplace = True)         # drop duplicated 'title'
    pd.DataFrame(scrape_2).to_csv('../datasets/thetrueishere1.csv', index = False)

last ID: t3_ffx2x1
last ID: t3_fejf90
last ID: t3_fda074
last ID: t3_fbdupg
length of collected posts: 100
uniqueID: 100
