# SubReddit Web Scraping, Analysis & Classification: Data Collection

## Problem Statement

Taking the role of a business analyst in a company specializing in home appliances (light fixtures, vacuum cleaners, etc), I am tasked to perform research on smart devices as we explore the option of expanding and enhancing our current line of products. We have acquired a very large market-research dataset from varied sources to aid the exploration effort, unfortunately part of the metadata on data origin was lost. All that we know is the data segment that was collected from Reddit originated from r/homeassistant and r/homeautomation. Owing to time and data-storage constraints, we only have time to narrow down on posts that originate from one subreddit that is more relevant to our needs, which is to explore the current trends and issues of popular, easy to install, smart-home devices. Data from the less relevant subreddit is expected to be disposed so as to alleviate the data-storage issue.

<b>The goal would be to build a text classifier based on a small subset of subreddit data scraped from the Web, that can differentiate between posts from r/homeassistant and those from r/homeautomation, so that we can reliably classify and preserve the relevant text data from our market-research dataset.</b>

## Executive Summary

This project is focused on building a text classifier to differentiate between posts from r/homeassistant and r/homeautomation.

*Data Collection* phase began with scraping of the Reddit JSON API and it yielded 501 posts from each subreddit. 

*Data Cleaning* phase revealed some r/homeassistant posts containing programmatic code, which could impact text analysis efforts. Both subreddits had a proportion of their posts containing only titles, as users did not see any further need to elaborate on their posts within the main body.

In the *Data Analysis* phase, distributions for word counts of titles and posts where charted out, but both subreddits did not show any significant differences in terms of content length. During wordcloud and top 20 frequently-used words analysis, it indicated that users in r/homeassistant were centered around tinkering with code and individual components (e.g. single-board computers, wireless modules, sensors, etc) to create new or enhance existing devices. On the otherhand, r/homeautmation posts tended towards smart-home devices that are well known consumer brands on the market, without having to go to the level of crafting a smart-home device together from separate parts. All in all, r/homeassistant seem to be attracting hobby electronic enthusiasts, while r/homeautomation is a more suitable place for typical consumers of tech gadgets.

In *Pre-processing, Model Creation & Benchmarking* phase, Term Frequency-Inverse Document Frequency vectorizer was used to tranform our text data into vectors. r/homeautomation was set as the positive class as it fulfilled the business requirement of our company wanting to explore issues faced by owners/prospective buyers of smart-home devices. Scoring metrics were determined to be test-set accuracy score and recall score.

Multinomial Naive Bayes classifier and Support Vector classifier were modeled, fitted and scored, eventually showing that the Multinomial Naive Bayes classifier performed slightly better in terms of test-set accuracy score. Subsequent efforts to tune the Multinomial Naive Bayes classifier and Support Vector Classifier did not yield significantly better results, especially in terms of test-set accuracy score and recall score.

In favor of the simpler model that was faster to train at the same time, we selected the Multinomial Naive Bayes classifier (alpha=1.0) as our choice model for classification of incoming content from the 2 subreddits, with the aim of identifying r/homeautomation posts to further our business objectives. It has a test-set Accuracy Score of 81.6% and Recall Score of 0.824.

# Data Collection

The purpose of this section would be to collect posts from the above chosen subreddits, which would form our dataset for purpose of analysis and subsequent classification.

We would be using the [Reddit JSON API](https://github.com/reddit-archive/reddit/wiki/JSON) for retrieving of posts in JSON format. From the original web-address `https://www.reddit.com/r/homeassistant`, we will transform it into `https://www.reddit.com/r/homeassistant.json` which immediately grants us access to the JSON API of the Home Assistant subreddit.

The key-value pair structure of the response would make it intuitive to navigate and manipulate the response data like a conventional Python dictionary.

## Import all necessary libraries

In [1]:
import pandas as pd
import requests
import time
import random
from os import listdir
from os.path import isfile, join
import re
import datetime

## Define user-agent for HTTP request header

Adhering to the [guidelines](https://github.com/reddit-archive/reddit/wiki/API) laid down by Reddit on web scraping, we define a user-agent value for usage when making HTTP requests to the Reddit API for scraping data:

In [2]:
# instantiate user-agent
user_agent = 'android:com.hwrobotics.sc:v1.3.0'

h_assistant_url = 'https://www.reddit.com/r/homeassistant.json'

h_automation_url = 'https://www.reddit.com/r/homeautomation.json'

subreddit_url_list = [h_assistant_url, h_automation_url]
subreddit_title_list = ['home_assistant', 'home_automation']

## Execute a single HTTP request and inspect the response structure:

We execute a HTTP GET requests to the Home Improvement JSON API, whilst supplying the user agent value that has been defined in the previous cell:

In [3]:
# execute a GET request
res = requests.get('https://www.reddit.com/r/homeautomation.json', headers={'User-agent': user_agent})

# check that response is successful
print('HTTP response: {} {}'.format(res.status_code, res.reason))

HTTP response: 200 OK


After verifying that the HTTP requests has yielded a successful response, we begin to inspect the HTTP response data:

In [4]:
# .json() function returns a JSON object of the result, which resembles a Python dictionary
reddit_dict = res.json()
reddit_dict.keys()

dict_keys(['kind', 'data'])

Among the 2 keys, we navigate to the `data` node:

In [5]:
data_node = reddit_dict['data']
data_node.keys()

dict_keys(['modhash', 'dist', 'children', 'after', 'before'])

The data segment should have a list of posts, so we navigate further to the `children` node within it:

In [6]:
posts = reddit_dict['data']['children']
len(posts)

25

From the above length check, it indicates that we have found the 27 posts that are sent as a response to the initial HTTP GET request. We inspect the `data` node of one of the posts:

In [7]:
single_post = reddit_dict['data']['children'][5]['data']

In [8]:
single_post.keys()

dict_keys(['approved_at_utc', 'subreddit', 'selftext', 'author_fullname', 'saved', 'mod_reason_title', 'gilded', 'clicked', 'title', 'link_flair_richtext', 'subreddit_name_prefixed', 'hidden', 'pwls', 'link_flair_css_class', 'downs', 'thumbnail_height', 'top_awarded_type', 'hide_score', 'name', 'quarantine', 'link_flair_text_color', 'upvote_ratio', 'author_flair_background_color', 'subreddit_type', 'ups', 'total_awards_received', 'media_embed', 'thumbnail_width', 'author_flair_template_id', 'is_original_content', 'user_reports', 'secure_media', 'is_reddit_media_domain', 'is_meta', 'category', 'secure_media_embed', 'link_flair_text', 'can_mod_post', 'score', 'approved_by', 'author_premium', 'thumbnail', 'edited', 'author_flair_css_class', 'author_flair_richtext', 'gildings', 'content_categories', 'is_self', 'mod_note', 'created', 'link_flair_type', 'wls', 'removed_by_category', 'banned_by', 'author_flair_type', 'domain', 'allow_live_comments', 'selftext_html', 'likes', 'suggested_sort',

In [9]:
pd.DataFrame([single_post]) \
    .loc[:,['stickied','created_utc', 'is_video', 'title', 'selftext','url','subreddit_name_prefixed']]

Unnamed: 0,stickied,created_utc,is_video,title,selftext,url,subreddit_name_prefixed
0,False,1607235000.0,False,Multiple August Smart Lock Pros with Connect,I have three August Smart Lock Pros with conne...,https://www.reddit.com/r/homeautomation/commen...,r/homeautomation


In [10]:
list(pd.DataFrame([single_post]).loc[:,'url'])[0]

'https://www.reddit.com/r/homeautomation/comments/k7oen6/multiple_august_smart_lock_pros_with_connect/'

Each post would contain a lot of metadata, but the fields we would be interested in are the `selftext` and the `title`. These 2 fields hold the title and main contents of the post:

## Create web scraping procedure

### Define parameters for iterative web scraping

Each HTTP GET request would retrieve 25 posts (1st request likely retrieve more than 25 posts, if the subreddit contains at least 1 stickied post). We aim to retrieve 500 posts per subreddit:

In [11]:
# Run with is_live_data_collection=True only for live data collection. 
#
# For demo purpose, set is_live_data_collection=False to scrape only 50 posts for proof-of-concept.
#
is_live_data_collection = False

if is_live_data_collection:
    target_number_of_post = 500
else:
    target_number_of_post = 50

post_count_per_request = 25

iteration_count = int(target_number_of_post / post_count_per_request)
iteration_count

2

### Define web scraping function for 1 subreddit:

In [12]:
def collect_reddit_posts(url, iter_count, after_value=None):
    posts = []
    after = after_value

    for index in range(iter_count):
        if after == None:
            current_url = url
        else:
            current_url = url + '?after=' + after
        print(current_url)
        req = requests.get(current_url, headers={'User-agent': user_agent})

        if req.status_code != 200:
            print('[collect_reddit_posts] Status error: {} {}', req.status_code, req.reason)
            break
        else:
            print('[collect_reddit_posts] Iteration #{}, HTTP response: {} {}'.format(index, req.status_code, req.reason))

        request_dict = req.json()
        subreddit_posts = [node['data'] for node in request_dict['data']['children']]
        posts.extend(subreddit_posts)
        after = request_dict['data']['after']
        
        sleep_duration = random.randint(10, 60) # set delay of random no. of seconds
        print('[collect_reddit_posts] Extracted {} post(s), now sleeping for {}s'.format(len(subreddit_posts), sleep_duration))
        time.sleep(sleep_duration) # execute delay before next iteration
        
    return posts, after

### Define aggregation function to collectively scrap and save data from multiple subreddits:

In [13]:
def aggregate_reddit_posts(url_list, filename_list, iter_count, after_list):
    print('url list: {}'.format(url_list))
    print('filename list: {}'.format(filename_list))
    print('iter_count: {}'.format(iter_count))
    for item in zip(url_list, filename_list, after_list):
        print('[aggregate_reddit_posts] Scraping post for [{}]'.format(item[0]))
        posts, after = collect_reddit_posts(
                                        item[0], 
                                        iter_count, 
                                        item[2])
        print('[aggregate_reddit_posts] Saving posts to csv file: {}_[{}].csv'.format(item[1], after))
        pd.DataFrame(posts).to_csv('../data/{}_[{}].csv'.format(item[1], after))

## Perform web scraping and saving of data from the 2 subreddits

In [14]:
after_list = [None, None]

In [15]:
subreddit_url_list

['https://www.reddit.com/r/homeassistant.json',
 'https://www.reddit.com/r/homeautomation.json']

In [16]:
aggregate_reddit_posts(subreddit_url_list, subreddit_title_list, iteration_count, after_list)

url list: ['https://www.reddit.com/r/homeassistant.json', 'https://www.reddit.com/r/homeautomation.json']
filename list: ['home_assistant', 'home_automation']
iter_count: 2
[aggregate_reddit_posts] Scraping post for [https://www.reddit.com/r/homeassistant.json]
https://www.reddit.com/r/homeassistant.json
[collect_reddit_posts] Iteration #0, HTTP response: 200 OK
[collect_reddit_posts] Extracted 26 post(s), now sleeping for 34s
https://www.reddit.com/r/homeassistant.json?after=t3_k7k3m4
[collect_reddit_posts] Iteration #1, HTTP response: 200 OK
[collect_reddit_posts] Extracted 25 post(s), now sleeping for 24s
[aggregate_reddit_posts] Saving posts to csv file: home_assistant_[t3_k796m5].csv
[aggregate_reddit_posts] Scraping post for [https://www.reddit.com/r/homeautomation.json]
https://www.reddit.com/r/homeautomation.json
[collect_reddit_posts] Iteration #0, HTTP response: 200 OK
[collect_reddit_posts] Extracted 25 post(s), now sleeping for 40s
https://www.reddit.com/r/homeautomation.js

Subreddit data in the saved CSV files can be used for cleaning and analysis in the next notebook.