# Reddit Post Classifier

## Problem Statement

Some people wish they were born with super powers, allowing them to do many amazing things typical human beings could not do. On the other hand, there are also some super powers which could bring some inconvenience too. For example, the ability to turn things into gold by touching an object (like the midas touch) may seem like an awesome super power. However, it also turns food into gold; you will not be unable to eat anything at all. In such cases, people would call it a "shitty super power". The subreddit dedicated for such discussion is r/shittysuperpowers. The super powers discussion are usually hilarious power which affects daily lives and daily activities.

The more extreme version and positive version can be found here r/godtiersuperpower. According to the description, it is "r/shittysuperpowers, but they're actually god tier". In this subreddit, they discuss hilarious powers but gives strong powers that are comparable to God. Hence, both subreddit have hilarious powers but the difference lies with the outcome of those super powers.

In which subreddit would your super power be classified under? The aim of this notebook is to build a classifier model based on reddit users' sentiments. The classifier would be able to determine whether a super power is "god-tier" [1] or "shitty" [0] through analysing the text within the post. Since the training dataset would be from reddit users, if the model has high accuracy and predictability, it should be able to determine what reddit users think about your super power. The classfier models which will be explored are Logistic Regression, Multinomial NB and RandomForest, vectorised either by Count or by TFIDF.

**Measurement of Success**

The prediction of the model will be scored and it's accuracy will be determined. The higher the accuracy, the more accurate the model is at predicting whether a super power is considered god-tier or shitty, based on Redditors' sentiment/opinions. 

Another measurement of success will be feeding the model two super powers and determine if the predictions are likely to be true.

**Primary Stakeholders**

The model could be useful content creators, such as novel writers or comic creators, to determine whether their new super hero with a new super power would be favourable among the community or not. In the world of content creators, these are valuable information which allows them to sense whether their idea would be favourable or not within the community. Furthermore, when vying for consumers' attention, creators with a headstart have a higher chance extending their lead in the market.

**Secondary Stakeholders**

Another useful application would be using it as a tool for reddit's website backend. The subreddit moderators could use such tools to automatically classify posts which may not belong to a certain subreddit group.

## Executive Summary

Reddit is a go-to place for communities to engage in discussions and express themselves in across various topics in different subreddits. Wouldn't it be nice to know how the community feels about certain issues without looking through thousands of posts? Wouldn't we then be able to know what grabs their attention?

Based on two subreddits, r/godtiersuperpowers and r/shittysuperpowers, a model was developed to determine whether a post belongs to which subreddit. The model was trained based on redditors' sentiments and opinions in both subreddits. Through such a model, we could understand whether a super power would be considered "God-tier" or "Shitty".

During the EDA, the words used in each subreddit was analysed based on frequency (using Count Vectorisation) and its salience (using TFIDF Vectorisation). Firstly, the 20 most frequent words were compared between the two subreddits to see what kind of words appear the most in each subreddit. The more certain words appear over the other subreddit, the stronger it is as a determinant in the classifier model. Secondly, the TFIDF Vectorisor showed the most significant (or unique) word in each posts, relative to the all the text. The more unique the word, the stronger it is as a determinant in the classifier model. Finally, salient words in each subreddit were also compared against each other to determine the words unique to each subreddit and whether there were common words.

**Findings**:

- The median word count for r/godtiersuperpowers and r/shittysuperpowers are 56 words and 46 words respectively, with many outliers writing beyond 500 words.

- Both subreddits have quite a lot in common
    - Among the 20 most frequent words, there were 10 common words.
    - There about 700 words unique to each subreddit, with 166 common words.

- The common words tend to describe god-tier powers more, making it likely for model to predict false negatives.

- Context is still very important, words without context are insufficient to qualify a super power.


**Model Evaluation and Selection**

Using Count Vectorisation and TFIDF Vectorisation to preprocess the dataset, and applying it to Logisitic Regression, Multinomial Naive Bayes and RandomForest, a total of 6 models were developed. The models were also optimised by tuning their hyperparameters using GridSearchCV and RandomSearchCV. Out of the 6 models, the top 2 performing models were scored based on Accuracy and AUC score. In the end, RandomForest with Count Vectorisation performed the best, with the least overfitting, high accuracy and higher AUC score. 

**Results**

One can understand whether a super power would be god-tier or shitty based on the model. It has an accuracy of about 65% with an AUC score of 0.649. Content creators and reddit moderators would be able to leverage on such algorithm but its limitation should be overcome in order for it to be practical and useful.





### Contents:

- [Data Dictionary](#Data-Dictionary)
- [Scrapping from Reddit](#1.-Scapping-from-Reddit)
- [Data Cleaning and Basic EDA](2%20of%203_Reddit%20Classifier%20(Data%20Cleaning%20and%20EDA).ipynb#2.-Data-Cleaning-and-EDA)
- [Preprocessing and Modelling](3%20of%203_Reddit%20Classifier%20(Modelling%2C%20Evaluation%2C%20Conclusion).ipynb#3.-Preprocessing-and-Modelling)
- [Model Evaluation](3%20of%203_Reddit%20Classifier%20(Modelling%2C%20Evaluation%2C%20Conclusion).ipynb#4.-Evalution)
- [Conclusion and Recommendations](3%20of%203_Reddit%20Classifier%20(Modelling%2C%20Evaluation%2C%20Conclusion).ipynb#5.-Conclusion-and-Recommendations)

## Data Dictionary

|Feature|Type|Dataset|Description|
|---|---|---|---|
|**name**|object|god_tier_df, shitty_df, combined_df|unique ID for each poster|
|**selftext**|object|god_tier_df, shitty_df, combined_df|The text where users can write more than one sentence or more than the title|
|**title**|object|god_tier_df, shitty_df, combined_df|The title of the post|
|**subreddit**|object|god_tier_df, shitty_df, combined_df|The subreddit the post belongs to|
|**combined_text**|object|god_tier_df, shitty_df, combined_df|The combination of selftext and title|

## 1. Scapping from Reddit

In [6]:
import requests
import pandas as pd
import time
import random

In [7]:
# set headers for scraping reddit

headers= {'User-agent': 'Mr. Celeryboy'}

In [8]:
# create a function to scrape since I am going to scrape 2 subreddits.

def scrap_reddit(subreddit_name, num_res):
    """Defining a function which takes in a subreddit name and the number of
    requests. Since reddit only allows us to scrap 25 times per request,
    num_res = number of post / 25"""
    
    url = (f"https://www.reddit.com/r/{subreddit_name}/top/.json?sort=top&t=all")
    
    posts = [] # creates an empty list which to store all the posts
    after = None # the identifier or "name" of the last post, which allows to load the next 25. 

    for a in range(num_res):
        # for the first 25, there after would be None.
        if after == None:
            current_url = url
        else:
            # goes to the next 25 posts
            current_url = url + '&after=' + after
        print(current_url)
        res = requests.get(current_url, headers=headers)

        if res.status_code != 200:
            print('Status error', res.status_code)
            break

        # returns a single post with all attributes
        current_dict = res.json()
        
        # returns only the posts
        current_posts = [p['data'] for p in current_dict['data']['children']]
        
        # extends the posts(list) created above
        posts.extend(current_posts)
        
        # changes 'after' to the identifier or the "name" of the 25th or the last post.
        after = current_dict['data']['after']

        # generate a random sleep duration to avoid clogging the server
        sleep_duration = random.randint(2,20)
        print(f"Sleep Time: {sleep_duration} seconds, Loop {a+1}/{num_res} completed")
        time.sleep(sleep_duration)
        
    return posts

In [9]:
# a function to also create a dataframe

def create_df(*list_of_posts, columns=['name','selftext', 'title', 'subreddit']):
    """
    Define a function which creates a dataframe based on any number of subreddit categories. For example
    create_df(subreddit_1, subreddit_2, subreddit_3, subreddit_4).
    
    These are the columns which we have an interest in
    
    """
    combined_posts = []
    for posts in list_of_posts:
        combined_posts.extend(posts)
    return pd.DataFrame(combined_posts,columns = columns)

In [10]:
# to get 1000 post exactly
god_tier_posts = scrap_reddit("godtiersuperpowers" , 40)

https://www.reddit.com/r/godtiersuperpowers/top/.json?sort=top&t=all
Sleep Time: 17 seconds, Loop 1/40 completed
https://www.reddit.com/r/godtiersuperpowers/top/.json?sort=top&t=all&after=t3_czgzsl
Sleep Time: 2 seconds, Loop 2/40 completed
https://www.reddit.com/r/godtiersuperpowers/top/.json?sort=top&t=all&after=t3_cag882
Sleep Time: 18 seconds, Loop 3/40 completed
https://www.reddit.com/r/godtiersuperpowers/top/.json?sort=top&t=all&after=t3_dngfr5
Sleep Time: 17 seconds, Loop 4/40 completed
https://www.reddit.com/r/godtiersuperpowers/top/.json?sort=top&t=all&after=t3_cgqf9t
Sleep Time: 7 seconds, Loop 5/40 completed
https://www.reddit.com/r/godtiersuperpowers/top/.json?sort=top&t=all&after=t3_gu0x4p
Sleep Time: 17 seconds, Loop 6/40 completed
https://www.reddit.com/r/godtiersuperpowers/top/.json?sort=top&t=all&after=t3_dv73ym
Sleep Time: 2 seconds, Loop 7/40 completed
https://www.reddit.com/r/godtiersuperpowers/top/.json?sort=top&t=all&after=t3_dhpg3m
Sleep Time: 14 seconds, Loop 8/

In [11]:
# to get 1025 post because 40 loops did not give me 1000 posts
shitty_posts = scrap_reddit("shittysuperpowers", 41)

https://www.reddit.com/r/shittysuperpowers/top/.json?sort=top&t=all
Sleep Time: 3 seconds, Loop 1/41 completed
https://www.reddit.com/r/shittysuperpowers/top/.json?sort=top&t=all&after=t3_an2s53
Sleep Time: 5 seconds, Loop 2/41 completed
https://www.reddit.com/r/shittysuperpowers/top/.json?sort=top&t=all&after=t3_ah1u4w
Sleep Time: 4 seconds, Loop 3/41 completed
https://www.reddit.com/r/shittysuperpowers/top/.json?sort=top&t=all&after=t3_dr8sm1
Sleep Time: 15 seconds, Loop 4/41 completed
https://www.reddit.com/r/shittysuperpowers/top/.json?sort=top&t=all&after=t3_dnnv0m
Sleep Time: 19 seconds, Loop 5/41 completed
https://www.reddit.com/r/shittysuperpowers/top/.json?sort=top&t=all&after=t3_fpxstu
Sleep Time: 11 seconds, Loop 6/41 completed
https://www.reddit.com/r/shittysuperpowers/top/.json?sort=top&t=all&after=t3_fccafb
Sleep Time: 13 seconds, Loop 7/41 completed
https://www.reddit.com/r/shittysuperpowers/top/.json?sort=top&t=all&after=t3_crkdrk
Sleep Time: 17 seconds, Loop 8/41 compl

In [12]:
# Check the number of post per subreddit, should have around 1000.

print(f"Total God Tier Posts: {len(god_tier_posts)}")
print(f"Total Shitty Posts: {len(shitty_posts)}")

Total God Tier Posts: 996
Total Shitty Posts: 1019


### Examine the Scrape

Here, I check the shape of the dataframe and for duplicates.

In [13]:
# create dataframe and check shape
god_tier_df = create_df(god_tier_posts)
god_tier_df.shape

(996, 4)

In [14]:
# check duplicates
god_tier_df.drop_duplicates().shape

# it has no duplicates

(996, 4)

In [15]:
# create dataframe and check shape

shitty_df = create_df(shitty_posts)
shitty_df.shape

(1019, 4)

In [16]:
# check duplicates

shitty_df.drop_duplicates().shape

# quite a bit of duplicates

(969, 4)

For both subreddit threads, the number of post is closed to a thousand. Even though there are a lot of duplicates for `r/shittysuperpowers`, there is still a high number of unique values for each subreddit.

The two dataframes will be combined and the duplicates will be dropped.

### Delete Duplicates and Save as CSV

In [17]:
# create data frame, check dataframe shape to make sure it is correct
reddit_df = create_df(god_tier_posts, shitty_posts)
reddit_df.shape

(2015, 4)

In [18]:
# take a look
reddit_df.head()

Unnamed: 0,name,selftext,title,subreddit
0,t3_dg9js8,,You have the power to successfully say “Fuck t...,godtiersuperpowers
1,t3_clcajt,,Instant argument win.,godtiersuperpowers
2,t3_dokgtk,,"Saying ""Fuck it"" just before making a decision...",godtiersuperpowers
3,t3_cn2eyf,,You can toot as much as you please,godtiersuperpowers
4,t3_ctjwz5,,You can reply before someone even asks the que...,godtiersuperpowers


In [19]:
# remove duplicates

reddit_df.drop_duplicates(inplace=True)

In [20]:
# reset index
reddit_df.reset_index(drop = True, inplace=True)

In [21]:
# check size again to make sure it is dropped successfully
reddit_df.shape

(1965, 4)

In [24]:
# save it as CSV in data folder

reddit_df.to_csv('data/reddit_df.csv',index=False)

In [25]:
# last extracted on 28th September 2020, 11.20am

Click [here](2%20of%203_Reddit%20Classifier%20(Data%20Cleaning%20and%20EDA).ipynb#2.-Data-Cleaning-and-EDA) to access next notebook!