# Problem Statement

Reddit has gained tremondously popularity recently. A recent article on [The Verge](https://www.theverge.com/2020/12/1/21754984/reddit-dau-daily-users-revealed) revealed that Reddit now has 52 million users registered (that's about 9 times of Singapore's population!), and this was recorded before the /rwallstreetbets saga that happened in the summer of 2021. Due to the popularity of Reddit of late, it would be interesting to experiment and design a Machine Learning model that is able to predict and classify posts against two subreddits. This project aims to classify posts between subreddits of r/GainIt and r/LoseIt. r/GainIt is a subreddit where people share on different ways to bulk up weight as well as success stories on how they did it. r/LoseIt is the opposite where people share on weight loss. Whether you're on whose looking for the right 'keywords' to use in your Reddit post, or simply just trying to play around with Machine Learning models to predict classifications, this model should be of interest to you.

The key question is this: **What are the key words one should use in a Reddit post body so that it can distintictly classify itself in either r/LoseIt or r/GainIt?**

Due to the nature of the topics, one can imagine that the contents are likely to be very identical with each other between the two subreddits. Despite this, the model that was designed performed well with an accuracy score of more than 92%. Baseline accuracy score to beat is 50%. This project has showed that utilizing bag-of-words vectorizing techniques like CountVectorizer/TfidfVectorizer along with Logistical Regression as well as Naive Bayes Classifier was effective despite the similarities between the two subreddits.

# Executive Summary

For ease of readability, this project is split into 4 workbooks:

1. Scraping (`01. Scraping.ipynb`)
2. Data Cleaning (`02. Data Cleaning.ipynb`)
3. EDA (`03. EDA.ipynb`)
4. Modelling (`04. Modelling.ipynb`)

`01. Scraping.ipynb`
This part of the workbook entails the usage of for loops to iterate through .json files directly from Reddit. I managed to scrape about 700 posts per subreddit, giving me a total of about 1,400 posts. As these two subreddits were very text-based, I was able to obtain a good amount of dataset to run my model on.

`02. Data Cleaning.ipynb`
Data cleaning involves taking the raw form of the Reddit scrape and putting them through various streams of text cleansing including lemmatizing, removal of special characters and digits as well as stop words removal. Lemmatizing was found to be more effective than Porter Stemmer's approach hence the use of it.

`03. EDA.ipynb`
Here I do some exploratory data analysis to see if there are any interesting observations. You can find plots of top word counts that appeared in either subreddits as well as word clouds to visualize this as well.

`04. Modelling.ipynb`
This part covers the actual modelling aspects of this project. Modelling include the use of bag-of-words techniques (Count Vectorizer and TF-IDF Vectorizer) to vectorize the words, and using multiple classifier models (Logistic Regression, Multinomial Naive Bayes and Random Forest Classifier) to test and see which is the most effective model.

# Reddit Scrapping

##  Reddit Scrapping

This section includes the codes for scraping from two subreddits, /rGainIt and /rLoseIt. We simply use a for loop to iterate through all of Reddit's posts and extend them into a single list. We also include the use of a random user-agent name generator to prevent Reddit from blocking us during the web scrapping process.

Import libraries

In [3]:
import requests
import pandas as pd
import time
import random
import string

Create a user-agent name generator to prevent Reddit from blocking web scrape

In [5]:
def name_generator():
    letters = string.ascii_lowercase
    return (''.join(random.choice(letters) for i in range(9)) )

Create a function to scrape from Reddit and export into .csv file. Function takes in URL link, number of times to scrape (note that you get 25 posts per scrape). I scrapped through the main subreddit and it's 'new' section in order to obtain more rows.

In [16]:
def scrape_reddit(url, count_of_scrape, filename):
    results = []
    after = None
    
    for a in range(count_of_scrape):
        if after == None:
            current_url = url
        else:
            current_url = url + '?after=' + after
        print(current_url)
        res = requests.get(current_url, headers={'User-agent': name_generator()})

        if res.status_code != 200:
            print('Status error', res.status_code)
            break

        current_dict = res.json()
        current_posts = [p['data'] for p in current_dict['data']['children']]
        results.extend(current_posts)
        after = current_dict['data']['after']
    results = pd.DataFrame(results)
    results.to_csv('../datasets/'+filename, index=False)
    
    return results

In [17]:
scrape_reddit('https://www.reddit.com/r/gainit.json', 20, 'gainit_raw.csv')
scrape_reddit('https://www.reddit.com/r/gainit/new.json', 20, 'gainit_raw2.csv')
scrape_reddit('https://www.reddit.com/r/loseit.json', 20, 'loseit_raw.csv')
scrape_reddit('https://www.reddit.com/r/loseit/new.json', 20, 'loseit_raw2.csv')

https://www.reddit.com/r/gainit.json
https://www.reddit.com/r/gainit.json?after=t3_m4c0un
https://www.reddit.com/r/gainit.json?after=t3_m1nu7s
https://www.reddit.com/r/gainit.json?after=t3_lz6t31
https://www.reddit.com/r/gainit.json?after=t3_lxx9kl


Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,...,permalink,parent_whitelist_status,stickied,url,subreddit_subscribers,created_utc,num_crossposts,media,is_video,link_flair_template_id
0,,gainit,**Welcome to the weekly stupid questions threa...,t2_6l4z3,False,,0,False,[Mod] Simple Questions - the weekly stupid que...,[],...,/r/gainit/comments/m5j4gj/mod_simple_questions...,all_ads,True,https://www.reddit.com/r/gainit/comments/m5j4g...,308755,1.615810e+09,0,,False,
1,,gainit,Good morning everyone!\n\nThird week! Here we ...,t2_2pz9dpsj,False,,0,False,The Gainit Program Party - 5/3/1 BBB Beefcake ...,[],...,/r/gainit/comments/m5jg9i/the_gainit_program_p...,all_ads,True,https://www.reddit.com/r/gainit/comments/m5jg9...,308755,1.615811e+09,0,,False,c08085c4-3cc0-11e4-92c2-12313d16464b
2,,gainit,http://imgur.com/a/GowELgh\n\nExactly two year...,t2_16zgu6,False,,0,False,"[Progress] - Two years, 25kg gained. 19/6'4/100kg",[],...,/r/gainit/comments/m5pvg4/progress_two_years_2...,all_ads,False,https://www.reddit.com/r/gainit/comments/m5pvg...,308755,1.615829e+09,0,,False,
3,,gainit,"i either have insomnia or DSPD, basically i us...",t2_4gn6rlq3,False,,0,False,how many of you have sleep disorders ?,[],...,/r/gainit/comments/m5id3k/how_many_of_you_have...,all_ads,False,https://www.reddit.com/r/gainit/comments/m5id3...,308755,1.615807e+09,0,,False,
4,,gainit,https://imgur.com/gallery/1QoYqRa\n\nI’ll try ...,t2_16xhmr,False,,0,False,Feeling Directionless (nutrition/Cals),[],...,/r/gainit/comments/m6165f/feeling_directionles...,all_ads,False,https://www.reddit.com/r/gainit/comments/m6165...,308755,1.615870e+09,0,,False,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
122,,gainit,I’m training for a 5k right now so my emphasis...,t2_1cjvuxse,False,,0,False,2 Day Push/Pull + 1 Full,[],...,/r/gainit/comments/lwepo2/2_day_pushpull_1_full/,all_ads,False,https://www.reddit.com/r/gainit/comments/lwepo...,308755,1.614725e+09,0,,False,
123,,gainit,"Like for example, following PHAT:\n\n* Dumbbel...",t2_8apfe,False,,0,False,"Exercise order and hypertrophy, does it matter...",[],...,/r/gainit/comments/lw0r1h/exercise_order_and_h...,all_ads,False,https://www.reddit.com/r/gainit/comments/lw0r1...,308755,1.614688e+09,0,,False,
124,,gainit,,t2_3rdqewr5,False,,0,False,I am currently 6’3 185 and I’m planning to bul...,[],...,/r/gainit/comments/lwhief/i_am_currently_63_18...,all_ads,False,https://www.reddit.com/r/gainit/comments/lwhie...,308755,1.614734e+09,0,,False,
125,,gainit,[PHAT](https://simplyshredded.com/mega-feature...,t2_6bmrcwal,False,,0,False,"Going to start PHAT again, but some of the exe...",[],...,/r/gainit/comments/lwhdx2/going_to_start_phat_...,all_ads,False,https://www.reddit.com/r/gainit/comments/lwhdx...,308755,1.614733e+09,0,,False,
