# Project 3: Subreddit Classification with NLP

# Introduction

Reddit is a collection of interest-based communities known as subreddits, with content that covers almost anything you can imagine.  Within each subreddit, users can create text or image posts, and upvote or downvote posts to express approval or disapproval regarding the content of the post. The number of upvotes and downvotes are fed into a hot-ranking algorithm to determine a score for the post, with higher scoring posts rising to the top of the subreddit.

The goal of this project is simple -- to <b>classify posts from two different subreddits based on their title and selftext</b>, or text within the body of the post. This can serve multiple purposes including:
- Help new users post to the right subreddit (subreddit names can be highly misleading)
- Allow subreddit moderators to easily identify posts that might not belong to their particular subreddit.

For this project, I selected two subreddits that are ostensibly similar on the surface -- `r/MensLib` and `r/MensRights`. Both subreddits focus mainly on male issues such as suicide or homelessness, and allow men a space to discuss their personal beliefs regarding their position in society. The main difference between these subreddits is that `r/MensLib` has a [broad definition of masculinity and supports feminism](https://www.reddit.com/r/MensLib/comments/93oyty/menslibs_official_position_on_the_mens_rights/), while `r/MensRights` generally has a narrower definition of masculinity and is driven by the idea that [feminism/feminists are actively harming men](https://www.reddit.com/r/MensRights/comments/jz1lzz/how_feminist_shaming_tactics_needlessly_cause/), and that there is [serious discrimination against men inherent in western societies](https://www.reddit.com/r/MensRights/wiki/faq#wiki_1._what_is_r.2Fmensrights.3F).

Beyond throwing my text into a vectorizer and feeding it to various classification models, I looked into various forms of sentiment analysis to further explore the differences between both subreddits. 

In this project, I experimented with vectorizers including CountVectorizer and TfidfVectorizer, as well as models including Logistic Regression, Random Forest, Graident Boosting, Naive Bayes and Support Vector Machines. Each of these models was evaluated mainly on their accuracy as well as f-score.

This project has been split into three separate notebooks:

<b>1. Introduction and Data Scraping (Current Notebook)</b><br>
[2. Data Cleaning & EDA](2_Data_Cleaning_EDA.ipynb)<br>
[3. Model Selection, Tuning & Insights](3_Model_Selection_Tuning_and_Insights.ipynb)

# Data Scraping

In this notebook, we'll look to extract text data from two different subreddits -- `r/MensLib` and `r/MensRights`. There are various methods for scraping data, but the most straightforward way is to access the <b>Application Programming Interface (API)</b> of a site and extract a <b>JavaScript Object notation (JSON)</b>, if available. This .JSON can be then read into Python as a dictionary using the json module.

For Reddit, each subreddit has it's own individual JSON that we can access. While this makes data scraping much easier, there are some limitations with using this method, namely that you can only scrape 1000 posts at a time. Off the bat, I knew that I wanted to track the volume of posts in each Reddit, which means that I would have to find another method to scrape that much data. Additionally, I wanted more infomation such as the number of upvotes and comments that each post had.

To accomplish this, I took a dual approach. I used the Python Reddit API Wrapper (PRAW) - a third party tool - to extract 1000 of the most popular posts from each subreddit and the PushShift API (PSAW) to extract all posts between Aug 1 and Nov 28.

This allowed me to capture the following information:
- All posts from each subreddit between Aug 1 - Nov 28
- 1000 top posts from each subreddit including:
    - Date of post
    - Title of post
    - Text within post
    - Number of upvotes
    - Upvote ratio
    - Number of comments
    - Type of post (moderator/non-moderator post)
    - Permalink
    - Author of post

## 1. Default Scraping with Subreddit .JSON

In [101]:
# Import libraries needed for webscraping
import requests
from datetime import datetime
import time
import random
import pandas as pd
from tqdm import tqdm

import praw
from psaw import PushshiftAPI

In [2]:
url = 'https://www.reddit.com/r/menslib.json'
res = requests.get(url, headers={'User-agent': 'Benjamin D'})

In [3]:
# Checking that there are no problems with accessing the .JSON
res.status_code

200

In [61]:
### IMPORT R/MENSLIB POSTS ###
url = 'https://www.reddit.com/r/menslib.json'
res = requests.get(url, headers={'User-agent': 'Benjamin D'})
print('STATUS CODE:', res.status_code)
posts = []
after = None

for a in range(4):
    if after == None:
        current_url = url
    else:
        current_url = url + '?after=' + after
    print(current_url)
    res = requests.get(current_url, headers={'User-agent': 'Benjamin D'})
    
    if res.status_code != 200:
        print('Status error', res.status_code)
        break
    
    current_dict = res.json()
    current_posts = [p['data'] for p in current_dict['data']['children']]
    posts.extend(current_posts)
    after = current_dict['data']['after']
    
    # generate a random sleep duration to look more 'natural'
    sleep_duration = random.randint(2,6)
    print(sleep_duration)
    time.sleep(sleep_duration)

STATUS CODE: 200
https://www.reddit.com/r/menslib.json
3
https://www.reddit.com/r/menslib.json?after=t3_jx84b1
3
https://www.reddit.com/r/menslib.json?after=t3_jumwsu
4
https://www.reddit.com/r/menslib.json?after=t3_jp4jj0
3


In [4]:
### IMPORT R/MENSRIGHTS POSTS ###
url = 'https://www.reddit.com/r/mensrights.json'
res = requests.get(url, headers={'User-agent': 'Benjamin D'})
print('STATUS CODE:', res.status_code)
posts = []
after = None

for a in range(4):
    if after == None:
        current_url = url
    else:
        current_url = url + '?after=' + after
    print(current_url)
    res = requests.get(current_url, headers={'User-agent': 'Benjamin D'})
    
    if res.status_code != 200:
        print('Status error', res.status_code)
        break
    
    current_dict = res.json()
    current_posts = [p['data'] for p in current_dict['data']['children']]
    posts.extend(current_posts)
    after = current_dict['data']['after']
    
    # generate a random sleep duration to look more 'natural'
    sleep_duration = random.randint(2,6)
    print(sleep_duration)
    time.sleep(sleep_duration)

STATUS CODE: 200
https://www.reddit.com/r/mensrights.json
6
https://www.reddit.com/r/mensrights.json?after=t3_k3cfz0
5
https://www.reddit.com/r/mensrights.json?after=t3_k2khx7
3
https://www.reddit.com/r/mensrights.json?after=t3_k2874v
6


## 2. Scraping with Python Reddit API Wrapper (PRAW)

In [102]:
# Using a placeholder for user and password here
reddit = praw.Reddit(client_id="EE1MnzB-y9M1-Q",
                     client_secret="7azXvqdMdT2aCiimpobPQYL1Jjvu7Q",
                     password="password",
                     user_agent="user_agent",
                     username="user")

In [103]:
# Check if able to access Reddit API
print(reddit.user.me())

Raynor77


In [104]:
# Documentation: https://praw.readthedocs.io/en/latest/code_overview/models/submission.html
def praw_scrapper(target, n_to_scrape):
    date = [] #.created_utc
    title = [] #.title
    is_self = [] #.is_self
    selftext = [] # .selftext 
    upvotes = [] #.score
    upvote_ratio = [] #.upvote_ratio
    n_comments = [] #.num_comments
    permalink = [] #.permalink
    author = [] #.author
    distinguished = [] #.distinguished
    
    for post in tqdm(reddit.subreddit(target).hot(limit=n_to_scrape)):
        date.append(str(datetime.fromtimestamp(post.created_utc)))
        title.append(post.title)
        upvotes.append(post.score)
        upvote_ratio.append(post.upvote_ratio)
        n_comments.append(post.num_comments)
        is_self.append(post.is_self)
        selftext.append(post.selftext)
        author.append(post.author)
        permalink.append(post.permalink)
        distinguished.append(post.distinguished)
        
    df = pd.DataFrame({'date':date,
                  'title':title,
                  'selftext':selftext,
                  'is_self':is_self,
                  'upvotes':upvotes,
                  'upvote_ratio':upvote_ratio,
                  'n_comments':n_comments,
                  'distinguished':distinguished,
                  'permalink':permalink,
                  'author':author})
    
    return df

In [105]:
# Scrape 1000 hot posts
menslib_df = praw_scrapper('menslib', 1000)

981it [00:20, 47.64it/s]


In [108]:
menslib_df.head(3)

Unnamed: 0,date,title,selftext,is_self,upvotes,upvote_ratio,n_comments,distinguished,permalink,author
0,2020-11-20 07:19:28,Happy International Men's Day from MensLib,"Officially, the theme is ""better health for me...",True,323,0.99,28,,/r/MensLib/comments/jxdphk/happy_international...,delta_baryon
1,2020-12-04 14:00:22,Weekly Free Talk Friday Thread!,Welcome to our weekly Free Talk Friday thread!...,True,3,1.0,1,moderator,/r/MensLib/comments/k6f4uf/weekly_free_talk_fr...,MLModBot
2,2020-12-04 02:30:10,What Happens When Men and Women Reverse Dating...,**NOTE: IF YOU COMMENT HERE WITHOUT READING TH...,True,750,0.9,244,,/r/MensLib/comments/k6310t/what_happens_when_m...,TAKEitTOrCIRCLEJERK


In [107]:
# Convert menslib df to CSV
pd.DataFrame(menslib_df).to_csv('menslib.csv', index = False)

In [112]:
# Scrape 1000 hot posts
mensrights_df = praw_scrapper('mensrights', 1000)

779it [00:16, 47.19it/s]


In [113]:
mensrights_df.head(3)

Unnamed: 0,date,title,selftext,is_self,upvotes,upvote_ratio,n_comments,distinguished,permalink,author
0,2020-11-23 03:22:47,How feminist shaming tactics needlessly cause ...,A crosspost from /r/LeftWingMaleAdvocates Bec...,True,404,0.97,165,,/r/MensRights/comments/jz1lzz/how_feminist_sha...,Forgetaboutthelonely
1,2020-11-21 12:00:33,A Non-Feminist FAQ,,False,140,0.98,16,,/r/MensRights/comments/jy47ns/a_nonfeminist_faq/,throwaway3745387r
2,2020-12-04 08:18:10,Facebook will no longer flag posts saying “Men...,,False,550,0.99,34,,/r/MensRights/comments/k69v93/facebook_will_no...,brutusdamius


In [114]:
# Convert menslib df to CSV
pd.DataFrame(mensrights_df).to_csv('mensrights.csv', index = False)

## 3. Scraping with PushShift (PSAW)

In [121]:
# Adapted from https://rareloot.medium.com/using-pushshifts-api-to-extract-reddit-submissions-fb517b286563
def getPushshiftData(after, before, sub):
    url = 'https://api.pushshift.io/reddit/search/submission/?subreddit=' \
            +str(sub)+'&size=1000&after='+str(after)+'&before='+str(before)
    
    print(url)
    r = requests.get(url)
    data = json.loads(r.text)
    return data['data']

In [122]:
def post_scrapper(data):
    date = [] #.created_utc
    title = [] #.title
    is_self = [] #.is_self
    selftext = [] # .selftext 
    upvotes = [] #.score
    upvote_ratio = [] #.upvote_ratio
    n_comments = [] #.num_comments
    permalink = [] #.permalink
    author = [] #.author
    
    for post in data:
        date.append(str(datetime.fromtimestamp(post['created_utc'])))
        title.append(post['title'])
        upvotes.append(post['score'])
        upvote_ratio.append(post['upvote_ratio'])
        n_comments.append(post['num_comments'])
        is_self.append(post['is_self'])
        try:
            selftext.append(post['selftext'])
        except:
            selftext.append('NA')
        author.append(post['author'])
        permalink.append(post['permalink'])
    
    df = pd.DataFrame({'date':date,
                  'title':title,
                  'selftext':selftext,
                  'is_self':is_self,
                  'upvotes':upvotes,
                  'upvote_ratio': upvote_ratio,
                  'n_comments':n_comments,
                  'permalink':permalink,
                  'author':author})
    
    return df

In [146]:
def parse_posts(after, before, sub):
    
    # Initialise list
    list_of_dfs = []
    data = getPushshiftData(after, before, sub)
    
    while len(data) > 0:
        current_df = post_scrapper(data)
        # Calls getPushshiftData() with the created date of the last submission
        print(len(data))
        print(str(datetime.fromtimestamp(data[-1]['created_utc'])))
        after = data[-1]['created_utc']
        data = getPushshiftData(after, before, sub)
        list_of_dfs.append(post_scrapper(data))
        
    return list_of_dfs

#### Extracting Posts from r/MensLib

In [148]:
# Setting date slightly further back to ensure all posts in timeframe are captured
menslib_psaw = parse_posts('1593561600', '1606521600', 'menslib')

https://api.pushshift.io/reddit/search/submission/?subreddit=menslib&size=1000&after=1593561600&before=1606521600
100
2020-07-10 06:24:42
https://api.pushshift.io/reddit/search/submission/?subreddit=menslib&size=1000&after=1594333482&before=1606521600
100
2020-07-18 11:41:39
https://api.pushshift.io/reddit/search/submission/?subreddit=menslib&size=1000&after=1595043699&before=1606521600
100
2020-07-25 07:39:26
https://api.pushshift.io/reddit/search/submission/?subreddit=menslib&size=1000&after=1595633966&before=1606521600
100
2020-08-01 23:57:36
https://api.pushshift.io/reddit/search/submission/?subreddit=menslib&size=1000&after=1596297456&before=1606521600
100
2020-08-09 10:05:16
https://api.pushshift.io/reddit/search/submission/?subreddit=menslib&size=1000&after=1596938716&before=1606521600
100
2020-08-16 04:59:40
https://api.pushshift.io/reddit/search/submission/?subreddit=menslib&size=1000&after=1597525180&before=1606521600
100
2020-08-25 07:17:10
https://api.pushshift.io/reddit/se

In [153]:
menslib_vdf = pd.concat(menslib_psaw, ignore_index=True)

In [154]:
# Dropping entries before July 2017
menslib_vdf = menslib_vdf.drop(menslib_vdf[menslib_vdf['date'].str.contains('2020-07')].index)

In [155]:
menslib_vdf.head()

Unnamed: 0,date,title,selftext,is_self,upvotes,upvote_ratio,n_comments,permalink,author
285,2020-08-01 00:06:49,Domestic violence double standard in Daily Mai...,[removed],1.0,1.0,1.0,36.0,/r/MensLib/comments/i1a5iv/domestic_violence_d...,avengahM
286,2020-08-01 01:17:53,I’ve notice some self projection from soically...,[removed],1.0,1.0,1.0,0.0,/r/MensLib/comments/i1bghz/ive_notice_some_sel...,Zero_The-Hero
287,2020-08-01 02:26:48,"Hey r/Menslib, what’s your experience with not...",A female friend of mine recently got the Ol’ S...,1.0,1.0,1.0,39.0,/r/MensLib/comments/i1cqcd/hey_rmenslib_whats_...,mandolin6648
288,2020-08-01 04:27:09,Why is Brie Larson painted as a Man Hater?,[removed],1.0,1.0,1.0,1.0,/r/MensLib/comments/i1ewcd/why_is_brie_larson_...,weeping_angel_fish
289,2020-08-01 04:51:04,Does anyone else have a messed up view of sex?...,[removed],1.0,1.0,1.0,1.0,/r/MensLib/comments/i1fbn3/does_anyone_else_ha...,fademeprobably


In [156]:
menslib_vdf.to_csv('menslib_vdf.csv', index=False)

#### Extracting Posts from r/MensRights

In [157]:
mensrights_psaw = parse_posts('1593561600', '1606521600', 'mensrights')

https://api.pushshift.io/reddit/search/submission/?subreddit=mensrights&size=1000&after=1593561600&before=1606521600
100
2020-07-02 15:28:59
https://api.pushshift.io/reddit/search/submission/?subreddit=mensrights&size=1000&after=1593674939&before=1606521600
100
2020-07-03 19:17:35
https://api.pushshift.io/reddit/search/submission/?subreddit=mensrights&size=1000&after=1593775055&before=1606521600
100
2020-07-05 03:03:59
https://api.pushshift.io/reddit/search/submission/?subreddit=mensrights&size=1000&after=1593889439&before=1606521600
100
2020-07-06 21:55:23
https://api.pushshift.io/reddit/search/submission/?subreddit=mensrights&size=1000&after=1594043723&before=1606521600
100
2020-07-08 02:49:28
https://api.pushshift.io/reddit/search/submission/?subreddit=mensrights&size=1000&after=1594147768&before=1606521600
100
2020-07-09 15:21:29
https://api.pushshift.io/reddit/search/submission/?subreddit=mensrights&size=1000&after=1594279289&before=1606521600
100
2020-07-10 23:35:09
https://api.p

100
2020-09-25 20:23:33
https://api.pushshift.io/reddit/search/submission/?subreddit=mensrights&size=1000&after=1601036613&before=1606521600
100
2020-09-27 06:43:38
https://api.pushshift.io/reddit/search/submission/?subreddit=mensrights&size=1000&after=1601160218&before=1606521600
100
2020-09-28 22:28:56
https://api.pushshift.io/reddit/search/submission/?subreddit=mensrights&size=1000&after=1601303336&before=1606521600
100
2020-09-30 06:34:07
https://api.pushshift.io/reddit/search/submission/?subreddit=mensrights&size=1000&after=1601418847&before=1606521600
100
2020-10-01 18:17:51
https://api.pushshift.io/reddit/search/submission/?subreddit=mensrights&size=1000&after=1601547471&before=1606521600
100
2020-10-03 04:07:49
https://api.pushshift.io/reddit/search/submission/?subreddit=mensrights&size=1000&after=1601669269&before=1606521600
100
2020-10-04 12:44:18
https://api.pushshift.io/reddit/search/submission/?subreddit=mensrights&size=1000&after=1601786658&before=1606521600
100
2020-10-0

In [158]:
mensrights_vdf = pd.concat(mensrights_psaw, ignore_index=True)

In [161]:
mensrights_vdf.head(3)

Unnamed: 0,date,title,selftext,is_self,upvotes,upvote_ratio,n_comments,permalink,author
2246,2020-08-01 00:52:05,Lesbian women gets off for domestic abuse by c...,,0.0,1.0,1.0,42.0,/r/MensRights/comments/i1az48/lesbian_women_ge...,Egalitarianwhistle
2247,2020-08-01 00:53:32,The notion that only men can be pedophiles sho...,,0.0,1.0,1.0,5.0,/r/MensRights/comments/i1b01i/the_notion_that_...,VulgarGenes
2248,2020-08-01 01:09:01,If I see one more person saying shit like teac...,"I don't care if this gets removed, I just need...",1.0,1.0,0.99,115.0,/r/MensRights/comments/i1baf4/if_i_see_one_mor...,TheRealTgc


In [159]:
# Dropping entries before July 2017
mensrights_vdf = mensrights_vdf.drop(mensrights_vdf[mensrights_vdf['date'].str.contains('2020-07')].index)

In [160]:
mensrights_vdf.to_csv('mensrights_vdf.csv', index=False)