## Classifying Subreddit Posts

### Overview <br>

**Goal:** The goal is to build a model that can classify reddit posts into the subreddit they belong to. 

### Methodology 

To do this, I will build and compare two classifiers: **Logistic Regression** and **RandomForest**. Each classifier will rely on natural language processing (NLP) on the text within each post to better understand the characteristics (and ideally context) of the post. By doing so, the classifier will learn which post belongs in which subreddit. 

Here is an overview of the steps taken to build each classifier: <br>

Steps common to both classifiers: 
- Pull posts from subreddits being examined using Reddit's API
- Clean gathered data to extract post content (text), and any other potential identifying characteristics within each post. <br>
- NLP: <br>
    - Tokenize & lemmatize/stem data
    - Vectorize data (CountVectorizer, HasingVectorizer, TF-IDF)
- Modelling (Logisitic Regression, Random Forest)
- Evaluate Model (initial)
- Changes + Hyperparameter tuning: GridSearch, others? <br>
- Evaluate Model (final)


Logistic Regression: <br>

Random Forest: <br> 


### Data Gathering

In [2]:
# Import Libraries

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LogisticRegression
import sklearn.metrics as metrics
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier

import json
import requests
import time
import datetime
from bs4 import BeautifulSoup

%matplotlib inline
%config InlineBackend.print_figure_kwargs={'facecolor' : "w"} # For dark themed j-notebooks

In [4]:
# Pulling data from r/showerthoughts

url = 'https://www.reddit.com/r/showerthoughts.json'
agent = {'User-agent': 'redditter04'}
res = requests.get(url, headers=agent)

In [5]:
res.status_code

200

In [8]:
shower_json = res.json()

In [10]:
# Examine json file
sorted(shower_json.keys())

['data', 'kind']

In [26]:
sorted(shower_json['data'].keys())

['after', 'before', 'children', 'dist', 'modhash']

In [None]:
shower_json['data']

In [24]:
# The 'children' key stores each post. Note that the first post is a "welcome" post - this will
# be excluded from the analysis.

len(shower_json['data']['children'])
# There are 25 posts (exluding welcome post) per request

26

In [55]:
type(shower_json['data']['children'][0]['data'])

dict

In [86]:
# Pulling posts from subreddits

## Set timer
t0 = time.time()

# Define parameters used in for-loop below
# Subreddits: 'technology/', 'fitness/', 'sports/', 'showerthoughts/', 'mildlyinteresting/'
# 'controversial', 'top', 'rising'
subreddits = ['science/']
filters = ['new'] # The most number of subreddit posts 
for subreddit in subreddits:
    posts = []
    print('Pulling posts from:', subreddit, '...')
    print()
    for filter in filters: 
        print('Pulling filter:', filter, '...')
        print()
        url = 'https://www.reddit.com/r/' + subreddit + filter + '.json'
        after = None
        agent = {'User-agent': 'red_bk'}
        num_requests = 100 # Number of times to request posts from reddit's API
        for i in range(num_requests):
            if i % 10 == 0:
                print('Requests made:', i)
            if after == None:
                params = {} # this dict represents the unique tag that tells us the last post pulled
            else:
                params = {'after': after}
            res = requests.get(url, headers=agent, params=params)
            if res.status_code == 200:
                the_json = res.json()
                posts.extend(the_json['data']['children'])
                after = the_json['data']['after']
                if i % 5 == 0:
                    agent['User-agent'] = 'red_bk' + str(i)
            else:
                print('Error:', status_code)
                break
        # Convert subreddit list of dictionaries to posts
    df_rising = pd.DataFrame(posts)
#     df.to_csv('./data/' + subreddit[:-1] + '_' + time.strftime('%Y-%m-%d-%I%p'), index=False)
    time.sleep(1)

time_elapsed_secs = time.time() - t0 # time elapsed in seconds
print()
print('Num of Posts Collected (last subreddit):', len(posts))
print('Time Elapsed:', datetime.timedelta(seconds=time_elapsed_secs))
print('Code ended at:', time.strftime('%Y-%m-%d-%I%p:%M'))

Pulling posts from: science/ ...

Pulling filter: rising ...

Requests made: 0
Requests made: 10
Requests made: 20
Requests made: 30
Requests made: 40
Requests made: 50
Requests made: 60
Requests made: 70
Requests made: 80
Requests made: 90

Num of Posts Collected (last subreddit): 200
Time Elapsed: 0:00:19.205557
Code ended at: 2018-12-17-01PM:51


In [52]:
len(posts)

2437

In [88]:
# shower_df = pd.read_csv('./data/showerthoughts2018-12-16-04PM')
science_df = pd.read_csv('./data/science_2018-12-17-01PM')
# tech_df = pd.read_csv('./data/technology2018-12-17-12PM')
# fit_df = pd.read_csv('./data/fitness2018-12-16-04PM')
# sports_df = pd.read_csv('./data/sports2018-12-16-04PM')
# mild_df = pd.read_csv('./data/mildlyinteresting2018-12-16-04PM')

In [96]:
list_of_dfs = [science_df, df_top, df_new, df_cont, df_rising, df_test]

In [89]:
# Convert dictionaries stored as strings back to dictionaries
# for df in list_of_dfs:
science_df['data'] = science_df['data'].map(lambda x: eval(x))

In [21]:
# Check type of dict entry
# type(science_df['data'][0])

dict

In [None]:
science_df.drop(columns='post_name', inplace=True)
df_top.drop(columns='post_name', inplace=True)
df_new.drop(columns='post_name', inplace=True)

In [97]:
# Num of unique posts in each subreddit (using 'name' for each post as an identifier)

for df in list_of_dfs:
#     sr_name = df['data'][0]['subreddit']
    df['post_name'] = df['data'].map(lambda x: x['name'])
    print(sr_name, 'Num of Unique Posts:', len(set(df['post_name'])))

science Num of Unique Posts: 704
science Num of Unique Posts: 33
science Num of Unique Posts: 868
science Num of Unique Posts: 33
science Num of Unique Posts: 2
science Num of Unique Posts: 868


In [92]:
df_test = pd.concat([science_df, df_top, df_new, df_rising, df_cont])

In [75]:
df_test.head()

Unnamed: 0,data,kind
0,"{'approved_at_utc': None, 'subreddit': 'scienc...",t3
1,"{'approved_at_utc': None, 'subreddit': 'scienc...",t3
2,"{'approved_at_utc': None, 'subreddit': 'scienc...",t3
3,"{'approved_at_utc': None, 'subreddit': 'scienc...",t3
4,"{'approved_at_utc': None, 'subreddit': 'scienc...",t3


### Data Cleaning

Let's begin by cleaning posts from the r/science and r/technology subreddits. These subreddits are relatively similar in terms of content.

In [None]:
# Separate the dictionary entries for each post into individual columns 

# r/science df

dict_keys = sorted(eval(science_df['data'][0]).keys())

In [144]:
# Check number of keys (i.e. data fields) for each post
science_df['data'].map(lambda x: len(eval(x).keys())).value_counts()

# Results show some approx. 12% of posts lack 2 keys (compared to the majority of posts), while 
# very few have more/less than that. 

96    2147
94     316
92       6
97       4
99       3
90       3
Name: data, dtype: int64

In [130]:
# List of data fields within each post
print(dict_keys)

['approved_at_utc', 'approved_by', 'archived', 'author', 'author_flair_background_color', 'author_flair_css_class', 'author_flair_richtext', 'author_flair_template_id', 'author_flair_text', 'author_flair_text_color', 'author_flair_type', 'author_fullname', 'author_patreon_flair', 'banned_at_utc', 'banned_by', 'can_gild', 'can_mod_post', 'category', 'clicked', 'content_categories', 'contest_mode', 'created', 'created_utc', 'distinguished', 'domain', 'downs', 'edited', 'gilded', 'gildings', 'hidden', 'hide_score', 'id', 'is_crosspostable', 'is_meta', 'is_original_content', 'is_reddit_media_domain', 'is_robot_indexable', 'is_self', 'is_video', 'likes', 'link_flair_background_color', 'link_flair_css_class', 'link_flair_richtext', 'link_flair_template_id', 'link_flair_text', 'link_flair_text_color', 'link_flair_type', 'locked', 'media', 'media_embed', 'media_only', 'mod_note', 'mod_reason_by', 'mod_reason_title', 'mod_reports', 'name', 'no_follow', 'num_comments', 'num_crossposts', 'num_rep

Inspecting the data fields above shows that many of them contain data for admin/formatting purposes, as opposed to information on the content of the post (e.g. "author_flair_background_color", "is_robot_indexable", etc.) that might allude to which subreddit it belongs to. Thus, I will select a subset of these data fields to examine in further detail.

In [145]:
# Selected data fields

post_fields = ['approved_by', 'author', 'category', 'content_categories', 'created', 'domain', 
              'likes', 'media', 'name', 'num_comments', 'num_crossposts', 'num_reports','selftext',
               'subreddit', 'title', 'wls']

In [146]:
for field in post_fields:
    science_df[field] = science_df['data'].map(lambda x: eval(x)[field])

Let's examine some columns in more detail:

In [150]:
science_df.head(10)

Unnamed: 0,data,kind,approved_by,author,category,content_categories,created,domain,likes,media,name,num_comments,num_crossposts,num_reports,selftext,subreddit,title,wls
0,"{'approved_at_utc': None, 'subreddit': 'scienc...",t3,,Wagamaga,,,1545026000.0,irishcentral.com,,,t3_a6tbhu,85,1,,,science,Healthy levels of Vitamin D are linked to a 75...,6
1,"{'approved_at_utc': None, 'subreddit': 'scienc...",t3,,mvea,,,1544988000.0,news.psu.edu,,,t3_a6ocrh,916,3,,,science,People who met and became acquainted with at l...,6
2,"{'approved_at_utc': None, 'subreddit': 'scienc...",t3,,avogadros_number,,,1545021000.0,e360.yale.edu,,,t3_a6smhu,24,0,,,science,Nearly 70% of infrastructure in the Arctic - i...,6
3,"{'approved_at_utc': None, 'subreddit': 'scienc...",t3,,Wagamaga,,,1545017000.0,eastbaytimes.com,,,t3_a6rxe0,11,4,,,science,Sierra Nevada snow pack on track to shrink 79 ...,6
4,"{'approved_at_utc': None, 'subreddit': 'scienc...",t3,,mvea,,,1545023000.0,psypost.org,,,t3_a6st6e,24,1,,,science,Men tend to perceive both polygyny — in which ...,6
5,"{'approved_at_utc': None, 'subreddit': 'scienc...",t3,,p1percub,,,1545003000.0,self.science,,,t3_a6pvdj,70,0,,"#Thank you, readers of r/science, for helping ...",science,r/science has reached 20M subscribers! To cele...,6
6,"{'approved_at_utc': None, 'subreddit': 'scienc...",t3,,mvea,,,1544966000.0,psypost.org,,,t3_a6mi7e,35,1,,,science,"The development of a kind, caring, and warm at...",6
7,"{'approved_at_utc': None, 'subreddit': 'scienc...",t3,,mvea,,,1544936000.0,psypost.org,,,t3_a6ipw7,2235,7,,,science,"Being in a committed relationship, having excl...",6
8,"{'approved_at_utc': None, 'subreddit': 'scienc...",t3,,The_Old_Wise_One,,,1545002000.0,jasoncollins.blog,,,t3_a6pp34,23,1,,,science,"Multi-lab, high-powered replication of widespr...",6
9,"{'approved_at_utc': None, 'subreddit': 'scienc...",t3,,jq1984_is_me,,,1544925000.0,liebertpub.com,,,t3_a6h7fc,337,1,,,science,Breastfeeding Greater Than 6 Months Is Associa...,6


In [None]:
science_df.tail(10)

In [155]:
# Check for null values within columns in df
for col in science_df.columns[2:]:
    print(col, science_df[col].isnull().sum())
    
# Columns with null values will be removed

approved_by 2479
author 0
category 2479
content_categories 2479
created 0
domain 0
likes 2479
media 2479
name 0
num_comments 0
num_crossposts 0
num_reports 2479
selftext 0
subreddit 0
title 0
wls 0


In [158]:
# Drop null columns and original data columns
science_df.drop(columns=['approved_by', 'category', 'content_categories', 'likes', 'media',
                         'num_reports', 'data', 'kind'], inplace=True)

In [None]:
# Most values in column 'selftext' appear to be empty strings, and the cells with values don't 
# appear to be good candidates for feature selection. Thus, let's remove 'selftext' as well.
science_df['selftext']

In [167]:
science_df.drop(columns='selftext', inplace=True)

In [174]:
# Below is the draft list of candidates for feature selection for the r/science subreddit.
features_list = science_df.columns
features_list.drop(labels='subreddit')
features_list

Index(['author', 'created', 'domain', 'name', 'num_comments', 'num_crossposts',
       'subreddit', 'title', 'wls'],
      dtype='object')

In [175]:
# The target value is the column with the subreddit's name
target_field = 'subreddit'

In [176]:
science_df.head(10)

Unnamed: 0,author,created,domain,name,num_comments,num_crossposts,subreddit,title,wls
0,Wagamaga,1545026000.0,irishcentral.com,t3_a6tbhu,85,1,science,Healthy levels of Vitamin D are linked to a 75...,6
1,mvea,1544988000.0,news.psu.edu,t3_a6ocrh,916,3,science,People who met and became acquainted with at l...,6
2,avogadros_number,1545021000.0,e360.yale.edu,t3_a6smhu,24,0,science,Nearly 70% of infrastructure in the Arctic - i...,6
3,Wagamaga,1545017000.0,eastbaytimes.com,t3_a6rxe0,11,4,science,Sierra Nevada snow pack on track to shrink 79 ...,6
4,mvea,1545023000.0,psypost.org,t3_a6st6e,24,1,science,Men tend to perceive both polygyny — in which ...,6
5,p1percub,1545003000.0,self.science,t3_a6pvdj,70,0,science,r/science has reached 20M subscribers! To cele...,6
6,mvea,1544966000.0,psypost.org,t3_a6mi7e,35,1,science,"The development of a kind, caring, and warm at...",6
7,mvea,1544936000.0,psypost.org,t3_a6ipw7,2235,7,science,"Being in a committed relationship, having excl...",6
8,The_Old_Wise_One,1545002000.0,jasoncollins.blog,t3_a6pp34,23,1,science,"Multi-lab, high-powered replication of widespr...",6
9,jq1984_is_me,1544925000.0,liebertpub.com,t3_a6h7fc,337,1,science,Breastfeeding Greater Than 6 Months Is Associa...,6


_Along with 'title', 'author' and 'domain' may be potentially good features to help our classifier predict which subreddit a post belongs to._ 

#### Classifier 1: Logistic Regression

### Resources

Total number of posts by subreddit as of June, 2017: https://gist.github.com/anonymous/ef075ee973dd5f883ae17729c147c1de