# Classifying Subreddit Posts <br>

_**Author:** Bala Krishnamoorthy_

### Overview <br>

**Goal:** Build a model that can classify reddit posts into the subreddit they belong to. <br>

**Note:** Throughout this workbook, you'll find my comments/insights on the workflow in _italics_. <br>

### Methodology <br>

To do this, I will build and compare two classifiers: **Logistic Regression** and **RandomForest**. Each classifier will rely on natural language processing (NLP) on the text within each post to better understand the characteristics (and ideally context) of the post. By doing so, the classifier will learn which post belongs in which subreddit. 

Here is an overview of the steps taken to build each classifier: <br>

Steps common to both classifiers: 
- Pull posts from subreddits being examined using Reddit's API
- Clean gathered data to extract post content (text), and any other potential identifying characteristics within each post. <br>
- NLP: <br>
    - Tokenize & lemmatize/stem data
    - Vectorize data (CountVectorizer, HasingVectorizer, TF-IDF)
- Modelling (Logisitic Regression, Random Forest)
- Evaluate Model (initial)
- Changes + Hyperparameter tuning: GridSearch, others? <br>
- Evaluate Model (final)


Logistic Regression: <br>

Random Forest: <br> 


### Data Gathering

#### Import Libraries

In [208]:
## Data Manipulation
import numpy as np
import pandas as pd
import regex as re

## Machine Learning: sklearn
from sklearn.linear_model import LogisticRegression
import sklearn.metrics as metrics
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, HashingVectorizer
from sklearn.feature_extraction import stop_words
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier,
VotingClassifier

# Machine Learning: nltk
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer

## JSON Manipultion / API Access
import json
import requests
import time
import datetime
from bs4 import BeautifulSoup

## Time 
import time

## Plotting
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
%config InlineBackend.print_figure_kwargs={'facecolor' : "w"} # For dark themed j-notebooks

In [None]:
# # Pulling posts from subreddits

# ## Set timer
# t0 = time.time()

# # Define parameters used in for-loop below
# # Subreddits: 'technology/', 'fitness/', 'sports/', 'showerthoughts/', 'mildlyinteresting/'
# # 'controversial', 'top', 'rising'
# subreddits = ['science/', 'technology/', 'fitness/', 'sports/', 'showerthoughts/', 
#               'mildlyinteresting/']
# filters = ['new'] 
# # The most number of subreddit posts are available in the 'new' filter. The default subreddit url
# # brings up the 'hot' filter within the subreddit. As part of this analysis, I found that reddit
# # splits all 'new' posts into 'hot', 'top', 'controversial' and 'rising'.
# for subreddit in subreddits:
#     posts = []
#     print('Pulling posts from:', subreddit, '...')
#     print()
#     for filter in filters: 
#         print()
#         print('Pulling filter:', filter, '...')
#         print()
#         url = 'https://www.reddit.com/r/' + subreddit + filter + '.json'
#         after = None
#         agent = {'User-agent': 'red_bk'}
#         num_requests = 100 # Number of times to request posts from reddit's API
#         for i in range(num_requests):
#             if i % 10 == 0:
#                 print('Requests made:', i)
#             if after == None:
#                 params = {} # this dict represents the unique tag that tells us the last post pulled
#             else:
#                 params = {'after': after}
#             res = requests.get(url, headers=agent, params=params)
#             if res.status_code == 200:
#                 the_json = res.json()
#                 posts.extend(the_json['data']['children'])
#                 after = the_json['data']['after']
#                 if i % 5 == 0:
#                     agent['User-agent'] = 'red_bk' + str(i)
#             else:
#                 print('Error:', status_code)
#                 break
#         # Convert subreddit list of dictionaries to posts
#     df = pd.DataFrame(posts)
#     df.to_csv('./data/' + subreddit[:-1] + '_' + time.strftime('%Y-%m-%d-%I%p'), index=False)
#     time.sleep(1)

# time_elapsed_secs = time.time() - t0 # time elapsed in seconds
# print()
# print('Num of Posts Collected (last subreddit):', len(posts))
# print('Time Elapsed:', datetime.timedelta(seconds=time_elapsed_secs))
# print('Code ended at:', time.strftime('%Y-%m-%d-%I%p:%Mmins'))

### Import Gathered Data

In [3]:
shower_df = pd.read_csv('./data/showerthoughts_2018-12-17-03PM')
science_df = pd.read_csv('./data/science_2018-12-17-03PM')
tech_df = pd.read_csv('./data/technology_2018-12-17-03PM')
fit_df = pd.read_csv('./data/fitness_2018-12-17-03PM')
sports_df = pd.read_csv('./data/sports_2018-12-17-03PM')
mild_df = pd.read_csv('./data/mildlyinteresting_2018-12-17-03PM')

In [4]:
list_of_dfs = [science_df, tech_df, fit_df, sports_df, mild_df, shower_df]

In [5]:
# Convert dictionaries stored as strings back to dictionaries
for df in list_of_dfs:
    df['data'] = df['data'].map(lambda x: eval(x))

In [21]:
# Check type of dict entry
# type(science_df['data'][0])

dict

In [6]:
# Num of unique posts in each subreddit (using 'name' for each post as an identifier)

for df in list_of_dfs:
    sr_name = df['data'][0]['subreddit']
    df['post_name'] = df['data'].map(lambda x: x['name'])
    print(sr_name, 'Num of Unique Posts:', len(set(df['post_name'])))

science Num of Unique Posts: 870
technology Num of Unique Posts: 536
Fitness Num of Unique Posts: 914
sports Num of Unique Posts: 498
mildlyinteresting Num of Unique Posts: 998
Showerthoughts Num of Unique Posts: 998


In [8]:
science_df.head()

Unnamed: 0,data,kind,post_name
0,"{'approved_at_utc': None, 'subreddit': 'scienc...",t3,t3_a74sy0
1,"{'approved_at_utc': None, 'subreddit': 'scienc...",t3,t3_a74lvx
2,"{'approved_at_utc': None, 'subreddit': 'scienc...",t3,t3_a74jhp
3,"{'approved_at_utc': None, 'subreddit': 'scienc...",t3,t3_a74bl1
4,"{'approved_at_utc': None, 'subreddit': 'scienc...",t3,t3_a74aro


In [11]:
# There are duplicate posts within each df (reddit provides duplicates once all the "new" posts
# according to reddit have been provided.)
for df in list_of_dfs:
    df.drop_duplicates(subset='post_name', inplace=True)

In [12]:
# Check to make sure duplicates were dropped correctly
for df in list_of_dfs:
    sr_name = df['data'][0]['subreddit']
    print(sr_name, 'Rows:', df.shape[0])

science Rows: 870
technology Rows: 536
Fitness Rows: 914
sports Rows: 498
mildlyinteresting Rows: 998
Showerthoughts Rows: 998


_Subreddits to compare: r/science and r/technology_ <br>
- _Both are relatively similar, however, science focuses on academic research while technology is more news-oriented._

In [14]:
# Combine science and technology dfs
df_st = pd.concat([science_df, tech_df], axis=0)

In [35]:
# Reset index range after combining dfs 
df_st.reset_index(inplace=True)

In [36]:
df_st.index # Check index range

RangeIndex(start=0, stop=1406, step=1)

### Data Cleaning

Let's begin by cleaning posts from the r/science and r/technology subreddits. These subreddits are relatively similar in terms of content.

In [38]:
# Separate the dictionary entries for each post into individual columns 

# Combined df

dict_keys = sorted(df_st['data'][0].keys())

In [39]:
# Check number of keys (i.e. data fields) for each post
(df_st['data'].map(lambda x: len(x.keys()))).value_counts(normalize=True)

# Most posts have a similar number of keys. The keys missing in certain posts are unlikely to
# be significant features for this analysis

96    0.540541
92    0.369844
94    0.086771
99    0.000711
97    0.000711
93    0.000711
90    0.000711
Name: data, dtype: float64

In [40]:
# List of data fields within each post
print(dict_keys)

['approved_at_utc', 'approved_by', 'archived', 'author', 'author_flair_background_color', 'author_flair_css_class', 'author_flair_richtext', 'author_flair_template_id', 'author_flair_text', 'author_flair_text_color', 'author_flair_type', 'author_fullname', 'author_patreon_flair', 'banned_at_utc', 'banned_by', 'can_gild', 'can_mod_post', 'category', 'clicked', 'content_categories', 'contest_mode', 'created', 'created_utc', 'distinguished', 'domain', 'downs', 'edited', 'gilded', 'gildings', 'hidden', 'hide_score', 'id', 'is_crosspostable', 'is_meta', 'is_original_content', 'is_reddit_media_domain', 'is_robot_indexable', 'is_self', 'is_video', 'likes', 'link_flair_background_color', 'link_flair_css_class', 'link_flair_richtext', 'link_flair_template_id', 'link_flair_text', 'link_flair_text_color', 'link_flair_type', 'locked', 'media', 'media_embed', 'media_only', 'mod_note', 'mod_reason_by', 'mod_reason_title', 'mod_reports', 'name', 'no_follow', 'num_comments', 'num_crossposts', 'num_rep

Inspecting the data fields above shows that many of them contain data for admin/formatting purposes, as opposed to information on the content of the post (e.g. "author_flair_background_color", "is_robot_indexable", etc.) that might allude to which subreddit it belongs to. Thus, I will select a subset of these data fields to examine in further detail.

In [41]:
# Selected data fields

post_fields = ['approved_by', 'author', 'category', 'content_categories', 'created', 'domain', 
              'likes', 'media', 'name', 'num_comments', 'num_crossposts', 'num_reports','selftext',
               'subreddit', 'title', 'wls']

In [43]:
for field in post_fields:
    df_st[field] = df_st['data'].map(lambda x: x[field])

Let's examine the selected columns in more detail:

In [45]:
df_st.head()

Unnamed: 0,index,data,kind,post_name,approved_by,author,category,content_categories,created,domain,likes,media,name,num_comments,num_crossposts,num_reports,selftext,subreddit,title,wls
0,0,"{'approved_at_utc': None, 'subreddit': 'scienc...",t3,t3_a74sy0,,Wagamaga,,,1545115000.0,news.vcu.edu,,,t3_a74sy0,3,0,,,science,Children of parents who have alcohol use disor...,6
1,1,"{'approved_at_utc': None, 'subreddit': 'scienc...",t3,t3_a74lvx,,cromatron,,,1545114000.0,gizmodo.com,,,t3_a74lvx,3,0,,,science,"Astronomers Just Discovered 'Farout,' the Most...",6
2,2,"{'approved_at_utc': None, 'subreddit': 'scienc...",t3,t3_a74jhp,,ididntwin,,,1545113000.0,nature.com,,,t3_a74jhp,3,0,,,science,A novel and safe small molecule enhances hair ...,6
3,3,"{'approved_at_utc': None, 'subreddit': 'scienc...",t3,t3_a74bl1,,Wagamaga,,,1545112000.0,ldi.upenn.edu,,,t3_a74bl1,14,0,,,science,"A study of 500 U.S. hospitals, and from the pe...",6
4,4,"{'approved_at_utc': None, 'subreddit': 'scienc...",t3,t3_a74aro,,Edude60,,,1545112000.0,nasa.gov,,,t3_a74aro,5,0,,,science,Saturn's Rings May Disappear in 100 million years,6


In [46]:
df_st.tail()

Unnamed: 0,index,data,kind,post_name,approved_by,author,category,content_categories,created,domain,likes,media,name,num_comments,num_crossposts,num_reports,selftext,subreddit,title,wls
1401,531,"{'approved_at_utc': None, 'subreddit': 'techno...",t3,t3_a4wyyf,,khayrirrw,,,1544487000.0,nytimes.com,,,t3_a4wyyf,4,0,,,technology,Chinese Court Says Apple Infringed on Qualcomm...,6
1402,532,"{'approved_at_utc': None, 'subreddit': 'techno...",t3,t3_a4wsh4,,chrisarchitect,,,1544486000.0,reuters.com,,,t3_a4wsh4,8,0,,,technology,China says rejecting physical cash is illegal ...,6
1403,533,"{'approved_at_utc': None, 'subreddit': 'techno...",t3,t3_a4wgsz,,swingadmin,,,1544484000.0,arstechnica.com,,,t3_a4wgsz,17,1,,,technology,Elon Musk makes mockery of SEC settlement in 6...,6
1404,534,"{'approved_at_utc': None, 'subreddit': 'techno...",t3,t3_a4wc4a,,speckz,,,1544483000.0,techdirt.com,,,t3_a4wc4a,8,0,,,technology,AT&amp;T Finds Yet Another Way To Nickel-And-D...,6
1405,535,"{'approved_at_utc': None, 'subreddit': 'techno...",t3,t3_a4w8rd,,spsheridan,,,1544482000.0,venturebeat.com,,,t3_a4w8rd,1,0,,,technology,Qualcomm wins iPhone import and sales ban in C...,6


In [47]:
# Check for null values within columns in df
for col in df_st.columns[2:]:
    print(col, df_st[col].isnull().sum())
    
# Columns with null values will be removed

kind 0
post_name 0
approved_by 1406
author 0
category 1406
content_categories 1406
created 0
domain 0
likes 1406
media 1406
name 0
num_comments 0
num_crossposts 0
num_reports 1406
selftext 0
subreddit 0
title 0
wls 0


In [49]:
# Drop null columns and original data columns
df_st.drop(columns=['approved_by', 'category', 'content_categories', 'likes', 'media',
                         'num_reports', 'kind'], inplace=True)

In [None]:
# Most values in column 'selftext' appear to be empty strings, and the cells with values don't 
# appear to be good candidates for feature selection. Thus, let's remove 'selftext' as well.
df_st['selftext']

In [51]:
df_st.drop(columns='selftext', inplace=True)

In [52]:
df_st.columns

Index(['index', 'data', 'post_name', 'author', 'created', 'domain', 'name',
       'num_comments', 'num_crossposts', 'subreddit', 'title', 'wls'],
      dtype='object')

_Along with 'title', 'author' and 'domain' may be potentially good features to help our classifier predict which subreddit a post belongs to. However, I will begin with conducting NLP solely on the post's 'title'._ 

In [149]:
# Below is the draft list of candidates for feature selection for the r/science subreddit.
X = df_st['title']
y = df_st['subreddit'] # The target value is the column with the subreddit's name

In [150]:
X.head()

0    Children of parents who have alcohol use disor...
1    Astronomers Just Discovered 'Farout,' the Most...
2    A novel and safe small molecule enhances hair ...
3    A study of 500 U.S. hospitals, and from the pe...
4    Saturn's Rings May Disappear in 100 million years
Name: title, dtype: object

In [151]:
y.value_counts()

science       870
technology    536
Name: subreddit, dtype: int64

In [152]:
# Encode the target classes
y = y.map({'science': 0, 'technology': 1})

In [153]:
y.value_counts(normalize=True)

0    0.618777
1    0.381223
Name: subreddit, dtype: float64

_For this dataset, the baseline accuracy of any model is 61.9%. This is the accuracy my models must beat._

### Modelling with NLP

In [177]:
# Train/Test Split Data
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

In [None]:
# Instantiate Vectorizers

count_vect = CountVectorizer() # CountVectorizer requires a Series as an input, *not* df.
tfidf = TfidfVectorizer()

In [189]:
# Instantiate Model
logreg = LogisticRegression(random_state=42)

#### Classifier 1: Logisitic Regression

**Pipeline 1: CountVectorizer + LogisticRegression**

In [182]:
# Set up Pipeline 1
pipe1 = Pipeline([
    ('count_vect', count_vect),
    ('logreg', logreg)  
])

In [214]:
# Tune parameters and evaluate model
params = {
    'count_vect__stop_words': ['english'],
    'count_vect__max_features': [None, 6000],
    'count_vect__ngram_range': [(1,3), (1,4)],
#     'logreg__C': [0.6, 0.8, 1.0] # C = 1.0 was the best parameter
}

gs = GridSearchCV(pipe1, param_grid=params, cv=5)
gs.fit(X_train, y_train)
print('Best Training Set Score:', gs.best_score_)
print('Best Parameters:', gs.best_params_)
print('Test Set Score:', gs.score(X_test, y_test))



Best Training Set Score: 0.9193548387096774
Best Parameters: {'count_vect__max_features': None, 'count_vect__ngram_range': (1, 3), 'count_vect__stop_words': 'english'}
Test Set Score: 0.9147727272727273


_The first pipeline (vectorizer + model) performs quite well relative to the baseline accuracy: 91% (training and test set) vs 62%. This means that our model predicts the right subreddit (in this case, r/science or r/technology 9 out of 10 times)._ <br> <br>
_Let's see what we can do to improve model performance._ 

**Pipeline 2: Tfidf + Logistic Regression**

In [185]:
# Set up Pipeline 2
pipe2 = Pipeline([
    ('tfidf', tfidf),
    ('logreg', logreg)  
])

In [192]:
# Tune parameters and evaluate model
params = {
    'tfidf__stop_words': [None, 'english'],
    'tfidf__max_features': [None, 5000, 6000],
    'logreg__penalty': ['l1', 'l2']
}

gs = GridSearchCV(pipe2, param_grid=params, cv=3)
gs.fit(X_train, y_train)
print('Best Training Set Score:', gs.best_score_)
print('Best Parameters:', gs.best_params_)
print('Test Set Score:', gs.score(X_test, y_test))



Best Training Set Score: 0.8425047438330171
Best Parameters: {'logreg__penalty': 'l2', 'tfidf__max_features': None, 'tfidf__stop_words': None}
Test Set Score: 0.8721590909090909


_With logisitic regression, CountVectorizer yields better results than the TFIDF vectorizer._

#### Classifier 2: Random Forest

In [None]:
rf = RandomForestClassifier(random_state=42)

**Pipeline 3: CountVectorizer + RandomForest**

_Let's see if an ensemble method such as RandomForest will yield better results..._

In [193]:
# Set up pipeline 3
pipe3 = Pipeline([
    ('count_vect', count_vect),
    ('rf', rf)  
])

In [202]:
# Tune parameters and evaluate model

## Set timer
t0 = time.time()

params = {
    'count_vect__stop_words': ['english'],
    'count_vect__max_features': [950, 1000, 1050],
    'rf__n_estimators': [475, 500, 525],
    'rf__max_depth': [None, 5, 6]
}

gs = GridSearchCV(pipe3, param_grid=params, cv=3)
gs.fit(X_train, y_train)
print('Best Training Set Score:', gs.best_score_)
print('Best Parameters:', gs.best_params_)
print('Test Set Score:', gs.score(X_test, y_test))
print()

time_elapsed_secs = time.time() - t0 # time elapsed in seconds
print('Time Elapsed:', datetime.timedelta(seconds=time_elapsed_secs))
print('Code ended at:', time.strftime('%I:%M %p'))

Best Training Set Score: 0.8795066413662239
Best Parameters: {'count_vect__max_features': 1000, 'count_vect__stop_words': 'english', 'rf__max_depth': None, 'rf__n_estimators': 500}
Test Set Score: 0.8806818181818182

Time Elapsed: 0:01:17.488705
Code ended at: 04PM:44mins


_Iterations on the parameters above show that LogReg is still the best performing model between the two_

**Pipeline 4: TFIDF + RandomForest**

In [204]:
# Set up pipeline 4
pipe4 = Pipeline([
    ('tfidf', tfidf),
    ('rf', rf)  
])

In [205]:
# Tune parameters and evaluate model

## Set timer
t0 = time.time()

params = {
    'tfidf__stop_words': ['english'],
    'tfidf__max_features': [800, 1000, 1200],
    'rf__n_estimators': [450, 500, 550],
    'rf__max_depth': [None, 2, 3]
}

gs = GridSearchCV(pipe4, param_grid=params, cv=3)
gs.fit(X_train, y_train)
print('Best Training Set Score:', gs.best_score_)
print('Best Parameters:', gs.best_params_)
print('Test Set Score:', gs.score(X_test, y_test))
print()

time_elapsed_secs = time.time() - t0 # time elapsed in seconds
print('Time Elapsed:', datetime.timedelta(seconds=time_elapsed_secs))
print('Code ended at:', time.strftime('%%I:%M %p'))

Best Training Set Score: 0.8795066413662239
Best Parameters: {'rf__max_depth': None, 'rf__n_estimators': 500, 'tfidf__max_features': 1000, 'tfidf__stop_words': 'english'}
Test Set Score: 0.8693181818181818

Time Elapsed: 0:01:20.696398
Code ended at: 04PM:47mins


_With RandomForest, CountVectorizer provides slightly better results that TFIDF. LogReg with CountVectorizer is still the best performing combo out of those evaluated, but there is still room for improvement._

#### Ensemble Models using VotingClassifier

In [217]:
# Instantiate voter
voter = VotingClassifier([
    ('lr', pipe1),
    ('rf', pipe3)
])

In [220]:
voter.fit(X_train, y_train)
print('Training Score:', voter.score(X_train, y_train))
print('Test Score:', voter.score(X_test, y_test))



Training Score: 0.9895635673624289
Test Score: 0.8409090909090909


### Model Performance with Other Subreddits

### Visualizing Results

### Resources

Total number of posts by subreddit as of June, 2017: https://gist.github.com/anonymous/ef075ee973dd5f883ae17729c147c1de