# NLP Project: Subreddit Binary Text Classification 
- In this binary classification problem, I scrape data from two subreddits (*r/userexperience* and *r/UXResearch*) using [Pushshift’s API](https://github.com/pushshift/api), then use Natural Language Processing (NLP) to train for a classifier model on which subreddit a given post came from.
- This is the first of two notebooks for this project. In this notebook, I scrape the subreddit data, explore the data, and prepare the dataset that I use for modeling.
---

# Contents
- [Scrape Reddit data with Pushshift API](#Scrape-Reddit-data-with-Pushshift-API)
- [EDA and data cleaning](#EDA-and-data-cleaning)
- [Prepare merged dataset for modeling](#Prepare-merged-dataset-for-modeling) 

# Import Libraries

In [1]:
# the magic trio
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

# processing
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.pipeline import Pipeline

# nlp
import requests
from bs4 import BeautifulSoup
import json
import time

import nltk
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
import regex as re
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Scrape Reddit data with Pushshift API

In [2]:
# function to pull text from api
def pull_text(post_type, subreddit, n_iter):
    df_list = []
    current_time = 1587495632 # 1PM MST on 4/21/20
    for _ in range(n_iter):
        url = 'https://api.pushshift.io/reddit/search/'
        res = requests.get(url + str(post_type), 
        params={
            'subreddit': subreddit, 
            'size': 1000,
            'before': current_time
        }
    )

        df = pd.DataFrame(res.json()['data'])
        if post_type == 'comment':
            df = df.loc[:, ['id', 'created_utc', 'author', 'body', 'subreddit']]
        if post_type == 'submission':
            df = df.loc[:, ['id', 'created_utc', 'author', 'title', 'selftext', 'subreddit']]
        df_list.append(df)
        current_time = df.created_utc.min()
    
    return pd.concat(df_list, axis=0)

# function adapted from Tim Book

In [3]:
# save raw text files as variables
ex_comment_raw = pull_text('comment', 'userexperience', 6)  # userexperience comments
res_comment_raw = pull_text('comment', 'UXResearch', 6)     # uxreseearch comments
ex_subm_raw = pull_text('submission', 'userexperience', 2)  # userexperience submissions
res_subm_raw = pull_text('submission', 'UXResearch', 2)     # uxresearch submissions

# EDA and data cleaning

In [4]:
print('exp commment shape:', ex_comment_raw.shape)
print('res commment shape:',res_comment_raw.shape)
print('exp submission shape:',ex_subm_raw.shape)
print('res submission shape:',res_subm_raw.shape)

exp commment shape: (6000, 5)
res commment shape: (5105, 5)
exp submission shape: (2000, 6)
res submission shape: (1492, 6)


In [5]:
ex_comment_raw.columns

Index(['id', 'created_utc', 'author', 'body', 'subreddit'], dtype='object')

In [6]:
ex_subm_raw.columns

Index(['id', 'created_utc', 'author', 'title', 'selftext', 'subreddit'], dtype='object')

In [7]:
# check nulls
print('exp commment null:', ex_comment_raw.isnull().sum())
print('res commment null:',res_comment_raw.isnull().sum())
print('exp submission null:',ex_subm_raw.isnull().sum())
print('res submission null:',res_subm_raw.isnull().sum())

exp commment null: id             0
created_utc    0
author         0
body           0
subreddit      0
dtype: int64
res commment null: id             0
created_utc    0
author         0
body           0
subreddit      0
dtype: int64
exp submission null: id              0
created_utc     0
author          0
title           0
selftext       12
subreddit       0
dtype: int64
res submission null: id             0
created_utc    0
author         0
title          0
selftext       4
subreddit      0
dtype: int64


In [8]:
# remove documents with '[removed]' (i.e., post deleted)
print('\nexp comments [removed]')
print(ex_comment_raw[(ex_comment_raw['body'] == '[removed]')].count())
print('\nres comments [removed]')
print(res_comment_raw[(res_comment_raw['body'] == '[removed]')].count())
print('\nexp submissions [removed]')
print(ex_subm_raw[(ex_subm_raw['selftext'] == '[removed]')].count())
print('\nexp submissions [removed]')
print(res_subm_raw[(res_subm_raw['selftext'] == '[removed]')].count())


exp comments [removed]
id             40
created_utc    40
author         40
body           40
subreddit      40
dtype: int64

res comments [removed]
id             17
created_utc    17
author         17
body           17
subreddit      17
dtype: int64

exp submissions [removed]
id             712
created_utc    712
author         712
title          712
selftext       712
subreddit      712
dtype: int64

exp submissions [removed]
id             196
created_utc    196
author         196
title          196
selftext       196
subreddit      196
dtype: int64


In [9]:
# concatenate comments and submissions dataframes
comments = pd.concat([ex_comment_raw, res_comment_raw], ignore_index=True)
submissions = pd.concat([ex_subm_raw, res_subm_raw], ignore_index=True)

In [10]:
# delete rows with [removed]
comments = comments.drop(comments[comments.body == '[removed]'].index)
submissions = submissions.drop(submissions[submissions.selftext == '[removed]'].index)

In [11]:
# for submissions, concatenate 'title' and 'selftext' features
# in order to a) get more text data, and; b) make same shape as comments
# this will also help to take care of empty selftext without deleting them
submissions['text'] = submissions['title'] + str(' ') + submissions['selftext']

# Prepare merged dataset for modeling

In [12]:
# drop 'title' and 'selftext' columns 
submissions_to_merge = submissions.drop(columns=['title', 'selftext'], axis=1)
submissions_to_merge.head()

Unnamed: 0,id,created_utc,author,subreddit,text
0,g5ib1h,1587486153,yellow_brick,userexperience,Webinar en français : Design Sprint Session 1 ...
3,g5ck0j,1587463711,thisisfats,userexperience,I want to create a volunteer directory for peo...
4,g5cg9z,1587463161,nix_oxten94,userexperience,Starting a career in UX is a good option for m...
6,g534qi,1587422673,Mariciano,userexperience,Best way to analyse web shops and what makes t...
10,g4pfyy,1587373548,herotohero,userexperience,My company is hosting a virtual conference and...


In [13]:
# reorder columns to match comments 
submissions_to_merge = submissions_to_merge[['id', 'created_utc', 'author', 'text', 'subreddit']]
submissions_to_merge.head()

Unnamed: 0,id,created_utc,author,text,subreddit
0,g5ib1h,1587486153,yellow_brick,Webinar en français : Design Sprint Session 1 ...,userexperience
3,g5ck0j,1587463711,thisisfats,I want to create a volunteer directory for peo...,userexperience
4,g5cg9z,1587463161,nix_oxten94,Starting a career in UX is a good option for m...,userexperience
6,g534qi,1587422673,Mariciano,Best way to analyse web shops and what makes t...,userexperience
10,g4pfyy,1587373548,herotohero,My company is hosting a virtual conference and...,userexperience


In [14]:
# rename 'body' column to 'text'; same as submissions df
comments_to_merge = comments.rename(columns={'body': 'text'})
comments_to_merge.head()

Unnamed: 0,id,created_utc,author,text,subreddit
0,fo3tjjq,1587492704,ryusakai,Sure no problem!,userexperience
1,fo3t2lw,1587492485,_heisenberg__,Sorry I never got back to you here. So I'm of ...,userexperience
2,fo3no3f,1587489935,WisePudding,The ones you might show me if we only had 5 mi...,userexperience
3,fo3njfe,1587489875,Mariciano,"I honestly think the same, but what can I do, ...",userexperience
4,fo3g9o4,1587486394,tanaysharma97,"Oh, okay! Can you tell me some of the projects...",userexperience


In [15]:
# merge submissions and comments dataframes
merged = pd.concat([comments_to_merge, submissions_to_merge], ignore_index=True)
merged.shape

(13632, 5)

In [16]:
merged.dropna(inplace=True)

In [17]:
merged.isnull().sum()

id             0
created_utc    0
author         0
text           0
subreddit      0
dtype: int64

In [18]:
merged.to_csv('./datasets/merged_unprocessed.csv', index=False)

In [19]:
# remove special characters
spec_chars = ["!",'"',"#","%","&","'","(",")",
              "*","+",",","-",".","/",":",";","<",
              "=",">","?","@","[","\\","]","^","_",
              "`","{","|","}","~","–", "\n"]
for char in spec_chars:
    merged['text'] = merged['text'].str.replace(char, ' ')
    
# and rejoin the whitespaces we created
merged['text'] = merged['text'].str.split().str.join(" ")

# https://medium.com/analytics-vidhya/simplify-your-dataset-cleaning-with-pandas-75951b23568e

In [20]:
# lowercase-ify
merged['text'] = [doc.lower() for doc in merged['text']]

In [21]:
# create positive(1) and negative variables
merged['subreddit'] = merged['subreddit'].map({'userexperience': 1, 'UXResearch': 0})

In [23]:
# save out csv for modeling
merged.to_csv('./datasets/merged_processed.csv', index=False)