# Project 3: Reddit API Classification & Natural Language Processing

## Tom Ludlow, DSI-NY-6

Using NLP to identify posts from **r/audioengineering** and **r/livesound**

# Notebook 2: Pre-Processing

In [26]:
# library imports
import requests
import time
import pandas as pd
import numpy as np
import ast
import re
from tqdm import tqdm

# preprocessing imports
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
from sklearn.model_selection import train_test_split

import warnings
warnings.filterwarnings('ignore')

In [27]:
# random state var
r = 1219

## Pre-processing

Reimport our post DataFrame to begin pre-processing.

In [3]:
df = pd.read_csv('./csv/181219_post_df.csv')

In [4]:
df.shape

(1904, 6)

In [5]:
df.comb.head()

0    Tech Support and Troubleshooting - December 10...
1    Gear Recommendation (What Should I Buy?) Threa...
2    Will I EVER understand compression...? Ahh yes...
3    I'm interviewing to be an intern at a big stud...
4    If I faced two speakers towards each other, on...
Name: comb, dtype: object

In [6]:
df.head()

Unnamed: 0,title,post,author,upvotes,is_ls,comb
0,Tech Support and Troubleshooting - December 10...,Welcome the /r/audioengineering Tech Support a...,t2_6l4z3,7,0,Tech Support and Troubleshooting - December 10...
1,Gear Recommendation (What Should I Buy?) Threa...,Welcome to our weekly Gear Recommendation Thre...,t2_6l4z3,15,0,Gear Recommendation (What Should I Buy?) Threa...
2,Will I EVER understand compression...?,"Ahh yes, my monthly compression post...\n\nI '...",t2_2r3uhjqr,96,0,Will I EVER understand compression...? Ahh yes...
3,I'm interviewing to be an intern at a big stud...,What questions should I ask?\n\nEdit: I'm gett...,t2_dd3qi,145,0,I'm interviewing to be an intern at a big stud...
4,"If I faced two speakers towards each other, on...",,t2_bl2x2,5,0,"If I faced two speakers towards each other, on..."


### Tokenizing titles and posts

We begin by tokenizing our posts, which we will do using Regex (regular expression) formatting.  Our target result is to collect any full words from our post strings, as well as any words connected by a single `/`, in order to maintain URL connection.

In [7]:
rt = RegexpTokenizer(r"[\w/\']+") # regex to include words, slash characters for urls, apostrophes

To test our Regex formatting, we will use this sample text pasted into the Regex101.com tool.

In [9]:
df.comb.sample(5)

315     Any words of wisdom? Hey all, I've been making...
586     Friday - How did they do that? - October 26, 2...
1792     Taking the new d&amp;b GSL for a spin tonight.  
1889    How do I tell the live band that they need to ...
572     Is there a length limit when using an optical ...
Name: comb, dtype: object

There are a number of HTML-formatted character values that were captured during our API pulls.  We remove them manually through an enumerated loop.

In [12]:
for i, text in enumerate(df.comb): # for each string in our combined column
    text_loop = text.replace('&amp;','&')
    text_loop = text_loop.replace('#x200B;',' ') # manually remove symbols &, nzsp, nbsp, \n
    text_loop = text_loop.replace('nbsp;',' ')
    df.comb.iloc[i] = text_loop.replace('\n',' ').strip()
    

In [13]:
len(df.comb)

1904

#### Tokenize each post and save to list

In [14]:
comb_tokens = []  # empty token list

for i in range(len(df.comb)):
    loop_tokens = rt.tokenize(df.comb.iloc[i].lower()) # use iloc to skip removed rows
    for j, token in enumerate(loop_tokens):
        if re.match(r"\d+[\w]*", token):
            loop_tokens[j] = ''
        if re.match(r"//[\w]*", token):
            loop_tokens[j] = ''
        if ('audioengineering' in token)|('livesound' in token)|('http' in token):
            loop_tokens[j] = ''
    comb_tokens.append(loop_tokens)                    # add tokenized string to post_tokens list

In [29]:
len(comb_tokens)

1904

In [30]:
len(comb_tokens[0])

153

In [28]:
comb_tokens[0][:5]

['tech', 'support', 'and', 'troubleshooting', 'december']

## Stem tokens

We will improve the modeling ability of our strings by using a stemmer, which trims characters from each word to convert it to a stem.  Words will register as equivalent during feature extraction if they share a stem (i.e., `computer`,`computing`,`computed` all result in a stem of `comput`).

We will use the `PorterStemmer` class from the `nltk` library.

In [16]:
ps = PorterStemmer()

In [21]:
posts_t_st = [] # posts tokenized stemmed

for post in comb_tokens:
    post_st = [] # empty post stems
    for word in post:
        #print(word)
        word_st = ps.stem(word) # get stem of word
        post_st.append(word_st) # add to post list
    posts_t_st.append(post_st)  # add post list to stem matrix
    
posts_t_st[0][:10]

['tech',
 'support',
 'and',
 'troubleshoot',
 'decemb',
 '',
 '',
 'welcom',
 'the',
 '']

### Combine stems to single string list

In [22]:
# format tokenized stems for vectorizer: list of strings
posts_t_st_list = []

for post in posts_t_st:
    posts_t_st_list.append(' '.join(post))

## Lemmatized tokens

We will try Lemmatizing as well, which is a "lighter-touch" approach.  We can run both vectors through our initial models to determine which is better.

We used the `WordNetLemmatizer` class from the `nltk` library.

In [31]:
lm = WordNetLemmatizer()

In [33]:
posts_t_lm = []

for post in comb_tokens:
    post_st = [] # empty post stems
    for word in post:
        #print(word)
        word_st = lm.lemmatize(word) # get lemmatized word
        post_st.append(word_st) # add to post list
    posts_t_lm.append(post_st)  # add post list to lemma matrix
    
posts_t_lm[0][:10]

['tech',
 'support',
 'and',
 'troubleshooting',
 'december',
 '',
 '',
 'welcome',
 'the',
 '']

### Combine lemmatized to list

In [34]:
posts_t_lm_list = []

for post in posts_t_lm:
    posts_t_lm_list.append(' '.join(post))

In [35]:
posts_t_lm_list[:2]

["tech support and troubleshooting december   welcome the  tech support and troubleshooting thread we kindly ask that all tech support question and basic troubleshooting question how do i hook up 'a' to 'b' headphone v mon etc go here if you see post that belong here please report them to help u get to them in a timely manner thank you daily thread monday gear recommendation sticky thread   reddit  q title  author  restrict_sr on sort new t all monday tech support and troubleshooting sticky thread   reddit  q title  author  restrict_sr on sort new t all tuesday tip trick   reddit  q title    and   or author   restrict_sr on sort new t all friday how did they do that   reddit  q title  author  restrict_sr on sort new t all",
 'gear recommendation what should i buy thread december   welcome to our weekly gear recommendation thread where you can ask  for recommendation on smart purchase low cost gear and purchasing recommendation request have become common in the ae subreddit there is als

### Add index to posts and titles and create DataFrames

Here we create a DataFrame with our combined posts with both stem and lemmatized values, and save to csv to continue.

In [36]:
df_pre = pd.DataFrame(data=[posts_t_st_list, posts_t_lm_list], index=['post_st','post_lm'])

In [37]:
df_pre = df_pre.T

In [38]:
df_pre.head()

Unnamed: 0,post_st,post_lm
0,tech support and troubleshoot decemb welcom ...,tech support and troubleshooting december we...
1,gear recommend what should i buy thread decemb...,gear recommendation what should i buy thread d...
2,will i ever understand compress ahh ye my mont...,will i ever understand compression ahh yes my ...
3,i'm interview to be an intern at a big studio ...,i'm interviewing to be an intern at a big stud...
4,if i face two speaker toward each other one pl...,if i faced two speaker towards each other one ...


In [39]:
df_pre['is_ls'] = df['is_ls']

In [40]:
df_pre.head()

Unnamed: 0,post_st,post_lm,is_ls
0,tech support and troubleshoot decemb welcom ...,tech support and troubleshooting december we...,0
1,gear recommend what should i buy thread decemb...,gear recommendation what should i buy thread d...,0
2,will i ever understand compress ahh ye my mont...,will i ever understand compression ahh yes my ...,0
3,i'm interview to be an intern at a big studio ...,i'm interviewing to be an intern at a big stud...,0
4,if i face two speaker toward each other one pl...,if i faced two speaker towards each other one ...,0


In [73]:
df_pre.to_csv('./csv/181220_df_pre.csv', index=False)

## Train-Test Split

We will perform a train/test split of the data, which will share index values.  This allows us to directly compare the modeling accuracy scores between stemmed and lemmatized vectors.

In [74]:
df_pre = pd.read_csv('./csv/181220_df_pre.csv') # reimport data

In [41]:
df_pre.head()

Unnamed: 0,post_st,post_lm,is_ls
0,tech support and troubleshoot decemb welcom ...,tech support and troubleshooting december we...,0
1,gear recommend what should i buy thread decemb...,gear recommendation what should i buy thread d...,0
2,will i ever understand compress ahh ye my mont...,will i ever understand compression ahh yes my ...,0
3,i'm interview to be an intern at a big studio ...,i'm interviewing to be an intern at a big stud...,0
4,if i face two speaker toward each other one pl...,if i faced two speaker towards each other one ...,0


In [42]:
df_pre.isnull().sum()

post_st    0
post_lm    0
is_ls      0
dtype: int64

In [43]:
df_pre.dropna(inplace=True) # remove nulls to continue

We set our `X` matrix to contain both post vectors, and our `y` target matrix to `is_ls`.

In [44]:
X = df_pre[['post_st','post_lm']]
y = df_pre['is_ls']

In [45]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=r, stratify=y)

We confirm that our train and test values are the same length.

In [46]:
len(X_train)

1428

In [47]:
len(X_test)

476

In [48]:
len(y_train)

1428

In [49]:
len(y_test)

476

In [50]:
y_train = pd.DataFrame(y_train, columns=['is_ls'])

In [51]:
y_test = pd.DataFrame(y_test, columns=['is_ls'])

Saved each train and test DataFrame to a separate dated csv for future reference.

In [87]:
X_train.to_csv('./csv/181220_X_train.csv', index=True)
X_test.to_csv('./csv/181220_X_test.csv', index=True)
y_train.to_csv('./csv/181220_y_train.csv', index=True)
y_test.to_csv('./csv/181220_y_test.csv', index=True)

## Continue to Notebook 3: Model Selection