# Project 3: Natural Language Processing and Classification

Benjamin Chee, DSI-SG-17

Classifying posts from r/xboxone and r/PS5

# Notebook 2: Pre-processing

This notebook contains code used to:
- tokenise
- stem/lemmatise and compare between them

Contents:
- Pre-processing
- Tokenizing Titles and Posts
- Stemming Tokens
- Lemmatizing Tokens



## Libraries

In [1]:
import datetime
import time
import re
import pandas as pd
import numpy as np
from tqdm import tqdm

from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer

In [2]:
#initialise date time
date_run = datetime.datetime.now()
date= date_run.date()

In [3]:
#reading output from notebook 1
df=pd.read_csv('./csv/df_final_2020-10-02.csv')

First treatment, set up regex, then remove undetected white space, ampersands

In [4]:
# regex to greedily include words, slash characters for urls, apostrophes
rt = RegexpTokenizer(r"[\w/\']+") 

In [5]:
# replace ampersands and spaces which wont be cauglt
df.comb.replace('&amp;','&',inplace=True)
df.comb.replace('#x200B;',' ',inplace=True) # zero width space
df.comb.replace('nbsp;',' ',inplace=True) #non breaking space


Tokenising each post


In [6]:
comb_tokens = []  # empty token list

for i in range(len(df.comb)):
    loop_tokens = rt.tokenize(df.comb.at[i].lower())
    for j, token in enumerate(loop_tokens):
        if re.match(r"\d+[\w]*", token):
            loop_tokens[j] = ''
        if re.match(r"//[\w]*", token):
            loop_tokens[j] = ''
        if ('PS5' in token)|('xboxone' in token)|('http' in token):
            loop_tokens[j] = ''
    comb_tokens.append(loop_tokens)   

### Stem vs lemmatise

We will stem and lemmatise tokens to see how they measure up in a rough model
Their effectiveness will be evaluated during the first cut of model selection (next notebook)

In [7]:
#initialise Porter Stemmer
ps = PorterStemmer()

In [8]:
# posts tokenized stemmed
posts_st = [] 

for post in comb_tokens:
    post_st = [] # empty post stems
    for word in post:
        #print(word)
        word_st = ps.stem(word) # get stem of word
        post_st.append(word_st) # add to post list
    posts_st.append(post_st)  # add post list to stem matrix
    
posts_st[0][:10]

['tech',
 'weekli',
 'xbox',
 'one',
 'tech',
 'support',
 'thi',
 'is',
 'the',
 'thread']

Already we see some of the common errors, such as 'thi' and 'weekli' for 'this' and 'weekly' respectively

In [9]:
# format tokenized stems for vectorizer: list of strings
posts_st_list = []

for post in posts_st:
    posts_st_list.append(' '.join(post))

Now we will lemmatise the same tokens


In [10]:
lm = WordNetLemmatizer()

In [11]:
posts_lm = []

for post in comb_tokens:
    post_st = [] # empty post stems
    for word in post:
        #print(word)
        word_st = lm.lemmatize(word) # get lemmatized word
        post_st.append(word_st) # add to post list
    posts_lm.append(post_st)  # add post list to lemma matrix
    
posts_lm[0][:10]

['tech',
 'weekly',
 'xbox',
 'one',
 'tech',
 'support',
 'this',
 'is',
 'the',
 'thread']

Output looks a lot better for the lemmatised output, without any of the errors seen earlier

In [12]:
#creating lemmatised list for each post
posts_lm_list = []

for post in posts_lm:
    posts_lm_list.append(' '.join(post))

In [13]:
posts_lm_list[:2]

["tech weekly xbox one tech support this is the thread where subscriber should bring their xbox support issue the collective experience of the xbox community is one of the most powerful tool to help you fix any gaming issue and we've been seeing some awesome response we love to see gamers helping gamers there will also be xbox support team member who will be helping moderate and provide solution and can be recognized by their official support flair we ll try to provide support right here in the thread but won't be able to respond to every single question we might redirect you to support xbox com   xbox com to contact one of our support team or send you a private message if that s the best way to help the xbox support team member won't respond to request for help with third party product or situation that would violate the term of use community standard or void a product warranty let u know what you need help with we re here to help you with xbox service game hardware social feature and

In [14]:
#add all tokens into a single data frame
df_pre = pd.DataFrame(data=[posts_st_list, posts_lm_list], index=['post_st','post_lm'])
df_pre = df_pre.T
df_pre.head()

Unnamed: 0,post_st,post_lm
0,tech weekli xbox one tech support thi is the t...,tech weekly xbox one tech support this is the ...
1,gta iv one of my fav game ever nearli a lock ...,gta iv one of my fav game ever nearly a locked...
2,more seri x load time comparison,more series x load time comparison
3,digit foundri xbox seri x backward compat test...,digital foundry xbox series x backwards compat...
4,do you rememb when thi pictur blew our mind,do you remember when this picture blew our mind


In [17]:
len(df_pre)

1363

In [15]:
 # remove nulls to continue
df_pre.dropna(inplace=True)

In [16]:
#adding our target variable, checkpoint save

df_pre['from_ps5'] = df['from_ps5']
df_pre.to_csv(f'./csv/df_pre_{date}.csv', index=False)

### Continue to Notebook 3: Model Selection