# Part 2 - Cleaning/Tokenizing

In this portion of the report, I take the raw post titles from the JSONs and clean/tokenize them.  This will allow vectoriztion in the next notbook for NLP.  Additionally, I chose to complete this in a separate notebook because it will allow me to return and apply stemming or lemmatization, should I choose to apply that to my models.

In [2]:
import pandas as pd

In [3]:
df_raw = pd.read_csv('./data/raw_titles.csv')

In [4]:
df_raw.head()

Unnamed: 0.1,Unnamed: 0,title,is_Evolution
0,0,Anti-evolution courses on Udemy,1
1,1,When Birds Stopped Flying PBS Eons,1
2,2,I am currently researching creationism and evo...,1
3,3,The flaws that I see with the Savanna hypothes...,1
4,4,These Female Insects Evolved Penises,1


In [5]:
df_raw.drop(columns='Unnamed: 0', inplace=True)

In [6]:
df_raw.head()

Unnamed: 0,title,is_Evolution
0,Anti-evolution courses on Udemy,1
1,When Birds Stopped Flying PBS Eons,1
2,I am currently researching creationism and evo...,1
3,The flaws that I see with the Savanna hypothes...,1
4,These Female Insects Evolved Penises,1


## Time to Transform

I will be tokenizing each of the titles in order to analyze the most indicative words, but I am chosing to **_NOT_** stem or lemmatize the tokens because I believe that tense may be a powerful indicator for these two subreddits, given their context.

If I find that tense is not as powerful an indicator as I believed, I will plan to perform a second analysis with the stemmed/lemmatized tokens.

In [7]:
import re
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords

In [8]:
tokenizer = RegexpTokenizer(r'\w+')
tokenizer.tokenize(df_raw['title'][0].lower())

['anti', 'evolution', 'courses', 'on', 'udemy']

In [9]:
def title_to_words(raw_review):
    tokenizer = RegexpTokenizer(r'\w+')
    words = tokenizer.tokenize(title.lower())    # tokenizing the title
    stops = set(stopwords.words('english'))                   # setting stop words
    meaningful_words = [w for w in words if not w in stops]   # removing stop words from titles
    return(" ".join(meaningful_words))                        # return the title as a list

In [10]:
clean_titles = []
for title in df_raw['title']:
    clean_titles.append(title_to_words(title))

In [11]:
df_clean = pd.DataFrame(clean_titles)

In [12]:
df_clean.rename(columns={0:'title'}, inplace=True)

In [13]:
df_clean['is_evolution'] = df_raw['is_Evolution']

In [14]:
df_clean.head()

Unnamed: 0,title,is_evolution
0,anti evolution courses udemy,1
1,birds stopped flying pbs eons,1
2,currently researching creationism evolution,1
3,flaws see savanna hypothesis explanation human...,1
4,female insects evolved penises,1


In [19]:
df_clean.loc[1566]

title            
is_evolution    0
Name: 1566, dtype: object

The removal of my stop words removed every word from the post at index 1566, resulting in an empty string.  This caused a null value to be introduced when I imported the data into a new notebook.  I have removed the row in order to resolve this.

In [21]:
df_clean.drop(index=1566, inplace=True)

In [22]:
df_clean.to_csv('./data/clean_token_titles.csv', index=False)

In the next notebook, I will be vectorizing the data and performing EDA!