# Natural Language Processing
## Text Cleaning & Basic Bag of Words Wizardry

* Natalia Zhang wenshuo.zhang@gmail.com
* Remember, annotate so that you should be able to find anything with ctrl-F

## Table of Contents

1. Cleaning and processing step-by-step
2. Canned Subroutines
3. BoW Magic


## 1. Cleaning Processing Step-by-Step

I've modified a real Hostelworld review to illustrate what's happening:

>I th1nk the best part    about this hostel was the great social atmosphere and planned activities! It made it really easy to meet and hangout with new people as a solo traveler. The 34 downside however would be the unfinished remodel of the bathroom nd the cleanliness of the dorm room.  But otherwise the location is good, situated in a relatively quiet neighborhood serviced by the northern line and a couple bus routes. The staff were exceptionally friendly and helpful for everything.

In [47]:
import numpy as np
import pandas as pd

text = 'I th1nk the best part    about this hostel was the great social atmosphere and planned activities! It made it really easy to meet and hangout with new people as a solo traveler. The 34 downside however would be the unfinished remodel of the bathroom nd the cleanliness of the dorm room.  But otherwise the location is good, situated in a relatively quiet neighborhood serviced by the northern line and a couple bus routes. The staff were exceptionally friendly and helpful for everything.'


### Tokenize me!

The next thing is to clean and tokenize the text, though generally "tokenizing" involves more than just breaking the text down to word-level vectors.

**Steps may include:**
1. Lowering word case - Apple is the same thing as apple. Or maybe not!
1.5 remove numbers - either independently terms or stuck within words. You decide whether target words are h4x0r, or not.
2. Filtering stop words - some words are used a lot, so you may not want them
    * You may want to use an existing set or extend/create your own
    * a Common question is whether to include negation in the stop words set
3. Stemming _or_ lemmatizing
    * Stem only cuts the ends of words off
    * Lemmatizing is more intelligent and tries to find the word root



In [48]:
import nltk.tokenize

#tokenize 
tokens = nltk.word_tokenize(text)


#You'd need to do additional processing for lower case
tokens = [token.lower() for token in tokens]

#also removing punctuation
import string
tokens = [token for token in tokens if token not in string.punctuation]

print(tokens)

['i', 'th1nk', 'the', 'best', 'part', 'about', 'this', 'hostel', 'was', 'the', 'great', 'social', 'atmosphere', 'and', 'planned', 'activities', 'it', 'made', 'it', 'really', 'easy', 'to', 'meet', 'and', 'hangout', 'with', 'new', 'people', 'as', 'a', 'solo', 'traveler', 'the', '34', 'downside', 'however', 'would', 'be', 'the', 'unfinished', 'remodel', 'of', 'the', 'bathroom', 'nd', 'the', 'cleanliness', 'of', 'the', 'dorm', 'room', 'but', 'otherwise', 'the', 'location', 'is', 'good', 'situated', 'in', 'a', 'relatively', 'quiet', 'neighborhood', 'serviced', 'by', 'the', 'northern', 'line', 'and', 'a', 'couple', 'bus', 'routes', 'the', 'staff', 'were', 'exceptionally', 'friendly', 'and', 'helpful', 'for', 'everything']


### Remove numbers, punctuations, stop words

The text contains numbers, misspellings, and punctuation in the text, so we'll want to filter it out. 

However, keep in mind that social media text may be badly misspelled.

In [19]:
#remove numerical sequences
tokens = [token for token in tokens if not str(token).isnumeric()]

#remove tokens that have words containing numbers, eg 7PM, h4x0r
#watch out bc this may remove too many words!
tokens = [token for token in tokens if not any(char.isdigit() for char in token)]

print(tokens)

['i', 'the', 'best', 'part', 'about', 'this', 'hostel', 'was', 'the', 'great', 'social', 'atmosphere', 'and', 'planned', 'activities', 'it', 'made', 'it', 'really', 'easy', 'to', 'meet', 'and', 'hangout', 'with', 'new', 'people', 'as', 'a', 'solo', 'traveler', 'the', 'downside', 'however', 'would', 'be', 'the', 'unfinished', 'remodel', 'of', 'the', 'bathroom', 'nd', 'the', 'cleanliness', 'of', 'the', 'dorm', 'room', 'but', 'otherwise', 'the', 'location', 'is', 'good', 'situated', 'in', 'a', 'relatively', 'quiet', 'neighborhood', 'serviced', 'by', 'the', 'northern', 'line', 'and', 'a', 'couple', 'bus', 'routes', 'the', 'staff', 'were', 'exceptionally', 'friendly', 'and', 'helpful', 'for', 'everything']


### Remove accents

Different people may put accents on different words. You may want to clean things up so that fiance and fiancé are regarded as the same word. 

In [39]:
#Unfortunately NLTK doesn't have a native subroutine for this, so we'll have to write one

import unidecode

def strip_accent(text):
    return unidecode.unidecode(text)
    
sometext = "àéêöhello"
print(sometext, " ", strip_accent(sometext))


àéêöhello   aeeohello


In [20]:
# What is the standard NLTK stopwords list?
from nltk.corpus import stopwords

stop_words = stopwords.words('english')
print(stop_words)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [21]:
#Not that we know, it's a simple list comprehension to remove everything
tokens = [ token for token in tokens if token not in stop_words ]

print(tokens)

['best', 'part', 'hostel', 'great', 'social', 'atmosphere', 'planned', 'activities', 'made', 'really', 'easy', 'meet', 'hangout', 'new', 'people', 'solo', 'traveler', 'downside', 'however', 'would', 'unfinished', 'remodel', 'bathroom', 'nd', 'cleanliness', 'dorm', 'room', 'otherwise', 'location', 'good', 'situated', 'relatively', 'quiet', 'neighborhood', 'serviced', 'northern', 'line', 'couple', 'bus', 'routes', 'staff', 'exceptionally', 'friendly', 'helpful', 'everything']


### Word Endings: Stemming

We can either stem OR lemmatize, but lemmatizing is smarter, though it will take slightly more time.

There are a bunch of stemmers, and you can choose between them.

In [22]:
#lancaster
from nltk.stem.lancaster import LancasterStemmer

#set language
stemmer = LancasterStemmer()
tokens_lancaster = [ stemmer.stem(token) for token in tokens ]

In [23]:
#snowball
from nltk.stem.porter import PorterStemmer

#set language
stemmer = PorterStemmer()
tokens_porter = [ stemmer.stem(token) for token in tokens ]

In [24]:
#snowball
from nltk.stem.snowball import SnowballStemmer

#set language
stemmer = SnowballStemmer("english")
tokens_snowball = [ stemmer.stem(token) for token in tokens ]

In [25]:
compare_stemmers = pd.DataFrame({'Lancaster': tokens_lancaster, 'Porter': tokens_porter, 'Snowball': tokens_snowball})
print(compare_stemmers)

   Lancaster        Porter      Snowball
0       best          best          best
1       part          part          part
2     hostel        hostel        hostel
3        gre         great         great
4        soc        social        social
5    atmosph     atmospher     atmospher
6       plan          plan          plan
7        act         activ         activ
8        mad          made          made
9       real        realli        realli
10      easy          easi          easi
11      meet          meet          meet
12   hangout       hangout       hangout
13       new           new           new
14     peopl         peopl         peopl
15      solo          solo          solo
16    travel        travel        travel
17   downsid       downsid       downsid
18     howev         howev         howev
19     would         would         would
20     unfin      unfinish      unfinish
21   remodel       remodel       remodel
22  bathroom      bathroom      bathroom
23        nd    

### Word Endings: Lemmatizing

Alternately, you lemmatize for better effect. Depending on the package, it may take a while. Lemmatization usually either takes a word to its adjective or noun roots. In the case below, "routes" has been lemmatized to "route".

In [26]:
from nltk.stem.wordnet import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

tokens = [lemmatizer.lemmatize(token) for token in tokens]

#let's compare against the stemmed versions earlier
compare_stemmers['Lemma'] = tokens
print(compare_stemmers)

   Lancaster        Porter      Snowball          Lemma
0       best          best          best           best
1       part          part          part           part
2     hostel        hostel        hostel         hostel
3        gre         great         great          great
4        soc        social        social         social
5    atmosph     atmospher     atmospher     atmosphere
6       plan          plan          plan        planned
7        act         activ         activ       activity
8        mad          made          made           made
9       real        realli        realli         really
10      easy          easi          easi           easy
11      meet          meet          meet           meet
12   hangout       hangout       hangout        hangout
13       new           new           new            new
14     peopl         peopl         peopl         people
15      solo          solo          solo           solo
16    travel        travel        travel       t

## 2. Canned Subroutines

Usually I end up combining most of these into a function, or take advantage of preexisting subroutines in different packages. NLTK is the multitool of NLP, but occasionally you may want to use gensim or spaCy.

I'm actually lying. You really shouldn't need to use spaCy for anything as basic as text cleaning.

In [52]:
#NLTK implementation
#This is the whole thing crunched together into one function.

#It's highly likely that you may not need to do ALL of this cleaning.
#Decide what you're trying to do and then judiciously modify subroutines as needed.

from nltk.stem.wordnet import WordNetLemmatizer
import string
    
def strip_accent(text):
    return unidecode.unidecode(text)

def tokenize_text(text):
    tokens = []
    lemmatizer = WordNetLemmatizer()

    #You may occasionally want to break it into sentences instead with nltk.sentence_tokenize(text)
    #and run an extra for loop inside in case you want to create a separate token array for each sentence
    for word in nltk.word_tokenize(text):
        tokens.append(word)
        
    
    tokens = [token for token in tokens if token not in string.punctuation]
    tokens = [token.lower() for token in tokens]   
    tokens = [token for token in tokens if not str(token).isnumeric()]
    tokens = [token for token in tokens if not any(char.isdigit() for char in token)]
    tokens = [ token for token in tokens if token not in stop_words ]
    tokens = [ strip_accent(token) for token in tokens ]
    tokens = [lemmatizer.lemmatize(token) for token in tokens]

    
    return tokens

** Gensim Simple Preprocess: **

0. lower case
1. strip punctuation
2. removed numbers (and split th1nk into two words on 1 as space)
3. remove accents

It did not 
4. lemmatize
5. stem

Note that there's a minimum and maximum length setting on word size.

In [42]:
#Gensim Pre Process 1

from gensim.utils import simple_preprocess

g_tokens = simple_preprocess(text, deacc=True, min_len=1, max_len=20)
print(g_tokens)

['i', 'th', 'nk', 'the', 'best', 'part', 'about', 'this', 'hostel', 'was', 'the', 'great', 'social', 'atmosphere', 'and', 'planned', 'activities', 'it', 'made', 'it', 'really', 'easy', 'to', 'meet', 'and', 'hangout', 'with', 'new', 'people', 'as', 'a', 'solo', 'traveler', 'the', 'downside', 'however', 'would', 'be', 'the', 'unfinished', 'remodel', 'of', 'the', 'bathroom', 'nd', 'the', 'cleanliness', 'of', 'the', 'dorm', 'room', 'but', 'otherwise', 'the', 'location', 'is', 'good', 'situated', 'in', 'a', 'relatively', 'quiet', 'neighborhood', 'serviced', 'by', 'the', 'northern', 'line', 'and', 'a', 'couple', 'bus', 'routes', 'the', 'staff', 'were', 'exceptionally', 'friendly', 'and', 'helpful', 'for', 'everything']


** Better gensim alternative, preprocess_string: **

[Default filters](https://radimrehurek.com/gensim/parsing/preprocessing.html#gensim.parsing.preprocessing.preprocess_string):

1. removes HTML tags
2. removes punctuation
3. removes whitespace
4. removes numbers
    * unlike simple_preprocess, this one removes the numbers and crunches the word together instead of creating a whitespace, so "th1nk" became "thnk"
5. removes stopwords
6. removes words that are less than X number of letters (default = 3)
7. stems text using the Porter stemmer


In [49]:
# Alternate method

from gensim.parsing.preprocessing import preprocess_string
g_tokenss = preprocess_string(text)
print(g_tokenss)

['thnk', 'best', 'hostel', 'great', 'social', 'atmospher', 'plan', 'activ', 'easi', 'meet', 'hangout', 'new', 'peopl', 'solo', 'travel', 'downsid', 'unfinish', 'remodel', 'bathroom', 'cleanli', 'dorm', 'room', 'locat', 'good', 'situat', 'rel', 'quiet', 'neighborhood', 'servic', 'northern', 'line', 'coupl', 'bu', 'rout', 'staff', 'exception', 'friendli', 'help']


### 3. BoW Magic

Generally speaking, you'll want to transform documents into bag of words (bag-of-words, BoW) with Scikit-Learn, which has its own routines for straight word counts (CountVectorizer) or TFIDF (TFIDF Vectorizer).

Both Vectorizers have built-in preprocessing subroutines:
0. read from file
1. strip accents
2. lowercase
3. stop_words: built-in or you can call a custom list
    *scikit-learn warns that its list is flawed because it's actually incompatible with the built-in tokenizer
4. limit word frequency, both at minimum and maximum

**Best of all, both vectorizers do ngrams, either at the word or the character level.**

In [72]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

count_vectorizer = CountVectorizer() #runs on default settings

#alternately, you can specify everything.
#For example, I want to use the tokenizer, tokenize_text that I built above!

count_vectorizer = CountVectorizer(tokenizer = tokenize_text)
bow_counts = count_vectorizer.fit_transform([text]) #get document counts
#SKLearn expects the corpus to be a list, with each document as an item in that list

##FULLY CUSTOMIZABLE CODE
#count_vectorizer = CountVectorizer(input=’content’, encoding=’utf-8’, decode_error=’strict’, strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern=’(?u)\b\w\w+\b’, ngram_range=(1, 1), analyzer=’word’, max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class ‘numpy.int64’>)

In [73]:
print(count_vectorizer.get_feature_names()) #get list of words

['activity', 'atmosphere', 'bathroom', 'best', 'bus', 'cleanliness', 'couple', 'dorm', 'downside', 'easy', 'everything', 'exceptionally', 'friendly', 'good', 'great', 'hangout', 'helpful', 'hostel', 'however', 'line', 'location', 'made', 'meet', 'nd', 'neighborhood', 'new', 'northern', 'otherwise', 'part', 'people', 'planned', 'quiet', 'really', 'relatively', 'remodel', 'room', 'route', 'serviced', 'situated', 'social', 'solo', 'staff', 'traveler', 'unfinished', 'would']


In [74]:
print(bow_counts.toarray())

[[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
  1 1 1 1 1 1 1 1 1]]


In [75]:
#put them together

sparse_bow = pd.DataFrame((count, word) for word, count in zip(bow_counts.toarray().tolist()[0],count_vectorizer.get_feature_names()))
sparse_bow.columns = ['Word','Count']
print(sparse_bow)

#with a bigger corpus, you can also do sparse_bow.sort_values('Count', ascending=False, inplace=true)

             Word  Count
0        activity      1
1      atmosphere      1
2        bathroom      1
3            best      1
4             bus      1
5     cleanliness      1
6          couple      1
7            dorm      1
8        downside      1
9            easy      1
10     everything      1
11  exceptionally      1
12       friendly      1
13           good      1
14          great      1
15        hangout      1
16        helpful      1
17         hostel      1
18        however      1
19           line      1
20       location      1
21           made      1
22           meet      1
23             nd      1
24   neighborhood      1
25            new      1
26       northern      1
27      otherwise      1
28           part      1
29         people      1
30        planned      1
31          quiet      1
32         really      1
33     relatively      1
34        remodel      1
35           room      1
36          route      1
37       serviced      1
38       situated      1


In [None]:
#You can also try to normalize the word frequency, but frankly you should
#probably just do TF-IDF

norm_count_vectorizer = CountVectorizer(tokenizer = tokenize_text, norm='l2')
normbow_counts = count_vectorizer.fit_transform([text])

### TF-IDF
Most people will use TF-IDF, term frequency-inverse document frequency, where rare words contribute more weights to the model.

You can check to see if normalized counts is equal to TF-IDF when the number of documents in a corpus equals 1.

General rule of thumb:
1. Word importance increases with # occurrences in document
2. Word importance decreases with # occurrences in corpus

In [76]:
from sklearn.feature_extraction.text import TfidfVectorizer

tf_vect = TfidfVectorizer(tokenizer=tokenize_text)
tfidf_counts = tf_vect.fit_transform([text])

In [77]:
sparse_tfidf = pd.DataFrame((count, word) for word, count in zip(bow_counts.toarray().tolist()[0],count_vectorizer.get_feature_names()))
sparse_tfidf.columns = ['Word','TF-IDF']
print(sparse_tfidf)


             Word  TF-IDF
0        activity       1
1      atmosphere       1
2        bathroom       1
3            best       1
4             bus       1
5     cleanliness       1
6          couple       1
7            dorm       1
8        downside       1
9            easy       1
10     everything       1
11  exceptionally       1
12       friendly       1
13           good       1
14          great       1
15        hangout       1
16        helpful       1
17         hostel       1
18        however       1
19           line       1
20       location       1
21           made       1
22           meet       1
23             nd       1
24   neighborhood       1
25            new       1
26       northern       1
27      otherwise       1
28           part       1
29         people       1
30        planned       1
31          quiet       1
32         really       1
33     relatively       1
34        remodel       1
35           room       1
36          route       1
37       ser

You can also do this in gensim, but Gensim is poorly documented relative to Scikit-learn, and you might as well keep it all within the same package. Also, gensim doesn't seem to have an easy way to do BoW, just TF-IDF.

I won't cover the topic further, but here's [the gensim API for TF-IDF](https://radimrehurek.com/gensim/models/tfidfmodel.html).