Source material is from "Text Analytics with Python: A Practical Real-World Approach to Gaining Actionable Insights from Your Data" by Dipanjan Sarkar

# # Defintions of common Natural Language Processing terms

Stop Words: A stop word is a commonly used word (such as “the”, “a”, “an”, “in”) that a search engine has been programmed to ignore, both when indexing entries for searching and when retrieving them as the result of a search query.

Corpus : "A corpus is a large body of natural language text used for accumulating statistics on natural language text.

Tokenization: Given a character sequence and a defined document unit, tokenization is the task of chopping it up into pieces, called tokens

Word stems:  also known as the base form of a word, and we can create new words by attaching affixes to them in a process known as inflection. Consider the word JUMP. You can add affixes to it and form new words like JUMPS, JUMPED, and JUMPING. In this case, the base word JUMP is the word stem.

Stemming: The reverse process of obtaining the base form of a word from its inflected form

Lemmatization: Similar to stemming, we remove word affixes to get to the base form of a word. However, the base form in this case is known as the root word, but not the root stem. 

# Section 1. Text wrangling and pre-processing

This section will highlight some of the most important steps which are used heavily in Natural Language Processing (NLP) pipelines. We will be leveraging a fair bit of nltk and spacy, both "state-of-the-art" libraries in NLP.

In [1]:
#______Standard Packages___________#
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns
import os
import pickle
%matplotlib inline

#_________NLP specfic Packages_____________#

import requests #lets us leverage HTML code
from bs4 import BeautifulSoup #Web Scraper
import spacy
import nltk
from nltk.tokenize.toktok import ToktokTokenizer
import re
import unicodedata

The following is a contraction map which will be useful for our futrue processing of text

In [2]:
CONTRACTION_MAP=pd.read_pickle("CONTRACTION_MAP.pickle")

Now we load the necessary dependencies for text pre-processing. We will remove negation words from stop words (see def), since we would want to keep them as they might be useful, especially during sentiment analysis.

In [3]:
!/Users/user/anaconda/envs/nlp_env/bin/python -m spacy download en_core_web_md

[33mYou are using pip version 9.0.1, however version 18.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m

[93m    Linking successful[0m
    /Users/user/anaconda/envs/nlp_env/lib/python3.5/site-packages/en_core_web_md
    -->
    /Users/user/anaconda/envs/nlp_env/lib/python3.5/site-packages/spacy/data/en_core_web_md

    You can now load the model via spacy.load('en_core_web_md')



In [4]:
nlp = spacy.load('en_core_web_md', parse=True, tag=True, entity=True)
tokenizer = ToktokTokenizer()
stopword_list = nltk.corpus.stopwords.words('english')
stopword_list.remove('no')
stopword_list.remove('not')

Most of the time unstructured text contains noise, especially if you use techniques like web or screen scraping. HTML tags are typically one of these components which don’t add much value towards understanding and analyzing text. The following method is designed to remove HTML tags from our text

In [5]:
def strip_html_tags(text):
    soup = BeautifulSoup(text, "html.parser")
    stripped_text = soup.get_text()
    return stripped_text

In [6]:
#Example
strip_html_tags('<html><h2>Some important text</h2></html>')
#Returns

'Some important text'

In [7]:
def remove_accented_chars(text):
    text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
    return text

In [8]:
#Example
remove_accented_chars('Sómě Áccěntěd těxt')
#Returns

'Some Accented text'

Usually in any text corpus, you will also find contractions. Converting each contraction to its expanded, original form helps with text standardization. Below is a method which performs this task for us. Note this method hevily exploits methods in the regular expression package more detail on the inner workings of the following method can be gleamed by visting (https://docs.python.org/3/library/re.html)

In [9]:
def expand_contractions(text, contraction_mapping=CONTRACTION_MAP):
    
    contractions_pattern = re.compile('({})'.format('|'.join(contraction_mapping.keys())), 
                                      flags=re.IGNORECASE|re.DOTALL)
    def expand_match(contraction):
        match = contraction.group(0)
        first_char = match[0]
        expanded_contraction = contraction_mapping.get(match)\
                                if contraction_mapping.get(match)\
                                else contraction_mapping.get(match.lower())                       
        expanded_contraction = first_char+expanded_contraction[1:]
        return expanded_contraction
        
    expanded_text = contractions_pattern.sub(expand_match, text)
    expanded_text = re.sub("'", "", expanded_text)
    return expanded_text

In [10]:
#Example
expand_contractions("Y'all can't expand contractions I'd think")
#Returns

'You all cannot expand contractions I would think'

Special characters and symbols are usually non-alphanumeric characters or even occasionally numeric characters (depending on the problem), which add to the extra noise in unstructured text. Usually, simple regular expressions (regexes) can be used to remove them.

Note in the method removing digits is optional, because often we might need to keep them in the pre-processed text.

In [11]:
def remove_special_characters(text, remove_digits=False):
    pattern = r'[^a-zA-z0-9\s]' if not remove_digits else r'[^a-zA-z\s]'
    text = re.sub(pattern, '', text)
    return text

In [12]:
#Example
remove_special_characters("Well this was fun! What do you think? 123#@!", remove_digits=True)
#Returns

'Well this was fun What do you think '

Stemming helps us in standardizing words to their base or root stem, irrespective of their inflections, which helps many applications like classifying or clustering text, and even in information retrieval. We will be employing the popular Porter stemmer (https://tartarus.org/martin/PorterStemmer/) for this purpose

In [13]:
def simple_stemmer(text):
    ps = nltk.porter.PorterStemmer()
    text = ' '.join([ps.stem(word) for word in text.split()])
    return text

In [14]:
#Example
simple_stemmer("My system keeps crashing his crashed yesterday, ours crashes daily")
#Returns

'My system keep crash hi crash yesterday, our crash daili'

NOTE: that usually stemming has a fixed set of rules, hence, the root stems may not be lexicographically correct. Which means, the stemmed words may not be semantically correct, and might have a chance of not being present in the dictionary (as evident from the preceding output).

Now we will address Lemmatization

The key difference between steming and lemmatization is that the root word in lemmatization is always a lexicographically correct word (present in the dictionary), but the root stem may not be so. Thus, root word, also known as the lemma, will always be present in the dictionary.

Below is a method in which we exploit spacy to do this for us

In [15]:
def lemmatize_text(text):
    text = nlp(text)
    text = ' '.join([word.lemma_ if word.lemma_ != '-PRON-' else word.text for word in text])
    return text

In [16]:
#Example
lemmatize_text("My system keeps crashing his crashed yesterday, ours crashes daily")
#Returns

'My system keep crash his crash yesterday , ours crash daily'

f lemmatization works this well why would we even bother with steming?

Well the lemmatization process is considerably slower than stemming, because an additional step is involved where the root form or lemma is formed by removing the affix from the word if and only if the lemma is present in the dictionary. So sometimes steaming is desireable

We have previously mentioned stop words in passing and have defined what we mean by stop words now we will provide a method in which we can leverage to eliminate these words from our corpus as they tend to be very frequent. We will be exploiting Toktoktokenizer. The tok-tok tokenizer is a simple, general tokenizer, where the input has one 
sentence per line; thus only final period is tokenized. More can be learned here 
https://www.nltk.org/_modules/nltk/tokenize/toktok.html, 


There is no universal stopword list, but we use a standard English language stopwords list from nltk. You can also add your own domain-specific stopwords as needed.




In [17]:
def remove_stopwords(text, is_lower_case=False):
    tokens = tokenizer.tokenize(text)
    tokens = [token.strip() for token in tokens]
    if is_lower_case:
        filtered_tokens = [token for token in tokens if token not in stopword_list]
    else:
        filtered_tokens = [token for token in tokens if token.lower() not in stopword_list]
    filtered_text = ' '.join(filtered_tokens)    
    return filtered_text

In [18]:
#Example
remove_stopwords("The, and, if are stopwords, computer is not")
#Returns

', , stopwords , computer not'

There is a lot more we can do like correcting spelling, grammar and so on,however one of the objectives of this tutorial is for you to learn to write some of theses methods for youself (experiment and bring some to the next meeting?). So for now we are just going to bring what we have covered so far togther in the following section.

# Section 2.  Building a Text Normalizer

In this section we are going to chain all the operations we have learned togther in order to build a text normalizer to pre-process text data.

In [19]:
def normalize_corpus(corpus, html_stripping=True, contraction_expansion=True,
                     accented_char_removal=True, text_lower_case=True, 
                     text_lemmatization=True, special_char_removal=True, 
                     stopword_removal=True, remove_digits=True): #This function take all of the inputs of our previous 
    #functions and turns it into one input line
    
    normalized_corpus = []
    # normalize each document in the corpus
    for doc in corpus:
        # strip HTML
        if html_stripping:
            doc = strip_html_tags(doc)
        # remove accented characters
        if accented_char_removal:
            doc = remove_accented_chars(doc)
        # expand contractions    
        if contraction_expansion:
            doc = expand_contractions(doc)
        # lowercase the text    
        if text_lower_case:
            doc = doc.lower() #Standard text lower method
        # remove extra newlines
        doc = re.sub(r'[\r|\n|\r\n]+', ' ',doc)
        # lemmatize text
        if text_lemmatization:
            doc = lemmatize_text(doc)
        # remove special characters and\or digits    
        if special_char_removal:
            # insert spaces between special characters to isolate them    
            special_char_pattern = re.compile(r'([{.(-)!}])')
            doc = special_char_pattern.sub(" \\1 ", doc)
            doc = remove_special_characters(doc, remove_digits=remove_digits)  
        # remove extra whitespace
        doc = re.sub(' +', ' ', doc)
        # remove stopwords
        if stopword_removal:
            doc = remove_stopwords(doc, is_lower_case=text_lower_case)
            
        normalized_corpus.append(doc)
    return normalized_corpus

# Section 3. A Basic Machine Learning application

In [20]:
# Importing the dataset
dataset = pd.read_csv('Restaurant_Reviews.tsv', delimiter = '\t', quoting = 3)

In [21]:
reviews=normalize_corpus(dataset['Review'], html_stripping=False)

In [22]:
# Creating the Bag of Words model
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 1500)
X = cv.fit_transform(reviews).toarray()
y = dataset.iloc[:, 1].values

In [23]:
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)

In [24]:
# Fitting Naive Bayes to the Training set
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)

GaussianNB(priors=None)

In [25]:
# Predicting the Test set results
y_pred = classifier.predict(X_test)

In [26]:
# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)

In [27]:
cm

array([[51, 46],
       [11, 92]])