# Introduction

In this notebook, I will implement an exhaustive NLP Pipeline that is able to process text data, correct spellings, and extract keywords and ngrams from multiple languages

In [1]:
import warnings
warnings.filterwarnings('ignore')

# Preprocessing
import numpy as np
import pandas as pd
import nltk
import re
import string
import html
import unicodedata
import blocks

# NLP Pipeline
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords 
from nltk.corpus import wordnet
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from collections import Counter
from textblob import TextBlob
from sklearn.feature_extraction.text import CountVectorizer

# Visualization
import seaborn as sns
import matplotlib.pyplot as plt

# Data

We will be looking at two different datasets for this project.

### Amazon Reviews
https://www.kaggle.com/snap/amazon-fine-food-reviews/version/2

This dataset consists of reviews of fine foods from amazon. The data spans a period of more than 10 years, including all ~500,000 reviews up to October 2012. For the purposes of this demonstration, we will use only 1000 rows

In [2]:
df1 = pd.read_csv('Reviews.csv').head(1000)
df1 = df1[['Text']]
df1.head(10)

Unnamed: 0,Text
0,I have bought several of the Vitality canned d...
1,Product arrived labeled as Jumbo Salted Peanut...
2,This is a confection that has been around a fe...
3,If you are looking for the secret ingredient i...
4,Great taffy at a great price. There was a wid...
5,I got a wild hair for taffy and ordered this f...
6,This saltwater taffy had great flavors and was...
7,This taffy is so good. It is very soft and ch...
8,Right now I'm mostly just sprouting this so my...
9,This is a very healthy dog food. Good for thei...


Let's inspect a long review that has html elements and a relatively large vocabulary

In [3]:
df1['Text'][508]

'These are perhaps the worst chips that have ever gone into my mouth.<br /><br />For my entire life Sour Cream & Onion (and in this case "& Chive") chips were my favorite. Recently Kettle brand Honey Dijon Mustard took that slot. So when I found out they had sour cream & onion I just had to try them.<br /><br />As soon as I opened the bag the chips smelled of powdered milk. And indeed each chip is coated with a powdered sour cream that is just awful. It tastes like rancid milk. Not just sour, but like sour cream that when rancid. The powdery texture is also extremely unappealing. I basically hated these chips. I would not recommend these chips to anyone unless they had a particular affinity for a powdery, chalky texture on a chip, with a rancid (and onion flavor) and I have a hard time believing that person exists.<br /><br />I plan on contacting Kettle and sharing my thoughts with them. Hopefully they\'ll reassess the seasoning on these otherwise wonderful kettle style chips.'

### SMS Spam Collection
https://www.dt.fee.unicamp.br/~tiago/smsspamcollection/

The SMS Spam Collection v.1 is a public set of SMS labeled messages that have been collected for mobile phone spam research. It has one collection composed by 5,574 real and non-enconded messages, tagged according to being legitimate (ham) or spam. For the purposes of this demonstration, we will use only 5000 rows

In [4]:
with open('SMSSpamCollection.txt') as f:
    lines = f.readlines()
    df2 = pd.DataFrame(lines).head(1000)
    df2.columns = ['Text']
    df2[['SMS', 'Text']] = df2.Text.str.split('\t', expand=True)
    
df2.head(10)

Unnamed: 0,Text,SMS
0,"Go until jurong point, crazy.. Available only ...",ham
1,Ok lar... Joking wif u oni...\n,ham
2,Free entry in 2 a wkly comp to win FA Cup fina...,spam
3,U dun say so early hor... U c already then say...,ham
4,"Nah I don't think he goes to usf, he lives aro...",ham
5,FreeMsg Hey there darling it's been 3 week's n...,spam
6,Even my brother is not like to speak with me. ...,ham
7,As per your request 'Melle Melle (Oru Minnamin...,ham
8,WINNER!! As a valued network customer you have...,spam
9,Had your mobile 11 months or more? U R entitle...,spam


Let's inspect an SMS that is riddled with grammatical errors

In [5]:
df2['Text'][244]

"Although i told u dat i'm into baig face watches now but i really like e watch u gave cos it's fr u. Thanx 4 everything dat u've done today, i'm touched...\n"

# <b>Pre-Processing Pipeline</b>

Read function docstrings to understand what each function does

In [6]:
def clean_math(text):
    '''Clean all math elements'''
    cleanr = re.compile('<math>(.*?)<\/math>')
    clean_text = re.sub(cleanr, ' ', text)
    return ' '.join(clean_text.split())

def clean_html(text):
    '''Clean all HTML elements'''
    cleanr = re.compile('<.*?>')
    clean_text = re.sub(cleanr, ' ', text)
    clean_text = html.unescape(clean_text)
    clean_text = unicodedata.normalize('NFKD', clean_text).encode('ASCII','ignore').decode('ASCII')    
    return ' '.join(clean_text.split())

def clean_unicode(text):
    '''Clean text by unicode block. Remove unwanted symbols and emojis'''
    cleanr = re.compile(
        "(["
        "\U0001F1E0-\U0001F1FF"  # flags (iOS)
        "\U0001F300-\U0001F5FF"  # symbols & pictographs
        "\U0001F600-\U0001F64F"  # emoticons
        "\U0001F680-\U0001F6FF"  # transport & map symbols
        "\U0001F700-\U0001F77F"  # alchemical symbols
        "\U0001F780-\U0001F7FF"  # Geometric Shapes Extended
        "\U000027F0-\U000027FF"  # Supplemental Arrows-A
        "\U00002900-\U0000297F"  # Supplemental Arrows-B
        "\U0001F800-\U0001F8FF"  # Supplemental Arrows-C
        "\U0001F900-\U0001F9FF"  # Supplemental Symbols and Pictographs
        "\U0001FA00-\U0001FA6F"  # Chess Symbols
        "\U0001FA70-\U0001FAFF"  # Symbols and Pictographs Extended-A
        "\U00002702-\U000027B0"  # Dingbats
        "\U00002000-\U0000206F"  # General Punctuation
        "\U00002E00-\U00002E7F"  # Supplemental Punctuation
        "\U00003000-\U0000303F"  # CJK Symbols and Punctuation
        "\U00002200-\U000022FF"  # Mathematical Operators
        "\U0001D400-\U0001D7FF"  # Mathematical Alphanumeric Symbols        
        "\U000020A0-\U000020CF"  # Currency Symbols
        "\U00002100-\U0000214F"  # Combining Diacritical Marks 
        "\U000020A0-\U000020CF"  # Letterlike Symbols
        "\U00002190-\U000021FF"  # Arrows 
        "\U00002B00-\U00002BFF"  # Miscellaneous Symbols and Arrows 
        "\U000025A0-\U000025FF"  # Geometric Shapes 
        "\U00002600-\U000026FF"  # Miscellaneous Symbols         
        "\U00002070-\U0000209F"  # Superscripts and Subscripts
        "\U00002150-\U0000218F"  # Number Forms
        "\U00000250-\U000002AF"  # IPA Extensions
        "\U0000E000-\U0000F8FF"  # Private Use Area
        "\U0000FE00-\U0000FE0F"  # Variation Selectors
        "\U00002300-\U000023FF"  # Miscellaneous Technical
        "\U000002B0-\U000002FF"  # Spacing Modifier Letters
        "\U00002500-\U0000257F"  # Box Drawing
        "])"
    )
    
    clean_text = re.sub(cleanr, ' ', text)
    return ' '.join(clean_text.split())

def clean_punct(text):
    '''Remove all punctuation'''
    return text.translate(str.maketrans(' ', ' ', string.punctuation))

def clean_digits(text):
    '''Remove all digits'''
    return text.translate(str.maketrans(' ', ' ', string.digits))

def process_text(text):
    text = clean_math(text)
    text = clean_html(text)
    text = clean_unicode(text)
    text = clean_punct(text)
    text = clean_digits(text)
    text = text.strip()
    text = text.lower()
    return text

def label_unicode(row):
    '''Label the document by the unicode block it belongs to'''
    seq = Counter(row['Text'])
    if len(seq) > 0:
        return blocks.of(max(seq))
    else:
        return
        
def clean_text(df):
    '''Combine all pre-processing into one function'''
    df['Text'] = df['Text'].map(lambda x: process_text(x))
    df['unicode'] = df.apply(label_unicode, axis = 1)
    df.dropna(inplace=True)
    
    return df

### Amazon Review Text

In [7]:
df1 = clean_text(df1)
df1.head(10)

Unnamed: 0,Text,unicode
0,i have bought several of the vitality canned d...,BASIC_LATIN
1,product arrived labeled as jumbo salted peanut...,BASIC_LATIN
2,this is a confection that has been around a fe...,BASIC_LATIN
3,if you are looking for the secret ingredient i...,BASIC_LATIN
4,great taffy at a great price there was a wide ...,BASIC_LATIN
5,i got a wild hair for taffy and ordered this f...,BASIC_LATIN
6,this saltwater taffy had great flavors and was...,BASIC_LATIN
7,this taffy is so good it is very soft and chew...,BASIC_LATIN
8,right now im mostly just sprouting this so my ...,BASIC_LATIN
9,this is a very healthy dog food good for their...,BASIC_LATIN


In [8]:
df1['Text'][508]

'these are perhaps the worst chips that have ever gone into my mouth for my entire life sour cream  onion and in this case  chive chips were my favorite recently kettle brand honey dijon mustard took that slot so when i found out they had sour cream  onion i just had to try them as soon as i opened the bag the chips smelled of powdered milk and indeed each chip is coated with a powdered sour cream that is just awful it tastes like rancid milk not just sour but like sour cream that when rancid the powdery texture is also extremely unappealing i basically hated these chips i would not recommend these chips to anyone unless they had a particular affinity for a powdery chalky texture on a chip with a rancid and onion flavor and i have a hard time believing that person exists i plan on contacting kettle and sharing my thoughts with them hopefully theyll reassess the seasoning on these otherwise wonderful kettle style chips'

### SMS Text

In [9]:
df2 = clean_text(df2)
df2.head(10)

Unnamed: 0,Text,SMS,unicode
0,go until jurong point crazy available only in ...,ham,BASIC_LATIN
1,ok lar joking wif u oni,ham,BASIC_LATIN
2,free entry in a wkly comp to win fa cup final...,spam,BASIC_LATIN
3,u dun say so early hor u c already then say,ham,BASIC_LATIN
4,nah i dont think he goes to usf he lives aroun...,ham,BASIC_LATIN
5,freemsg hey there darling its been weeks now ...,spam,BASIC_LATIN
6,even my brother is not like to speak with me t...,ham,BASIC_LATIN
7,as per your request melle melle oru minnaminun...,ham,BASIC_LATIN
8,winner as a valued network customer you have b...,spam,BASIC_LATIN
9,had your mobile months or more u r entitled t...,spam,BASIC_LATIN


In [10]:
df2['Text'][244]

'although i told u dat im into baig face watches now but i really like e watch u gave cos its fr u thanx  everything dat uve done today im touched'

# Create Corpus Vocabulary

In this section, we create a document vocabulary by filtering out stop words and using a CountVectorizer instance

In [11]:
def filter_docs(df):
    '''
    Create a vocabulary from all documents that only includes non-stopwords 
    that appear in atleast 2 documents and less than 90% of all documents
    '''
    stop = []
    lst = []

    # Find stopwords from all languages
    for language in stopwords.fileids():
        stop.extend(stopwords.words(language))
        
    # Only count non-stopwords that appear in atleast 5 documents and less than 80% of all documents
    cv = CountVectorizer(min_df = 5, max_df = 0.80, stop_words = stop)

    # Implement CountVectorizer by unicode block (language)
    for unicode in np.unique(df['unicode']):
        docs = df[df['unicode'] == unicode]['Text'].tolist()
        if len(docs) > 100:
            wcv = cv.fit_transform(docs)
            lst.extend(cv.vocabulary_.keys())

    vocab = set(lst)
    
    return vocab

# Extract Keywords
In this section, we extract keywords from the text by filtering out stop words and words that are not in the vocabulary. The words are spell corrected and lemmatized to maintain context for phrase analysis

In [12]:
def create_kw_map(vocab):
    kw_map = dict()
    
    for word in vocab:
        if not wordnet.synsets(word):
            kw_map[word] = str(TextBlob(word).correct())            
    
    return kw_map

def extract_words(df, vocab, kw_map):
    '''
    Extract list of spell corrected and lemmatized keywords 
    from each document that is included in the vocabulary 
    '''
#     ps = PorterStemmer()
    lm = WordNetLemmatizer()
    
    # Tokenize
    df['Text'] = df['Text'].map(lambda x: nltk.word_tokenize(x))
    
    # Remove word if not in vocab
    df['Text'] = df['Text'].map(lambda x: list(filter(lambda w: w in vocab, x)))
    
    # Correct word spelling if incorrect
    df = df.replace(kw_map)
    
    # Lemmatize words
    df['Text'] = df['Text'].map(lambda x: list(map(lambda w: lm.lemmatize(w), x)))

#     Stem words (Instead of lemmatization)
#     df['Text'] = df['Text'].map(lambda x: list(map(lambda w: ps.stem(w), x)))

#     Select unique words (optional)
#     df['Text'] = df['Text'].map(lambda x: np.unique(x))
    
    return df

# Create N-grams

Suppose we wish to analyse phrases as well as keywords. The functions below allow users to create n-grams either by specifying `n` or by generating all possible n-grams from 1 to `n-1` where `n` is the length of the phrase

In [13]:
def create_ngrams(n, tokens):
    ''' 
    Create list of n-grams from the input phrase
    '''
    if n == 1:
        return tokens
    else:
        ngrams = [tokens[i:i+n] for i in range(len(tokens)-1)]
        ngrams = [' '.join(ngram) for ngram in ngrams if len(ngram) == n]
        
        return ngrams
    
def create_all_ngrams(tokens):
    ''' 
    Create list of all n-grams from n=1 to n=len(phrase) from the input phrase
    '''
    out = []
    
    for n in range(1, len(tokens)):
        out.extend(create_ngrams(n, tokens))
        
    return out

# Results

### Amazon Review Text

In [14]:
vocab1 = filter_docs(df1)
len(vocab1)

1215

#### Spell Check Dictionary

In [15]:
kw_map1 = create_kw_map(vocab1)
kw_map1

{'ive': 'give',
 'walmart': 'palmar',
 'microbrews': 'microbes',
 'shes': 'she',
 'dont': 'dont',
 'others': 'others',
 'couldnt': 'couldn',
 'pocky': 'rocky',
 'bbq': 'by',
 'else': 'else',
 'youve': 'you',
 'isnt': 'isn',
 'youre': 'your',
 'starbucks': 'starbucks',
 'havent': 'haven',
 'something': 'something',
 'hey': 'hey',
 'would': 'would',
 'wouldnt': 'wouldn',
 'glutenfree': 'glutenfree',
 'thats': 'that',
 'etc': 'etc',
 'whenever': 'whenever',
 'glycemic': 'glycerin',
 'ahmad': 'ahead',
 'crunchy': 'crutch',
 'unless': 'unless',
 'everyone': 'everyone',
 'doesnt': 'doesn',
 'could': 'could',
 'didnt': 'didn',
 'dissapointed': 'disappointed',
 'everything': 'everything',
 'trans': 'trans',
 'anything': 'anything',
 'although': 'although',
 'habanero': 'habanero',
 'mccanns': 'mccanns',
 'twizzlers': 'twizzlers',
 'youd': 'you',
 'youll': 'you',
 'without': 'without',
 'germanstyle': 'germanstyle',
 'steaz': 'steam',
 'amazoncom': 'amazoncom',
 'plockys': 'locks',
 'oz': 'oz',

In [16]:
df1 = extract_words(df1, vocab1, kw_map1)
df1.head(10)

Unnamed: 0,Text,unicode
0,"[bought, several, canned, food, product, found...",BASIC_LATIN
1,"[product, arrived, salted, peanut, actually, s...",BASIC_LATIN
2,"[around, light, nut, case, cut, tiny, sugar, t...",BASIC_LATIN
3,"[looking, ingredient, believe, found, got, add...",BASIC_LATIN
4,"[great, taffy, great, price, yummy, taffy, del...",BASIC_LATIN
5,"[got, taffy, ordered, five, pound, bag, taffy,...",BASIC_LATIN
6,"[taffy, great, flavor, soft, chewy, candy, wra...",BASIC_LATIN
7,"[taffy, good, soft, chewy, flavor, amazing, wo...",BASIC_LATIN
8,"[right, mostly, cat, eat, love, around]",BASIC_LATIN
9,"[healthy, food, good, good, small, eats, amoun...",BASIC_LATIN


In [17]:
df1['Text'][508]

['perhaps',
 'worst',
 'chip',
 'ever',
 'gone',
 'mouth',
 'entire',
 'life',
 'sour',
 'cream',
 'onion',
 'case',
 'chip',
 'favorite',
 'recently',
 'kettle',
 'brand',
 'honey',
 'dijon',
 'mustard',
 'took',
 'found',
 'sour',
 'cream',
 'onion',
 'try',
 'soon',
 'opened',
 'bag',
 'chip',
 'milk',
 'chip',
 'sour',
 'cream',
 'awful',
 'taste',
 'like',
 'rancid',
 'milk',
 'sour',
 'like',
 'sour',
 'cream',
 'rancid',
 'texture',
 'extremely',
 'chip',
 'would',
 'recommend',
 'chip',
 'anyone',
 'unless',
 'particular',
 'texture',
 'chip',
 'rancid',
 'onion',
 'flavor',
 'hard',
 'time',
 'person',
 'plan',
 'kettle',
 'seasoning',
 'otherwise',
 'wonderful',
 'kettle',
 'style',
 'chip']

In [18]:
create_all_ngrams(df1['Text'][508])

['perhaps',
 'worst',
 'chip',
 'ever',
 'gone',
 'mouth',
 'entire',
 'life',
 'sour',
 'cream',
 'onion',
 'case',
 'chip',
 'favorite',
 'recently',
 'kettle',
 'brand',
 'honey',
 'dijon',
 'mustard',
 'took',
 'found',
 'sour',
 'cream',
 'onion',
 'try',
 'soon',
 'opened',
 'bag',
 'chip',
 'milk',
 'chip',
 'sour',
 'cream',
 'awful',
 'taste',
 'like',
 'rancid',
 'milk',
 'sour',
 'like',
 'sour',
 'cream',
 'rancid',
 'texture',
 'extremely',
 'chip',
 'would',
 'recommend',
 'chip',
 'anyone',
 'unless',
 'particular',
 'texture',
 'chip',
 'rancid',
 'onion',
 'flavor',
 'hard',
 'time',
 'person',
 'plan',
 'kettle',
 'seasoning',
 'otherwise',
 'wonderful',
 'kettle',
 'style',
 'chip',
 'perhaps worst',
 'worst chip',
 'chip ever',
 'ever gone',
 'gone mouth',
 'mouth entire',
 'entire life',
 'life sour',
 'sour cream',
 'cream onion',
 'onion case',
 'case chip',
 'chip favorite',
 'favorite recently',
 'recently kettle',
 'kettle brand',
 'brand honey',
 'honey dijon

### SMS Text

In [19]:
vocab2 = filter_docs(df2)
len(vocab2)

346

#### Spell Check Dictionary

In [20]:
kw_map2 = create_kw_map(vocab2)
kw_map2

{'ive': 'give',
 'gonna': 'donna',
 'hav': 'had',
 'dont': 'dont',
 'oso': 'so',
 'dunno': 'funny',
 'isnt': 'isn',
 'youre': 'your',
 'ah': 'ah',
 'havent': 'haven',
 'something': 'something',
 'hey': 'hey',
 'boytoy': 'boston',
 'thanx': 'than',
 'txt': 'txt',
 'would': 'would',
 'haha': 'hata',
 'tscs': 'tss',
 'thats': 'that',
 'thk': 'the',
 'ure': 'are',
 'lar': 'war',
 'doesnt': 'doesn',
 'could': 'could',
 'didnt': 'didn',
 'gud': 'god',
 'everything': 'everything',
 'ard': 'and',
 'st': 'st',
 'pls': 'pus',
 'anything': 'anything',
 'wkly': 'wily',
 'anytime': 'daytime',
 'ppm': 'pp',
 'ringtone': 'ringtone',
 'jus': 'just',
 'since': 'since',
 'prob': 'probe',
 'lol': 'll',
 'luv': 'lui',
 'yup': 'up',
 'wanna': 'anna',
 'nokia': 'nikita'}

In [21]:
df2 = extract_words(df2, vocab2, kw_map2)
df2.head(10)

Unnamed: 0,Text,SMS,unicode
0,"[go, available, great, world, got]",ham,BASIC_LATIN
1,"[ok, lar]",ham,BASIC_LATIN
2,"[free, entry, wkly, win, st, may, text, receiv...",spam,BASIC_LATIN
3,"[dun, say, early, already, say]",ham,BASIC_LATIN
4,"[dont, think, around, though]",ham,BASIC_LATIN
5,"[hey, word, back, id, like, still, ok, xxx, send]",spam,BASIC_LATIN
6,"[even, like, speak, like]",ham,BASIC_LATIN
7,[friend],ham,BASIC_LATIN
8,"[network, customer, selected, prize, claim, ca...",spam,BASIC_LATIN
9,"[mobile, month, update, latest, camera, free, ...",spam,BASIC_LATIN


In [22]:
df2['Text'][244]

['told', 'really', 'like', 'co', 'thanx', 'everything', 'done', 'today']

In [23]:
create_all_ngrams(df2['Text'][244])

['told',
 'really',
 'like',
 'co',
 'thanx',
 'everything',
 'done',
 'today',
 'told really',
 'really like',
 'like co',
 'co thanx',
 'thanx everything',
 'everything done',
 'done today',
 'told really like',
 'really like co',
 'like co thanx',
 'co thanx everything',
 'thanx everything done',
 'everything done today',
 'told really like co',
 'really like co thanx',
 'like co thanx everything',
 'co thanx everything done',
 'thanx everything done today',
 'told really like co thanx',
 'really like co thanx everything',
 'like co thanx everything done',
 'co thanx everything done today',
 'told really like co thanx everything',
 'really like co thanx everything done',
 'like co thanx everything done today',
 'told really like co thanx everything done',
 'really like co thanx everything done today']

With this pre-processing out of the way, the raw data is now ready for feature engineering and word embedding for applications like sentiment analysis etc.