# Text Preprocessing and Tokenizing

In this notebook, we process and tokenize each review text. In the preprocessing stage, the text is converted to lowercase and words are lemmatized. In the tokenizing stage, each review is converted into a document that contains single words as tokens. These documents are then compiled into a corpus.

Since restaurants dominate the Yelp review space, two corpuses are constructed: one for restaurants, and one for all non-restaurant businesses.

## Importing modules

In [2]:
import pandas as pd
import numpy as np
from sqlalchemy import create_engine
import matplotlib.pyplot as plt
import seaborn as sns
import spacy
import textacy
import pickle
import time
import string
import unidecode

plt.style.use('fivethirtyeight')

%run .engine.py
engine = create_engine(LOGIN)

  """)


## Load data
### Restaurants

In [3]:
rests = pd.read_csv('../data/restaurants.csv', compression='gzip', usecols=['text'])

In [4]:
rests.shape

(3055990, 1)

### Query for Other Businesses

In [3]:
businesses = pd.read_csv('../data/businesses.csv', compression='gzip', usecols=['text'])

In [4]:
businesses.shape

(1968973, 1)

## Preprocess the text

I use textacy's preprocess method to convert all the text to lowercase and remove numbers, URLs, and punctuation.

### Restaurants

In [15]:
%%time
rests['processed'] = rests['text'].map(lambda x: textacy.preprocess.preprocess_text(x, lowercase=True, no_urls=True, 
                                                                                    no_punct=True, no_numbers=True))

CPU times: user 23min 56s, sys: 1.62 s, total: 23min 57s
Wall time: 23min 57s


### Businesses

In [5]:
%%time
businesses['processed'] = businesses['text'].map(lambda x: textacy.preprocess.preprocess_text(x, lowercase=True, no_urls=True,
                                                                                              no_punct=True, no_numbers=True))

CPU times: user 15min 17s, sys: 1.24 s, total: 15min 18s
Wall time: 15min 18s


## Checking for non-English reviews

Since Yelp is used worldwide, we should check to see if any reviews were written using non-English characters as non-English words may not be tokenized or processed correctly. I define the function `isEnglish` to filter out non-ASCII characters as non-ASCII characters would predominantly be used by users typing reviews in non-English languages.

### Define `isEnglish` function

In [6]:
def isEnglish(s):
    try:
        s.encode('ascii')
    except UnicodeEncodeError:
        return False
    else:
        return True

### Restaurants

In [17]:
rests['isEnglish'] = rests['processed'].astype('str').astype('unicode').apply(lambda x: isEnglish(x) == True)

In [18]:
rests['isEnglish'].value_counts()

True     2984421
False      71569
Name: isEnglish, dtype: int64

In [19]:
rests_english = rests[rests['isEnglish'] == True]

In [22]:
np.save('../data/rests_eng_index.npy', rests[rests['isEnglish'] == True].index)

### Businesses

In [7]:
businesses['isEnglish'] = businesses['processed'].astype('str').astype('unicode').apply(lambda x: isEnglish(x) == True)

In [8]:
businesses['isEnglish'].value_counts()

True     1952542
False      16431
Name: isEnglish, dtype: int64

In [9]:
np.save('../data/bus_eng_index.npy', businesses[businesses['isEnglish'] == True].index)

## Tokenizing

In this section I set up a spacy tokenizer. We disable part-of-speech tagging, semantic parsing, and text categorization to reduce overall memory usage, and choose to retain the lemmas of each token. We also create a filter function to eliminate stopwords and short tokens (less than 4 characters). The tokenized documents are then added to a list which we can pass through a vectorizer (see Notebook 3.)

In [31]:
nlp = textacy.load_spacy("en_core_web_sm", disable = ("tagger", "parser", "ner", "textcat"))

In [32]:
def token_filter(token): #remove stopwords and tokens 4 char or less
    return not (token.is_stop | len(token.text) <= 4)

### Restaurants

In [34]:
docs = rests['processed'].astype('str').astype('unicode').tolist()

In [35]:
filtered_tokens = []
start = time.time()
i = 1
for doc in nlp.pipe(docs, disable=['tagger', 'parser', 'ner', 'textcat'], batch_size=10000):
    try:
        tokens = [token.lemma_ for token in doc if token_filter(token)]
        filtered_tokens.append(tokens)
        i += 1
        if i % 10000 == 0:
            print(f'Tokenized {i} documents in {(time.time()-start)/60} minutes')
    except:
        print(f'Document {i} has an encoding error/has error characters.')

Tokenized 10000 documents in 0.5251728336016337 minutes
Tokenized 20000 documents in 1.0496850768725077 minutes
Tokenized 30000 documents in 1.579992691675822 minutes
Tokenized 40000 documents in 2.0894193092981976 minutes
Tokenized 50000 documents in 2.5953927556673686 minutes
Tokenized 60000 documents in 3.1252748092015583 minutes
Tokenized 70000 documents in 3.643654414017995 minutes
Tokenized 80000 documents in 4.167919055620829 minutes
Tokenized 90000 documents in 4.67060960928599 minutes
Tokenized 100000 documents in 5.168066438039144 minutes
Tokenized 110000 documents in 5.657072214285533 minutes
Tokenized 120000 documents in 6.167018107573191 minutes
Tokenized 130000 documents in 6.684710649649302 minutes
Tokenized 140000 documents in 7.20735391775767 minutes
Tokenized 150000 documents in 7.725206498305003 minutes
Tokenized 160000 documents in 8.251602478822072 minutes
Tokenized 210000 documents in 10.8531729499499 minutes
Tokenized 220000 documents in 11.353967575232188 minute

Tokenized 1500000 documents in 73.7062535405159 minutes
Tokenized 1510000 documents in 74.20094054937363 minutes
Tokenized 1520000 documents in 74.70778512557348 minutes
Tokenized 1530000 documents in 75.20267874002457 minutes
Tokenized 1540000 documents in 75.68102876345317 minutes
Tokenized 1550000 documents in 76.16191310882569 minutes
Tokenized 1560000 documents in 76.6355654199918 minutes
Tokenized 1570000 documents in 77.11772193908692 minutes
Tokenized 1580000 documents in 77.5879980802536 minutes
Tokenized 1590000 documents in 78.06524838209153 minutes
Tokenized 1600000 documents in 78.55346725781759 minutes
Tokenized 1610000 documents in 79.03019727071127 minutes
Tokenized 1620000 documents in 79.50208503007889 minutes
Tokenized 1630000 documents in 79.97397522131602 minutes
Tokenized 1640000 documents in 80.44068801403046 minutes
Tokenized 1650000 documents in 80.8973538160324 minutes
Tokenized 1660000 documents in 81.35792133808135 minutes
Tokenized 1670000 documents in 81.8

Tokenized 2930000 documents in 142.78816661039988 minutes
Tokenized 2940000 documents in 143.28276998202006 minutes
Tokenized 2950000 documents in 143.75757474104563 minutes
Tokenized 2960000 documents in 144.24034826358158 minutes
Tokenized 2970000 documents in 144.7172559817632 minutes
Tokenized 2980000 documents in 145.19437002340953 minutes


Save the tokenized restaurant reviews to disk:

In [36]:
with open('../data/tokenized_rest_reviews.pkl', 'wb') as f:
    pickle.dump(filtered_tokens, f)

### Businesses

In [37]:
docs = businesses['processed'].astype('str').astype('unicode').tolist()

In [None]:
filtered_tokens_bs = []
start = time.time()
i = 1
for doc in nlp.pipe(docs, disable=['tagger', 'parser', 'ner', 'textcat'], batch_size=10000):
    try:
        tokens = [token.lemma_ for token in doc if token_filter(token)]
        filtered_tokens_bs.append(tokens)
        i += 1
        if i % 10000 == 0:
            print(f'Tokenized {i} documents in {(time.time()-start)/60} minutes')
    except:
        print(f'Document {i} has an encoding error.')

Tokenized 10000 documents in 0.5455632209777832 minutes
Tokenized 20000 documents in 1.1006454229354858 minutes
Tokenized 30000 documents in 1.7564432183901468 minutes
Tokenized 40000 documents in 2.3085965077082315 minutes
Tokenized 50000 documents in 2.8641520818074544 minutes
Tokenized 60000 documents in 3.405859084924062 minutes
Tokenized 70000 documents in 3.9494373003641763 minutes
Tokenized 80000 documents in 4.483182958761851 minutes
Tokenized 90000 documents in 5.038688929875692 minutes
Tokenized 100000 documents in 5.60417918364207 minutes
Tokenized 110000 documents in 6.160481135050456 minutes
Tokenized 120000 documents in 6.709739796320597 minutes
Tokenized 130000 documents in 7.259782950083415 minutes
Tokenized 140000 documents in 7.803244296709696 minutes
Tokenized 150000 documents in 8.351007922490437 minutes
Tokenized 160000 documents in 8.897802464167277 minutes
Tokenized 170000 documents in 9.447558689117432 minutes
Tokenized 180000 documents in 9.995304199059804 minu

In [None]:
with open('../data/tokenized_bs_reviews.pkl', 'wb') as f:
    pickle.dump(filtered_tokens_bs, f)