# Text Preprocessing and Tokenizing

In this notebook, we process and tokenize each review text. In the preprocessing stage, the text is converted to lowercase and words are lemmatized. In the tokenizing stage, each review is converted into a document that contains single words as tokens. These documents are then compiled into a corpus.

Since restaurants dominate the Yelp review space, restaurant reviews and reviews for other businesses will be assessed separately.

## Importing modules

In [2]:
import pandas as pd
import numpy as np
from sqlalchemy import create_engine
import matplotlib.pyplot as plt
import seaborn as sns
import spacy
import textacy
import pickle
import time

## Load data
### Restaurants

In [3]:
rests = pd.read_csv('../data/restaurants.csv', compression='gzip', usecols=['text'])

In [4]:
rests.shape

(3055990, 1)

### Query for Other Businesses

In [3]:
businesses = pd.read_csv('../data/businesses.csv', compression='gzip', usecols=['text'])

In [4]:
businesses.shape

(1968973, 1)

## Preprocess the text

Using textacy's preprocess method, I converted all review text to lowercase and remove numbers, URLs, and punctuation. The textacy preprocessor will convert numbers to the string 'numb' and URLs to the string 'URL'. I chose to combine numbers because time and price descriptions are likely to be very common in reviews. Rather than having individual tokens for '5 minutes' and '10 dollars', documents processed in this way will contain a 'numb' token whenever a numeric term is encountered. 

### Restaurants

In [5]:
%%time
rests['processed'] = rests['text'].map(lambda x: textacy.preprocess.preprocess_text(x, lowercase=True, 
                                                                                    no_urls=True, 
                                                                                    no_punct=True, 
                                                                                    no_numbers=True))

CPU times: user 23min 16s, sys: 1.88 s, total: 23min 18s
Wall time: 23min 18s


### Businesses

In [5]:
%%time
businesses['processed'] = businesses['text'].map(lambda x: textacy.preprocess.preprocess_text(x, lowercase=True, 
                                                                                              no_urls=True,
                                                                                              no_punct=True, 
                                                                                              no_numbers=True))

CPU times: user 15min 17s, sys: 1.24 s, total: 15min 18s
Wall time: 15min 18s


## Checking for non-English reviews

Since Yelp is used worldwide, I checked for reviews containing non-English characters as non-English words or statements may not be tokenized or processed correctly. I define the function `isEnglish` to filter out non-ASCII characters as non-ASCII characters would predominantly be used by users typing reviews in non-English languages.

### Define `isEnglish` function

Strings that contain non-English characters will return a UnicodeEncodeError when attempting to convert them to the ASCII format.

In [6]:
def isEnglish(s):
    try:
        s.encode('ascii')
    except UnicodeEncodeError:
        return False
    else:
        return True

### Restaurants

In [7]:
rests['isEnglish'] = rests['processed'].astype('str').astype('unicode').apply(lambda x: isEnglish(x) == True)

In [8]:
rests['isEnglish'].value_counts()

True     2984421
False      71569
Name: isEnglish, dtype: int64

In [9]:
rests = rests[rests['isEnglish'] == True]

I save the index of the reviews as an array to retain for later use.

In [10]:
np.save('../data/rests_eng_index.npy', rests[rests['isEnglish'] == True].index)

### Businesses

In [7]:
businesses['isEnglish'] = businesses['processed'].astype('str').astype('unicode').apply(lambda x: isEnglish(x) == True)

In [8]:
businesses['isEnglish'].value_counts()

True     1952542
False      16431
Name: isEnglish, dtype: int64

In [9]:
np.save('../data/bus_eng_index.npy', businesses[businesses['isEnglish'] == True].index)

## Tokenizing

In this section I set up a spacy tokenizer. We disable part-of-speech tagging, semantic parsing, and text categorization to reduce overall memory usage, and choose to retain the lemmas of each token. We also create a filter function to eliminate stopwords and short tokens (less than 4 characters). The tokenized documents are then added to a list which we can pass through a vectorizer (see Notebook 3.)

**The tokenizing loops below have had their output cleared to improve this notebook's readability.**

In [31]:
nlp = textacy.load_spacy("en_core_web_sm", disable = ("tagger", "parser", "ner", "textcat"))

In [32]:
def token_filter(token): 
    return not (token.is_stop | len(token.text) <= 4)

### Restaurants

In [34]:
docs = rests['processed'].astype('str').astype('unicode').tolist()

In [None]:
filtered_tokens = []
start = time.time()
i = 1
for doc in nlp.pipe(docs, disable=['tagger', 'parser', 'ner', 'textcat'], batch_size=10000):
    try:
        tokens = [token.lemma_ for token in doc if token_filter(token)]
        filtered_tokens.append(tokens)
        i += 1
        if i % 10000 == 0:
            print(f'Tokenized {i} documents in {(time.time()-start)/60} minutes')
    except:
        print(f'Document {i} has an encoding error/has error characters.')

The above cell takes around 3 hours to tokenize approximately 3 million reviews and append them to the list `filtered_tokens`.

Save the tokenized restaurant reviews to disk:

In [36]:
with open('../data/tokenized_rest_reviews.pkl', 'wb') as f:
    pickle.dump(filtered_tokens, f)

### Businesses

In [37]:
docs = businesses['processed'].astype('str').astype('unicode').tolist()

In [None]:
filtered_tokens_bs = []
start = time.time()
i = 1
for doc in nlp.pipe(docs, disable=['tagger', 'parser', 'ner', 'textcat'], batch_size=10000):
    try:
        tokens = [token.lemma_ for token in doc if token_filter(token)]
        filtered_tokens_bs.append(tokens)
        i += 1
        if i % 10000 == 0:
            print(f'Tokenized {i} documents in {(time.time()-start)/60} minutes')
    except:
        print(f'Document {i} has an encoding error.')

The above cell takes around 2 hours to tokenize approximately 2 million reviews. 

Save the tokenized business reviews to disk:

In [None]:
with open('../data/tokenized_bs_reviews.pkl', 'wb') as f:
    pickle.dump(filtered_tokens_bs, f)