# Text Preprocessing

In this notebook, we process and tokenize each review text. Tokenization transforms each review into a list of relevant tokens; these reviews are added to a corpus as a list. Since restaurants dominate the Yelp review space, two corpuses are developed: one for restaurants, and one for all other types of businesses.

## Importing modules

In [1]:
import pandas as pd
import numpy as np
from sqlalchemy import create_engine
import matplotlib.pyplot as plt
import seaborn as sns
import spacy
import textacy
import pickle
import time
import string
import unidecode

plt.style.use('fivethirtyeight')

engine = create_engine('postgres://postgres:moop@52.25.218.143:5432/capstone')

  """)


## Get data from server

### Query for Restaurants

In [6]:
query = '''SELECT reviews.text as text,
reviews.useful as useful,
business.categories as categories
FROM reviews
INNER JOIN business ON reviews.business_id = business.business_id
WHERE categories LIKE '%%Restaurants%%'
'''

restaurants = pd.read_sql_query(query, engine)
restaurants.shape

(3221418, 3)

### Query for Other Businesses

In [2]:
query = '''
SELECT 
reviews.text as text, 
reviews.useful as useful, 
reviews.funny as funny, 
reviews.cool as cool,
business.categories as categories
FROM reviews
INNER JOIN business ON reviews.business_id = business.business_id
WHERE categories NOT LIKE '%%Restaurants%%'
'''

businesses = pd.read_sql_query(query, engine)
businesses.shape

(2040250, 5)

## Preprocess the text

I use textacy's preprocess method to convert all the text to lowercase and remove numbers, URLs, and punctuation and save it to a file so we can load it in to work on.

### Restaurants

In [9]:
restaurants.drop('categories', axis=1, inplace=True)

In [10]:
%%time
restaurants['processed'] = restaurants['text'].map(lambda x: textacy.preprocess.preprocess_text(x, lowercase=True, no_urls=True, no_punct=True, no_numbers=True))

CPU times: user 24min 49s, sys: 0 ns, total: 24min 49s
Wall time: 24min 49s


In [12]:
restaurants.drop('text', 1, inplace=True)

In [14]:
restaurants.to_csv('../data/rests.csv', index=False)

### Businesses

In [4]:
businesses.drop('categories', 1, inplace=True)

In [5]:
%%time
businesses['processed'] = businesses['text'].map(lambda x: textacy.preprocess.preprocess_text(x, lowercase=True, no_urls=True, no_punct=True, no_numbers=True))

CPU times: user 16min 38s, sys: 0 ns, total: 16min 38s
Wall time: 16min 38s


In [6]:
businesses.drop('text', 1, inplace=True)

In [8]:
businesses.to_csv('../data/businesses.csv', index=False)

## Removing non-English reviews

Since Yelp is used worldwide, we should check to see if any reviews were written using non-English characters as non-English words may not be tokenized or processed correctly. I define the function `isEnglish` to filter out non-ASCII characters as non-ASCII characters would predominantly be used by users typing reviews in non-English languages.

In [15]:
def isEnglish(s):
    try:
        s.encode('ascii')
    except UnicodeEncodeError:
        return False
    else:
        return True

### Restaurants

In [11]:
restaurants = pd.read_csv('../data/rests.csv')

In [13]:
restaurants.drop('Unnamed: 0', 1, inplace=True)

We create the boolean column `isEnglish` and then drop the columns that return True:

In [14]:
restaurants['isEnglish'] = restaurants['processed'].astype('unicode').apply(lambda x: isEnglish(x) == True)

In [15]:
restaurants = restaurants[restaurants['isEnglish'] == True]

In [26]:
restaurants.to_csv('../data/rests_english.csv', index=False)

### Businesses

In [12]:
businesses = pd.read_csv('../data/businesses.csv')

In [13]:
businesses.head()

Unnamed: 0,useful,funny,cool,processed
0,1,0,0,cycle pub las vegas was a blast got a groupon ...
1,9,0,1,i thought tidy s flowers had a great reputatio...
2,0,0,0,i too have been trying to book an appt to use ...
3,0,0,0,great place to bring dogs it s really a dog pl...
4,0,0,0,decided to give this place a try since i was i...


In [16]:
businesses['isEnglish'] = businesses['processed'].astype('str').astype('unicode').apply(lambda x: isEnglish(x) == True)

In [18]:
businesses = businesses[businesses['isEnglish'] == True]

In [20]:
businesses.to_csv('../data/businesses_english.csv', index=False)

## Tokenizing

In this section I set up a spacy tokenizer. We disable part-of-speech tagging, semantic parsing, and text categorization to reduce overall memory usage. We also create a filter function to eliminate stopwords and short tokens (less than 4 characters) from our final tokenized documents. The tokenized documents are then added to a list which we can pass through a vectorizer (see Notebook 3.)

In [8]:
nlp = textacy.load_spacy("en_core_web_sm", disable = ("tagger", "parser", "ner", "textcat"))

In [9]:
def token_filter(token): #remove stopwords and tokens 4 char or less
    return not (token.is_stop | len(token.text) <= 4)

### Restaurants

In [2]:
restaurants = pd.read_csv('../data/rests_english.csv')

In [5]:
docs = restaurants['processed'].astype('str').astype('unicode').tolist() # The spacy pipeline requires raw text to be in utf-8 format

In [10]:
filtered_tokens = []
start = time.time()
i = 1
for doc in nlp.pipe(docs, disable=['tagger', 'parser', 'ner', 'textcat'], batch_size=10000):
    try:
        tokens = [token.lemma_ for token in doc if token_filter(token)]
        filtered_tokens.append(tokens)
        i += 1
        if i % 10000 == 0:
            print(f'Tokenized {i} documents in {(time.time()-start)/60} minutes')
    except:
        print(f'Document {i} has an encoding error/has error characters.')

Tokenized 10000 documents in 0.5264656027158101 minutes
Tokenized 20000 documents in 0.9942031065622966 minutes
Tokenized 30000 documents in 1.4545220096906026 minutes
Tokenized 40000 documents in 1.927197806040446 minutes
Tokenized 50000 documents in 2.3963774919509886 minutes
Tokenized 60000 documents in 2.88506498336792 minutes
Tokenized 70000 documents in 3.4080612897872924 minutes
Tokenized 80000 documents in 3.921268590291341 minutes
Tokenized 90000 documents in 4.4100446422894795 minutes
Tokenized 100000 documents in 4.8793296098709105 minutes
Tokenized 110000 documents in 5.344931999842326 minutes
Tokenized 120000 documents in 5.839535077412923 minutes
Tokenized 130000 documents in 6.326991760730744 minutes
Tokenized 140000 documents in 6.799816346168518 minutes
Tokenized 150000 documents in 7.271938117345174 minutes
Tokenized 160000 documents in 7.764986379941305 minutes
Tokenized 170000 documents in 8.238894959290823 minutes
Tokenized 180000 documents in 8.718480745951334 min

Tokenized 1460000 documents in 70.93188039461772 minutes
Tokenized 1470000 documents in 71.43803168535233 minutes
Tokenized 1480000 documents in 71.91862063010534 minutes
Tokenized 1490000 documents in 72.42247564792633 minutes
Tokenized 1500000 documents in 72.90267002185186 minutes
Tokenized 1510000 documents in 73.37509539524714 minutes
Tokenized 1520000 documents in 73.86426709493001 minutes
Tokenized 1530000 documents in 74.34241573015849 minutes
Tokenized 1540000 documents in 74.85306224822997 minutes
Tokenized 1550000 documents in 75.33233666817347 minutes
Tokenized 1560000 documents in 75.80130147536596 minutes
Tokenized 1570000 documents in 76.2842357993126 minutes
Tokenized 1580000 documents in 76.75883768002193 minutes
Tokenized 1590000 documents in 77.24144959847132 minutes
Tokenized 1600000 documents in 77.73604729175568 minutes
Tokenized 1610000 documents in 78.20423442522684 minutes
Tokenized 1620000 documents in 78.64850689172745 minutes
Tokenized 1630000 documents in 7

Tokenized 2890000 documents in 139.53636140426 minutes
Tokenized 2900000 documents in 140.015194050471 minutes
Tokenized 2910000 documents in 140.49456522067388 minutes
Tokenized 2920000 documents in 140.97068164348602 minutes
Tokenized 2930000 documents in 141.4460195660591 minutes
Tokenized 2940000 documents in 141.9182750582695 minutes
Tokenized 2950000 documents in 142.4004859606425 minutes
Tokenized 2960000 documents in 142.87431532144547 minutes
Tokenized 2970000 documents in 143.35957545836766 minutes
Tokenized 2980000 documents in 143.8504452228546 minutes
Tokenized 2990000 documents in 144.32347902059556 minutes
Tokenized 3000000 documents in 144.79202582041424 minutes
Tokenized 3010000 documents in 145.2582317153613 minutes
Tokenized 3020000 documents in 145.79376603364943 minutes
Tokenized 3030000 documents in 146.2923243880272 minutes
Tokenized 3040000 documents in 146.766107237339 minutes
Tokenized 3050000 documents in 147.23065647681554 minutes
Tokenized 3060000 documents

Save the tokenized restaurant reviews to disk:

In [11]:
with open('../data/tokenized.pkl', 'wb') as f:
    pickle.dump(filtered_tokens, f)

### Businesses

In [23]:
docs = businesses['processed'].astype('str').astype('unicode').tolist()

In [24]:
filtered_tokens_bs = []
start = time.time()
i = 1
for doc in nlp.pipe(docs, disable=['tagger', 'parser', 'ner', 'textcat'], batch_size=10000):
    try:
        tokens = [token.lemma_ for token in doc if token_filter(token)]
        filtered_tokens_bs.append(tokens)
        i += 1
        if i % 10000 == 0:
            print(f'Tokenized {i} documents in {(time.time()-start)/60} minutes')
    except:
        print(f'Document {i} has an encoding error.')

Tokenized 10000 documents in 0.5435734550158183 minutes
Tokenized 20000 documents in 1.0860312422116598 minutes
Tokenized 30000 documents in 1.6481721639633178 minutes
Tokenized 40000 documents in 2.205586814880371 minutes
Tokenized 50000 documents in 2.7536702275276186 minutes
Tokenized 60000 documents in 3.290149998664856 minutes
Tokenized 70000 documents in 3.8397332151730854 minutes
Tokenized 80000 documents in 4.389596164226532 minutes
Tokenized 90000 documents in 4.932957581679026 minutes
Tokenized 100000 documents in 5.466143671671549 minutes
Tokenized 110000 documents in 6.000996728738149 minutes
Tokenized 120000 documents in 6.546352938810984 minutes
Tokenized 130000 documents in 7.077045182387034 minutes
Tokenized 140000 documents in 7.6094328244527185 minutes
Tokenized 150000 documents in 8.157556692759195 minutes
Tokenized 160000 documents in 8.699893486499786 minutes
Tokenized 170000 documents in 9.256111359596252 minutes
Tokenized 180000 documents in 9.816303678353627 min

Tokenized 1460000 documents in 77.79469596544901 minutes
Tokenized 1470000 documents in 78.3345144867897 minutes
Tokenized 1480000 documents in 78.86865790287654 minutes
Tokenized 1490000 documents in 79.43154362440109 minutes
Tokenized 1500000 documents in 79.97490022977193 minutes
Tokenized 1510000 documents in 80.51221429109573 minutes
Tokenized 1520000 documents in 81.03863857189815 minutes
Tokenized 1530000 documents in 81.5610393246015 minutes
Tokenized 1540000 documents in 82.09267423550288 minutes
Tokenized 1550000 documents in 82.64216370979945 minutes
Tokenized 1560000 documents in 83.17951014041901 minutes
Tokenized 1570000 documents in 83.70984456539153 minutes
Tokenized 1580000 documents in 84.27566196918488 minutes
Tokenized 1590000 documents in 84.79718762636185 minutes
Tokenized 1600000 documents in 85.31929559310278 minutes
Tokenized 1610000 documents in 85.83360180854797 minutes
Tokenized 1620000 documents in 86.34297099113465 minutes
Tokenized 1630000 documents in 86

In [25]:
with open('../data/tokenized_bs.pkl', 'wb') as f:
    pickle.dump(filtered_tokens_bs, f)