# Text Preprocessing

In this notebook, we process and tokenize each review text. Tokenization transforms each review into a list of relevant tokens; these reviews are added to a corpus as a list. Since restaurants dominate the Yelp review space, two corpuses are developed: one for restaurants, and one for all other types of businesses.

## Importing modules

In [1]:
import pandas as pd
import numpy as np
from sqlalchemy import create_engine
import matplotlib.pyplot as plt
import seaborn as sns
import spacy
import textacy
import pickle
import time

plt.style.use('fivethirtyeight')

engine = create_engine('postgres://postgres:moop@52.25.218.143:5432/capstone')

  """)


## Get data from server

### Query for Restaurants

In [6]:
query = '''SELECT reviews.text as text,
reviews.useful as useful,
business.categories as categories
FROM reviews
INNER JOIN business ON reviews.business_id = business.business_id
WHERE categories LIKE '%%Restaurants%%'
'''

restaurants = pd.read_sql_query(query, engine)
restaurants.shape

(3221418, 3)

### Query for Other Businesses

In [2]:
query = '''
SELECT 
reviews.text as text, 
reviews.useful as useful, 
reviews.funny as funny, 
reviews.cool as cool,
business.categories as categories
FROM reviews
INNER JOIN business ON reviews.business_id = business.business_id
WHERE categories NOT LIKE '%%Restaurants%%'
'''

businesses = pd.read_sql_query(query, engine)
businesses.shape

(2040250, 5)

## Preprocess the text

I use textacy's preprocess method to convert all the text to lowercase and remove numbers, URLs, and punctuation and save it to a file so we can load it in to work on.

### Restaurants

In [9]:
restaurants.drop('categories', axis=1, inplace=True)

In [10]:
%%time
restaurants['processed'] = restaurants['text'].map(lambda x: textacy.preprocess.preprocess_text(x, lowercase=True, no_urls=True, no_punct=True, no_numbers=True))

CPU times: user 24min 49s, sys: 0 ns, total: 24min 49s
Wall time: 24min 49s


In [12]:
restaurants.drop('text', 1, inplace=True)

In [14]:
restaurants.to_csv('../data/rests.csv', index=False)

### Businesses

In [3]:
businesses.head()

Unnamed: 0,text,useful,funny,cool,categories
0,Cycle Pub Las Vegas was a blast! Got a groupon...,1,0,0,Pubs;Bars;Bar Crawl;Tours;Nightlife;Hotels & T...
1,I thought Tidy's Flowers had a great reputatio...,9,0,1,Event Planning & Services;Flowers & Gifts;Flor...
2,I too have been trying to book an appt to use ...,0,0,0,Beauty & Spas;Massage
3,Great place to bring dogs! It's really a dog p...,0,0,0,Pet Services;Pets
4,Decided to give this place a try since I was i...,0,0,0,Food;Bakeries


In [4]:
businesses.drop('categories', 1, inplace=True)

In [5]:
%%time
businesses['processed'] = businesses['text'].map(lambda x: textacy.preprocess.preprocess_text(x, lowercase=True, no_urls=True, no_punct=True, no_numbers=True))

CPU times: user 16min 38s, sys: 0 ns, total: 16min 38s
Wall time: 16min 38s


In [6]:
businesses.drop('text', 1, inplace=True)

In [8]:
businesses.to_csv('../data/businesses.csv', index=False)

## Tokenizing

In this section I set up a spacy tokenizer. We disable part-of-speech tagging, semantic parsing, and text categorization to reduce overall memory usage. We also create a filter function to eliminate stopwords and short tokens (less than 4 characters) from our final tokenized documents. The tokenized documents are then added to a list which we can pass through a vectorizer (see Notebook 3.)

In [2]:
restaurants = pd.read_csv('../data/rests.csv')

In [10]:
nlp = textacy.load_spacy("en_core_web_sm", disable = ("tagger", "parser", "ner", "textcat"))

In [16]:
def token_filter(token): #remove stopwords and tokens 4 char or less
    return not (token.is_stop | len(token.text) <= 4)

### Restaurants

In [6]:
docs = restaurants['processed'].astype('unicode').tolist() # The spacy pipeline requires raw text to be in utf-8 format

In [18]:
filtered_tokens = []
start = time.time()
i = 1
for doc in nlp.pipe(docs, disable=['tagger', 'parser', 'ner', 'textcat'], batch_size=10000):
    tokens = [token.lemma_ for token in doc if token_filter(token)]
    filtered_tokens.append(tokens)
    i += 1
    if i % 10000 == 0:
        print(f'Tokenized {i} documents in {(time.time()-start)/60} minutes')

Tokenized 10000 documents in 0.5417954802513123 minutes
Tokenized 20000 documents in 1.0150205810864767 minutes
Tokenized 30000 documents in 1.4781464775403341 minutes
Tokenized 40000 documents in 1.9605997800827026 minutes
Tokenized 50000 documents in 2.440075139204661 minutes
Tokenized 60000 documents in 2.9163665135701495 minutes
Tokenized 70000 documents in 3.4392976482709248 minutes
Tokenized 80000 documents in 3.963463226954142 minutes
Tokenized 90000 documents in 4.478336413701375 minutes
Tokenized 100000 documents in 4.979599559307099 minutes
Tokenized 110000 documents in 5.450708782672882 minutes
Tokenized 120000 documents in 5.970176621278127 minutes
Tokenized 130000 documents in 6.496784055233002 minutes
Tokenized 140000 documents in 7.020404386520386 minutes
Tokenized 150000 documents in 7.524508945147196 minutes
Tokenized 160000 documents in 8.063184555371603 minutes
Tokenized 170000 documents in 8.576491578420002 minutes
Tokenized 180000 documents in 9.10695337454478 minu

Tokenized 1460000 documents in 74.72361800670623 minutes
Tokenized 1470000 documents in 75.24632306098938 minutes
Tokenized 1480000 documents in 75.76393938064575 minutes
Tokenized 1490000 documents in 76.2843542178472 minutes
Tokenized 1500000 documents in 76.79368238051732 minutes
Tokenized 1510000 documents in 77.30218292077383 minutes
Tokenized 1520000 documents in 77.81592822472254 minutes
Tokenized 1530000 documents in 78.32706807454427 minutes
Tokenized 1540000 documents in 78.8543143471082 minutes
Tokenized 1550000 documents in 79.36111665169398 minutes
Tokenized 1560000 documents in 79.8617068608602 minutes
Tokenized 1570000 documents in 80.35899256865183 minutes
Tokenized 1580000 documents in 80.87297044992447 minutes
Tokenized 1590000 documents in 81.38849177360535 minutes
Tokenized 1600000 documents in 81.92208697001139 minutes
Tokenized 1610000 documents in 82.44450322389602 minutes
Tokenized 1620000 documents in 82.9399381160736 minutes
Tokenized 1630000 documents in 83.4

Tokenized 2890000 documents in 143.91503901878994 minutes
Tokenized 2900000 documents in 144.39802083969116 minutes
Tokenized 2910000 documents in 144.9025940656662 minutes
Tokenized 2920000 documents in 145.4133126695951 minutes
Tokenized 2930000 documents in 145.89905375639597 minutes
Tokenized 2940000 documents in 146.4042413433393 minutes
Tokenized 2950000 documents in 146.90093118747075 minutes
Tokenized 2960000 documents in 147.40726381540298 minutes
Tokenized 2970000 documents in 147.8788391470909 minutes
Tokenized 2980000 documents in 148.3371657927831 minutes
Tokenized 2990000 documents in 148.870343708992 minutes
Tokenized 3000000 documents in 149.38477727969487 minutes
Tokenized 3010000 documents in 149.87994529008864 minutes
Tokenized 3020000 documents in 150.3524611234665 minutes
Tokenized 3030000 documents in 150.83511337836583 minutes
Tokenized 3040000 documents in 151.3062119881312 minutes
Tokenized 3050000 documents in 151.7816428343455 minutes
Tokenized 3060000 docume

Save the tokenized reviews to disk:

In [19]:
with open('../data/tokenized.pkl', 'wb') as f:
    pickle.dump(filtered_tokens, f)

### Businesses

In [11]:
docs = businesses['processed'].astype('unicode').tolist()

In [17]:
filtered_tokens_bs = []
start = time.time()
i = 1
for doc in nlp.pipe(docs, disable=['tagger', 'parser', 'ner', 'textcat'], batch_size=10000):
    tokens = [token.lemma_ for token in doc if token_filter(token)]
    filtered_tokens_bs.append(tokens)
    i += 1
    if i % 10000 == 0:
        print(f'Tokenized {i} documents in {(time.time()-start)/60} minutes')

Tokenized 10000 documents in 0.5255980491638184 minutes
Tokenized 20000 documents in 1.037276017665863 minutes
Tokenized 30000 documents in 1.5686418016751607 minutes
Tokenized 40000 documents in 2.1024542530377706 minutes
Tokenized 50000 documents in 2.618568793932597 minutes
Tokenized 60000 documents in 3.1329532504081725 minutes
Tokenized 70000 documents in 3.653790811697642 minutes
Tokenized 80000 documents in 4.175454847017924 minutes
Tokenized 90000 documents in 4.6936612010002134 minutes
Tokenized 100000 documents in 5.195524656772614 minutes
Tokenized 110000 documents in 5.699959830443064 minutes
Tokenized 120000 documents in 6.218272244930267 minutes
Tokenized 130000 documents in 6.734584017594655 minutes
Tokenized 140000 documents in 7.238488566875458 minutes
Tokenized 150000 documents in 7.749598217010498 minutes
Tokenized 160000 documents in 8.263826489448547 minutes
Tokenized 170000 documents in 8.79062537352244 minutes
Tokenized 180000 documents in 9.313677577177684 minut

Tokenized 1460000 documents in 73.36287189324698 minutes
Tokenized 1470000 documents in 73.86117143630982 minutes
Tokenized 1480000 documents in 74.3657881339391 minutes
Tokenized 1490000 documents in 74.8628033320109 minutes
Tokenized 1500000 documents in 75.3468784570694 minutes
Tokenized 1510000 documents in 75.85534100135168 minutes
Tokenized 1520000 documents in 76.3619817574819 minutes
Tokenized 1530000 documents in 76.86218363841375 minutes
Tokenized 1540000 documents in 77.35893313487371 minutes
Tokenized 1550000 documents in 77.84132454792659 minutes
Tokenized 1560000 documents in 78.33095343112946 minutes
Tokenized 1570000 documents in 78.83524558544158 minutes
Tokenized 1580000 documents in 79.33973424832026 minutes
Tokenized 1590000 documents in 79.84158979256948 minutes
Tokenized 1600000 documents in 80.34113273620605 minutes
Tokenized 1610000 documents in 80.83974155982335 minutes
Tokenized 1620000 documents in 81.32629689772924 minutes
Tokenized 1630000 documents in 81.8

In [19]:
with open('../data/tokenized_bs.pkl', 'wb') as f:
    pickle.dump(filtered_tokens_bs, f)