# Yelp Reviews Preprocessing

This notebook prepares the text data obtained from the App Store Scrape script by using a function that will lowercase, un-abbreviate, and, among other things, tokenize the text.

## Data Loading

In [12]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

In [13]:
reviews = pd.read_csv('yelp_reviews.csv')
reviews.drop('Unnamed: 0', axis = 1, inplace = True)

In [14]:
reviews.head(5)

Unnamed: 0,date,review,rating,isEdited,userName,title,developerResponse
0,2024-11-16 17:12:47,let me tell you why I named this and encouragi...,5,False,Salonroxie,Encouraging App,
1,2024-11-13 19:31:04,I vaguely remember yelp being a positive exper...,1,False,S_Mosher,Unreliable Dishonest Shady…. Let’s Cancel Yelp!,
2,2024-10-11 18:43:56,I will not be using Yelp ever again. After a t...,1,False,jennausuwiajdneka,Horrible,
3,2024-09-22 20:35:32,During think tank meetings with other business...,1,False,Srepman,Is yelp fair?,"{'id': 46973211, 'body': 'Thank you for taking..."
4,2024-10-24 17:15:35,"Brought my 2023 car, with 15,454 miles, to hav...",1,False,Livid Sr Citizen,"Discount Tire - Plymouth, MN",


## Preprocessing

Change to correct data types

In [15]:
reviews['review'] = reviews['review'].astype('string')
reviews['title'] = reviews['title'].astype('string')

Create a function that handles chat words such as BTW, ASAP, LMAO, and so on.

In [16]:
# Opens a file containing definitions about chat words such as btw, asap, lmao, etc.
slang_file = open('slang.txt')
slang = slang_file.read()
slang_file.close()
slang_list = slang.split("\n") # Splits the file by new line, generates a list of abbreviatons
del slang_list[-1] # Removes the last list item

# Goes through each item and splits on the symbol defining the abbreviation's meaning
def abbrev_def_split(x):
    x = x.replace('=', '`')
    x = x.replace(': ', '`')
    x = x.replace(' â€“ ', '`')
    x = x.replace('â€™', '’')
    x = x.replace('\t', '`')
    x = x.lower()
    return tuple(x.split('`')) # Returns a two item tuple containing the abbreviation and meaning

# Apply the function to each item in the abbrevetion list to get a list of two item tuples to form a dictionary
# Now the contraction word can be replaced with the whole meaning
slang_def_dict = dict(map(abbrev_def_split, slang_list)) 

Create a lookup dictionary that will handle contractions

In [17]:
import re

# Create lookup dictionary of contractions from excel table created from wikipedia page
contractions = pd.read_excel('contractions.xlsx', dtype = {'contraction': 'string', 'meaning': 'string'})
    
contractions['contraction'] = contractions['contraction'].apply(lambda x: x.replace('(informal)', ''))
contractions['meaning'] = contractions['meaning'].apply(lambda x: re.sub(r"\s/\s(.*)", "", x))
contractions['contraction'] = contractions['contraction'].apply(lambda x: x.lower())
contractions['contraction'] = contractions['contraction'].apply(lambda x: x.replace("'", "’"))
contractions['meaning'] = contractions['meaning'].apply(lambda x: x.lower())

keys = list(contractions.contraction)
values = list(contractions.meaning)
contraction_dict = dict(zip(keys, values))

### One Cleaning Function

Define one cleaning function using the code from earlier to create separate columns that have the words only, the parts of speech (POS) only, and the words and parts of speech tokenized together. This will help analyze things like word usage frequency, while also having a column (the tokenized word with the POS) that will be useful.

In [18]:
# Import library for POS labeling
import spacy
nlp = spacy.load('en_core_web_sm')

# Import NLTK library for tokenization
from nltk import word_tokenize

In [19]:
# Create new dataframe
reviews_prep = reviews.copy()
reviews_prep = reviews_prep.drop(['isEdited', 'userName', 'developerResponse'], axis = 1)
reviews_prep.head(5)

Unnamed: 0,date,review,rating,title
0,2024-11-16 17:12:47,let me tell you why I named this and encouragi...,5,Encouraging App
1,2024-11-13 19:31:04,I vaguely remember yelp being a positive exper...,1,Unreliable Dishonest Shady…. Let’s Cancel Yelp!
2,2024-10-11 18:43:56,I will not be using Yelp ever again. After a t...,1,Horrible
3,2024-09-22 20:35:32,During think tank meetings with other business...,1,Is yelp fair?
4,2024-10-24 17:15:35,"Brought my 2023 car, with 15,454 miles, to hav...",1,"Discount Tire - Plymouth, MN"


In [20]:
# Preprocessing function

def preprocessingFunction(text):
    # Lowercase
    text = text.lower()
    
    # Handling chat words
    text = text.split(" ")
    text = [slang_def_dict[word] if word in slang_def_dict else word for word in text]
    text = " ".join(text)
    
    # Handling contractions
    text = text.split(" ")
    text = [contraction_dict[word] if word in contraction_dict else word for word in text]
    text = " ".join(text)
    
    # Remove punctuation
    text = word_tokenize(text)
    text = [word for word in text if word.isalpha()]
    text = " ".join(text)
    
    # POS labeling with Spacy
    doc = nlp(text)
    output = [f'{token.text}/{token.pos_}' for token in doc]
    
    return output

reviews_prep['review_prepped'] = reviews_prep['review'].apply(preprocessingFunction)
reviews_prep['title_prepped'] = reviews_prep['title'].apply(preprocessingFunction)
reviews_prep.head(5)

Unnamed: 0,date,review,rating,title,review_prepped,title_prepped
0,2024-11-16 17:12:47,let me tell you why I named this and encouragi...,5,Encouraging App,"[let/VERB, me/PRON, tell/VERB, you/PRON, why/S...","[encouraging/VERB, app/NOUN]"
1,2024-11-13 19:31:04,I vaguely remember yelp being a positive exper...,1,Unreliable Dishonest Shady…. Let’s Cancel Yelp!,"[i/PRON, vaguely/ADV, remember/VERB, yelp/NOUN...","[unreliable/ADJ, dishonest/INTJ, let/VERB, us/..."
2,2024-10-11 18:43:56,I will not be using Yelp ever again. After a t...,1,Horrible,"[i/PRON, will/AUX, not/PART, be/AUX, using/VER...",[horrible/ADJ]
3,2024-09-22 20:35:32,During think tank meetings with other business...,1,Is yelp fair?,"[during/ADP, think/NOUN, tank/NOUN, meetings/N...","[is/AUX, yelp/NOUN, fair/ADJ]"
4,2024-10-24 17:15:35,"Brought my 2023 car, with 15,454 miles, to hav...",1,"Discount Tire - Plymouth, MN","[brought/VERB, my/PRON, car/NOUN, with/ADP, mi...","[discount/NOUN, tire/NOUN, plymouth/PROPN, mn/..."


In [21]:
# Functions to split the word/POS pairs into separate columns

def createTokenCol(preppedText):
    tokenPOSPairs = [tokenPosPair.split("/") for tokenPosPair in preppedText]
    tokens = [tokenPosPairSplit[0] for tokenPosPairSplit in tokenPOSPairs]
    tokens = ' '.join(tokens)
    
    return tokens

def createPOSCol(preppedText):
    tokenPOSPairs = [tokenPosPair.split("/") for tokenPosPair in preppedText]
    POSes = [tokenPosPairSplit[1] for tokenPosPairSplit in tokenPOSPairs]
    POSes = ' '.join(POSes)
    
    return POSes

reviews_prep['review_tokens'] = reviews_prep['review_prepped'].apply(createTokenCol)
reviews_prep['review_POSes'] = reviews_prep['review_prepped'].apply(createPOSCol)
reviews_prep['title_tokens'] = reviews_prep['title_prepped'].apply(createTokenCol)
reviews_prep['title_POSes'] = reviews_prep['title_prepped'].apply(createPOSCol)

reviews_prep.head(5)

Unnamed: 0,date,review,rating,title,review_prepped,title_prepped,review_tokens,review_POSes,title_tokens,title_POSes
0,2024-11-16 17:12:47,let me tell you why I named this and encouragi...,5,Encouraging App,"[let/VERB, me/PRON, tell/VERB, you/PRON, why/S...","[encouraging/VERB, app/NOUN]",let me tell you why i named this and encouragi...,VERB PRON VERB PRON SCONJ PRON VERB PRON CCONJ...,encouraging app,VERB NOUN
1,2024-11-13 19:31:04,I vaguely remember yelp being a positive exper...,1,Unreliable Dishonest Shady…. Let’s Cancel Yelp!,"[i/PRON, vaguely/ADV, remember/VERB, yelp/NOUN...","[unreliable/ADJ, dishonest/INTJ, let/VERB, us/...",i vaguely remember yelp being a positive exper...,PRON ADV VERB NOUN AUX DET ADJ NOUN PRON AUX A...,unreliable dishonest let us cancel yelp,ADJ INTJ VERB PRON VERB NOUN
2,2024-10-11 18:43:56,I will not be using Yelp ever again. After a t...,1,Horrible,"[i/PRON, will/AUX, not/PART, be/AUX, using/VER...",[horrible/ADJ],i will not be using yelp ever again after a te...,PRON AUX PART AUX VERB NOUN ADV ADV ADP DET AD...,horrible,ADJ
3,2024-09-22 20:35:32,During think tank meetings with other business...,1,Is yelp fair?,"[during/ADP, think/NOUN, tank/NOUN, meetings/N...","[is/AUX, yelp/NOUN, fair/ADJ]",during think tank meetings with other business...,ADP NOUN NOUN NOUN ADP ADJ NOUN NOUN ADP PRON ...,is yelp fair,AUX NOUN ADJ
4,2024-10-24 17:15:35,"Brought my 2023 car, with 15,454 miles, to hav...",1,"Discount Tire - Plymouth, MN","[brought/VERB, my/PRON, car/NOUN, with/ADP, mi...","[discount/NOUN, tire/NOUN, plymouth/PROPN, mn/...",brought my car with miles to have my snow tire...,VERB PRON NOUN ADP NOUN PART VERB PRON NOUN NO...,discount tire plymouth mn,NOUN NOUN PROPN NOUN


**Save to a csv**

In [22]:
reviews_prep.to_csv('reviews_prep_v2.csv')