## TF-IDF Amazon

You are an analyst working for Amazon's product team, and charged with identifying areas for improvement for the toy reviews.

Using the **amazon-[poor|good]-reviews.csv** dataset, clean and parse the text reviews. Explain the decisions you make:
- why remove/keep stopwords?
- which stopwords to remove?
- stemming versus lemmatization?
- regex cleaning and substitution?
- adding in custom stopwords?
- what `n` for your `n-grams`?
- which words to collocate together?

Finally, generate a TF-IDF report that explains for a business (non-technical) stakeholder:
* the features your analysis showed that customers cited as reasons for a poor review
* the features your analysis showed that customers cited as reasons for a good review
* the most common issues identified from your analysis that generated customer dissatisfaction.

Explain to what degree the TF-IDF findings make sense - what are its limitations?




In [1]:
import re
import pandas as pd 
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from collections import Counter
from matplotlib.pyplot import figure


import nltk


from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk.corpus import wordnet


from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer


In [2]:
bads = open('poor_amazon_toy_reviews.txt', 'r', encoding="utf8").readlines()
goods = open('good_amazon_toy_reviews.txt', 'r', encoding="utf8").readlines()


### Cleaning and Stopword removal

Lets clean up the data by checking and removing garbage values (this code stolen from my previous HW), and then lets remove stopwords based on our custom list

In [3]:
def find_unique_characters(regex, lines):
    """
    Finds unique characters from a list of strings, almost certainly inefficiently 
    
    """
    #Match anything that is non alpha-numeric or whitespace, creates list of lists of matching characters
    potential_malforms = [re.findall(regex, review) for review in lines]

    #lets whittle down this list of lists to a unqiue list, btw this took me way longer than it needed to
    unique_malforms = set([char for review in potential_malforms for char in review])
    
    print(F"Number of unique potential Malformed Characters: {len(unique_malforms)}, \n\nCandidates: {unique_malforms}")
    return unique_malforms

### Utility functions

In [4]:
def cleaner(data):
    cleaned=[" ".join(re.findall("[a-zA-z0-9]+",i)).lower() for i in data]
    return cleaned

def count_words(lines, delimiter=" "):
    
    words = Counter() # instantiate a Counter object called words
    for line in lines:
        for word in line.split(delimiter):
            words[word] += 1 # increment count for word
    return words


In [5]:
%time 
stolen_unique_malforms = find_unique_characters(r"[\W ]", goods+bads)

Wall time: 0 ns
Number of unique potential Malformed Characters: 228, 

Candidates: {'▽', '☀', '♡', '😴', '＿', '😷', '＾', '👀', '|', '‘', '😝', '–', '。', '\xa0', '◄', '🏼', '💛', '*', '（', '🚮', '）', '“', '—', '<', '😀', '☹', '͡', '⚾', '😗', '´', '😛', '👧', '・', '░', '🌴', '͜', '😪', '👎', '\u3000', '😨', '¿', '😻', '😡', '😞', '☺', '🙂', '😠', '😜', '♥', '🐌', '😯', '+', '🎶', '🐶', '@', '}', '🙌', '🐮', '😢', '😆', '✴', '😭', '😬', '$', '™', '❤', '💞', '😩', '😫', '😐', '😄', '\n', '⌒', '💔', '🐘', '✋', '🔫', '😏', '∩', '🏿', '😃', '🍟', '🐜', '?', '£', '🎃', '💕', '💁', '🌵', ' ', '!', '💖', '"', '👨', '`', '🙅', '%', '😁', '😥', '🐝', '😟', '🍴', '😱', '♪', '✔', '›', '⁃', '💪', '🐣', '✌', "'", '🏃', '😇', '，', '😒', '″', '👏', '🐧', '😘', '￣', '🙏', '»', '👩', '🍦', '😳', '•', '≡', '∀', '🏽', '😎', '😂', '😚', '💥', '🍂', '🐩', '\u200b', '🐗', '☆', '\x1a', '💯', '❗', '👌', '😲', '[', '≦', '🚗', '.', '💘', '😈', '😮', '\\', '😍', ')', '&', '😰', '-', '⁄', '●', '«', '🐬', '😙', '🐻', '️', '≧', '💀', '🍔', '🐞', '⭐', '°', '👻', '\u200d', ':', '😖', '×', '►', '👽', '🎪', '^', '👍

### Removing all malformed / Unecessary Characters

In [6]:
#SANITY CHECK: make sure the list comprehension works like i think it does
bad_cleaned_reviews = [re.sub(r"[^A-Za-z0-9 ]",'',review) for review in bads]
good_cleaned_reviews = [re.sub(r"[^A-Za-z0-9 ]",'',review) for review in goods]


### Lets look at our current stop words, and add some of the most common stopwords from the reviews

Because we will be creating n-grams, we are going to err on the side of caution since different combinations of common words may carry large signal, we will also retain numbers to maintain meaning around age of recepients which we know is important from our previous analysis. 

In [7]:
tot_words = []
tot_reviews = []
tot_reviews.extend(bad_cleaned_reviews)
tot_reviews.extend(good_cleaned_reviews)

for review in tot_reviews:
    words = re.sub(r"[^A-Za-z0-9 ]",'',review)
    tot_words.extend(word_tokenize(words))
    
Counter(list(tot_words)).most_common()

[('the', 115938),
 ('and', 98027),
 ('a', 75951),
 ('to', 73296),
 ('it', 72270),
 ('I', 64346),
 ('for', 53980),
 ('is', 48332),
 ('of', 43441),
 ('this', 40375),
 ('with', 33028),
 ('my', 31826),
 ('in', 30397),
 ('was', 25629),
 ('that', 21962),
 ('on', 21758),
 ('are', 21280),
 ('The', 19969),
 ('as', 19187),
 ('but', 17668),
 ('you', 17439),
 ('My', 17253),
 ('have', 16986),
 ('so', 16888),
 ('great', 16559),
 ('not', 15167),
 ('them', 14223),
 ('very', 13838),
 ('This', 13426),
 ('one', 12829),
 ('love', 12807),
 ('loves', 12795),
 ('It', 12751),
 ('they', 12087),
 ('be', 11742),
 ('all', 11544),
 ('old', 11014),
 ('Great', 10846),
 ('like', 10228),
 ('at', 10115),
 ('these', 9988),
 ('can', 9813),
 ('fun', 9726),
 ('up', 9702),
 ('little', 9543),
 ('just', 9493),
 ('game', 9351),
 ('well', 8806),
 ('br', 8786),
 ('her', 8611),
 ('has', 8504),
 ('year', 8458),
 ('or', 8411),
 ('good', 8344),
 ('will', 8302),
 ('out', 8283),
 ('kids', 8042),
 ('had', 7998),
 ('product', 7914),
 ('

In [8]:
# we know these products are bought from amazon, lets ditch it
extras = ['Amazon','amazon']

nltk_stopwords=list(set(stopwords.words('english')))
nltk_stopwords.extend(extras)

In [9]:
def nltk_tag_to_wordnet_tag(nltk_tag):
    if nltk_tag.startswith('J'):
        return wordnet.ADJ
    elif nltk_tag.startswith('V'):
        return wordnet.VERB
    elif nltk_tag.startswith('N'):
        return wordnet.NOUN
    elif nltk_tag.startswith('R'):
        return wordnet.ADV
    else:          
        return None

### Tokenize, lemmatize, and consolidate concepts

    1. Tokenize the reviews into words
    2. Tag the word with part of speech
    3. lemmatize the word using POS calculated before 
    4. Consolidate concepts with regex
    5. Remove words that are in our stopwords list
    
**If spellcheck is enabled, cell takes forever youve been warned**

I am choosing lemmatization since we are not feeding this into any ML models, and since we do not need to worry about overfitting, and lemmatization has the added benefit of being easier to understand for the business analyst who will read our report, since stemmed words are less human-readable

In [10]:
#part of speech logic stolen from: https://www.programiz.com/python-programming/methods/set/update

def process_documents(tot_reviews, spellcheck=False):
    cleaned_reviews = []
    for review in tot_reviews:
        # Clean punctuation
        clean_review = re.sub(r"[^A-Za-z0-9 ]",'', review)

        # Tokenize into words 
        lemmatized_word = []

        #before tagging part of speech lets run our spellcheck
        if spellcheck: 
            clean_review = spellcheck_word(clean_review)
        else: 
            clean_review = word_tokenize(clean_review)

        #Tag words with part of speech 
        nltk_tagged = nltk.pos_tag(clean_review)  
        wordnet_tagged = map(lambda x: (x[0], nltk_tag_to_wordnet_tag(x[1])), nltk_tagged)
        lemmatizer = WordNetLemmatizer()

        # lemetize, use part of speech if available
        for word, tag in wordnet_tagged:
            if tag is None:
                #if there is no available tag, append the token as is
                lemmatized_word.append(word)
            else:        
                #else use the tag to lemmatize the token
                lemmatized_word.append(lemmatizer.lemmatize(word, tag))

        words_clean = []
        for word in lemmatized_word:
            word = re.sub(r"(?:bdays?|birthdays?|b-day?)",'_BIRTHDAY_', word, flags=re.IGNORECASE)
            word = re.sub(r"(?:xmas|x-mas|christmas?)",'_CHRISTAMS_', word, flags=re.IGNORECASE)
            if word in nltk_stopwords:
                continue
            words_clean.append(word)
        cleaned_review = " ".join(words_clean)
        cleaned_reviews.append((cleaned_review))

    return cleaned_reviews

In [11]:
good_finished_reviews = process_documents(good_cleaned_reviews)
bad_finished_reviews = process_documents(bad_cleaned_reviews)
bad_finished_reviews

['Do buy They break fast I spin 15 minute end fly dont waste money They make cheap plastic crack Buy poi ball work lot good limit fund',
 'Showed show Was someone old toy paint',
 'You need expansion pack 35 want access player aid Factions expansion The base game Alien Frontiers play much smooth add Factions expansion pack All pigeonhole certain path victory',
 'This gift husband new pool Did receive color I order one month use continuously mesh pull away material inflatable side Completely shred long use It store properly kept outside pool Poorly make good go WM get something clearance',
 'Received pineapple rather advertised smore',
 'This much small average size child bounce It size punching balloon VERY DISAPPOINTING',
 'Dont buy Cheaply make Hippos get stick Cover ball storage attach securely underside game every time pick cover fall ball go everywhere Cant believe game actually pass QC standard You would think 99 cent game',
 'Not good',
 'Misrepresentation product per picture I 

### Deciding on number of N-Grams

Both 2, 3, and 4 are common choices for n-grams, I am leaning toward higher counts since we want to deliver our final report to a business stakeholder, lets run both and we can compare outputs down the road, its worth noting that 3 ngram will have many more vectors which may bea  definite downside.

Looking at these results we can see that 4 grams produced the most insightful features, we will use this number moving forward

### TF-IDF Report

By looking at n-grams without numbers we are able to consolidate "my x year old" type statements which creates a more accurate and succienct analysis. 

We performed an TF-IDF analysis in order to answer the following questions: 

    1. Features that drive good reviews
    2. Features that drive poor  reviews
    3. the most common issues that generated customer dissatisfaction.

#### Features that drive Good reviews    
    bought this for my	552.572127

    Top 7 features from our 4-6 n-gram vectorizer are:

| N-Gram      | Score  |
| ----------- | ----------- |
| bought this for my      | 552       |
| for my year old   | 461        |
| my son loves it | 416 | 
| in exchange for my | 304 |
| my daughter loves it | 302 |
| exchange for my honest | 277 |
| my year old loves |	277 | 

</br> </br>
</br> </br>


The most important phrases from our TF-IDF analysis for positive reviews can be grouped into 3 main "concepts" or collocated phrases, groomed from across 3 to 6 ngram results for the most insightful features that are still human readable
    
    1. Reviewer was given a benefit in exchange for a reivew, such as a discount
        - "discount in exchange for my honest"
        - "honest and unbiased review"
        
    2. Item was a gift and recepient was pleased with item
        - "my son loves it"
        - "for my year old"
        
    3. Actual Features of the product or service
        - "arrived on time"
        - "easy to put together"

In [12]:
vectorizer = TfidfVectorizer(ngram_range=(3, 4),
                             max_features=500, max_df = .8)
X = vectorizer.fit_transform(good_cleaned_reviews)
terms = vectorizer.get_feature_names()

tf_idf = pd.DataFrame(X.toarray().transpose(), index=terms)
tf_idf = tf_idf.sum(axis=1)

score = pd.DataFrame(tf_idf, columns=["score"])
score.sort_values(by="score", ascending=False, inplace=True)
score.head(50)

Unnamed: 0,score
my year old,1060.143115
my son loves,783.1404
my daughter loves,713.726869
for the price,634.414076
this for my,622.111232
to play with,609.354407
bought this for,542.959091
this is great,529.088319
kids love it,474.385197
daughter loves it,446.607244


In [13]:
vectorizer = TfidfVectorizer(ngram_range=(3, 3),
                             max_features=500, max_df = .8)
X = vectorizer.fit_transform(good_cleaned_reviews)
terms = vectorizer.get_feature_names()

tf_idf = pd.DataFrame(X.toarray().transpose(), index=terms)
tf_idf = tf_idf.sum(axis=1)

score = pd.DataFrame(tf_idf, columns=["score"])
score.sort_values(by="score", ascending=False, inplace=True)
score.head(50)

Unnamed: 0,score
my year old,1145.890122
my son loves,828.266182
my daughter loves,790.16477
this for my,689.560817
for the price,627.45372
to play with,610.856129
bought this for,586.068913
this is great,519.53417
kids love it,503.941592
daughter loves it,502.48705


In [14]:
vectorizer = TfidfVectorizer(ngram_range=(4, 6),
                             max_features=500, max_df = .8)
X = vectorizer.fit_transform(good_cleaned_reviews)
terms = vectorizer.get_feature_names()

tf_idf = pd.DataFrame(X.toarray().transpose(), index=terms)
tf_idf = tf_idf.sum(axis=1)

score = pd.DataFrame(tf_idf, columns=["score"])
score.sort_values(by="score", ascending=False, inplace=True)
score.head(50)

Unnamed: 0,score
bought this for my,552.572127
my son loves it,461.160128
my daughter loves it,416.263511
for my year old,402.806434
my son loves this,304.527816
my daughter loves this,302.138022
got this for my,277.788459
my year old loves,224.438965
easy to put together,220.357903
my grandson loves it,212.367835


#### Features that drive poor reviews

    Top 7 features from our 4-6 n-gram vectorizer are:

| N-Gram      | Score  |
| ----------- | ----------- |
| dont waste your money      | 131       |
| do not buy this   | 78        |
| not worth the money | 78 | 
| out of the box | 64 |
| bought this for my | 44 |
| would not recommend this | 39 |
| not worth the price	 |	36 | 


</br> </br>
</br> </br>


The most important phrases from our TF-IDF analysis for positive reviews can be grouped into 2 main "concepts" or collocated phrases, groomed from across 3 to 6 ngram results for the most insightful features that are still human readable
    
    1. Value Proposition Was lacking
        - "dont waste your time"
        - "not worth the money"
        
    2. Dissatisfaction with quality of item 
        - "not work at all"
        - "broke the first day"



In [15]:
vectorizer = TfidfVectorizer(ngram_range=(4, 6),
                             max_features=500, max_df = .8)
X = vectorizer.fit_transform(bad_cleaned_reviews)
terms = vectorizer.get_feature_names()

tf_idf = pd.DataFrame(X.toarray().transpose(), index=terms)
tf_idf = tf_idf.sum(axis=1)

score = pd.DataFrame(tf_idf, columns=["score"])
score.sort_values(by="score", ascending=False, inplace=True)
score.head(50)

Unnamed: 0,score
dont waste your money,130.507159
do not buy this,78.245221
not worth the money,77.731842
out of the box,63.645846
bought this for my,44.415243
would not recommend this,39.333319
not worth the price,36.059896
total waste of money,34.366202
to send it back,33.821013
complete waste of money,33.121143


In [16]:
vectorizer = TfidfVectorizer(ngram_range=(4, 4),
                             max_features=500, max_df = .8)
X = vectorizer.fit_transform(bad_cleaned_reviews)
terms = vectorizer.get_feature_names()

tf_idf = pd.DataFrame(X.toarray().transpose(), index=terms)
tf_idf = tf_idf.sum(axis=1)

score = pd.DataFrame(tf_idf, columns=["score"])
score.sort_values(by="score", ascending=False, inplace=True)
score.head(50)

Unnamed: 0,score
dont waste your money,134.42722
not worth the money,78.308744
do not buy this,76.523377
out of the box,72.766264
would not recommend this,43.844095
bought this for my,43.266875
not worth the price,36.377436
total waste of money,34.737827
to send it back,33.413432
it out of the,33.304878


#### the most common issues that generated customer dissatisfaction.

Below are some good candidates for reasons that have generated dissatisfaction, since we are interested in most common issues, we will use a standard count vectorizer as opposed to TFIDF: 

    -Quality issues
    -Having to return the item
    -Misaligned Expectations

In [17]:
vectorizer = CountVectorizer(ngram_range=(4, 4),
                             max_features=500, max_df = .8)
X = vectorizer.fit_transform(bad_cleaned_reviews)
terms = vectorizer.get_feature_names()

tf_idf = pd.DataFrame(X.toarray().transpose(), index=terms)
tf_idf = tf_idf.sum(axis=1)

score = pd.DataFrame(tf_idf, columns=["score"])
score.sort_values(by="score", ascending=False, inplace=True)
score.head(50)

Unnamed: 0,score
dont waste your money,179
out of the box,132
do not buy this,108
not worth the money,96
bought this for my,78
would not recommend this,70
it out of the,60
get what you pay,54
you get what you,54
right out of the,52


## Similarity and Word Embeddings

Using
* `CountVectorizer`
* `TfIdfVectorizer`

Identify the most similar reviews from the McDonalds Yelp dataset.

In order to create effective and succient analysis we are reusing relevant code from previous analysis

In [18]:
import requests
from tqdm.notebook import tqdm
from bs4 import BeautifulSoup
from sklearn.metrics.pairwise import cosine_similarity
#Burger Menu list
url='https://www.mcdonalds.com/us/en-us/full-menu/burgers.html'
response = requests.get(url)
html=response.text
soup = BeautifulSoup(html,'html.parser')
burger_menu=[i.get_text() for i in soup.find_all('span',{'class':'categories-item-name'})]
cleaned_burger_menu = cleaner(burger_menu)
#Drinks Menu list
url='https://www.mcdonalds.com/us/en-us/full-menu/drinks.html'
response = requests.get(url)
html=response.text
soup = BeautifulSoup(html,'html.parser')
drink_menu=[i.get_text() for i in soup.find_all('span',{'class':'categories-item-name'})]
cleaned_drink_menu = cleaner(drink_menu)
#Breakfast Menu list
url='https://www.mcdonalds.com/us/en-us/full-menu/breakfast.html'
response = requests.get(url)
html=response.text
soup = BeautifulSoup(html,'html.parser')
breakfast_menu=[i.get_text() for i in soup.find_all('span',{'class':'categories-item-name'})]
cleaned_breakfast_menu = cleaner(breakfast_menu)
#Dessert Menu list
url='https://www.mcdonalds.com/us/en-us/full-menu/desserts-and-shakes.html'
response = requests.get(url)
html=response.text
soup = BeautifulSoup(html,'html.parser')
dessert_menu=[i.get_text() for i in soup.find_all('span',{'class':'categories-item-name'})]
cleaned_dessert_menu = cleaner(dessert_menu)
#McCafe Menu list
url='https://www.mcdonalds.com/us/en-us/full-menu/mccafe.html'
response = requests.get(url)
html=response.text
soup = BeautifulSoup(html,'html.parser')
cafe_menu=[i.get_text() for i in soup.find_all('span',{'class':'categories-item-name'})]
cleaned_cafe_menu = cleaner(cafe_menu)

In [19]:
#Tokenize each sentence, and remove stopwords
def token_rm_stopword(data : list):
    print('Tokenizing and removing stopwords...')
    temp_list=[]
    for i in tqdm(range(len(data)),desc='Loading...'):
        k=data[i]
        tok=nltk.word_tokenize(k)
        rem_stop_tok=[t for t in tok if t not in stopwords.words('english')]
        temp_list.append(" ".join(rem_stop_tok))
    return temp_list 
def nltk_tag_to_wordnet_tag(nltk_tag):
    if nltk_tag.startswith('J'):
        return wordnet.ADJ
    elif nltk_tag.startswith('V'):
        return wordnet.VERB
    elif nltk_tag.startswith('N'):
        return wordnet.NOUN
    elif nltk_tag.startswith('R'):
        return wordnet.ADV
    else:          
        return None
def lemmatize(data : list):
    print('Recommended to use after tokenization and stopword removal')
    print('lemmatizing each sentences')
    lemmatizer = WordNetLemmatizer()
    res_list=[]
    for i in tqdm(range(len(data)),desc='Loading...'):
        k=data[i]
        temp=k.split(" ")
        res_list.append(" ".join([lemmatizer.lemmatize(t) for t in temp]))
    return res_list
def spellcheck_word(data:list):
    print("Checking spellings for each line")
    words = set(map(lambda word: word.replace("\n", ""), open("20k.txt").readlines()))
    for i in tqdm(range(len(data)),desc='Loading'):
        line=data[i]
        res_line=[]
        for token in word_tokenize(line):
            new_tokens = []
            matches = difflib.get_close_matches(token.lower(), words, n=1, cutoff=0.7)
            if len(matches) == 0 or token.lower() in words:
                new_tokens.append(token)
            else:
                new_tokens.append(matches[0])
            res_line.append(" ".join(new_tokens))
    return res_line

### Cleaning data 

In [20]:
text = pd.read_csv('mcdonalds-yelp-negative-reviews.csv', encoding="ISO-8859-1")
mcd = pd.read_csv('mcdonalds-yelp-negative-reviews.csv', encoding="ISO-8859-1")

In [21]:
#Customized stopwords
nltk_stopwords=list(set(stopwords.words('english')))
new_stopwords=["i'm", 'hangout', 'spot', 'across', 'before', 'just', 'grab', 'spot',\
                                                    'filled', 'deal','little','having']
nltk_stopwords.extend(new_stopwords)

#Redundant word list & Dictionary
hamburger = ['hamburgers','burger','burgers','cheese burger']+cleaned_burger_menu
drink = ['drink','drinks','coke']+cleaned_drink_menu
breakfast = ['mcmuffin']+cleaned_breakfast_menu
dessert = ['mcflurry','cone','shake']+cleaned_dessert_menu
cafe = ['coffee']+cleaned_cafe_menu
facility = ['toilet','restroom','dirty','kitchen','cashier','parking','park','chair','chairs','dumpster']
servce = ['rude','impolite','sigh','sighed']


red_dict = {
    'hamburger':hamburger,
    'drink': drink,
    'breakfast':breakfast,
    'mcdonalds':['mcd','mcds','mcdonald'],
    'cafe':cafe,
    'facility':facility,
    'service':['service']
}
#I didn't know how to use the regex for multiple replacements, so I used a dictionary method.


#Customized stopwords
nltk_stopwords=list(set(stopwords.words('english')))
new_stopwords=["i'm", 'hangout', 'spot', 'across', 'before', 'just', 'grab', 'spot',\
                                                    'filled', 'deal','little','having']
nltk_stopwords.extend(new_stopwords)


# Lets iterate thru the review, removing punctuatation and numbers and creating word tokens using nltk word toeknize,
# next lets remove our custom stopwords and do some cleaning using regex to catch common concepts
# and map them back to a single concept (mcdonalds, hamburgers, nuggets)

#part of speech logic stolen from: https://www.programiz.com/python-programming/methods/set/update
cleaned_reviews = []
for idx, review in text.iterrows():
    # Clean punctuation
    clean_review = re.sub(r"[^A-Za-z ]",'',review['review'])
    # Tokenize into words and Tag words with part of speech 
    lemmatized_word = []
    nltk_tagged = nltk.pos_tag(nltk.word_tokenize(clean_review))  
    wordnet_tagged = map(lambda x: (x[0], nltk_tag_to_wordnet_tag(x[1])), nltk_tagged)
    lemmatizer = WordNetLemmatizer()
    # lemetize, use part of speech if available
    for word, tag in wordnet_tagged:
        if tag is None:
            #if there is no available tag, append the token as is
            lemmatized_word.append(word)
        else:        
            #else use the tag to lemmatize the token
            lemmatized_word.append(lemmatizer.lemmatize(word, tag))
    words_clean = []
    for i in red_dict.items():
        if word in i[1]:
            word=i[0]
    for word in lemmatized_word:
        word = re.sub(r"(?:mcdonalds?|macdonalds?|mcds?)",'McDonald', word, flags=re.IGNORECASE)
        word = re.sub(r"(?:burgers?|cheeseburgers?|hamburgers?|hamburgersandwiches?)",'hamburger', word, flags=re.IGNORECASE)
        word = re.sub(r"(?:McNuggets?|nuggets?|nugs?)",'nuggets', word, flags=re.IGNORECASE)
        word = re.sub(r"(?:fries?|frys?|french fries?)",'fries', word, flags=re.IGNORECASE)
        word = re.sub(r"('toilet|restroom|dirty|kitchen|cashier|parking|park|chair?|dumpster')",'facility',\
                      word, flags=re.IGNORECASE)
        if word in new_stopwords:
            continue
        words_clean.append(word)
    cleaned_review = " ".join(words_clean)
    cleaned_reviews.append((cleaned_review))
                      
text['cleaned_reviews']=cleaned_reviews

In [22]:
cleaned_reviews = cleaner(cleaned_reviews)
cleaned_reviews = token_rm_stopword(cleaned_reviews)
cleaned_reviews_lem = lemmatize(cleaned_reviews)

Tokenizing and removing stopwords...


Loading...:   0%|          | 0/1525 [00:00<?, ?it/s]

Recommended to use after tokenization and stopword removal
lemmatizing each sentences


Loading...:   0%|          | 0/1525 [00:00<?, ?it/s]

### Finding Similar reviews

For finding similar reviews, I think it is better to have some slack on similarity, so that there are more similar reviews including almost identical ones. For such, I will have the threshold to 0.7. The threshold can be controlled based on the research objective in the future.

#### Using Count Vectorizer

In [23]:
from sklearn.metrics.pairwise import cosine_similarity

In [24]:
vectorizer = CountVectorizer(lowercase=True, ngram_range=(2, 3),
                             max_features=12000)
X = vectorizer.fit_transform(cleaned_reviews) 
X = X.toarray()

corpus_df = pd.DataFrame(X, columns=vectorizer.get_feature_names())
# iterate through the Pandas dataframe, and drop the columns that reflect stopwords:
original_columns = corpus_df.columns # get existing columns


similarity_matrix = pd.DataFrame(cosine_similarity(corpus_df.values))

In [25]:
non_one = similarity_matrix[similarity_matrix <= .999]
empt_dic={}
for i in non_one.columns:
    li = non_one[i]
    if (max(li) >= 0.7):
        if  i <= li.index[li==max(li)]:
            empt_dic[i]=li.index[li==max(li)].to_list()

In [26]:
for i in empt_dic.keys():
    print(f'\nThe review number {i}\n\n {mcd.loc[i].review} \n\nis most similar to the review number {empt_dic[i][0]}\n\n{mcd.loc[empt_dic[i][0]].review}')
    print('-----------------------------------')


The review number 139

 Worst McDonalds EVER!!!!!!!!!! They are unable to get an order correct, Went through the drive through last night got home and no Fries in my bag. Got to the window and the woman running the window was talking about her personal life with a co worker laughing it up with her back t us making us wait instead of giving us our ice coffee. Then she seemed to remember she was working and brought us our coffee. Last time I went there I ordered something with no mayo which my tix said... got home and covered in mayo. called up to complain and the response...What do u want me to do about it??? If I get the craving for Mcdonald's I will drive out of my way.. I hope this one goes out of business.... This location is good for nothing 

is most similar to the review number 140

Worst McDonalds EVER!!!!!!!!!! They are unable to get an order correct, Went through the drive through last night got home and no Fries in my bag. Got to the window and the woman running the window w

#### Using TF-IDF

In [27]:
vectorizer = TfidfVectorizer(lowercase=True, ngram_range=(2, 3),
                             max_features=12000)
X = vectorizer.fit_transform(cleaned_reviews) 
X = X.toarray()

corpus_df = pd.DataFrame(X, columns=vectorizer.get_feature_names())
# iterate through the Pandas dataframe, and drop the columns that reflect stopwords:
original_columns = corpus_df.columns # get existing columns


similarity_matrix = pd.DataFrame(cosine_similarity(corpus_df.values))

In [28]:
non_one = similarity_matrix[similarity_matrix <= .999]
empt_dic={}
for i in non_one.columns:
    li = non_one[i]
    if (max(li) >= 0.7):
        if  i <= li.index[li==max(li)]:
            empt_dic[i]=li.index[li==max(li)].to_list()

In [29]:
for i in empt_dic.keys():
    print(f'\nThe review number {i}\n\n {mcd.loc[i].review} \n\nis most similar to the review number {empt_dic[i][0]}\n\n{mcd.loc[empt_dic[i][0]].review}')
    print('------------------------')


The review number 139

 Worst McDonalds EVER!!!!!!!!!! They are unable to get an order correct, Went through the drive through last night got home and no Fries in my bag. Got to the window and the woman running the window was talking about her personal life with a co worker laughing it up with her back t us making us wait instead of giving us our ice coffee. Then she seemed to remember she was working and brought us our coffee. Last time I went there I ordered something with no mayo which my tix said... got home and covered in mayo. called up to complain and the response...What do u want me to do about it??? If I get the craving for Mcdonald's I will drive out of my way.. I hope this one goes out of business.... This location is good for nothing 

is most similar to the review number 140

Worst McDonalds EVER!!!!!!!!!! They are unable to get an order correct, Went through the drive through last night got home and no Fries in my bag. Got to the window and the woman running the window w

### Findings

Both Count and TFIDF vectorizer have similar results in finding similar reviews. However, when holding the threshold to 0.7, there are more reviews(4) in using count vectorizer compared to the result(3) of when using TF-IDF vectorizer. 