### Collaborated with Hyung-Seok Seo

# Mcdonalds Yelp Reviews

A. Using the **McDonalds Yelp Review CSV file**, **process the reviews**.
This means you should think briefly about:
* what stopwords to remove (should you add any custom stopwords to the set? Remove any stopwords?)
* what regex cleaning you may need to perform (for example, are there different ways of saying `hamburger` that you need to account for?)
* stemming/lemmatization (explain in your notebook why you used stemming versus lemmatization). 

Next, **count-vectorize the dataset**. Use the **`sklearn.feature_extraction.text.CountVectorizer`** examples from `Linear Algebra, Distance and Similarity (Completed).ipynb` and `Text Preprocessing Techniques (Completed).ipynb`.

I do not want redundant features - for instance, I do not want `hamburgers` and `hamburger` to be two distinct columns in your document-term matrix. Therefore, I'll be taking a look to make sure you've properly performed your cleaning, stopword removal, etc. to reduce the number of dimensions in your dataset. 

## Load and Inspect Data

In [1]:
import pandas as pd 
from collections import Counter

from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk.corpus import wordnet
import nltk
from nltk.stem.porter import PorterStemmer

from sklearn.feature_extraction.text import CountVectorizer



import re
from nltk.tokenize import word_tokenize

mcd = pd.read_csv('mcdonalds-yelp-negative-reviews.csv', encoding="ISO-8859-1")
mcd.head()

Unnamed: 0,_unit_id,city,review
0,679455653,Atlanta,"I'm not a huge mcds lover, but I've been to be..."
1,679455654,Atlanta,Terrible customer service. I came in at 9:30pm...
2,679455655,Atlanta,"First they ""lost"" my order, actually they gave..."
3,679455656,Atlanta,I see I'm not the only one giving 1 star. Only...
4,679455657,Atlanta,"Well, it's McDonald's, so you know what the fo..."


In [2]:
#lets print the top 5 reviews and read them in their entierty
for row in range(5): 
    print(mcd.iloc[row,2], '\n')

I'm not a huge mcds lover, but I've been to better ones. This is by far the worst one I've ever been too! It's filthy inside and if you get drive through they completely screw up your order every time! The staff is terribly unfriendly and nobody seems to care. 

Terrible customer service. I came in at 9:30pm and stood in front of the register and no one bothered to say anything or help me for 5 minutes. There was no one else waiting for their food inside either, just outside at the window.  I left and went to Chickfila next door and was greeted before I was all the way inside. This McDonalds is also dirty, the floor was covered with dropped food. Obviously filled with surly and unhappy workers. 

First they "lost" my order, actually they gave it to someone one else than took 20 minutes to figure out why I was still waiting for my order.They after I was asked what I needed I replied, "my order".They asked for my ticket and the asst mgr looked at the ticket then incompletely filled it.I 

## Cleaning and Stopword removal

Lets clean up the data by checking and removing garbage values (this code stolen from my previous HW), and then lets remove stopwords based on our custom list

In [3]:
def find_unique_characters(regex, lines):
    """
    Finds unique characters from a list of strings, almost certainly inefficiently 
    
    """
    #Match anything that is non alpha-numeric or whitespace, creates list of lists of matching characters
    potential_malforms = [re.findall(regex, review) for review in lines]

    #lets whittle down this list of lists to a unqiue list, btw this took me way longer than it needed to
    unique_malforms = set([char for review in potential_malforms for char in review])
    
    print(F"Number of unique potential Malformed Characters: {len(unique_malforms)}, \n\nCandidates: {unique_malforms}")
    return unique_malforms

In [4]:
#These are mostly regular punctuation, some random "\x97" type patterns that could be artifacts from html, 
#we will drop all these since it will only consolidate our count of words

punc = find_unique_characters(r"[\W ]", list(mcd['review']))

Number of unique potential Malformed Characters: 33, 

Candidates: {';', '(', '\\', '!', '.', ':', ']', '^', '+', '/', ',', ' ', '*', '©', '`', '\x8a', '-', '$', '\x97', '=', '~', '#', '\x92', '[', "'", '|', '@', '"', '?', ')', '±', '&', '%'}


### Find some words we may want to remove with other stopwords

In [5]:
#lets get the top words so we can find other words we may want to remove as stopwords
tot_words = []
for idx, review in mcd.iterrows():
    words = re.sub(r"[^A-Za-z0-9 ]",'',review['review'])
    tot_words.extend(word_tokenize(words))
    
Counter(list(tot_words)).most_common()

[('the', 6220),
 ('I', 4368),
 ('and', 4099),
 ('to', 4015),
 ('a', 3439),
 ('of', 1999),
 ('is', 1913),
 ('was', 1789),
 ('in', 1771),
 ('for', 1644),
 ('my', 1416),
 ('this', 1409),
 ('it', 1402),
 ('that', 1238),
 ('McDonalds', 1209),
 ('they', 1142),
 ('you', 1115),
 ('at', 1018),
 ('have', 949),
 ('not', 903),
 ('on', 902),
 ('but', 839),
 ('me', 836),
 ('with', 812),
 ('order', 807),
 ('food', 807),
 ('The', 752),
 ('are', 707),
 ('one', 672),
 ('get', 662),
 ('be', 640),
 ('so', 624),
 ('there', 588),
 ('up', 568),
 ('here', 565),
 ('had', 564),
 ('just', 537),
 ('time', 512),
 ('go', 505),
 ('or', 499),
 ('out', 489),
 ('drive', 479),
 ('like', 478),
 ('as', 467),
 ('no', 461),
 ('service', 458),
 ('were', 453),
 ('place', 449),
 ('when', 448),
 ('its', 422),
 ('your', 413),
 ('This', 405),
 ('only', 396),
 ('all', 391),
 ('if', 382),
 ('because', 380),
 ('what', 377),
 ('dont', 376),
 ('their', 372),
 ('location', 371),
 ('been', 364),
 ('about', 364),
 ('an', 361),
 ('from', 

In [6]:
#lets add some common stopwords gleamed from our top words list that shouldnt carry very much value

custom_stopwords = (set(stopwords.words('english')))
add_words = {'hangout', 'spot', 'across', 'before', 'just', 'grab', 'spot',\
                                                    'filled', 'deal','little','having'}
custom_stopwords.update(add_words)
print(custom_stopwords)

{"hadn't", 'again', 'than', "won't", 'who', 'up', 'why', 'yours', 'once', 'needn', 'after', 'them', "isn't", 'filled', "should've", 'yourselves', 'had', 'to', 'did', 't', 'i', 'doesn', "shan't", "you've", 'most', 'an', 'you', 'hangout', 'during', 'here', 'not', "she's", 'of', 'spot', 'this', 'mustn', 'hers', 'doing', 'now', 'yourself', 'on', 'from', 'shouldn', "weren't", 'isn', 's', 'myself', 'only', 'few', 'any', 'aren', 'such', 'what', "you'd", 'or', 'while', 'ours', "you'll", "aren't", 'he', 'your', 'for', 'won', 'those', 'does', 'itself', 'their', 'theirs', 'was', 'each', 'herself', "couldn't", 'which', 'out', 'hadn', 'can', 'ourselves', 'do', 'about', 'before', 'a', 'the', 'there', 'when', "that'll", "you're", "it's", "doesn't", 'below', 'm', 'over', 'hasn', "hasn't", 'against', 'under', 'my', "needn't", 'mightn', 'above', 'wouldn', 'will', 'his', 'and', 'other', 've', 'it', 'at', 'deal', 'shan', 'very', 'through', "shouldn't", 'off', 'too', 'having', 'him', 'because', 'be', "wasn

### Remove Stopwords and Group Common Concepts Then Lemmatize

I chose lemmatization since we are not feeding this into any ML models we do not need to worry about overfitting, and lemmatization has the added benefit of being easier to understand since we are working with entire words instead of roots 

Also we will remove numbers with regex because otherwise it really clouds the data

In [7]:
# function to convert nltk tag to wordnet tag 
# STOLEN FROM: https://gaurav5430.medium.com/using-nltk-for-lemmatizing-sentences-c1bfff963258

def nltk_tag_to_wordnet_tag(nltk_tag):
    if nltk_tag.startswith('J'):
        return wordnet.ADJ
    elif nltk_tag.startswith('V'):
        return wordnet.VERB
    elif nltk_tag.startswith('N'):
        return wordnet.NOUN
    elif nltk_tag.startswith('R'):
        return wordnet.ADV
    else:          
        return None

In [8]:
# Lets iterate thru the review, removing punctuatation and numbers and creating word tokens using nltk word toeknize,
# next lets remove our custom stopwords and do some cleaning using regex to catch common concepts
# and map them back to a single concept (mcdonalds, hamburgers, nuggets)

#part of speech logic stolen from: https://www.programiz.com/python-programming/methods/set/update
nltk.download('wordnet')

cleaned_reviews = []
for idx, review in mcd.iterrows():
    # Clean punctuation
    clean_review = re.sub(r"[^A-Za-z ]",'',review['review'])
    # Tokenize into words and Tag words with part of speech 
    lemmatized_word = []
    nltk_tagged = nltk.pos_tag(nltk.word_tokenize(clean_review))  
    wordnet_tagged = map(lambda x: (x[0], nltk_tag_to_wordnet_tag(x[1])), nltk_tagged)
    lemmatizer = WordNetLemmatizer()
    # lemetize, use part of speech if available
    for word, tag in wordnet_tagged:
        if tag is None:
            #if there is no available tag, append the token as is
            lemmatized_word.append(word)
        else:        
            #else use the tag to lemmatize the token
            lemmatized_word.append(lemmatizer.lemmatize(word, tag))
    words_clean = []
    for word in lemmatized_word:
        word = re.sub(r"(?:mcdonalds?|macdonalds?|mcds?)",'McDonald', word, flags=re.IGNORECASE)
        word = re.sub(r"(?:burgers?|cheeseburgers?|hamburgers?|hamburgersandwiches?)",'hamburger', word, flags=re.IGNORECASE)
        word = re.sub(r"(?:McNuggets?|nuggets?|nugs?)",'nuggets', word, flags=re.IGNORECASE)
        word = re.sub(r"(?:fries?|frys?|french fries?)",'fries', word, flags=re.IGNORECASE)
        if word in custom_stopwords:
            continue
        words_clean.append(word)
    cleaned_review = " ".join(words_clean)
    cleaned_reviews.append((cleaned_review))

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\drpow\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


## Count Vectorize the Reviews as single words and as Two word ngrams

In [23]:
cleaned_reviews

['Im huge McDonald lover Ive good one This far bad one Ive ever Its filthy inside get drive completely screw order every time The staff terribly unfriesndly nobody seem care',
 'Terrible customer service I come pm stood front register one bother say anything help minute There one else wait food inside either outside window I leave go Chickfila next door greet I way inside This McDonald also dirty floor cover dropped food Obviously fill surly unhappy worker',
 'First lose order actually give someone one else take minute figure I still wait orderThey I ask I need I reply orderThey ask ticket asst mgr look ticket incompletely fill itI ask check see fill correctlyShe act couldnt bother I ask againShe begrudgingly check fact miss something ticketSo minute I finally breakfast biscuit platterAs I leave woman approach identify manager dress awake old tshirt sweat pantsShe say hear happen say shed take care itWell didnt intervene saw I grow annoy incompetence',
 'I see Im one give star Only Sta

In [9]:
vectorizer = CountVectorizer(lowercase=True)
word_vectors = vectorizer.fit_transform(cleaned_reviews)
vectorizer.get_feature_names()
word_vectors = word_vectors.toarray()
corpus_df = pd.DataFrame(word_vectors, columns=vectorizer.get_feature_names())
corpus_df

Unnamed: 0,aaaaaaaahhhhhhhhhhh,abbreviate,abc,ability,able,abode,abour,about,aboutits,abrams,...,zak,zax,zee,zekes,zero,zestychipotlethai,zip,zombie,zombievampirewerewolf,zoom
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1520,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1521,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1522,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1523,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0


In [10]:
print(list(corpus_df.columns))



In [11]:
#used example given in slack, limited to top 1k 
vectorizer = CountVectorizer(ngram_range=(2, 2), max_features=1000, lowercase=True)
word_vectors = vectorizer.fit_transform(cleaned_reviews)
vectorizer.get_feature_names()
word_vectors = word_vectors.toarray()
corpus_df = pd.DataFrame(word_vectors, columns=vectorizer.get_feature_names())
corpus_df

Unnamed: 0,accept cash,act like,after wait,almost always,also one,always busy,always forget,always get,always good,always pack,...,wrong order,year old,yet another,you dont,you get,you know,young lady,youre go,youre look,zero star
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1520,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1521,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1522,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1523,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


# Tale of Two Cities Practice

B. **Stopwords, Stemming, Lemmatization Practice**

Using the `tale-of-two-cities.txt` file from Week 1:
* Count-vectorize the corpus. Treat each sentence as a document.

How many features (dimensions) do you get when you:
* Perform **stemming** and then count-vectorization
* Perform **lemmatization** and then **count-vectorization**.
* Perform **lemmatization**, remove **stopwords**, and then perform **count-vectorization**?

In [12]:
tale = open("tale-of-two-cities.txt", "r")
tale_lines = tale.readlines()
tale_lines

['  IT WAS the best of times, it was the worst of times, it was the\n',
 'age of wisdom, it was the age of foolishness, it was the epoch of\n',
 'belief, it was the epoch of incredulity, it was the season of Light,\n',
 'it was the season of Darkness, it was the spring of hope, it was the\n',
 'winter of despair, we had everything before us, we had nothing\n',
 'before us, we were all going direct to Heaven, we were all going\n',
 'direct the other way- in short, the period was so far like the present\n',
 'period, that some of its noisiest authorities insisted on its being\n',
 'received, for good or for evil, in the superlative degree of\n',
 'comparison only.\n',
 '  There were a king with a large jaw and a queen with a plain face, on\n',
 'the throne of England; there were a king with a large jaw and a\n',
 'queen with a fair face, on the throne of France. In both countries\n',
 'it was clearer than crystal to the lords of the State preserves of\n',
 'loaves and fishes, that things

## Count-vectorize the corpus. Treat each sentence as a document.


In [13]:
#remove all newlines and punctuation
text_less_newlines = ''
for line in tale_lines:
    clean_tale_lines = re.sub(r"\n",' ',line)
    text_less_newlines += clean_tale_lines




In [14]:
#create sentences which we will treat as documents
sent_text = nltk.sent_tokenize(text_less_newlines)

In [15]:
#create sentences which we will treat as documents
sent_text = nltk.sent_tokenize(text_less_newlines)

vectorizer = CountVectorizer(lowercase = True)
word_vectors = vectorizer.fit_transform(sent_text)
vectorizer.get_feature_names()
word_vectors = word_vectors.toarray()
corpus_df = pd.DataFrame(word_vectors, columns=vectorizer.get_feature_names())
print(f"No Treatment we have {len(corpus_df.columns)} features")


No Treatment we have 9705 features


## Perform **stemming** and then count-vectorization


In [16]:
#tokenize into sentences then words, then stem and reconstruct into sentences
stemmer = PorterStemmer()
stemmed_tale= []
for line in sent_text:
    words_clean = []
    for word in nltk.word_tokenize(line): 
        stemmed_word = stemmer.stem(word)
        words_clean.append(stemmed_word)
    stemmed_words = " ".join(words_clean)
    stemmed_tale.append(stemmed_words)

In [17]:
vectorizer = CountVectorizer()
word_vectors = vectorizer.fit_transform(stemmed_tale)
vectorizer.get_feature_names()
word_vectors = word_vectors.toarray()
corpus_df = pd.DataFrame(word_vectors, columns=vectorizer.get_feature_names())
print(f"With Stemming we have {len(corpus_df.columns)} features")

With Stemming we have 6659 features


## Perform **lemmatization** and then **count-vectorization**.

In [18]:
#tokenize into sentences then words, then stem and reconstruct into sentences
Lemmatizer = WordNetLemmatizer()
lemmed_tale= []
for line in sent_text:
    words_clean = []
    for word in nltk.word_tokenize(line): 
        lemmed_word = Lemmatizer.lemmatize(word)
        words_clean.append(lemmed_word)
    lemmed_words = " ".join(words_clean)
    lemmed_tale.append(lemmed_words)

In [19]:
vectorizer = CountVectorizer()
word_vectors = vectorizer.fit_transform(lemmed_tale)
vectorizer.get_feature_names()
word_vectors = word_vectors.toarray()
corpus_df = pd.DataFrame(word_vectors, columns=vectorizer.get_feature_names())
print(f"With Lemmatization we have {len(corpus_df.columns)} features")

With Lemmatization we have 8910 features


## Perform **lemmatization**, remove **stopwords**, and then perform **count-vectorization**?

In [20]:
Lemmatizer = WordNetLemmatizer()
lemm_stopwords = (set(stopwords.words('english')))
lemmed_tale= []

for line in sent_text:
    words_clean = []
    for word in nltk.word_tokenize(line): 
        if word in lemm_stopwords:
            continue
        lemmed_word = Lemmatizer.lemmatize(word)
        words_clean.append(lemmed_word)
    lemmed_words = " ".join(words_clean)
    lemmed_tale.append(lemmed_words)

In [21]:
vectorizer = CountVectorizer()
word_vectors = vectorizer.fit_transform(lemmed_tale)
vectorizer.get_feature_names()
word_vectors = word_vectors.toarray()
corpus_df = pd.DataFrame(word_vectors, columns=vectorizer.get_feature_names())
print(f"With Lemmatization we have {len(corpus_df.columns)} features")

With Lemmatization we have 8897 features


## Summary
    - With Stemming we have 6659 features
    - With Lemmatization 8910 features
    - With Lemmatization and stopwords we have 8897 features