# Homework 2 (Due 6:29pm PST March 29th, 2022): Word Vectorization, Regex Practice, and Similarity

You may work with **one other person on this assignment**. You may also work independently if you prefer.

If you just want to be assigned someone to work with, message me on Slack and I will assign you a partner to work with.

A. Using the **Amazon Toy Reviews Dataset (both positive and negative)**, **process the reviews**.
This means you should think briefly about:
* what stopwords to remove (should you add any custom stopwords to the set? Remove any stopwords?)
* what regex cleaning you may need to perform (for example, are there different ways of saying `broken` that you need to account for?)
* stemming/lemmatization (explain in your notebook why you used stemming versus lemmatization). 

Next, **count-vectorize the dataset**. Use the **`sklearn.feature_extraction.text.CountVectorizer`** examples from `Linear Algebra, Distance and Similarity (Completed).ipynb` and `Text Preprocessing Techniques (Completed).ipynb`.

I do not want redundant features - for instance, I do not want `Christmas` and `Christ-mas` to be two distinct columns in your document-term matrix. Therefore, I'll be taking a look to make sure you've properly performed your cleaning, stopword removal, etc. to reduce the number of dimensions in your dataset. 

B. **Stopwords, Stemming, Lemmatization Practice**

Using the **McDonalds Negative Reviews** file from Week 1:
* Count-vectorize the corpus. Treat each sentence as a document.

How many features (dimensions) do you get when you:
* Perform **stemming** and then count-vectorization
* Perform **lemmatization** and then **count-vectorization**.
* Perform **lemmatization**, remove **stopwords**, and then perform **count-vectorization**?

In [1]:
import nltk
from nltk.corpus import stopwords
import pandas as pd

In [2]:
import pandas as pd
import re
data1=" "
data2=" "

with open("../datasets/poor_amazon_toy_reviews.txt","r") as fp:
    data1=fp.read()
with open("../datasets/good_amazon_toy_reviews.txt","r") as fp:
    data2=fp.read()
    
data1=data1+"\n"
data1=data1+data2
    

In [3]:
poor_amazon_toy_reviews = pd.DataFrame( open("../datasets/poor_amazon_toy_reviews.txt", "r"), columns=["Review"])


In [4]:
poor_amazon_toy_reviews

Unnamed: 0,Review
0,Do not buy these! They break very fast I spun ...
1,Showed up not how it's shown . Was someone's o...
2,You need expansion packs 3-5 if you want acces...
3,"""This was to be a gift for my husband for our ..."
4,Received a pineapple rather than the advertise...
...,...
12695,It's a piece of junk...doesn't charge multiple...
12696,Really small\n
12697,It is contained in glass which is dangerous if...
12698,"""Fake. Not original. Every time my 5 yr old ki..."


In [5]:
good_amazon_toy_reviews = pd.DataFrame( open("../datasets/good_amazon_toy_reviews.txt", "r"), columns=["Review"])

In [6]:
good_amazon_toy_reviews.head()

Unnamed: 0,Review
0,Excellent!!!\n
1,"""Great quality wooden track (better than some ..."
2,my daughter loved it and i liked the price and...
3,Great item. Pictures pop thru and add detail a...
4,I was pleased with the product.\n


In [7]:
#COMBINE GOOD AND POOR REVIEWS
AllReviews = pd.concat([poor_amazon_toy_reviews,good_amazon_toy_reviews])
AllReviews["Review"] = AllReviews["Review"].str.replace("\n", "")

In [8]:
AllReviews.head()

Unnamed: 0,Review
0,Do not buy these! They break very fast I spun ...
1,Showed up not how it's shown . Was someone's o...
2,You need expansion packs 3-5 if you want acces...
3,"""This was to be a gift for my husband for our ..."
4,Received a pineapple rather than the advertise...


In [11]:
nltk_stopwords=set(stopwords.words('english'))

#COUNTING WORDS TO SEE PATTERNS
from collections import Counter
def count_words(lines, delimiter=" "):
    
    words = Counter() # instantiate a Counter object called words
    for line in lines:
        for word in line.split(delimiter):
            if word.lower() not in nltk_stopwords:
                words[word] += 1 # increment count for word
    return words

In [12]:
cntr=count_words(AllReviews["Review"] )

In [13]:
#cntr.most_common()

## A.

### i)
I would have remove 'didn't' , 'couldn't' , 'doesn't'  stopwords from the NLTK set of stopwords but with full context of use cases as i believe this will be very helpful in certain use case where the whole meaning of teh sentence could change with these words.

As from addition point of view I think we could potentially add these to the list as these are high frequency and might not add much value to in analysis in this domain.'would','really','/><br','overall','anyway','anyways'

 More additions would more likely to be decided relecant to the business use case.
        

In [14]:
#remove stopwords

cleaned_reviews=[]
nltk_stopwords=set(stopwords.words('english'))
add_stopwords=['would','really','overall','anyway','anyways']
for word in add_stopwords:
    nltk_stopwords.add(word)


In [15]:
#remove stopwords
for review in AllReviews["Review"]:
    words = nltk.word_tokenize(review)
    new_words = []
    for word in words:
        if word.lower() in nltk_stopwords:
            continue
        new_words.append(word.lower())
    cleaned_review = " ".join(new_words)
    cleaned_reviews.append(cleaned_review)

In [16]:
cleaned_reviews_s=pd.Series(cleaned_reviews)

In [17]:
cleaned_reviews_s[0]

"buy ! break fast spun 15 minutes end flew n't waste money . made cheap plastic cracks . buy poi balls work lot better limited funds ."

## A.

### ii) REGEX CLEANING




In [18]:
#REGEX1
cleaned_reviews_s=cleaned_reviews_s.str.replace(r'\b(exact)\b|\b(match)\b|\b(fit)\b|\b(pleased)\b|\b(perfect)\b',\
                              '_perfect_',regex=True, case=True)

In [19]:
#REGEX2
cleaned_reviews_s=cleaned_reviews_s.str.replace(r'\b(hit)\b|\b(good)\b|\b(great)\b|\b(nice)\b|\b(fantastic)\b|\b(best)|\b(excellent)|\b(high(ly)? recommend(ed)?)\b|\b(awesome)\b',\
                              '_great_',regex=True, case=True)

In [20]:
#REGEX3
cleaned_reviews_s=cleaned_reviews_s.str.replace(r'\b(purchase(d)?)\b|\b(order(s|ed)?)\b|\b(bought)\b',\
                              '_bought_',regex=True, case=True)

In [21]:
#REGEX4
cleaned_reviews_s=cleaned_reviews_s.str.replace(r'\b(poor(ly|er|est)?)\b|\b(terrible)\b|\b(wors(e|t)?)\b|\b(bad)\b',\
                              '_bad_',regex=True, case=True)

In [22]:
#REGEX5
cleaned_reviews_s=cleaned_reviews_s.str.replace(r'\b(dissapointing)\b|\b(too small)\b|\b(misrepresentation)\b|\b(waste(ful)?)\b|\b(broke(n)?)\b|\b(fake(er|st)?)\b|\b(fail(ed)?)\b|\b(filmsy)\b|\b(sad(ness|est|er)?)\b',\
                              '_dissapointing_',regex=True, case=True)

In [23]:
cleaned_reviews_s

0         buy ! break fast spun 15 minutes end flew n't ...
1            showed 's shown . someone 's old toy . paint .
2         need expansion packs 3-5 want access player ai...
3         `` gift husband new pool . receive color _boug...
4               received pineapple rather advertised s'more
                                ...                        
114912                                             fun game
114913                      `` _great_ kit , well priced ''
114914                                           supposed .
114915          grandson loves playing police figurines…… .
114916                          grandson loves littlebits !
Length: 114917, dtype: object

## A

### iii).

#### I have stemmed the reviews here since I'll want to maximize reach in this case by virtue of increasing 
#### the recall at the expense of precision. Also stemming si faster, so since we had a larger dataset I preferred it.

In [24]:
cleaned_reviews_s[0]

"buy ! break fast spun 15 minutes end flew n't _dissapointing_ money . made cheap plastic cracks . buy poi balls work lot better limited funds ."

In [25]:
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()
stemmed_reviews=[]
for review in cleaned_reviews_s:
    words = nltk.word_tokenize(review)
    new_words = []
    for word in words:
        new_words.append(stemmer.stem(word))
    stemmed_review = " ".join(new_words)
    stemmed_reviews.append(stemmed_review)


In [26]:
stemmed_reviews_s=pd.Series(stemmed_reviews)

In [27]:
stemmed_reviews_s[0]

"buy ! break fast spun 15 minut end flew n't _dissapointing_ money . made cheap plastic crack . buy poi ball work lot better limit fund ."

In [28]:
from sklearn.feature_extraction.text import CountVectorizer

In [29]:
vectorizer = CountVectorizer(stop_words="english",token_pattern=r'\b[A-za-z]+\b',min_df=0.001) 
X = vectorizer.fit_transform(stemmed_reviews_s)
vectorizer.fit(stemmed_reviews)
vectorized_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())



In [30]:
vectorized_df.head()

Unnamed: 0,_bad_,_bought_,_dissapointing_,_great_,_perfect_,aa,aaa,abil,abl,absolut,...,yesterday,yo,young,younger,youngest,youtub,yr,zero,zip,zipper
0,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [31]:
vectorized_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 114917 entries, 0 to 114916
Columns: 1382 entries, _bad_ to zipper
dtypes: int64(1382)
memory usage: 1.2 GB


## B.

## MCDONALDS

(treated each review as document)

In [32]:
import pandas as pd
import re
data=pd.read_csv("../datasets/mcdonalds-yelp-negative-reviews.csv", encoding="latin1")
data.head()

Unnamed: 0,_unit_id,city,review
0,679455653,Atlanta,"I'm not a huge mcds lover, but I've been to be..."
1,679455654,Atlanta,Terrible customer service. I came in at 9:30pm...
2,679455655,Atlanta,"First they ""lost"" my order, actually they gave..."
3,679455656,Atlanta,I see I'm not the only one giving 1 star. Only...
4,679455657,Atlanta,"Well, it's McDonald's, so you know what the fo..."


In [33]:
##### SOME BASC REGEX CLEANING

In [34]:
#REGEX1
data["review"]=data["review"].str.replace(r'\b(mcd(onald(s)?|s)?)\b','_McD_',regex=True, case=True)

In [35]:
#REGEX2
data["review"]=data['review'].str.replace(r'\b(unfriendly)\b|\b(rude(ly|ness)?)\b|\b(unkind(ness)?)\b',\
                                             '_rude_',regex=True, case=True)

In [36]:
#REGEX3
data["review"]=data['review'].str.replace(r'\b(filthy)\b|\b(unclean)\b|\b(dirty)\b',\
                                             '_unclean_',regex=True, case=True)

## 1. STEMMING AND and COUNT VECTORIZATION

In [37]:

from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()
stemmed_reviews_mcd=[]
for review in data["review"]:
    words = nltk.word_tokenize(review)
    new_words = []
    for word in words:
        new_words.append(stemmer.stem(word.lower()))         #appending lowercased tokens
    stemmed_review = " ".join(new_words)
    stemmed_reviews_mcd.append(stemmed_review)

In [38]:
stemmed_reviews_mcd_s=pd.Series(stemmed_reviews_mcd)

In [39]:
stemmed_reviews_mcd_s

0       i 'm not a huge _mcd_ lover , but i 've been t...
1       terribl custom servic . i came in at 9:30pm an...
2       first they `` lost '' my order , actual they g...
3       i see i 'm not the onli one give 1 star . onli...
4       well , it 's mcdonald 's , so you know what th...
                              ...                        
1520    i enjoy the part where i repeatedli ask if i h...
1521    worst mcdonald i 've been in in a long time ! ...
1522    when i am realli crave for mcdonald 's , thi s...
1523    two point right out of the gate : 1 . thuggeri...
1524    i want to grab breakfast one morn befor work s...
Length: 1525, dtype: object

In [40]:
from sklearn.feature_extraction.text import CountVectorizer

In [41]:
vectorizer = CountVectorizer(stop_words="english",token_pattern=r'\b[A-za-z]{3,}\b') 
X = vectorizer.fit_transform(stemmed_reviews_mcd_s)
vectorizer.fit(stemmed_reviews_mcd_s)
vectorized_df1 = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())



In [42]:
vectorized_df1.head()

Unnamed: 0,_and,_mcd_,_rude_,_unclean_,aaaaaaaahhhhhhhhhhh,abbrevi,abc,abil,abl,abod,...,zak,zax,zee,zeke,zero,zesti,zip,zombi,zombie,zoom
0,0,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


##### 5863 features in total

## 2. LEMMATIZATION AND VECTORIZATION

In [43]:
lowercased__reviews=[]         #LOWERCASING ALL TOKENS
for review in data["review"]:
    words = nltk.word_tokenize(review)
    new_words = []
    for word in words:
        new_words.append(word.lower())
    lowercased_review = " ".join(new_words)
    lowercased__reviews.append(lowercased_review)
lowercased__reviews_s=pd.Series(lowercased__reviews)

In [44]:
## lemmatization
## https://gist.github.com/gaurav5430/9fce93759eb2f6b1697883c3782f30de#file-nltk-lemmatize-sentences-py
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

lemmatizer = WordNetLemmatizer()

# function to convert nltk tag to wordnet tag
def nltk_tag_to_wordnet_tag(nltk_tag):
    if nltk_tag.startswith('J'):
        return wordnet.ADJ
    elif nltk_tag.startswith('V'):
        return wordnet.VERB
    elif nltk_tag.startswith('N'):
        return wordnet.NOUN
    elif nltk_tag.startswith('R'):
        return wordnet.ADV
    else:          
        return None

def lemmatize_sentence(sentence):
    #tokenize the sentence and find the POS tag for each token
    nltk_tagged = nltk.pos_tag(nltk.word_tokenize(sentence))  
    #tuple of (token, wordnet_tag)
    wordnet_tagged = map(lambda x: (x[0], nltk_tag_to_wordnet_tag(x[1])), nltk_tagged)
    lemmatized_sentence = []
    for word, tag in wordnet_tagged:
        if tag is None:
            #if there is no available tag, append the token as is
            lemmatized_sentence.append(word)
        else:        
            #else use the tag to lemmatize the token
            lemmatized_sentence.append(lemmatizer.lemmatize(word, tag))
    return " ".join(lemmatized_sentence)




In [45]:
lemm_reviews_mcd=[]
for review in pd.Series(lowercased__reviews_s):
    lemm_review=lemmatize_sentence(review)
    lemm_reviews_mcd.append(lemm_review)

In [46]:
lemm_reviews_mcd_s=pd.Series(lemm_reviews_mcd)

In [47]:
lemm_reviews_mcd_s[0]

"i 'm not a huge _mcd_ lover , but i 've be to good one . this be by far the bad one i 've ever be too ! it 's _unclean_ inside and if you get drive through they completely screw up your order every time ! the staff be terribly _rude_ and nobody seem to care ."

In [48]:
lowercased__reviews_s[0]

"i 'm not a huge _mcd_ lover , but i 've been to better ones . this is by far the worst one i 've ever been too ! it 's _unclean_ inside and if you get drive through they completely screw up your order every time ! the staff is terribly _rude_ and nobody seems to care ."

In [49]:
vectorizer = CountVectorizer(stop_words="english",token_pattern=r'\b[A-za-z]{3,}\b') 
X = vectorizer.fit_transform(lemm_reviews_mcd_s)
vectorizer.fit(lemm_reviews_mcd_s)
vectorized_df2 = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())



In [50]:
vectorized_df2

Unnamed: 0,_and,_mcd_,_rude_,_unclean_,aaaaaaaahhhhhhhhhhh,abbreviate,abc,ability,able,abode,...,yuppie,zak,zax,zee,zeke,zero,zesty,zip,zombie,zoom
0,0,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1520,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1521,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1522,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1523,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0


###### 6380 features in total

## 3. STOPWORDS REMOVAL AND VECTORIZATION (ON LEMMATIZED DATA)

In [51]:
nltk_stopwords=set(stopwords.words('english'))

In [52]:
cleaned_reviews_mcd=[]

for review in lemm_reviews_mcd_s:
    words = nltk.word_tokenize(review)
    new_words = []
    for word in words:
        if word.lower() in nltk_stopwords:
            continue
        new_words.append(word.lower())
    cleaned_review = " ".join(new_words)
    cleaned_reviews_mcd.append(cleaned_review)

In [53]:
cleaned_reviews_mcd_s=pd.Series(cleaned_reviews_mcd)

In [54]:
cleaned_reviews_mcd_s.head()

0    'm huge _mcd_ lover , 've good one . far bad o...
1    terrible customer service . come 9:30pm stand ...
2    first `` lose `` order , actually give someone...
3    see 'm one give 1 star . -25 star ! ! ! 's nee...
4    well , 's mcdonald 's , know food . review ref...
dtype: object

In [55]:
lemm_reviews_mcd_s[2]

"first they `` lose `` my order , actually they give it to someone one else than take 20 minute to figure out why i be still wait for my order.they after i be ask what i need i reply , `` my order `` .they ask for my ticket and the asst mgr look at the ticket then incompletely fill it.i have to ask her to check to see if she fill it correctly.she act as if she could n't be bother with that so i ask her again.she begrudgingly check to she do in fact miss something on the ticket.so after 22 minute i finally have my breakfast biscuit platter.as i leave an woman approach and identify herself as the manager , she be dress as if she have just awake in an old t-shirt and sweat pants.she say she have hear what happen and say she 'd take care of it.well why do n't she intervene when she saw i be grow annoy with the incompetence ?"

In [56]:
cleaned_reviews_mcd_s[2]

"first `` lose `` order , actually give someone one else take 20 minute figure still wait order.they ask need reply , `` order `` .they ask ticket asst mgr look ticket incompletely fill it.i ask check see fill correctly.she act could n't bother ask again.she begrudgingly check fact miss something ticket.so 22 minute finally breakfast biscuit platter.as leave woman approach identify manager , dress awake old t-shirt sweat pants.she say hear happen say 'd take care it.well n't intervene saw grow annoy incompetence ?"

In [57]:
vectorizer = CountVectorizer(stop_words="english",token_pattern=r'\b[A-za-z]{3,}\b') 
X = vectorizer.fit_transform(cleaned_reviews_mcd_s)
vectorized_df3 = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())



In [58]:
vectorized_df3

Unnamed: 0,_and,_mcd_,_rude_,_unclean_,aaaaaaaahhhhhhhhhhh,abbreviate,abc,ability,able,abode,...,yuppie,zak,zax,zee,zeke,zero,zesty,zip,zombie,zoom
0,0,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1520,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1521,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1522,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1523,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0


##### 6378 features in total