# Homework 2 (Due 6:29pm PST March 29th, 2022): Word Vectorization, Regex Practice, and Similarity

You may work with **one other person on this assignment**. You may also work independently if you prefer.

If you just want to be assigned someone to work with, message me on Slack and I will assign you a partner to work with.

A. Using the **Amazon Toy Reviews Dataset (both positive and negative)**, **process the reviews**.
This means you should think briefly about:
* what stopwords to remove (should you add any custom stopwords to the set? Remove any stopwords?)
* what regex cleaning you may need to perform (for example, are there different ways of saying `broken` that you need to account for?)
* stemming/lemmatization (explain in your notebook why you used stemming versus lemmatization). 

Next, **count-vectorize the dataset**. Use the **`sklearn.feature_extraction.text.CountVectorizer`** examples from `Linear Algebra, Distance and Similarity (Completed).ipynb` and `Text Preprocessing Techniques (Completed).ipynb`.

I do not want redundant features - for instance, I do not want `Christmas` and `Christ-mas` to be two distinct columns in your document-term matrix. Therefore, I'll be taking a look to make sure you've properly performed your cleaning, stopword removal, etc. to reduce the number of dimensions in your dataset. 

**Finally, identify the pair of reviews that are the MOST similar after performing all of these steps.**

B. **Stopwords, Stemming, Lemmatization Practice**

Using the **McDonalds Negative Reviews** file from Week 1:
* Count-vectorize the corpus. Treat each sentence as a document.

How many features (dimensions) do you get when you:
* Perform **stemming** and then count-vectorization
* Perform **lemmatization** and then **count-vectorization**.
* Perform **lemmatization**, remove **stopwords**, and then perform **count-vectorization**?

### Contributors:
Suhas Sridhar: 8674826522<br>
Falak Jain: 2274350452

In [1]:
import sys
import pandas as pd
from collections import Counter
import numpy as np
from matplotlib import pyplot as plt
import re
from nltk.corpus import stopwords
import nltk
#nltk.download('stopwords')
#nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer

### Functions

In [2]:
from collections import Counter
def count_words(line, delimiter=" "):
    words = Counter() # instantiate a Counter object called words
    for word in line.split(delimiter):
            words[word] += 1 # increment count for word
    return words

In [3]:
def preprocessing(document):
    document = document.lower()
    document = re.sub('[\.]{2,}'," ",document)
    document = re.sub('[^a-zA-Z0-9 \n]', '', document)
    return document

In [4]:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

lemmatizer = WordNetLemmatizer()

# function to convert nltk tag to wordnet tag
def nltk_tag_to_wordnet_tag(nltk_tag):
    if nltk_tag.startswith('J'):
        return wordnet.ADJ
    elif nltk_tag.startswith('V'):
        return wordnet.VERB
    elif nltk_tag.startswith('N'):
        return wordnet.NOUN
    elif nltk_tag.startswith('R'):
        return wordnet.ADV
    else:          
        return None

def lemmatize_sentence(sentence):
    #tokenize the sentence and find the POS tag for each token
    nltk_tagged = nltk.pos_tag(nltk.word_tokenize(sentence))  
    #tuple of (token, wordnet_tag)
    wordnet_tagged = map(lambda x: (x[0], nltk_tag_to_wordnet_tag(x[1])), nltk_tagged)
    lemmatized_sentence = []
    for word, tag in wordnet_tagged:
        if tag is None:
            #if there is no available tag, append the token as is
            lemmatized_sentence.append(word)
        else:        
            #else use the tag to lemmatize the token
            lemmatized_sentence.append(lemmatizer.lemmatize(word, tag))
    return " ".join(lemmatized_sentence)

## A

In [5]:
# remove the punctuation marks before stemming

In [6]:
# Reading
amzn_df1 = pd.read_csv('../datasets/good_amazon_toy_reviews.txt',header = None)
amzn_df1['Type'] = 'Good'
amzn_df2 = pd.read_csv('../datasets/poor_amazon_toy_reviews.txt', header = None)
amzn_df2['Type'] = 'Poor'
# Concatenating Good and Poor Reviews
amzn_df = pd.concat([amzn_df1,amzn_df2],axis = 0).reset_index(drop=True)
amzn_df = amzn_df.replace(np.nan, '', regex=True)
# Quality Check
assert amzn_df.shape[0] == amzn_df1.shape[0] + amzn_df2.shape[0]
# Deleting redundant variables
del amzn_df1, amzn_df2
# Renaming column
amzn_df.rename(columns = {0:'Review'},inplace = True)
amzn_df.head()

Unnamed: 0,Review,Type
0,Excellent!!!,Good
1,Great quality wooden track (better than some o...,Good
2,my daughter loved it and i liked the price and...,Good
3,Great item. Pictures pop thru and add detail a...,Good
4,I was pleased with the product.,Good


In [7]:
# printing all the stopwords in nltk stopwords variable
print(set(stopwords.words('english')))

{'re', 'on', 'then', 'what', "that'll", 'while', 'before', 'you', 'does', 'not', 'yours', "aren't", 'by', 'are', 'up', 'out', 'same', 'how', 'that', 'where', "weren't", "don't", 'no', 'been', 'the', 'some', "you'll", 'can', 'mightn', "you're", "wasn't", 'was', 'if', 'once', 'all', 'we', 'have', 'those', 'during', "doesn't", 'nor', 't', 'ain', 'both', 'very', 'below', 'from', 'an', 'd', 'whom', 'until', 'who', 'me', 'my', 'into', 'off', 'through', 'each', 'theirs', 'itself', 'wouldn', 'has', "it's", 'only', "isn't", 'its', 'than', 'he', 'himself', 'any', 'o', 'against', 'it', 'were', "didn't", 'm', 'didn', 's', 'most', "won't", "hadn't", 'of', 'doesn', 'as', 'had', 'his', "she's", 'am', 'in', 'so', "should've", 'our', 'now', "wouldn't", 'hadn', 'for', 'and', "needn't", 'she', 'but', 'just', 'themselves', 'wasn', 'isn', 'further', 'hers', 'don', 'yourselves', 'your', 'why', 'few', 'or', 'haven', 'shan', 'again', 'own', 'll', 'did', 'here', 'y', 'because', 'myself', 'to', 'more', 'over', 

Stopwords to remove from nltk list:
- "not","doesn't","didn't": signifies negativity
- "very","too": signifies emphasis

Stopwords to add to nltk list:
- "toy": since all the reviews are about toys

Words to be cleaned using regex cleaning:
- "\\\&#34;": probably an encoding error
- "\\\<br />": probably encoding error
- "pre-school","preschool"
- "tkx","thnx","thanks","ty"
- "great","awesome","nice","best","perfect","good","amazing"
- "adorable","pretty","good looking","beautiful","cute","stylish","elegant"
- "gift","present"
- "pleased","satisfied","happy"
- "break","broke","broken","damaged","cracked","did not work","defective","fragile","flimsy","dud"
- "terrible","pathetic","bad","cheap"
- "gross","dirty","disgusting"
- "frustrated","disappointed",

In [8]:
# dictionary of regexes
reg_clean = {'good':["great","awesome","nice","best","perfect","good","amazing"],
 'pretty ':["adorable","pretty","good looking","beautiful","cute","stylish","elegant"],
 'gift':["gift","present"],
 'happy':["pleased","satisfied","happy"],
 'defective':["break","broke","broken","damaged","cracked","did not work","defective","fragile","flimsy","dud"],
 'bad':["terrible","pathetic","bad","cheap"],
 'dirty':["gross","dirty","disgusting"],
 'disappointed':['frustrated','disappointed']}


In [9]:
# stop words to remove from default stopwords list and to add to list
stops = ["not","doesn't","didn't","very","too"]
to_add = ["toy"]
# creating final list of stopwords
my_stopwords = list(set(stopwords.words('english')) - set(stops) | set(to_add))

In [10]:
# preprocessing dataframe
amzn_df['Review'] = amzn_df['Review'].str.lower()
amzn_df['Review'] = amzn_df['Review'].str.replace(r'[\.]{2,}', ' ',regex = True)
amzn_df['Review'] = amzn_df['Review'].str.replace('34', ' ',regex = True)
amzn_df['Review'] = amzn_df['Review'].str.replace(r'br\b', ' ',regex = True)
amzn_df['Review'] = amzn_df['Review'].str.replace(r'\bpre-school', 'preschool',regex = True)
amzn_df['Review'] = amzn_df['Review'].str.replace(r'\btks\b', 'thanks',regex = True)
amzn_df['Review'] = amzn_df['Review'].str.replace(r'[^a-zA-Z0-9 \n]', '',regex = True)
for k in reg_clean:
    amzn_df['Review'] = amzn_df['Review'].str.replace(r'\b(' + r'|'.join(reg_clean[k]) + r')\b\s*', str(k+' '),regex = True)
amzn_df.head()

Unnamed: 0,Review,Type
0,excellent,Good
1,good quality wooden track better than some oth...,Good
2,my daughter loved it and i liked the price and...,Good
3,good item pictures pop thru and add detail as ...,Good
4,i was happy with the product,Good


In [11]:
#removing all stopwords from dataframe
amzn_df['Review'] = amzn_df['Review'].apply(lambda x: ' '.join([word for word in x.split() if word not in (my_stopwords)]))
amzn_df.head()

Unnamed: 0,Review,Type
0,excellent,Good
1,good quality wooden track better others tried ...,Good
2,daughter loved liked price came rather shoppin...,Good
3,good item pictures pop thru add detail painted...,Good
4,happy product,Good


In [12]:
# performing stemming
stemmer = PorterStemmer()
amzn_df['Review'] = amzn_df['Review'].apply(lambda x: " ".join([stemmer.stem(y) for y in count_words(x)]))
amzn_df.head()

Unnamed: 0,Review,Type
0,excel,Good
1,good qualiti wooden track better other tri mat...,Good
2,daughter love like price came rather shop ton ...,Good
3,good item pictur pop thru add detail paint dri...,Good
4,happi product,Good


Reasons for choosing stemming over lemmatization:
- Stemming is faster to implement than lemmatization, since we had a large dataset, we decided to implement stemming to complete the count vectorization faster
- Stemming provides higher recall which would be crucial while finding similarity between reviews as words can be used in different forms with a similar meaning
- Lemmatization would give us a lot more tokens in our corpus which would require more computing power to process and vectorize

In [13]:
# performing count vectorization
reviews: pd.Series = amzn_df["Review"]
vectorizer = CountVectorizer(max_features=1000) 
X = vectorizer.fit_transform(reviews)
X = X.toarray()
corpus_df = pd.DataFrame(X, columns=vectorizer.get_feature_names())
corpus_df

Unnamed: 0,10,100,11,12,13,14,15,18,1st,20,...,ye,year,yellow,yet,yo,youll,young,younger,your,yr
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
114879,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
114880,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
114881,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
114882,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


## B

In [14]:
stops= ['mcdonalds','McD',"mcdonald's",'mcds','mcd']
stopwords = list(set(stopwords.words('english'))  - set(["through"]) | set(stops))

In [15]:
mcd_df = pd.read_csv('../datasets/mcdonalds-yelp-negative-reviews.csv',encoding = 'latin1')

In [16]:
from sklearn.feature_extraction.text import CountVectorizer

In [17]:
mcd_df = mcd_df.replace(np.nan, '', regex=True)

In [18]:
pattern = re.compile(r'\b(' + r'|'.join(stopwords) + r')\b\s*')
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()

Treating each sentence as a document

In [19]:
df2=pd.DataFrame(columns=['_unit_id', 'city', 'sentence_id','sentence'])
for row in mcd_df.index:
    review=mcd_df.loc[row,'review']
    sent_text = nltk.sent_tokenize(review)
    for i,sentence in enumerate(sent_text):
        rows={'_unit_id':mcd_df.loc[row,'_unit_id'],
        'city':mcd_df.loc[row,'city'],
        'sentence_id':i,
        'sentence':sentence}
        df2=df2.append(rows,ignore_index=True)

### Count vectorization without stemming/lematization

In [20]:
df2_reviews:pd.Series=df2['sentence']

In [21]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(df2_reviews) 
X = X.toarray()
corpus_df = pd.DataFrame(X, columns=vectorizer.get_feature_names())
corpus_df

Unnamed: 0,00,000,00am,00my,00pm,01,0200,03pm,04,04am,...,zax,zee,zeke,zero,zesty,zip,zombie,zombies,zoom,î_
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9721,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9722,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9723,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9724,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Just performing count vectorization gives 8379 features

### Stemming & Count Vectorization

In [22]:
for row in df2.index:
    document=df2.loc[row,'sentence']
    document=preprocessing(document)
    counter=count_words(document)
    stemmed_sentence=[]
    for word in counter:
        stemmed_sentence.append(stemmer.stem(word))
        newsentence=" ".join(stemmed_sentence)
    df2.loc[row,'sentence']=newsentence

In [23]:
df2.head()

Unnamed: 0,_unit_id,city,sentence_id,sentence
0,679455653,Atlanta,0,im not a huge mcd lover but ive been to better...
1,679455653,Atlanta,1,thi is by far the worst one ive ever been too
2,679455653,Atlanta,2,it filthi insid and if you get drive through t...
3,679455653,Atlanta,3,the staff is terribl unfriendli and nobodi see...
4,679455654,Atlanta,0,terribl custom servic


In [24]:
df2_reviews:pd.Series=df2['sentence']

In [25]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(df2_reviews) 
X = X.toarray()
corpus_df = pd.DataFrame(X, columns=vectorizer.get_feature_names())

In [26]:
corpus_df

Unnamed: 0,0200,040,04202014,045wtf,049,053,054,05if,0600,076,...,zak,zax,zee,zeke,zero,zesti,zip,zombi,zombievampirewerewolf,zoom
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9721,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9722,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9723,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9724,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Lemmatization & count vectorization

In [27]:
df3=pd.DataFrame(columns=['_unit_id', 'city', 'sentence_id','sentence'])

In [28]:
for row in mcd_df.index:
    review=mcd_df.loc[row,'review']
    sent_text = nltk.sent_tokenize(review)
    for i,sentence in enumerate(sent_text):
        rows={'_unit_id':mcd_df.loc[row,'_unit_id'],
        'city':mcd_df.loc[row,'city'],
        'sentence_id':i,
        'sentence':sentence}
        df3=df3.append(rows,ignore_index=True)

In [29]:
for row in df3.index:
    document=df3.loc[row,'sentence']
    document=preprocessing(document)
    counter=count_words(document)
    sentence = lemmatize_sentence(document)
    df3.loc[row,'sentence']=sentence

In [30]:
df3_reviews:pd.Series=df3['sentence']
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(df3_reviews) 
X = X.toarray()
corpus_df = pd.DataFrame(X, columns=vectorizer.get_feature_names())
corpus_df

Unnamed: 0,0200,040,04202014,045wtf,049,053,054,05if,0600,076,...,zak,zax,zee,zekes,zero,zesty,zip,zombie,zombievampirewerewolf,zoom
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9721,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9722,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9723,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9724,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Lemmatization & count vectorization gives 8105 features

### Perform lemmatization, remove stopwords and then count vectorization

In [31]:
for row in df3.index:
    sentence=df3.loc[row,'sentence']
    sentence=pattern.sub('', sentence)
    df3.loc[row,'sentence']=sentence

In [32]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(df3_reviews) 
X = X.toarray()
corpus_df = pd.DataFrame(X, columns=vectorizer.get_feature_names())
corpus_df

Unnamed: 0,0200,040,04202014,045wtf,049,053,054,05if,0600,076,...,zak,zax,zee,zekes,zero,zesty,zip,zombie,zombievampirewerewolf,zoom
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9721,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9722,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9723,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9724,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### There are 7968 features after lemmatization -> remove stopwords -> count vectorization