# Homework 2 (Due 6:29pm PST March 29th, 2022): Word Vectorization, Regex Practice, and Similarity

You may work with **one other person on this assignment**. You may also work independently if you prefer.

If you just want to be assigned someone to work with, message me on Slack and I will assign you a partner to work with.

A. Using the **Amazon Toy Reviews Dataset (both positive and negative)**, **process the reviews**.
This means you should think briefly about:
* what stopwords to remove (should you add any custom stopwords to the set? Remove any stopwords?)
* what regex cleaning you may need to perform (for example, are there different ways of saying `broken` that you need to account for?)
* stemming/lemmatization (explain in your notebook why you used stemming versus lemmatization). 

Next, **count-vectorize the dataset**. Use the **`sklearn.feature_extraction.text.CountVectorizer`** examples from `Linear Algebra, Distance and Similarity (Completed).ipynb` and `Text Preprocessing Techniques (Completed).ipynb`.

I do not want redundant features - for instance, I do not want `Christmas` and `Christ-mas` to be two distinct columns in your document-term matrix. Therefore, I'll be taking a look to make sure you've properly performed your cleaning, stopword removal, etc. to reduce the number of dimensions in your dataset. 

**Finally, identify the pair of reviews that are the MOST similar after performing all of these steps.**

B. **Stopwords, Stemming, Lemmatization Practice**

Using the **McDonalds Negative Reviews** file from Week 1:
* Count-vectorize the corpus. Treat each sentence as a document.

How many features (dimensions) do you get when you:
* Perform **stemming** and then count-vectorization
* Perform **lemmatization** and then **count-vectorization**.
* Perform **lemmatization**, remove **stopwords**, and then perform **count-vectorization**?

In [1]:
import sys
import pandas as pd
from collections import Counter
import numpy as np
from matplotlib import pyplot as plt
import re
from nltk.corpus import stopwords
import nltk
#nltk.download('stopwords')
#nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer

## A

In [2]:
# remove the punctuation marks before stemming

In [3]:
# Reading
amzn_df1 = pd.read_csv('../datasets/good_amazon_toy_reviews.txt',header = None)
amzn_df1['Type'] = 'Good'
amzn_df2 = pd.read_csv('../datasets/poor_amazon_toy_reviews.txt', header = None)
amzn_df2['Type'] = 'Poor'
# Concatenating Good and Poor Reviews
amzn_df = pd.concat([amzn_df1,amzn_df2],axis = 0)
amzn_df = amzn_df.replace(np.nan, '', regex=True)
# Quality Check
assert amzn_df.shape[0] == amzn_df1.shape[0] + amzn_df2.shape[0]
# Deleting redundant variables
del amzn_df1, amzn_df2
# Renaming column
amzn_df.rename(columns = {0:'Review'},inplace = True)
amzn_df.head()

Unnamed: 0,Review,Type
0,Excellent!!!,Good
1,Great quality wooden track (better than some o...,Good
2,my daughter loved it and i liked the price and...,Good
3,Great item. Pictures pop thru and add detail a...,Good
4,I was pleased with the product.,Good


In [4]:
# printing all the stopwords in nltk stopwords variable
print(set(stopwords.words('english')))

{'won', 'itself', "won't", 'too', "aren't", 'if', 'him', 'the', 'wouldn', "needn't", 'very', 'its', 'now', 'shan', 'which', 's', 'after', 'whom', 'by', 'only', 'into', 'off', 'while', 'we', 'not', 'it', 'out', 'an', "shan't", "mightn't", "shouldn't", "it's", 'few', 'hers', 'over', 'o', 'hadn', 'theirs', "don't", 'couldn', 'just', 'some', 'their', 'be', 'as', 'were', 've', 'i', 'why', 'ma', "wasn't", "you'd", 'so', 'doing', 'me', 'with', 'there', 'against', 'when', 'those', 'they', "you're", 'was', 'own', 'each', 't', 'our', 'how', 'between', 'all', 'here', 'y', 'you', 'further', 'should', 'didn', 'will', 'in', 'these', "hasn't", 'yours', 'herself', 'm', 'at', 'yourself', "she's", 'through', 'doesn', 'mustn', "you'll", 'during', 'who', 'below', 'did', "isn't", 'both', 'nor', 'from', 'themselves', 'and', "weren't", 'where', 'her', 'same', 'under', 'then', 'my', 'ourselves', "wouldn't", 'aren', 'before', 'no', 'don', 'd', 'down', 'what', 'do', "mustn't", 're', 'himself', 'to', 'a', 'hasn'

Stopwords to remove from nltk list:
- "not","doesn't","didn't": signifies negativity
- "very","too": signifies emphasis

Words to be cleaned using regex cleaning:
- "\\\&#34;": probably an encoding error
- "\\\<br />": probably encoding error
- "pre-school","preschool"
- "tkx","thnx","thanks","ty"

In [5]:
# stop words to remove from default stopwords list
stops= ["not","doesn't","didn't","very","too"]
# creating final list of stopwords
my_stopwords = list(set(stopwords.words('english'))  - set(stops))

In [6]:
# preprocessing dataframe
amzn_df['Review'] = amzn_df['Review'].str.lower()
amzn_df['Review'] = amzn_df['Review'].str.replace(r'[\.]{2,}', ' ',regex = True)
amzn_df['Review'] = amzn_df['Review'].str.replace('34', ' ',regex = True)
amzn_df['Review'] = amzn_df['Review'].str.replace(r'br\b', ' ',regex = True)
amzn_df['Review'] = amzn_df['Review'].str.replace(r'\bpre-school', 'preschool',regex = True)
amzn_df['Review'] = amzn_df['Review'].str.replace(r'\btks\b', 'thanks',regex = True)
amzn_df['Review'] = amzn_df['Review'].str.replace(r'[^a-zA-Z0-9 \n]', '',regex = True)

amzn_df.head()

Unnamed: 0,Review,Type
0,excellent,Good
1,great quality wooden track better than some ot...,Good
2,my daughter loved it and i liked the price and...,Good
3,great item pictures pop thru and add detail as...,Good
4,i was pleased with the product,Good


In [7]:
#removing all stopwords from dataframe
amzn_df['Review'] = amzn_df['Review'].apply(lambda x: ' '.join([word for word in x.split() if word not in (my_stopwords)]))
amzn_df.head()

Unnamed: 0,Review,Type
0,excellent,Good
1,great quality wooden track better others tried...,Good
2,daughter loved liked price came rather shoppin...,Good
3,great item pictures pop thru add detail painte...,Good
4,pleased product,Good


In [8]:
amzn_df.iloc[4,0]

'pleased product'

In [9]:
# count_word function
def count_words(line, delimiter=" "):
    words = Counter() # instantiate a Counter object called words
    for word in line.split(delimiter):
            words[word] += 1 # increment count for word
    return words

In [10]:
# performing stemming
stemmer = PorterStemmer()
amzn_df['Review'] = amzn_df['Review'].apply(lambda x: " ".join([stemmer.stem(y) for y in count_words(x)]))
amzn_df.head()

Unnamed: 0,Review,Type
0,excel,Good
1,great qualiti wooden track better other tri pe...,Good
2,daughter love like price came rather shop ton ...,Good
3,great item pictur pop thru add detail paint dr...,Good
4,pleas product,Good


In [34]:
# performing count vectorization
reviews: pd.Series = amzn_df["Review"]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(reviews) 
X = X.toarray()
corpus_df = pd.DataFrame(X, columns=vectorizer.get_feature_names())
corpus_df

MemoryError: Unable to allocate 29.7 GiB for an array with shape (114884, 34677) and data type int64

## B

In [15]:
stops= ['mcdonalds','McD',"mcdonald's",'mcds','mcd']
stopwords = list(set(stopwords.words('english'))  - set(["through"]) | set(stops))

In [16]:
mcd_df = pd.read_csv('../datasets/mcdonalds-yelp-negative-reviews.csv',encoding = 'latin1')

In [17]:
from sklearn.feature_extraction.text import CountVectorizer

In [18]:
mcd_df = mcd_df.replace(np.nan, '', regex=True)

In [19]:
pattern = re.compile(r'\b(' + r'|'.join(stopwords) + r')\b\s*')
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()

Treating each sentence as a document

In [20]:
df2=pd.DataFrame(columns=['_unit_id', 'city', 'sentence_id','sentence'])
for row in mcd_df.index:
    review=mcd_df.loc[row,'review']
    sent_text = nltk.sent_tokenize(review)
    for i,sentence in enumerate(sent_text):
        rows={'_unit_id':mcd_df.loc[row,'_unit_id'],
        'city':mcd_df.loc[row,'city'],
        'sentence_id':i,
        'sentence':sentence}
        df2=df2.append(rows,ignore_index=True)

Functions

In [21]:
from collections import Counter
def count_words(line, delimiter=" "):
    words = Counter() # instantiate a Counter object called words
    for word in line.split(delimiter):
            words[word] += 1 # increment count for word
    return words

In [22]:
def preprocessing(document):
    document = document.lower()
    document = re.sub('[\.]{2,}'," ",document)
    document = re.sub('[^a-zA-Z0-9 \n]', '', document)
    return document

In [23]:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

lemmatizer = WordNetLemmatizer()

# function to convert nltk tag to wordnet tag
def nltk_tag_to_wordnet_tag(nltk_tag):
    if nltk_tag.startswith('J'):
        return wordnet.ADJ
    elif nltk_tag.startswith('V'):
        return wordnet.VERB
    elif nltk_tag.startswith('N'):
        return wordnet.NOUN
    elif nltk_tag.startswith('R'):
        return wordnet.ADV
    else:          
        return None

def lemmatize_sentence(sentence):
    #tokenize the sentence and find the POS tag for each token
    nltk_tagged = nltk.pos_tag(nltk.word_tokenize(sentence))  
    #tuple of (token, wordnet_tag)
    wordnet_tagged = map(lambda x: (x[0], nltk_tag_to_wordnet_tag(x[1])), nltk_tagged)
    lemmatized_sentence = []
    for word, tag in wordnet_tagged:
        if tag is None:
            #if there is no available tag, append the token as is
            lemmatized_sentence.append(word)
        else:        
            #else use the tag to lemmatize the token
            lemmatized_sentence.append(lemmatizer.lemmatize(word, tag))
    return " ".join(lemmatized_sentence)

Stemming & Count Vectorization

In [24]:
for row in df2.index:
    document=df2.loc[row,'sentence']
    document=preprocessing(document)
    counter=count_words(document)
    stemmed_sentence=[]
    for word in counter:
        stemmed_sentence.append(stemmer.stem(word))
        newsentence=" ".join(stemmed_sentence)
    df2.loc[row,'sentence']=newsentence

In [25]:
df2.head()

Unnamed: 0,_unit_id,city,sentence_id,sentence
0,679455653,Atlanta,0,im not a huge mcd lover but ive been to better...
1,679455653,Atlanta,1,thi is by far the worst one ive ever been too
2,679455653,Atlanta,2,it filthi insid and if you get drive through t...
3,679455653,Atlanta,3,the staff is terribl unfriendli and nobodi see...
4,679455654,Atlanta,0,terribl custom servic


In [26]:
df2_reviews:pd.Series=df2['sentence']

In [28]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(df2_reviews) 
X = X.toarray()
corpus_df = pd.DataFrame(X, columns=vectorizer.get_feature_names())

In [33]:
len(pd.Series(amzn_df['Review']))

114884

In [None]:
corpus_df

Lematization & count vectorization

In [None]:
df3=pd.DataFrame(columns=['_unit_id', 'city', 'sentence_id','sentence'])

In [None]:
for row in mcd_df.index:
    review=mcd_df.loc[row,'review']
    sent_text = nltk.sent_tokenize(review)
    for i,sentence in enumerate(sent_text):
        rows={'_unit_id':mcd_df.loc[row,'_unit_id'],
        'city':mcd_df.loc[row,'city'],
        'sentence_id':i,
        'sentence':sentence}
        df3=df3.append(rows,ignore_index=True)

In [None]:
for row in df3.index:
    document=df3.loc[row,'sentence']
    document=preprocessing(document)
    counter=count_words(document)
    sentence = lemmatize_sentence(document)
    df3.loc[row,'sentence']=sentence

In [None]:
df3_reviews:pd.Series=df3['sentence']
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(df3_reviews) 
X = X.toarray()
corpus_df = pd.DataFrame(X, columns=vectorizer.get_feature_names())
corpus_df

Lemmatization & count vectorization gives 8105 features

### Perform lemmatization, remove stopwords and then count vectorization

In [None]:
for row in df3.index:
    sentence=df3.loc[row,'sentence']
    sentence=pattern.sub('', sentence)
    df3.loc[row,'sentence']=sentence

In [None]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(df3_reviews) 
X = X.toarray()
corpus_df = pd.DataFrame(X, columns=vectorizer.get_feature_names())
corpus_df

### There are 7968 features after lemmatization -> remove stopwords -> count vectorization