# Homework 2 (Due 6:29pm PST March 29th, 2022): Word Vectorization, Regex Practice, and Similarity

You may work with **one other person on this assignment**. You may also work independently if you prefer.

If you just want to be assigned someone to work with, message me on Slack and I will assign you a partner to work with.

A. Using the **Amazon Toy Reviews Dataset (both positive and negative)**, **process the reviews**.
This means you should think briefly about:
* what stopwords to remove (should you add any custom stopwords to the set? Remove any stopwords?)
* what regex cleaning you may need to perform (for example, are there different ways of saying `broken` that you need to account for?)
* stemming/lemmatization (explain in your notebook why you used stemming versus lemmatization). 

Next, **count-vectorize the dataset**. Use the **`sklearn.feature_extraction.text.CountVectorizer`** examples from `Linear Algebra, Distance and Similarity (Completed).ipynb` and `Text Preprocessing Techniques (Completed).ipynb`.

I do not want redundant features - for instance, I do not want `Christmas` and `Christ-mas` to be two distinct columns in your document-term matrix. Therefore, I'll be taking a look to make sure you've properly performed your cleaning, stopword removal, etc. to reduce the number of dimensions in your dataset. 

In [2]:
import pandas as pd
import numpy as np
import nltk

In [3]:
good_file = open("../datasets/good_amazon_toy_reviews.txt", "r")
goods = good_file.readlines()
bad_file = open("../datasets/poor_amazon_toy_reviews.txt", "r")
bads = bad_file.readlines()

In [4]:
# goods, 102217
# bads, 12,700
len(bads)

12700

In [5]:
pd.Series(bads).sample(3).values

array(['Not as wide as it appears. It fits acroos a small door.\n',
       '"awful!  The headline may say Kawasaki, but you won\'t receive any Kawasaki die casts!"\n',
       "Its slow doesn't go the speed it says but i gave it to my little brother he likes it.\n"],
      dtype=object)

In [79]:
[x for x in goods if 'one' in x]

['I got this item for me and my son to play around with. The closest relevance I have to items like these is while in the army I was trained in the camera rc bots. This thing is awesome we tested the range and got somewhere close to 50 yards without an issue. Getting the controls is a bit tricky at first but after about twenty minutes you get the feel for it. The drone comes just about fly ready you just have to sync the controller. I am definitely a fan of the drones now. Only concern I have is maybe a little more silent but other than that great buy.<br /><br />*Disclaimer I received this product at a discount for my unbiased review.\n',
 'Awesome customer service and a cool little drone! Especially for the price!\n',
 '"Ended up sending this guy back because I didnt need it, but I\'ve bought one in the past and loved it"\n',
 '"Works well ,quality product but this style of board will charge multiple batteries at the same time SAFELY  ( IF )  ALL BATTERIES ARE OF THE SAME CELL COUNT 

In [82]:
word = 'purchase'
len([x for x in goods if word in x]), len([x for x in bads if word in x]), 

(3672, 605)

In [None]:
from nltk import 

#### notes on first part
- from lecture:
    - "If you are working with basic NLP techniques like BOW, Count Vectorizer or TF-IDF(Term Frequency and Inverse Document Frequency) then removing stopwords is a good idea because stopwords act like noise for these methods"
- additional stop words: buy, bought, purchase, purchased, review, reviewed(?)
- &#34; in 3203 good reviews, in 626 bad
    - remove it? it's used for giving meaning to a certain extent, but is it needed for what we're doing?

In [37]:
from nltk.stem.porter import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize

porter = PorterStemmer()

In [47]:
porter.stem('break')

'break'

In [None]:
poor_mx

In [None]:
pd.Series(index=poor_mx.columns, values=[sum(poor_mx[x]) for x in poor_mx.columns])

B. **Stopwords, Stemming, Lemmatization Practice**

Using the **McDonalds Negative Reviews** file from Week 1:
* Count-vectorize the corpus. Treat each sentence as a document.

How many features (dimensions) do you get when you:
* Perform **stemming** and then count-vectorization
* Perform **lemmatization** and then **count-vectorization**.
* Perform **lemmatization**, remove **stopwords**, and then perform **count-vectorization**?

In [8]:
data = pd.read_csv('../datasets/mcdonalds-yelp-negative-reviews.csv', encoding="latin1")

In [9]:
data.head()

Unnamed: 0,_unit_id,city,review
0,679455653,Atlanta,"I'm not a huge mcds lover, but I've been to be..."
1,679455654,Atlanta,Terrible customer service. I came in at 9:30pm...
2,679455655,Atlanta,"First they ""lost"" my order, actually they gave..."
3,679455656,Atlanta,I see I'm not the only one giving 1 star. Only...
4,679455657,Atlanta,"Well, it's McDonald's, so you know what the fo..."


In [10]:
# nltk.download('punkt')

In [11]:
data['review_sent'] = data['review'].apply(lambda x: nltk.sent_tokenize(x))
data.head(3)

Unnamed: 0,_unit_id,city,review,review_sent
0,679455653,Atlanta,"I'm not a huge mcds lover, but I've been to be...","[I'm not a huge mcds lover, but I've been to b..."
1,679455654,Atlanta,Terrible customer service. I came in at 9:30pm...,"[Terrible customer service., I came in at 9:30..."
2,679455655,Atlanta,"First they ""lost"" my order, actually they gave...","[First they ""lost"" my order, actually they gav..."


In [12]:
#number of total sentences
sum([len(x) for x in data.review_sent])

9718

In [13]:
all_sents = []

for i in range(len(data)):
    row = data.iloc[i]
    for sent in row['review_sent']:
        all_sents.append(sent)

#sanity check
len(all_sents)

9718

### Overall Count Vectorize

In [14]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()

In [15]:
#fit_transform and output format taken from lecture
X = vectorizer.fit_transform(all_sents)

vectorized_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
print(f"Shape of dataframe is {vectorized_df.shape}")
print(f"Total number of occurences: {vectorized_df.sum().sum()}")
#print(f"Word counts: {vectorized_df.sum()}")
vectorized_df.head()

Shape of dataframe is (9718, 8379)
Total number of occurences: 139451


Unnamed: 0,00,000,00am,00my,00pm,01,0200,03pm,04,04am,...,zax,zee,zeke,zero,zesty,zip,zombie,zombies,zoom,î_
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Stem, then Count Vectorize

In [16]:
#stem_sentence function taken from https://www.datacamp.com/community/tutorials/stemming-lemmatization-python
from nltk.stem.porter import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize

porter = PorterStemmer()
def stem_sentence(sentence):
    token_words=word_tokenize(sentence)
    token_words
    stem_sentence=[]
    for word in token_words:
        stem_sentence.append(porter.stem(word))
        stem_sentence.append(" ")
    return "".join(stem_sentence)

In [17]:
stem_sentence('I like my dog, but I do not like other people\'s dogs because I had liked someone else\'s dog and then it bit me.')

"i like my dog , but i do not like other peopl 's dog becaus i had like someon els 's dog and then it bit me . "

In [18]:
stemmed_sents = [stem_sentence(x) for x in all_sents]

# vectorizer = CountVectorizer(stop_words="english")
X = vectorizer.fit_transform(stemmed_sents)

vectorized_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
print(f"Shape of dataframe is {vectorized_df.shape}")
print(f"Total number of occurences: {vectorized_df.sum().sum()}")
#print(f"Word counts: {vectorized_df.sum()}")
vectorized_df.head()

Shape of dataframe is (9718, 6445)
Total number of occurences: 139462


Unnamed: 0,00,000,00am,00mi,00pm,01,0200,03pm,04,04am,...,zax,zee,zeke,zero,zesti,zip,zombi,zombie,zoom,î_
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


**When stemming, then count-vectorizing, there are 6,445 dimensions.**

### Lemmatize, then Count Vectorize

In [19]:
# nltk.download('omw-1.4')

In [20]:
#code for this cell taken from lecture
#https://gist.github.com/gaurav5430/9fce93759eb2f6b1697883c3782f30de#file-nltk-lemmatize-sentences-py


from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

lemmatizer = WordNetLemmatizer()

# function to convert nltk tag to wordnet tag
def nltk_tag_to_wordnet_tag(nltk_tag):
    if nltk_tag.startswith('J'):
        return wordnet.ADJ
    elif nltk_tag.startswith('V'):
        return wordnet.VERB
    elif nltk_tag.startswith('N'):
        return wordnet.NOUN
    elif nltk_tag.startswith('R'):
        return wordnet.ADV
    else:          
        return None

def lemmatize_sentence(sentence):
    #tokenize the sentence and find the POS tag for each token
    nltk_tagged = nltk.pos_tag(nltk.word_tokenize(sentence))  
    #tuple of (token, wordnet_tag)
    wordnet_tagged = map(lambda x: (x[0], nltk_tag_to_wordnet_tag(x[1])), nltk_tagged)
    lemmatized_sentence = []
    for word, tag in wordnet_tagged:
        if tag is None:
            #if there is no available tag, append the token as is
            lemmatized_sentence.append(word)
        else:        
            #else use the tag to lemmatize the token
            lemmatized_sentence.append(lemmatizer.lemmatize(word, tag))
    return " ".join(lemmatized_sentence)

In [21]:
lemmatize_sentence('I like my dog, but I do not like other people\'s dogs because I had liked someone else\'s dog and then it bit me.')

"I like my dog , but I do not like other people 's dog because I have like someone else 's dog and then it bit me ."

In [22]:
lemmed_sents = [lemmatize_sentence(x) for x in all_sents]

X = vectorizer.fit_transform(lemmed_sents)

vectorized_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
print(f"Shape of dataframe is {vectorized_df.shape}")
print(f"Total number of occurences: {vectorized_df.sum().sum()}")
#print(f"Word counts: {vectorized_df.sum()}")
vectorized_df.head()

Shape of dataframe is (9718, 7188)
Total number of occurences: 139501


Unnamed: 0,00,000,00am,00my,00pm,01,0200,03pm,04,04am,...,zak,zax,zee,zeke,zero,zesty,zip,zombie,zoom,î_
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


**When lemmatizing, then count-vectorizing, there are 7,188 dimensions.**

### Lemmatize, Remove Stopwords, then Count Vectorize

In [23]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/akshaybhide/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [24]:
# lemmed_sents = [lemmatize_sentence(x) for x in all_sents]

vectorizer = CountVectorizer(stop_words="english")
X = vectorizer.fit_transform(lemmed_sents)

vectorized_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
print(f"Shape of dataframe is {vectorized_df.shape}")
print(f"Total number of occurences: {vectorized_df.sum().sum()}")
#print(f"Word counts: {vectorized_df.sum()}")
vectorized_df.head()

Shape of dataframe is (9718, 6907)
Total number of occurences: 61431


Unnamed: 0,00,000,00am,00my,00pm,01,0200,03pm,04,04am,...,zak,zax,zee,zeke,zero,zesty,zip,zombie,zoom,î_
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


**When lemmatizing, removing stopwords, then count-vectorizing, there are 6,907 dimensions.**