# Homework 2 (Due 6:29pm PST Nov 4th, 2021): Word Vectorization, Regex Practice, and Similarity

You may work with **one other person on this assignment**. You may also work independently if you prefer.

If you just want to be assigned someone to work with, message me on Slack and I will assign you a partner to work with.

A. Using the **Amazon Toy Reviews Dataset (both positive and negative)**, **process the reviews**.
This means you should think briefly about:
* what stopwords to remove (should you add any custom stopwords to the set? Remove any stopwords?)
* what regex cleaning you may need to perform (for example, are there different ways of saying `broken` that you need to account for?)
* stemming/lemmatization (explain in your notebook why you used stemming versus lemmatization). 

Next, **count-vectorize the dataset**. Use the **`sklearn.feature_extraction.text.CountVectorizer`** examples from `Linear Algebra, Distance and Similarity (Completed).ipynb` and `Text Preprocessing Techniques (Completed).ipynb`.

I do not want redundant features - for instance, I do not want `Christmas` and `Christ-mas` to be two distinct columns in your document-term matrix. Therefore, I'll be taking a look to make sure you've properly performed your cleaning, stopword removal, etc. to reduce the number of dimensions in your dataset. 


In [1]:
## import libraries
import pandas as pd
import numpy as np
import string, re
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
# nltk.download('punkt') # A popular NLTK sentence tokenizer
# nltk.download('stopwords') # library of common English stopwords

In [2]:
with open("../datasets/good_amazon_toy_reviews.txt", "r") as f:
    good_txt = f.read()

with open("../datasets/poor_amazon_toy_reviews.txt", "r") as f:
    poor_txt = f.read()

# with open('good_amazon_toy_reviews.txt', 'r') as f:
#     good_txt = f.read()

# with open('poor_amazon_toy_reviews.txt', 'r') as f:
#     poor_txt = f.read()

In [3]:
def clean_text(text):
    """
    Tokenize text into words. Convert texts to lower case.
    Remove hashtags, punctuations, stopwords, website links, extra spaces, non-alphanumeric characters and 
    single character. stemtize texts.
    """

    # erase html language characters
    html = re.compile(r'<.*?>')
    text = html.sub(r'',text)

    # year phrases
    text = re.sub(r'(\-?yrs?)', ' year', text)
    text = re.sub(r'(\-?years?\-?olds?)', ' year old', text)

    # birthday
    text = re.sub(r'([Bb]\-?[Dd]ays?)', 'birthday', text)

    # holiday words
    text = re.sub(r'([Xx][Mm]as|[Cc]hrist\-[Mm]as)', 'christmas', text)
    text = re.sub(r'([Nn]ew\-[Yy]ears?)', 'new years', text)

    tokens = [token for token in nltk.word_tokenize(text)]
    
    # Combine stopwords and punctuation
    stops = stopwords.words("english") + list(string.punctuation)

    # adding extra stopwords (buy, bought, purchase, purchased)
    stops.append('buy')
    stops.append('bought')
    stops.append('purchase')
    stops.append('purchased')

    # special characters
    s_chars = '¥₽ÏïŰŬĎŸæ₿₪ÚŇÀèÅ”ĜåŽÖéříÿý€ŝĤ₹áŜŮÂ₴ûÌÇšŘúüëÓ₫ŠčÎŤÆÒœ₩öËäøÍťìĈôàĥÝ¢ç“žðÙÊĉŭÈŒÐÉÔĵùÁů„âÄűĴóêĝÞîØòď฿ČÜþňÛ'
    
    # Create PorterStemmer
    stemmer = PorterStemmer()
    
    tokens_no_hashtag = [re.sub(r'#', '', token) for token in tokens]
    tokens_no_stopwords = [token.lower() for token in tokens_no_hashtag if token.lower() not in stops]
    tokens_no_url = [re.sub(r'http\S+', '', token) for token in tokens_no_stopwords]
    tokens_no_url = [re.sub(r'www\S+', '', token) for token in tokens_no_url]
    tokens_no_special_char = [re.sub(r'[{}]'.format(s_chars), '', token) for token in tokens_no_url]
    tokens_no_extra_space = [re.sub(r'\s\s+', '', token) for token in tokens_no_special_char]
    tokens_alnum = [token for token in tokens_no_extra_space if token.isalnum()]
    tokens_stem = [stemmer.stem(token) for token in tokens_alnum]
    tokens_final = [token for token in tokens_stem if len(token) > 1]
    
    return ' '.join(tokens_final)

### Stopwords
- Stopwords should be removed because we don't want these very common words that are meaningless in our analysis introducing noise and taking up dimensions in our final count-vectorized matrices.
- We added buy, bought, purchase, and purchased to the stopwords list since in the context of toy reviews all reviews are about buying/purchasing a product so an occurrence of one of these words doesn't add value to the analysis.

### stemming/lemmatization (explain in your notebook why you used stemming versus lemmatization). 

- Chose stemming over lemmatization because for this analysis, we do not need to consider the word's part-of-speech and the actual language word. Since we just want to use count vectorizer to make a term matrix, just having the stem of each word is enough.

In [4]:
good_df = pd.DataFrame(good_txt.split('\n'), columns = ['reviews'])
good_df['review_tok'] = good_df.reviews.apply(clean_text)
poor_df = pd.DataFrame(poor_txt.split('\n'), columns = ['reviews'])
poor_df['review_tok'] = poor_df.reviews.apply(clean_text)

In [5]:
good_df.head(3)

Unnamed: 0,reviews,review_tok
0,Excellent!!!,excel
1,"""Great quality wooden track (better than some ...",great qualiti wooden track better other tri pe...
2,my daughter loved it and i liked the price and...,daughter love like price came rather shop ton ...


In [6]:
def to_matrix(doc):
    vectorizer = CountVectorizer(binary=True, min_df = 10)
    X = vectorizer.fit_transform(doc) 
    X = X.toarray()
#     return pd.DataFrame(X, columns=vectorizer.get_feature_names_out())
    return pd.DataFrame(X, columns=vectorizer.get_feature_names())

In [7]:
good_mx = to_matrix(good_df['review_tok'])
good_mx

Unnamed: 0,10,100,1000,10th,11,11th,12,120,125,12th,...,zelda,zero,zip,ziplock,zipper,zombi,zone,zoo,zoob,zoom
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
102213,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
102214,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
102215,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
102216,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [8]:
poor_mx = to_matrix(poor_df['review_tok'])
poor_mx

Unnamed: 0,10,100,1000,11,12,13,14,15,150,16,...,yesterday,yet,yo,young,younger,youtub,zero,zip,zipper,zombi
0,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12696,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12697,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12698,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12699,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [9]:
# pd.Series(data=[sum(poor_mx[x]) for x in poor_mx.columns], index=poor_mx.columns).sort_values(ascending=False).head(20)

In [10]:
poor_mx.columns[500:550]

Index(['doa', 'doesnt', 'dog', 'doll', 'dollar', 'donat', 'done', 'dont',
       'door', 'dot', 'doubl', 'doubt', 'doug', 'dough', 'download', 'dozen',
       'drain', 'draw', 'drawer', 'dress', 'dri', 'drill', 'drink', 'drive',
       'drone', 'drop', 'duck', 'duct', 'dud', 'due', 'dull', 'dumb', 'duplic',
       'durabl', 'dust', 'dye', 'ear', 'earli', 'earlier', 'earn', 'earth',
       'easi', 'easier', 'easili', 'eat', 'ebay', 'edg', 'edit', 'effect',
       'effort'],
      dtype='object')

B. **Stopwords, Stemming, Lemmatization Practice**

Using the **McDonalds Negative Reviews** file from Week 1:
* Count-vectorize the corpus. Treat each sentence as a document.

How many features (dimensions) do you get when you:
* Perform **stemming** and then count-vectorization
* Perform **lemmatization** and then **count-vectorization**.
* Perform **lemmatization**, remove **stopwords**, and then perform **count-vectorization**?

In [11]:
data = pd.read_csv('../datasets/mcdonalds-yelp-negative-reviews.csv', encoding="latin1")

In [12]:
data.head()

Unnamed: 0,_unit_id,city,review
0,679455653,Atlanta,"I'm not a huge mcds lover, but I've been to be..."
1,679455654,Atlanta,Terrible customer service. I came in at 9:30pm...
2,679455655,Atlanta,"First they ""lost"" my order, actually they gave..."
3,679455656,Atlanta,I see I'm not the only one giving 1 star. Only...
4,679455657,Atlanta,"Well, it's McDonald's, so you know what the fo..."


In [13]:
data['review_sent'] = data['review'].apply(lambda x: nltk.sent_tokenize(x))
data.head(3)

Unnamed: 0,_unit_id,city,review,review_sent
0,679455653,Atlanta,"I'm not a huge mcds lover, but I've been to be...","[I'm not a huge mcds lover, but I've been to b..."
1,679455654,Atlanta,Terrible customer service. I came in at 9:30pm...,"[Terrible customer service., I came in at 9:30..."
2,679455655,Atlanta,"First they ""lost"" my order, actually they gave...","[First they ""lost"" my order, actually they gav..."


In [14]:
#number of total sentences
sum([len(x) for x in data.review_sent])

9718

In [15]:
#create list of all sentences from the dataset
all_sents = []

for i in range(len(data)):
    row = data.iloc[i]
    for sent in row['review_sent']:
        all_sents.append(sent)

#check
len(all_sents)

9718

In [16]:
#define stemmer/lemmatizer (and relevant functions) and function for count-vectorizing

from nltk.stem.porter import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from sklearn.feature_extraction.text import CountVectorizer

#stem_sentence function taken from https://www.datacamp.com/community/tutorials/stemming-lemmatization-python
porter = PorterStemmer()
def stem_sentence(sentence):
    token_words=word_tokenize(sentence)
    token_words
    stem_sentence=[]
    for word in token_words:
        stem_sentence.append(porter.stem(word))
        stem_sentence.append(" ")
    return "".join(stem_sentence)

#code for below 2 functions taken from lecture
#https://gist.github.com/gaurav5430/9fce93759eb2f6b1697883c3782f30de#file-nltk-lemmatize-sentences-py

lemmatizer = WordNetLemmatizer()
# function to convert nltk tag to wordnet tag
def nltk_tag_to_wordnet_tag(nltk_tag):
    if nltk_tag.startswith('J'):
        return wordnet.ADJ
    elif nltk_tag.startswith('V'):
        return wordnet.VERB
    elif nltk_tag.startswith('N'):
        return wordnet.NOUN
    elif nltk_tag.startswith('R'):
        return wordnet.ADV
    else:          
        return None

def lemmatize_sentence(sentence):
    #tokenize the sentence and find the POS tag for each token
    nltk_tagged = nltk.pos_tag(nltk.word_tokenize(sentence))  
    #tuple of (token, wordnet_tag)
    wordnet_tagged = map(lambda x: (x[0], nltk_tag_to_wordnet_tag(x[1])), nltk_tagged)
    lemmatized_sentence = []
    for word, tag in wordnet_tagged:
        if tag is None:
            #if there is no available tag, append the token as is
            lemmatized_sentence.append(word)
        else:        
            #else use the tag to lemmatize the token
            lemmatized_sentence.append(lemmatizer.lemmatize(word, tag))
    return " ".join(lemmatized_sentence)


def output_countvectorized(sents, stopwords=False):
    if stopwords:
        vectorizer = CountVectorizer(stop_words="english")
    else:
        vectorizer = CountVectorizer()
    
    X = vectorizer.fit_transform(sents)
    vectorized_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
    return vectorized_df.shape

### Overall Count-Vectorize

In [17]:
print('Count-vectorized dimenions overall:', output_countvectorized(all_sents))

Count-vectorized dimenions overall: (9718, 8379)


### Stem, then Count Vectorize

In [18]:
stemmed_sents = [stem_sentence(x) for x in all_sents]
print('Count-vectorized dimenions overall:', output_countvectorized(all_sents))
print('count-vectorized dimenions after stemming:', output_countvectorized(stemmed_sents))

Count-vectorized dimenions overall: (9718, 8379)
count-vectorized dimenions after stemming: (9718, 6445)


**When stemming, then count-vectorizing, there are 6,445 dimensions.**

### Lemmatize, then Count Vectorize

In [19]:
lemmed_sents = [lemmatize_sentence(x) for x in all_sents]
print('Count-vectorized dimenions overall:', output_countvectorized(all_sents))
print('count-vectorized dimenions after lemmatizing:', output_countvectorized(lemmed_sents))

Count-vectorized dimenions overall: (9718, 8379)
count-vectorized dimenions after lemmatizing: (9718, 7188)


**When lemmatizing, then count-vectorizing, there are 7,188 dimensions.**

### Lemmatize, Remove Stopwords, then Count Vectorize

In [20]:
# lemmed_sents = [lemmatize_sentence(x) for x in all_sents]
print('Count-vectorized dimenions overall:', output_countvectorized(all_sents))
print('count-vectorized dimenions after lemmatizing and removing stopwords:', output_countvectorized(lemmed_sents, True))

Count-vectorized dimenions overall: (9718, 8379)
count-vectorized dimenions after lemmatizing and removing stopwords: (9718, 6907)


**When lemmatizing, removing stopwords, then count-vectorizing, there are 6,907 dimensions.**