# Homework 2 (Due 6:29pm PST March 29th, 2022): Word Vectorization, Regex Practice, and Similarity

*This homework is done in collaboration with Shao Xuan Chew shaoxuan@usc.edu*

You may work with **one other person on this assignment**. You may also work independently if you prefer.

If you just want to be assigned someone to work with, message me on Slack and I will assign you a partner to work with.

A. Using the **Amazon Toy Reviews Dataset (both positive and negative)**, **process the reviews**.
This means you should think briefly about:
* what stopwords to remove (should you add any custom stopwords to the set? Remove any stopwords?)
* what regex cleaning you may need to perform (for example, are there different ways of saying `broken` that you need to account for?)
* stemming/lemmatization (explain in your notebook why you used stemming versus lemmatization). 

Next, **count-vectorize the dataset**. Use the **`sklearn.feature_extraction.text.CountVectorizer`** examples from `Linear Algebra, Distance and Similarity (Completed).ipynb` and `Text Preprocessing Techniques (Completed).ipynb`.

I do not want redundant features - for instance, I do not want `Christmas` and `Christ-mas` to be two distinct columns in your document-term matrix. Therefore, I'll be taking a look to make sure you've properly performed your cleaning, stopword removal, etc. to reduce the number of dimensions in your dataset. 


In [1]:
import re
import pandas as pd
import numpy as np
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet


In [2]:
positive = open('../datasets/good_amazon_toy_reviews.txt', 'r')
negative = open('../datasets/poor_amazon_toy_reviews.txt', 'r')

In [3]:
positive = positive.readlines()
negative = negative.readlines()

* what stopwords to remove (should you add any custom stopwords to the set? Remove any stopwords?)

We should not remove stopwords that might have a negative connotation such as doesn't, can't, won't etc. These words may contain necessary information to determine the meaning of the review. While it is not strictly necessary, it can be interesting to observe the frequency of these words in negative reviews versus positive reviews.

In [4]:
set(stopwords.words('english'))

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 'r

In [5]:
# removing some negative words
nltk_stopwords = set(stopwords.words('english'))
nltk_stopwords.remove('below')
nltk_stopwords.remove("aren't")
nltk_stopwords.remove('couldn')
nltk_stopwords.remove("couldn't")
nltk_stopwords.remove("didn't")

In [6]:
def remove_stopwords(reviews, nltk_stopwords):
    cleaned_reviews = []

    # This iterates through each of the reviews, splitting the review into distinct tokens
    # Then it checks each token for whether or not it is a stopword, before adding them back into a "cleaned_review"
    for review in reviews:
        words = nltk.word_tokenize(review)
        new_words = []
        for word in words:
            if word.lower() in nltk_stopwords:
                continue
            new_words.append(word)
        cleaned_review = " ".join(new_words)
        cleaned_reviews.append(cleaned_review)
    return cleaned_reviews


In [7]:
positive_cleaned = remove_stopwords(positive, nltk_stopwords)
negative_cleaned = remove_stopwords(negative, nltk_stopwords)
temp = positive_cleaned + negative_cleaned

In [8]:
reviews = pd.DataFrame({'review' : temp, 'positive' : np.zeros(len(temp))})
reviews.iloc[:len(positive_cleaned), 1] = 1

* what regex cleaning you may need to perform (for example, are there different ways of saying `broken` that you need to account for?)

In [9]:
# broken
reviews['review'] = reviews['review'].str.replace(r'\bbroken|(dys|non)-?function(ing|al)\b', '_BROKEN_', case=False)

  reviews['review'] = reviews['review'].str.replace(r'\bbroken|(dys|non)-?function(ing|al)\b', '_BROKEN_', case=False)


In [10]:
# quality
reviews['review'] = reviews['review'].str.replace(r'\bquality|build\b', '_QUALITY_', case=False)

  reviews['review'] = reviews['review'].str.replace(r'\bquality|build\b', '_QUALITY_', case=False)


In [11]:
# fake
reviews['review'] = reviews['review'].str.replace(r'\bfake|counterfeit|sham|rip-?off\b', '_FAKE_', case=False)

  reviews['review'] = reviews['review'].str.replace(r'\bfake|counterfeit|sham|rip-?off\b', '_FAKE_', case=False)


In [12]:
# removing all references that contain digits
reviews['review'] = reviews['review'].str.replace(r'\S*\d+\S*', '', case=False)

  reviews['review'] = reviews['review'].str.replace(r'\S*\d+\S*', '', case=False)


In [13]:
# removing all punctuations
reviews['review'] = reviews['review'].str.replace(r'[.,\/#!$%\^&\*;:{}=\-_`~()?]', '', case=False)

  reviews['review'] = reviews['review'].str.replace(r'[.,\/#!$%\^&\*;:{}=\-_`~()?]', '', case=False)


In [14]:
# removing all words with more than 15 digits
reviews['review'] = reviews['review'].str.replace(r'\b([a-z]|[A-Z]){15,}\b', '', case=False)

  reviews['review'] = reviews['review'].str.replace(r'\b([a-z]|[A-Z]){15,}\b', '', case=False)


In [15]:
# removing all non ASCII characters 
# source: https://stackoverflow.com/questions/150033/regular-expression-to-match-non-ascii-characters
reviews['review'] = reviews['review'].str.replace(r'[^\x00-\x7F]+', '', case=False)

  reviews['review'] = reviews['review'].str.replace(r'[^\x00-\x7F]+', '', case=False)


* stemming/lemmatization (explain in your notebook why you used stemming versus lemmatization). 

I choose lemmatization since it can better account for transformations that are not standard. Such transformations can be quite common in reviews. For example, it is very likely for reviews to contain `better` or `worse`, neither of which can be treated with stemming. Moreover, we do not have a performance limitation as the dataset is rather small and we are not doing realtime transformation.

In [16]:
lemmatizer = WordNetLemmatizer()

# function to convert nltk tag to wordnet tag
def nltk_tag_to_wordnet_tag(nltk_tag):
    if nltk_tag.startswith('J'):
        return wordnet.ADJ
    elif nltk_tag.startswith('V'):
        return wordnet.VERB
    elif nltk_tag.startswith('N'):
        return wordnet.NOUN
    elif nltk_tag.startswith('R'):
        return wordnet.ADV
    else:          
        return None

def lemmatize_sentence(sentence):
    #tokenize the sentence and find the POS tag for each token
    nltk_tagged = nltk.pos_tag(nltk.word_tokenize(sentence))  
    #tuple of (token, wordnet_tag)
    wordnet_tagged = map(lambda x: (x[0], nltk_tag_to_wordnet_tag(x[1])), nltk_tagged)
    lemmatized_sentence = []
    for word, tag in wordnet_tagged:
        if tag is None:
            #if there is no available tag, append the token as is
            lemmatized_sentence.append(word)
        else:        
            #else use the tag to lemmatize the token
            lemmatized_sentence.append(lemmatizer.lemmatize(word, tag))
    return " ".join(lemmatized_sentence)


In [17]:
reviews['review'] = reviews['review'].apply(lemmatize_sentence)

Next, **count-vectorize the dataset**. Use the **`sklearn.feature_extraction.text.CountVectorizer`** examples from `Linear Algebra, Distance and Similarity (Completed).ipynb` and `Text Preprocessing Techniques (Completed).ipynb`.

I do not want redundant features - for instance, I do not want `Christmas` and `Christ-mas` to be two distinct columns in your document\term matrix. Therefore, I'll be taking a look to make sure you've properly performed your cleaning, stopword removal, etc. to reduce the number of dimensions in your dataset. 

In [18]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()


X = vectorizer.fit_transform(reviews['review']) 
X = X.toarray()
print(vectorizer.get_feature_names())



B. **Stopwords, Stemming, Lemmatization Practice**

Using the **McDonalds Negative Reviews** file from Week 1:
* Count-vectorize the corpus. Treat each sentence as a document.

How many features (dimensions) do you get when you:
* Perform **stemming** and then count-vectorization
* Perform **lemmatization** and then **count-vectorization**.
* Perform **lemmatization**, remove **stopwords**, and then perform **count-vectorization**?

In [19]:
df_q2 = pd.read_csv('../datasets/mcdonalds-yelp-negative-reviews.csv')

In [20]:
stemmer = nltk.stem.porter.PorterStemmer()
def stem_sentence(sentence):
    words = nltk.word_tokenize(sentence)
    new_words = []
    for word in words:
        new_words.append(stemmer.stem(word))
    return ' '.join(new_words)

In [21]:
# stemmed
stemmed = df_q2['review'].apply(stem_sentence)

In [22]:
X = vectorizer.fit_transform(stemmed) 
X = X.toarray()
print('Number of features for %s: %s'%('stemmed', len(list(vectorizer.get_feature_names()))))

Number of features for stemmed: 6447


In [23]:
# lemmatized
lemmatized = df_q2['review'].apply(lemmatize_sentence)

In [24]:
X = vectorizer.fit_transform(lemmatized) 
X = X.toarray()
print('Number of features for %s: %s'%('lemmatized', len(list(vectorizer.get_feature_names()))))

Number of features for lemmatized: 7191


In [25]:
def remove_stopwords(sentence, stopwords=set(stopwords.words('english'))):
    
    words = nltk.word_tokenize(sentence)
    new_words = []
    for word in words:
        if word in stopwords:
            continue
        new_words.append(word)
    return ' '.join(new_words)

In [26]:
# lemmatized and stopword removed
lemmatized_and_removed = lemmatized.apply(lemmatize_sentence)

In [27]:
X = vectorizer.fit_transform(lemmatized_and_removed) 
X = X.toarray()
print('Number of features for %s: %s'%('lemmatized_and_removed', len(list(vectorizer.get_feature_names()))))

Number of features for lemmatized_and_removed: 7169
