# HW3

Submit via Slack. Due on **Tuesday, April 12th, 2022, 6:29pm PST**. You may work with one other person.
## TF-IDF (5pts)

You are an analyst working for Amazon's product team, and charged with identifying areas for improvement for the toy reviews.

Using the **amazon-fine-foods.csv** dataset, clean and parse the text reviews. Explain the decisions you make:
- why remove/keep stopwords?
- which stopwords to remove?
- stemming versus lemmatization?
- regex cleaning and substitution?
- adding in custom stopwords?
- what `n` for your `n-grams`?

Finally, generate a TF-IDF report that explains for a business (non-technical) stakeholder:
* the features your analysis showed that customers cited as reasons for a poor review
* the features your analysis showed that customers cited as reasons for a good review
* the most common issues identified from your analysis that generated customer dissatisfaction.

Explain to what degree the TF-IDF findings make sense - what are its limitations?

In [10]:
import pandas as pd
import numpy as np
import re, string
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
from sklearn.metrics.pairwise import euclidean_distances, cosine_similarity

In [11]:
def clean_text(text):
    """
    Tokenize text into words. Convert texts to lower case.
    Remove hashtags, punctuations, stopwords, website links, extra spaces, non-alphanumeric characters and 
    single character. stemtize texts.
    """

    # erase html language characters
    html = re.compile(r'<.*?>')
    text = html.sub(r'',text)

    # year phrases
    text = re.sub(r'(\-?yrs?)', ' year', text)
    text = re.sub(r'(\-?years?\-?olds?)', ' year old', text)

    # birthday
    text = re.sub(r'([Bb]\-?[Dd]ays?)', 'birthday', text)

    # holiday words
    text = re.sub(r'([Xx][Mm]as|[Cc]hrist\-[Mm]as)', 'christmas', text)
    text = re.sub(r'([Nn]ew\-[Yy]ears?)', 'new years', text)

    tokens = [token for token in nltk.word_tokenize(text)]
    
    # Combine stopwords and punctuation
    stops = stopwords.words("english") + list(string.punctuation)

    # # adding extra stopwords (buy, bought, purchase, purchased)
    stops.append('buy')
    stops.append('bought')
    stops.append('purchase')
    stops.append('purchased')

    ## the following codes are from my past nlp project that I use when cleaning the text/ tokens

    # special characters
    s_chars = '¥₽ÏïŰŬĎŸæ₿₪ÚŇÀèÅ”ĜåŽÖéříÿý€ŝĤ₹áŜŮÂ₴ûÌÇšŘúüëÓ₫ŠčÎŤÆÒœ₩öËäøÍťìĈôàĥÝ¢ç“žðÙÊĉŭÈŒÐÉÔĵùÁů„âÄűĴóêĝÞîØòď฿ČÜþňÛ'
    
    # Create PorterStemmer
    stemmer = PorterStemmer()
    
    tokens_no_hashtag = [re.sub(r'#', '', token) for token in tokens]
    tokens_no_stopwords = [token.lower() for token in tokens_no_hashtag if token.lower() not in stops]
    tokens_no_url = [re.sub(r'http\S+', '', token) for token in tokens_no_stopwords]
    tokens_no_url = [re.sub(r'www\S+', '', token) for token in tokens_no_url]
    tokens_no_special_char = [re.sub(r'[{}]'.format(s_chars), '', token) for token in tokens_no_url]
    tokens_no_extra_space = [re.sub(r'\s\s+', '', token) for token in tokens_no_special_char]
    tokens_alnum = [token for token in tokens_no_extra_space if token.isalnum()]
    tokens_stem = [stemmer.stem(token) for token in tokens_alnum]
    tokens_final = [token for token in tokens_stem if len(token) > 1]
    
    return ' '.join(tokens_final)

- Which stopwords to remove? Why remove/keep stopwords? Adding in custom stopwords?
    - which words we removed
    - which words we kept
    - which words we added

- stemming versus lemmatization?

- regex cleaning and substitution?
    - cleaned html format strings
    - cleaned phrases like `year`, `birthday`, and holiday words

- what `n` for your `n-grams`?
    - we chose n = 1 and 2

In [12]:
df = pd.read_csv('amazon_fine_foods.csv')
df.head(3)

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,20983,B002QWP89S,A21U4DR8M6I9QN,"K. M Merrill ""justine""",1,1,5,1318896000,addictive! but works for night coughing in dogs,my 12 year old sheltie has chronic brochotitis...
1,20984,B002QWP89S,A17TDUBB4Z1PEC,jaded_green,1,1,5,1318550400,genuine Greenies best price,"These are genuine Greenies product, not a knoc..."
2,20985,B002QWP89S,ABQH3WAWMSMBH,tenisbrat87,1,1,5,1317168000,Perfect for our little doggies,"Our dogs love Greenies, but of course, which d..."


In [13]:
df.ProductId.value_counts().head(10)

B007JFMH8M    913
B0026RQTGE    632
B002QWP8H0    632
B002QWHJOU    632
B002QWP89S    632
B003B3OOPA    623
B001EO5Q64    567
B000VK8AVK    564
B007M83302    564
B001RVFEP2    564
Name: ProductId, dtype: int64

In [14]:
## limiting our data to just 3 products
df = df[df.ProductId.isin(['B007JFMH8M', 'B0026RQTGE', 'B003B3OOPA'])]
df.shape

(2168, 10)

### Getting the word counts from summary and review text

In [15]:
df['summary_tok'] = df.Summary.apply(clean_text)
df['text_tok'] = df.Text.apply(clean_text)

In [16]:
def word_count(str):
    counts = dict()
    words = str.split()

    for word in words:
        if word in counts:
            counts[word] += 1
        else:
            counts[word] = 1

    return counts

In [17]:
# summary
sum_l = df['summary_tok'].astype(str).values.tolist()
summary_text = ' '.join(sum_l)
summary_df = pd.DataFrame.from_dict(word_count(summary_text), orient='index',columns=['words'])
summary_df.sort_values(by = 'words', ascending=False).head(20)

Unnamed: 0,words
love,307
great,307
cooki,245
dog,208
good,199
greeni,180
soft,174
yummi,158
oil,128
coconut,119


In [18]:
# review
review_l = df['text_tok'].astype(str).values.tolist()
review_text = ' '.join(review_l)
review_df = pd.DataFrame.from_dict(word_count(review_text), orient='index',columns=['words'])
review_df.sort_values(by = 'words', ascending=False).head(40)

Unnamed: 0,words
cooki,1677
love,1247
use,982
oil,944
dog,837
coconut,803
product,798
soft,787
great,766
like,749


### Newly added stopwords
* product/ products
* amazon

* package
* want, made, would, could, give, said, say
* realli
* lol

### regex substitution
* like/ good/ great/ love/ nice as POSITIVE
* bake/ cookie/ oatmeal/ quaker as FOOD_PRODUCT_1
* dog/ greenie/ treat as FOOD_PRODUCT_2
* coconut oil/ as FOOD_PRODUCT_3
* numerical number (170, 20,...) as NUMBER
* tasty/ delicious/ flavor as ...
* repeating letter words (mmmmmmmm, aaaa. ect)  --> when it repeats more than 3 times, omit?

In [19]:
## divide the reviews into positive, netual, and negative
def pos_neg(x):
    if x  >= 4:
        return 'P'
    else:
        return 'Ng'

df['sensetive'] = df.Score.apply(lambda x: pos_neg(x))

In [20]:
# split the dataframe into two parts
pos_df = df[df.sensetive == 'P']
neg_df = df[df.sensetive == 'Ng']

In [21]:
pos_df['text_tok'] = pos_df.Text.apply(clean_text)
neg_df['text_tok'] = neg_df.Text.apply(clean_text)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [22]:
def to_matrix(doc):
    # min_df = 0.01, since we wanted to reduce dimensionality and take away words that were not commonly used
    vectorizer = TfidfVectorizer(ngram_range=(1, 2), min_df = 0.01)
    X = vectorizer.fit_transform(doc) 
    X = X.toarray()
    return pd.DataFrame(X, columns=vectorizer.get_feature_names())

In [23]:
neg_mx = to_matrix(neg_df['text_tok'])
pos_mx = to_matrix(pos_df['text_tok'])



In [24]:
neg_mx

Unnamed: 0,100,12,12g,12g sugar,13,16,17,170,170 calori,20,...,write,wrong,ye,year,year old,yearup,yesterday,yet,yorki,yummi
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,...,0.0,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,...,0.0,0.000000,0.0,0.060841,0.0,0.0,0.0,0.0,0.0,0.000000
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,...,0.0,0.164452,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,...,0.0,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,...,0.0,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
165,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.110238,0.000000,0.0,...,0.0,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.165011
166,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,...,0.0,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000
167,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,...,0.0,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000
168,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,...,0.0,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000


## Get the TF-idf score for each tokens (n-grams)

In [35]:
tf_neg = neg_mx.T
tf_neg['score'] = tf_neg.sum(axis =1)
tf_pos = pos_mx.T
tf_pos['score'] = tf_pos.sum(axis =1)

#### top 20 scores for positive

In [39]:
tf_pos_score = tf_pos[['score']].sort_values(by = 'score',ascending = False).head(20)
tf_pos_score

Unnamed: 0,score
cooki,145.951363
love,114.116687
dog,92.750532
great,89.46579
use,86.995125
oil,83.022097
soft,82.763034
product,76.951207
good,76.478984
greeni,75.995298


#### top 20 scores for negative

In [40]:
tf_neg_score = tf_neg[['score']].sort_values(by = 'score',ascending = False).head(20)
tf_neg_score

Unnamed: 0,score
cooki,12.84222
dog,8.881612
greeni,7.451771
like,6.200713
product,6.16556
tast,5.675585
use,5.413039
would,5.080215
one,5.034909
love,4.909922


TF-IDF report that explains for a business (non-technical) stakeholder:
* the features your analysis showed that customers cited as reasons for a poor review
* the features your analysis showed that customers cited as reasons for a good review
* the most common issues identified from your analysis that generated customer dissatisfaction.

Explain to what degree the TF-IDF findings make sense - what are its limitations?

##### The features your analysis showed that customers cited as reasons for a poor review

##### The features your analysis showed that customers cited as reasons for a good review

##### The most common issues identified from your analysis that generated customer dissatisfaction.

## Similarity and Word Embeddings (2 pts)

Using
* `TfIdfVectorizer`

Identify the most similar pair of reviews from the `amazon-fine-foods.csv` dataset using both Euclidean distance and cosine similarity.

In [25]:
## cosine similarity function
def similar_doc_cs(vec_df, ori_df):
    """
    Inputs two arguments: vectorized dataframe and the original dataframe
    Ouput is the two reviews that had the highest similarity and the score

    This function uses the cosine similarity and retrieves the highest value besides 1
    """
    cs_mx = cosine_similarity(vec_df, vec_df)
    mod = np.where(cs_mx >= .99, 0, cs_mx)
    indices = np.where(mod == mod.max())

    count = 0
    for n,i in enumerate(mod):
        for m,j in enumerate(i):
            if j >= mod[indices][0] - 0.001 and n < m and count < 1:
                count += 1
                print(f'The first review:', '\n', ori_df.iloc[m,9], '\n------------------------\n', 
            f'The sencond review:', '\n', ori_df.iloc[n,9])
                print(f'The similarity score was {round(j,4)}')


## euclidean distance function
def similar_doc_ed(vec_df, ori_df):
    """
    Inputs two arguments: vectorized dataframe and the original dataframe
    Ouput is the two reviews that had the highest similarity and the score

    This function uses the euclidean distance and retrieves the highest value besides 1
    """
    cs_mx = euclidean_distances(vec_df, vec_df)
    mod = np.where(cs_mx <= .009, 100, cs_mx)
    indices = np.where(mod == mod.min())

    count = 0
    for n,i in enumerate(mod):
        for m,j in enumerate(i):
            if j <= mod[indices][0] and n < m and count < 1:
                count += 1
                print(f'The first review:', '\n', ori_df.iloc[m,9], '\n------------------------\n', 
            f'The sencond review:', '\n', ori_df.iloc[n,9])
                print(f'The similarity score was {round(j,4)}')
                print(n,m)
                break

In [26]:
t = "I am a dog trainer and have never seen  anything like it....<br /><br />three weeks later,, the beloved sheltie got a bowel blockage from these, use with caution.<br />if the cat gets too many she has the runs....<br />sheltie did better when i upped her thryoid meds, and gave her doggie asthma meds.<br />s"
html = re.compile(r'<.*?>')
text = html.sub(r'',t)
print(text)

I am a dog trainer and have never seen  anything like it....three weeks later,, the beloved sheltie got a bowel blockage from these, use with caution.if the cat gets too many she has the runs....sheltie did better when i upped her thryoid meds, and gave her doggie asthma meds.s


In [27]:
similar_doc_cs(neg_mx, neg_df)

The first review: 
 I purchased this due to my use of coconut oil products in my hair. I am currently using the Coconut Oil by Organic Root Stimulator and decided to go with something more natural since ORS contains petrolatum. While I'm sure this product works great I cannot attest to that because the smell was nauseating to me. If you love the strong smell of coconut then this may be for you. 
------------------------
 The sencond review: 
 I buy and use coconut oil often.  I use it to cook with, spread on various foods, take it for my health, use on my skin and hair; kind of a coconut oil junkie.  I bought this to try thinking it was better bang for my buck.  It works well on my skin and hair but in my humble opinion the taste is terrible.  I much prefer the taste of Coconut Pacific 100% Pure Organic Extra Virgin Raw Coconut Oil.  I hope Coconut Pacific's <a href="http://www.amazon.com/gp/product/B0011DHM8S">Coconut Oil Extra Virgin Organic 14 Oz - Noni Pacific</a> will come back in

In [28]:
similar_doc_ed(neg_mx, neg_df)

The first review: 
 I purchased this due to my use of coconut oil products in my hair. I am currently using the Coconut Oil by Organic Root Stimulator and decided to go with something more natural since ORS contains petrolatum. While I'm sure this product works great I cannot attest to that because the smell was nauseating to me. If you love the strong smell of coconut then this may be for you. 
------------------------
 The sencond review: 
 I buy and use coconut oil often.  I use it to cook with, spread on various foods, take it for my health, use on my skin and hair; kind of a coconut oil junkie.  I bought this to try thinking it was better bang for my buck.  It works well on my skin and hair but in my humble opinion the taste is terrible.  I much prefer the taste of Coconut Pacific 100% Pure Organic Extra Virgin Raw Coconut Oil.  I hope Coconut Pacific's <a href="http://www.amazon.com/gp/product/B0011DHM8S">Coconut Oil Extra Virgin Organic 14 Oz - Noni Pacific</a> will come back in

In [29]:
similar_doc_cs(pos_mx, pos_df)

The first review: 
 Greenies have been a part of my dogs lives for nearly thirteen years and they look forward to one everyday.  I ordered them from Amazon and everything went smoothly.  I did not have a hitch in the order or length of time getting them. Thank you so much, Linda Turner  I would recommend ordering these to anyone. 
------------------------
 The sencond review: 
 My dog is goofy enough but when he knows one of these are comming,  he really goes nuts.  I did not have a hitch in the order or length of time getting them. Thank you so much, Linda Turner I would recommend ordering these to anyone. Greenies have been a part of my dogs lives for nearly thirteen years and they look forward to one everyday. I ordered them from Amazon and everything went smoothly.
The similarity score was 0.9353


In [30]:
cs_mx = euclidean_distances(pos_mx, pos_mx)
cs_mx

array([[0.00000000e+00, 1.37833303e+00, 1.33597703e+00, ...,
        1.41421356e+00, 1.41421356e+00, 1.36611238e+00],
       [1.37833303e+00, 2.10734243e-08, 1.23037517e+00, ...,
        1.34566340e+00, 1.40838928e+00, 1.41421356e+00],
       [1.33597703e+00, 1.23037517e+00, 0.00000000e+00, ...,
        1.37091210e+00, 1.38063291e+00, 1.41421356e+00],
       ...,
       [1.41421356e+00, 1.34566340e+00, 1.37091210e+00, ...,
        0.00000000e+00, 1.37096979e+00, 1.41421356e+00],
       [1.41421356e+00, 1.40838928e+00, 1.38063291e+00, ...,
        1.37096979e+00, 0.00000000e+00, 1.20791871e+00],
       [1.36611238e+00, 1.41421356e+00, 1.41421356e+00, ...,
        1.41421356e+00, 1.20791871e+00, 2.10734243e-08]])

In [31]:
similar_doc_ed(pos_mx, pos_df)

The first review: 
 Greenies have been a part of my dogs lives for nearly thirteen years and they look forward to one everyday.  I ordered them from Amazon and everything went smoothly.  I did not have a hitch in the order or length of time getting them. Thank you so much, Linda Turner  I would recommend ordering these to anyone. 
------------------------
 The sencond review: 
 My dog is goofy enough but when he knows one of these are comming,  he really goes nuts.  I did not have a hitch in the order or length of time getting them. Thank you so much, Linda Turner I would recommend ordering these to anyone. Greenies have been a part of my dogs lives for nearly thirteen years and they look forward to one everyday. I ordered them from Amazon and everything went smoothly.
The similarity score was 0.3596
456 460


## Naive Bayes (3pts)

You are an NLP data scientist working at Fandango. You observe the following dataset in your review comments:

**Intent to Buy Tickets:**
1.	Love this movie. Can’t wait!
2.	I want to see this movie so bad.
3.	This movie looks amazing.

**No Intent to Buy Tickets:**
1.	Looks bad.
2.	Hard pass to see this bad movie.
3.	So boring!

You can consider the following stopwords for removal: `to`, `this`.

Is the following review an `Intent to Buy` or `No Intent to Buy`? Show your work for each computation.
> This looks so bad.

You'll need to compute:
* Prior
* Likelihood
* Posterior

**Prior:**

$$\begin{aligned}
P(y = Intent) &= 1/2   \\
P(y = No Intent) &= 1/2 \\
\end{aligned}$$

**Likelihood:**

$$\begin{aligned}
P(x \mid y = Intent) & = P(x = "looks" \mid y = Intent) * P(x = "so" \mid y = Intent) * P(x = "bad" \mid y = Intent)   \\
& = (1/3)/(1/2) * (1/3)/(1/2) * (2/3)/(1/2) \\
& = 1/6 * 1/6 * 1/3 \\
& = 1/108 \\
\\
\\
P(x \mid y = No Intent) & = P(x = "looks" \mid y = No Intent) * P(x = "so" \mid y = No Intent) * P(x = "bad" \mid y = No Intent)   \\
& = (1/3)/(1/2) * (1/3)/(1/2) * (1/3)/(1/2) \\
& = 1/6 * 1/6 * 1/6 \\
& = 1/216 \\
\end{aligned}$$

**Evidence:**

$$\begin{aligned}
P(x) &= P(x \mid y = Intent)*P(y = Intent) + P(x \mid y = No Intent)*P(y = No Intent)   \\
& = (1/108) * (1/2) + (1/216) * (1/2) \\
& = 1/144 \\
\end{aligned}$$


**Posterior:**

$$\begin{aligned}
P(y = Intent \mid x) &= (P(x \mid y = Intent) * P(y = Intent))/ P(x)  \\
& = ((1/108) * (1/2)) / (1/144) \\
& = 2/3 \\
\\
\\
P(y = No Intent \mid x) &= (P(x \mid y = No Intent) * P(y = No Intent))/ P(x) \\
& = ((1/216) * (1/2)) / (1/144) \\
& = 1/3 \\
\end{aligned}$$

From the posterior probability,  "This looks so bad." will be classified as Intent