# HW3

Submit via Slack. Due on **Tuesday, April 12th, 2022, 6:29pm PST**. You may work with one other person.
## TF-IDF (5pts)

You are an analyst working for Amazon's product team, and charged with identifying areas for improvement for the toy reviews.

Using the **amazon-fine-foods.csv** dataset, clean and parse the text reviews. Explain the decisions you make:
- why remove/keep stopwords?
- which stopwords to remove?
- stemming versus lemmatization?
- regex cleaning and substitution?
- adding in custom stopwords?
- what `n` for your `n-grams`?

Finally, generate a TF-IDF report that explains for a business (non-technical) stakeholder:
* the features your analysis showed that customers cited as reasons for a poor review
* the features your analysis showed that customers cited as reasons for a good review
* the most common issues identified from your analysis that generated customer dissatisfaction.

Explain to what degree the TF-IDF findings make sense - what are its limitations?

In [1]:
import pandas as pd
import numpy as np
import re, string
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
from sklearn.metrics.pairwise import euclidean_distances, cosine_similarity

In [2]:
def clean_text(text):
    """
    Tokenize text into words. Convert texts to lower case.
    Remove hashtags, punctuations, stopwords, website links, extra spaces, non-alphanumeric characters and 
    single character. stemtize texts.
    """

    # erase html language characters
    html = re.compile(r'<.*?>')
    text = html.sub(r'',text)

    # year phrases
    text = re.sub(r'(\-?yrs?)', ' year', text)
    text = re.sub(r'(\-?years?\-?olds?)', ' year old', text)

    # birthday
    text = re.sub(r'([Bb]\-?[Dd]ays?)', 'birthday', text)

    # holiday words
    text = re.sub(r'([Xx][Mm]as|[Cc]hrist\-[Mm]as)', 'christmas', text)
    text = re.sub(r'([Nn]ew\-[Yy]ears?)', 'new years', text)

    tokens = [token for token in nltk.word_tokenize(text)]
    
    # Combine stopwords and punctuation
    stops = stopwords.words("english") + list(string.punctuation)

    # # adding extra stopwords (buy, bought, purchase, purchased)
    stops.append('buy')
    stops.append('bought')
    stops.append('purchase')
    stops.append('purchased')

    ## the following codes are from my past nlp project that I use when cleaning the text/ tokens

    # special characters
    s_chars = '¥₽ÏïŰŬĎŸæ₿₪ÚŇÀèÅ”ĜåŽÖéříÿý€ŝĤ₹áŜŮÂ₴ûÌÇšŘúüëÓ₫ŠčÎŤÆÒœ₩öËäøÍťìĈôàĥÝ¢ç“žðÙÊĉŭÈŒÐÉÔĵùÁů„âÄűĴóêĝÞîØòď฿ČÜþňÛ'
    
    # Create PorterStemmer
    stemmer = PorterStemmer()
    
    tokens_no_hashtag = [re.sub(r'#', '', token) for token in tokens]
    tokens_no_stopwords = [token.lower() for token in tokens_no_hashtag if token.lower() not in stops]
    tokens_no_url = [re.sub(r'http\S+', '', token) for token in tokens_no_stopwords]
    tokens_no_url = [re.sub(r'www\S+', '', token) for token in tokens_no_url]
    tokens_no_special_char = [re.sub(r'[{}]'.format(s_chars), '', token) for token in tokens_no_url]
    tokens_no_extra_space = [re.sub(r'\s\s+', '', token) for token in tokens_no_special_char]
    tokens_alnum = [token for token in tokens_no_extra_space if token.isalnum()]
    tokens_stem = [stemmer.stem(token) for token in tokens_alnum]
    tokens_final = [token for token in tokens_stem if len(token) > 1]
    
    return ' '.join(tokens_final)

- Which stopwords to remove? Why remove/keep stopwords? Adding in custom stopwords?
    - which words we removed
    - which words we kept
    - which words we added

- stemming versus lemmatization?

- regex cleaning and substitution?
    - cleaned html format strings
    - cleaned phrases like `year`, `birthday`, and holiday words

- what `n` for your `n-grams`?
    - we chose n = 1 and 2

In [3]:
df = pd.read_csv('amazon_fine_foods.csv')
df.head(3)

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,20983,B002QWP89S,A21U4DR8M6I9QN,"K. M Merrill ""justine""",1,1,5,1318896000,addictive! but works for night coughing in dogs,my 12 year old sheltie has chronic brochotitis...
1,20984,B002QWP89S,A17TDUBB4Z1PEC,jaded_green,1,1,5,1318550400,genuine Greenies best price,"These are genuine Greenies product, not a knoc..."
2,20985,B002QWP89S,ABQH3WAWMSMBH,tenisbrat87,1,1,5,1317168000,Perfect for our little doggies,"Our dogs love Greenies, but of course, which d..."


In [4]:
df.ProductId.value_counts().head(10)

B007JFMH8M    913
B0026RQTGE    632
B002QWP8H0    632
B002QWHJOU    632
B002QWP89S    632
B003B3OOPA    623
B001EO5Q64    567
B000VK8AVK    564
B007M83302    564
B001RVFEP2    564
Name: ProductId, dtype: int64

In [5]:
## limiting our data to just 3 products
df = df[df.ProductId.isin(['B007JFMH8M', 'B0026RQTGE', 'B002QWP8H0'])]
df.shape

(2177, 10)

In [6]:
## divide the reviews into positive, netual, and negative
def pos_neg(x):
    if x  >= 4:
        return 'P'
    else:
        return 'Ng'

df['sensetive'] = df.Score.apply(lambda x: pos_neg(x))

In [7]:
# split the dataframe into two parts
pos_df = df[df.sensetive == 'P']
neg_df = df[df.sensetive == 'Ng']

In [8]:
pos_df['text_tok'] = pos_df.Text.apply(clean_text)
neg_df['text_tok'] = neg_df.Text.apply(clean_text)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [9]:
def to_matrix(doc):
    # min_df = 0.01, since we wanted to reduce dimensionality and take away words that were not commonly used
    vectorizer = TfidfVectorizer(ngram_range=(1, 2), min_df = 0.01)
    X = vectorizer.fit_transform(doc) 
    X = X.toarray()
    return pd.DataFrame(X, columns=vectorizer.get_feature_names())

In [10]:
neg_mx = to_matrix(neg_df['text_tok'])
pos_mx = to_matrix(pos_df['text_tok'])



In [11]:
neg_mx

Unnamed: 0,10,10 year,100,100 digest,100lb,100lb plu,12,12g,12g sugar,12hr,...,yorki 18,yorki breed,yorkshir,yorkshir terrier,younger,younger dog,younger saint,yummi,ziploc,ziploc top
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
187,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.169433,0.0,0.0
188,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0
189,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0
190,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0


TF-IDF report that explains for a business (non-technical) stakeholder:
* the features your analysis showed that customers cited as reasons for a poor review
* the features your analysis showed that customers cited as reasons for a good review
* the most common issues identified from your analysis that generated customer dissatisfaction.

Explain to what degree the TF-IDF findings make sense - what are its limitations?

##### The features your analysis showed that customers cited as reasons for a poor review

##### The features your analysis showed that customers cited as reasons for a good review

##### The most common issues identified from your analysis that generated customer dissatisfaction.

## Similarity and Word Embeddings (2 pts)

Using
* `TfIdfVectorizer`

Identify the most similar pair of reviews from the `amazon-fine-foods.csv` dataset using both Euclidean distance and cosine similarity.

In [25]:
## cosine similarity function
def similar_doc_cs(vec_df, ori_df):
    """
    Inputs two arguments: vectorized dataframe and the original dataframe
    Ouput is the two reviews that had the highest similarity and the score

    This function uses the cosine similarity and retrieves the highest value besides 1
    """
    cs_mx = cosine_similarity(vec_df, vec_df)
    mod = np.where(cs_mx >= .99, 0, cs_mx)
    indices = np.where(mod == mod.max())

    count = 0
    for n,i in enumerate(mod):
        for m,j in enumerate(i):
            if j >= mod[indices][0] - 0.001 and n < m and count < 1:
                count += 1
                print(f'The first review:', '\n', ori_df.iloc[m,9], '\n------------------------\n', 
            f'The sencond review:', '\n', ori_df.iloc[n,9])
                print(f'The similarity score was {round(j,4)}')


## euclidean distance function
def similar_doc_ed(vec_df, ori_df):
    """
    Inputs two arguments: vectorized dataframe and the original dataframe
    Ouput is the two reviews that had the highest similarity and the score

    This function uses the euclidean distance and retrieves the highest value besides 1
    """
    cs_mx = euclidean_distances(vec_df, vec_df)
    mod = np.where(cs_mx <= .009, 100, cs_mx)
    indices = np.where(mod == mod.min())

    count = 0
    for n,i in enumerate(mod):
        for m,j in enumerate(i):
            if j <= mod[indices][0] and n < m and count < 1:
                count += 1
                print(f'The first review:', '\n', ori_df.iloc[m,9], '\n------------------------\n', 
            f'The sencond review:', '\n', ori_df.iloc[n,9])
                print(f'The similarity score was {round(j,4)}')
                print(n,m)
                break

In [13]:
t = "I am a dog trainer and have never seen  anything like it....<br /><br />three weeks later,, the beloved sheltie got a bowel blockage from these, use with caution.<br />if the cat gets too many she has the runs....<br />sheltie did better when i upped her thryoid meds, and gave her doggie asthma meds.<br />s"
html = re.compile(r'<.*?>')
text = html.sub(r'',t)
print(text)

I am a dog trainer and have never seen  anything like it....three weeks later,, the beloved sheltie got a bowel blockage from these, use with caution.if the cat gets too many she has the runs....sheltie did better when i upped her thryoid meds, and gave her doggie asthma meds.s


In [14]:
similar_doc_cs(neg_mx, neg_df)

The first review: 
 I like oatmeal and raisin cookies but I don't eat them very often.  That's probably because my wife makes such awesome chocolate cookies.  When I do get an opportunity to try an oatmeal and raisin cookie it is usually from a store-bought package.  I usually come away impressed with how moist these pre-packaged cookies are.  That's what surprised me most about Quaker Soft Baked Oatmeal Cookie;  They're soft but they aren't moist.  There is a distinct soft dryness to the cookies. The taste was good and my grand daughter quickly ate up most of them.  She liked them because they are tasty and soft.  I agree and I guess two out of three criteria is good but I was looking for one more quality 
------------------------
 The sencond review: 
 I received the oatmeal raisin cookie in my momvoxbox from influenster and it was pleasantly surprised by how chewy and tasty the cookie was considering that oatmeal raisin cookies are usually hard.  I'm not a big fan of the oatmeal rai

In [15]:
similar_doc_ed(neg_mx, neg_df)

The first review: 
 I will continue to give these to my dogs. They like them and supposedly they clean their teeth.<br />I started to give these to my dog daily, but her digestive system did not like it enough for that. Most of it seems to be digestible, but she threw up green foam on the morning after the second day in a row of eating one-a-day greenies. Note that she was also constipated as well. Greenie advises to have plenty of water available for drinking these, so I am guess they know these side effects can occur. My dog has plenty of water available, but she didn't drink any more than normal, which may have resulted in the effects. I don't plan on training my dog to drink lots of extra water after eating these. Giving here one every now and then shows no negative effects, and that what I plan on doing.<br />Not a complaint to me, but I think people who like giving there dog all natural health food should know that this product is far from "green." It is pumped pull of artificial

In [16]:
similar_doc_cs(pos_mx, pos_df)

The first review: 
 Greenies have been a part of my dogs lives for nearly thirteen years and they look forward to one everyday.  I ordered them from Amazon and everything went smoothly.  I did not have a hitch in the order or length of time getting them. Thank you so much, Linda Turner  I would recommend ordering these to anyone. 
------------------------
 The sencond review: 
 My dog is goofy enough but when he knows one of these are comming,  he really goes nuts.  I did not have a hitch in the order or length of time getting them. Thank you so much, Linda Turner I would recommend ordering these to anyone. Greenies have been a part of my dogs lives for nearly thirteen years and they look forward to one everyday. I ordered them from Amazon and everything went smoothly.
The similarity score was 0.9183


In [20]:
cs_mx = euclidean_distances(pos_mx, pos_mx)
cs_mx

array([[2.10734243e-08, 1.39583802e+00, 1.35772903e+00, ...,
        1.41421356e+00, 1.41421356e+00, 1.36594777e+00],
       [1.39583802e+00, 1.49011612e-08, 1.26984116e+00, ...,
        1.35567183e+00, 1.40962530e+00, 1.41421356e+00],
       [1.35772903e+00, 1.26984116e+00, 0.00000000e+00, ...,
        1.36478141e+00, 1.37701242e+00, 1.41421356e+00],
       ...,
       [1.41421356e+00, 1.35567183e+00, 1.36478141e+00, ...,
        0.00000000e+00, 1.37165220e+00, 1.41421356e+00],
       [1.41421356e+00, 1.40962530e+00, 1.37701242e+00, ...,
        1.37165220e+00, 0.00000000e+00, 1.20904163e+00],
       [1.36594777e+00, 1.41421356e+00, 1.41421356e+00, ...,
        1.41421356e+00, 1.20904163e+00, 0.00000000e+00]])

In [26]:
similar_doc_ed(pos_mx, pos_df)

The first review: 
 Greenies have been a part of my dogs lives for nearly thirteen years and they look forward to one everyday.  I ordered them from Amazon and everything went smoothly.  I did not have a hitch in the order or length of time getting them. Thank you so much, Linda Turner  I would recommend ordering these to anyone. 
------------------------
 The sencond review: 
 My dog is goofy enough but when he knows one of these are comming,  he really goes nuts.  I did not have a hitch in the order or length of time getting them. Thank you so much, Linda Turner I would recommend ordering these to anyone. Greenies have been a part of my dogs lives for nearly thirteen years and they look forward to one everyday. I ordered them from Amazon and everything went smoothly.
The similarity score was 0.4042
456 460


## Naive Bayes (3pts)

You are an NLP data scientist working at Fandango. You observe the following dataset in your review comments:

**Intent to Buy Tickets:**
1.	Love this movie. Can’t wait!
2.	I want to see this movie so bad.
3.	This movie looks amazing.

**No Intent to Buy Tickets:**
1.	Looks bad.
2.	Hard pass to see this bad movie.
3.	So boring!

You can consider the following stopwords for removal: `to`, `this`.

Is the following review an `Intent to Buy` or `No Intent to Buy`? Show your work for each computation.
> This looks so bad.

You'll need to compute:
* Prior
* Likelihood
* Posterior

**Prior:**

$$\begin{aligned}
P(y = Intent) &= 1/2   \\
P(y = No Intent) &= 1/2 \\
\end{aligned}$$

**Likelihood:**

$$\begin{aligned}
P(x \mid y = Intent) & = P(x = "looks" \mid y = Intent) * P(x = "so" \mid y = Intent) * P(x = "bad" \mid y = Intent)   \\
& = (1/3)/(1/2) * (1/3)/(1/2) * (2/3)/(1/2) \\
& = 1/6 * 1/6 * 1/3 \\
& = 1/108 \\
\\
\\
P(x \mid y = No Intent) & = P(x = "looks" \mid y = No Intent) * P(x = "so" \mid y = No Intent) * P(x = "bad" \mid y = No Intent)   \\
& = (1/3)/(1/2) * (1/3)/(1/2) * (1/3)/(1/2) \\
& = 1/6 * 1/6 * 1/6 \\
& = 1/216 \\
\end{aligned}$$

**Evidence:**

$$\begin{aligned}
P(x) &= P(x \mid y = Intent)*P(y = Intent) + P(x \mid y = No Intent)*P(y = No Intent)   \\
& = (1/108) * (1/2) + (1/216) * (1/2) \\
& = 1/144 \\
\end{aligned}$$


**Posterior:**

$$\begin{aligned}
P(y = Intent \mid x) &= (P(x \mid y = Intent) * P(y = Intent))/ P(x)  \\
& = ((1/108) * (1/2)) / (1/144) \\
& = 2/3 \\
\\
\\
P(y = No Intent \mid x) &= (P(x \mid y = No Intent) * P(y = No Intent))/ P(x) \\
& = ((1/216) * (1/2)) / (1/144) \\
& = 1/3 \\
\end{aligned}$$

From the posterior probability,  "This looks so bad." will be classified as Intent