# HW3

Submit via Slack. Due on **Tuesday, April 12th, 2022, 6:29pm PST**. You may work with one other person.
## TF-IDF (5pts)

You are an analyst working for Amazon's product team, and charged with identifying areas for improvement for the toy reviews.

Using the **amazon-fine-foods.csv** dataset, clean and parse the text reviews. Explain the decisions you make:
- why remove/keep stopwords?
- which stopwords to remove?
- stemming versus lemmatization?
- regex cleaning and substitution?
- adding in custom stopwords?
- what `n` for your `n-grams`?

Finally, generate a TF-IDF report that explains for a business (non-technical) stakeholder:
* the features your analysis showed that customers cited as reasons for a poor review
* the features your analysis showed that customers cited as reasons for a good review
* the most common issues identified from your analysis that generated customer dissatisfaction.

Explain to what degree the TF-IDF findings make sense - what are its limitations?

In [1]:
import pandas as pd
import numpy as np
import re, string
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
from sklearn.metrics.pairwise import euclidean_distances, cosine_similarity

In [2]:
def clean_text(text):
    """
    Tokenize text into words. Convert texts to lower case.
    Remove hashtags, punctuations, stopwords, website links, extra spaces, non-alphanumeric characters and 
    single character. stemtize texts.
    """

    # erase html language characters
    html = re.compile(r'<.*?>')
    text = html.sub(r'',text)

    # product replacement
    text = re.sub(r'\b(cookiee?s?)\b', 'FOOD_PRODUCT_1', text, flags=re.IGNORECASE)
    text = re.sub(r'\b(greenies?)\b', 'FOOD_PRODUCT_2', text, flags=re.IGNORECASE)
    text = re.sub(r'\b(coconut\s?oils?|oils?)\b', 'FOOD_PRODUCT_3', text, flags=re.IGNORECASE)

    # numerbers plus units
    text = re.sub(r'\b([1-9]+[\w]*)\b', '_NUMERIC_', text, flags=re.IGNORECASE)

    # taste words
    text = re.sub(r'\b(tasty|delicious|yum*y|enjoy)\b', 'GOOD_TASTE', text, flags=re.IGNORECASE)
    text = re.sub(r'\b(nasty|disgusting|rotten|stale)\b', 'BAD_TASTE', text, flags=re.IGNORECASE)

    # words with 3 or more of same letters
    text = re.sub(r'\b[a-z0-9]*(.)\1\1+[a-z0-9]*\b', '', text, flags=re.IGNORECASE)

    tokens = [token for token in nltk.word_tokenize(text)]
    
    # Combine stopwords and punctuation
    stops = stopwords.words("english") + list(string.punctuation)

    # # adding extra stopwords (buy, bought, purchase, purchased)
    stops.append('buy')
    stops.append('bought')
    stops.append('purchase')
    stops.append('purchased')
    stops.append('product')
    stops.append('products')
    stops.append('package')
    stops.append('packages')

    stops.append('quaker')
    stops.append('raisin')
    stops.append('raisins')
    stops.append('bake')
    stops.append('baked')
    stops.append('oatmeal')
    stops.append('dog')
    stops.append('dogs')

    ## the following codes are from my past nlp project that I use when cleaning the text/ tokens

    # special characters
    s_chars = '¥₽ÏïŰŬĎŸæ₿₪ÚŇÀèÅ”ĜåŽÖéříÿý€ŝĤ₹áŜŮÂ₴ûÌÇšŘúüëÓ₫ŠčÎŤÆÒœ₩öËäøÍťìĈôàĥÝ¢ç“žðÙÊĉŭÈŒÐÉÔĵùÁů„âÄűĴóêĝÞîØòď฿ČÜþňÛ'
    
    # Create PorterStemmer
    stemmer = PorterStemmer()
    
    tokens_no_hashtag = [re.sub(r'#', '', token) for token in tokens]
    tokens_no_stopwords = [token.lower() for token in tokens_no_hashtag if token.lower() not in stops]
    tokens_no_url = [re.sub(r'http\S+', '', token) for token in tokens_no_stopwords]
    tokens_no_url = [re.sub(r'www\S+', '', token) for token in tokens_no_url]
    tokens_no_special_char = [re.sub(r'[{}]'.format(s_chars), '', token) for token in tokens_no_url]
    tokens_no_extra_space = [re.sub(r'\s\s+', '', token) for token in tokens_no_special_char]
    tokens_alnum = [token for token in tokens_no_extra_space if token.isalnum()]
    tokens_stem = [stemmer.stem(token) for token in tokens_alnum]
    tokens_final = [token for token in tokens_stem if len(token) > 1]
    
    return ' '.join(tokens_final)

- Which stopwords to remove? Why remove/keep stopwords? Adding in custom stopwords?
    - which words we removed
    - which words we kept
    - which words we added

### Stopwords
- Stopwords should be removed because we don't want these very common words that are meaningless in our analysis introducing noise and taking up dimensions in our final vectorized matrices.
- Stopwords that were removed including all the stopwords in the NLTK corpus, as well as some custom stopwords given the context of this data set. Words like purchase, buy, product, and package were removed because their frequency is likely high due to the data being about food reviews, which makes them relatively meaningless in this context. 
- Furthermore, since this analysis only focuses on 3 of the top reviewed products from the dataset, and these 3 products all had titles/names that were more than one token long, all but one of those words in the product's name were added to the stopword list (so that only 1 instance of the product being identified in a review could be regex-substituted into a keyword). 
    - For the Quaker Soft Baked Oatmeal Cookies with Raisin, all relevant naming words other than "cookie" were added to the stopword list.
    - For the Greenies dog treats, all words except "Greenies" were added to the stopword list.
    - For the coconut oil, no words were added to the stopword list, and "coconut oil" and "oil" were considered in the regex substitution section.

### Stemming vs Lemmatization
Since our analysis is focused on counts of words to ultimately do TF-IDF, the part of speech or actual language word doesn't have to be considered. Therefore, stemming was chosen over lemmatization.

### Regex Cleaning and Substitution
- Regex substitution was used to change the product name of the products used in this analysis to a keyword specifying which product they were so that we can understand which products were being referenced in a certain review or in our vectorized results.
    - "Cookie" was used as the identifier for the product with ID "B007JFMH8M"
    - "Greenies" was used as the identifier for the product with ID "B0026RQTGE"
    - "Coconut Oil/Oil was used as the identifier for the product with ID "B003B3OOPA"
- Regex substituion was also used to substitute and clean common words that may appear in a corpus of food reviews, since all of the products we focused on are consumed. These included the words "tasty", "delicious", "yummy", and "enjoy" for positive sentiments, and "nasty", "disgusting", "rotten", and "stale" for negative ones.
- Additionally, regex substitution was used to clean the text:
    - Numbers the appear in the corpus were substituted with NUMERIC, since specific numbers don't provide much meaning in an analysis resulting in a vectorized matrix.
    - Tokens that contain the same letter three or more times in a row were also removed, since the English language (in almost all cases) outlaws the use of words with 3 or more of the same character in a row, so these erroenous, meaningless words would just add unneeded dimensionality in our final matrices.
    - Lastly, regex substituion was used to remove html formatting characters, hastags, URLS, special characters, and extra spaces in the corpus.


### N-gram Selection - NEED TO ADD MORE?

For this analysis, we chose to use n-grams of size 1 and 2. 1 was chosen so we could find single meaningful words, while n-grams of size 2 were chosen to find common two-word phrases that were meaningful in a TF-IDF analysis.

In [3]:
df = pd.read_csv('../datasets/amazon_fine_foods.csv')
# df = pd.read_csv('amazon_fine_foods.csv')
df.head(3)

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,20983,B002QWP89S,A21U4DR8M6I9QN,"K. M Merrill ""justine""",1,1,5,1318896000,addictive! but works for night coughing in dogs,my 12 year old sheltie has chronic brochotitis...
1,20984,B002QWP89S,A17TDUBB4Z1PEC,jaded_green,1,1,5,1318550400,genuine Greenies best price,"These are genuine Greenies product, not a knoc..."
2,20985,B002QWP89S,ABQH3WAWMSMBH,tenisbrat87,1,1,5,1317168000,Perfect for our little doggies,"Our dogs love Greenies, but of course, which d..."


In [4]:
df.ProductId.value_counts().head(10)

B007JFMH8M    913
B0026RQTGE    632
B002QWP8H0    632
B002QWHJOU    632
B002QWP89S    632
B003B3OOPA    623
B001EO5Q64    567
B000VK8AVK    564
B007M83302    564
B001RVFEP2    564
Name: ProductId, dtype: int64

In [5]:
## limiting our data to just 3 products
df = df[df.ProductId.isin(['B007JFMH8M', 'B0026RQTGE', 'B003B3OOPA'])]
df.shape

(2168, 10)

### Getting the word counts from summary and review text

In [6]:
df['summary_tok'] = df.Summary.apply(clean_text)
df['text_tok'] = df.Text.apply(clean_text)

In [7]:
def word_count(str):
    counts = dict()
    words = str.split()

    for word in words:
        if word in counts:
            counts[word] += 1
        else:
            counts[word] = 1

    return counts

In [8]:
# summary
sum_l = df['summary_tok'].astype(str).values.tolist()
summary_text = ' '.join(sum_l)
summary_df = pd.DataFrame.from_dict(word_count(summary_text), orient='index',columns=['words'])
summary_df.sort_values(by = 'words', ascending=False).head(20)

Unnamed: 0,words
love,307
great,306
good,199
soft,174
treat,67
tast,67
best,58
price,57
yum,48
like,45


In [9]:
# review
review_l = df['text_tok'].astype(str).values.tolist()
review_text = ' '.join(review_l)
review_df = pd.DataFrame.from_dict(word_count(review_text), orient='index',columns=['words'])
review_df.sort_values(by = 'words', ascending=False).head(40)

Unnamed: 0,words
love,1247
use,981
soft,785
great,764
like,749
good,720
one,658
tast,646
hair,562
tri,557


### Newly added stopwords
* product/ products
* package
* want, made, would, could, give, said, say
* realli
* lol

### regex substitution
* like/ good/ great/ love/ nice as POSITIVE
* bake/ cookie/ oatmeal/ quaker as FOOD_PRODUCT_1
* dog/ greenie/ treat as FOOD_PRODUCT_2
* coconut oil/ as FOOD_PRODUCT_3
* numerical number (170, 20,...) as NUMBER
* tasty/ delicious/ flavor as ...
* repeating letter words (mmmmmmmm, aaaa. ect)  --> when it repeats more than 3 times, omit?

In [10]:
## divide the reviews into positive, netual, and negative
def pos_neg(x):
    if x  >= 4:
        return 'P'
    else:
        return 'Ng'

df['sensetive'] = df.Score.apply(lambda x: pos_neg(x))

In [11]:
# split the dataframe into two parts
pos_df = df[df.sensetive == 'P']
neg_df = df[df.sensetive == 'Ng']

In [12]:
pos_df['text_tok'] = pos_df.Text.apply(clean_text)
neg_df['text_tok'] = neg_df.Text.apply(clean_text)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [13]:
def to_matrix(doc):
    # min_df = 0.01, since we wanted to reduce dimensionality and take away words that were not commonly used
    vectorizer = TfidfVectorizer(ngram_range=(2, 2), min_df = 0.01)
    X = vectorizer.fit_transform(doc) 
    X = X.toarray()
    return pd.DataFrame(X, columns=vectorizer.get_feature_names())

In [14]:
neg_mx = to_matrix(neg_df['text_tok'])
pos_mx = to_matrix(pos_df['text_tok'])

In [15]:
neg_mx

Unnamed: 0,actual caus,actual would,affect stop,also contain,alway lookout,amazon vine,anoth one,ate one,away bewar,bad review,...,would get,would give,would go,would like,would much,would never,would order,would recommend,would take,year old
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.000000,0.0,0.0,0.5,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
165,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
166,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
167,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
168,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.306706,0.0,0.0,0.0,0.341275,0.0,0.0,0.0,0.0


In [16]:
pos_mx

Unnamed: 0,absolut love,also use,best price,brush teeth,ca wait,clean teeth,definit recommend,dri skin,even better,everi day,...,use skin,vox box,voxbox influenst,whole famili,whole grain,work great,work well,would definit,would recommend,year old
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1993,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1994,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1995,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
1996,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Get the TF-idf score for each tokens (n-grams)

In [17]:
tf_neg = neg_mx.T
tf_neg['score'] = tf_neg.sum(axis =1)
tf_pos = pos_mx.T
tf_pos['score'] = tf_pos.sum(axis =1)

#### top 20 scores for positive

In [18]:
tf_pos_score = tf_pos[['score']].sort_values(by = 'score',ascending = False).head(20)
tf_pos_score

Unnamed: 0,score
soft chewi,63.364194
love soft,53.39922
use hair,50.059379
year old,49.033671
highli recommend,42.803227
kid love,42.039136
mom voxbox,41.781764
tast like,39.038385
tast great,38.964118
absolut love,38.952077


#### top 20 scores for negative

In [19]:
tf_neg_score = tf_neg[['score']].sort_values(by = 'score',ascending = False).head(30)
tf_neg_score

Unnamed: 0,score
individu wrap,3.359883
year old,3.334901
realli like,2.577729
hair skin,2.474868
kid love,2.448023
glass milk,2.275937
littl dri,2.201032
plastic jar,2.118957
throw away,2.056551
read review,2.056375


TF-IDF report that explains for a business (non-technical) stakeholder:
* the features your analysis showed that customers cited as reasons for a poor review
* the features your analysis showed that customers cited as reasons for a good review
* the most common issues identified from your analysis that generated customer dissatisfaction.

Explain to what degree the TF-IDF findings make sense - what are its limitations?

- Note from prof: difference between first and third points:
- Bullet point 1 is asking you to find the top ngrams for poor reviews after TF-IDF vectorization. Bullet point 3 is asking you for a business analysis explanation that synthesizes the results from 1. It is not a one to one mapping of ngrams to issues. Remember I have said that there is multicollinearity between ngrams generated in vectorization. So just listing the top ngram scores would suffice for 1 but not for 3. You should explain what the pain points are and cite specific examples/action items from the corpus for management to consider

##### The features your analysis showed that customers cited as reasons for a poor review

##### The features your analysis showed that customers cited as reasons for a good review

##### The most common issues identified from your analysis that generated customer dissatisfaction.

## Similarity and Word Embeddings (2 pts)

Using
* `TfIdfVectorizer`

Identify the most similar pair of reviews from the `amazon-fine-foods.csv` dataset using both Euclidean distance and cosine similarity.

In [20]:
## cosine similarity function
def similar_doc_cs(vec_df, ori_df):
    """
    Inputs two arguments: vectorized dataframe and the original dataframe
    Ouput is the two reviews that had the highest similarity and the score

    This function uses the cosine similarity and retrieves the highest value besides 1
    """
    cs_mx = cosine_similarity(vec_df, vec_df)
    mod = np.where(cs_mx >= .99, 0, cs_mx)
    indices = np.where(mod == mod.max())

    count = 0
    for n,i in enumerate(mod):
        for m,j in enumerate(i):
            if j >= mod[indices][0] - 0.001 and n < m and count < 1:
                count += 1
                print(f'The first review:', '\n', ori_df.iloc[m,9], '\n------------------------\n', 
            f'The sencond review:', '\n', ori_df.iloc[n,9])
                print(f'The similarity score was {round(j,4)}')


## euclidean distance function
def similar_doc_ed(vec_df, ori_df):
    """
    Inputs two arguments: vectorized dataframe and the original dataframe
    Ouput is the two reviews that had the highest similarity and the score

    This function uses the euclidean distance and retrieves the highest value besides 1
    """
    cs_mx = euclidean_distances(vec_df, vec_df)
    mod = np.where(cs_mx <= .009, 100, cs_mx)
    indices = np.where(mod == mod.min())

    count = 0
    for n,i in enumerate(mod):
        for m,j in enumerate(i):
            if j <= mod[indices][0] and n < m and count < 1:
                count += 1
                print(f'The first review:', '\n', ori_df.iloc[m,9], '\n------------------------\n', 
            f'The sencond review:', '\n', ori_df.iloc[n,9])
                print(f'The similarity score was {round(j,4)}')
                print(n,m)
                break

In [21]:
t = "I am a dog trainer and have never seen  anything like it....<br /><br />three weeks later,, the beloved sheltie got a bowel blockage from these, use with caution.<br />if the cat gets too many she has the runs....<br />sheltie did better when i upped her thryoid meds, and gave her doggie asthma meds.<br />s"
html = re.compile(r'<.*?>')
text = html.sub(r'',t)
print(text)

I am a dog trainer and have never seen  anything like it....three weeks later,, the beloved sheltie got a bowel blockage from these, use with caution.if the cat gets too many she has the runs....sheltie did better when i upped her thryoid meds, and gave her doggie asthma meds.s


In [22]:
similar_doc_cs(neg_mx, neg_df)

The first review: 
 I loved this product when I initially received it.  I purchased it to use as a dietary supplement and was pleased to discover that it works wonders on hair and skin as well.  I read the label carefully because I was concerned about whether or not to refrigerate.  The label clearly said that refrigeration was not required.  However, after appx. 2 weeks of using as a daily dietary supplement, I began to notice an awful cramping, nauseous feeling not long after I took it.  It was the sickest I have ever felt in my entire life and I repeated it just to be sure that it was the coconut oil causing it and not something else I had eaten.  Same thing again.  All I can figure is the oil went rancid.  There's no other explanation for why it suddenly started causing me to be sick.  So, word to the wise, consider refrigerating this product.  I've stopped using it altogether because I now associate the smell of it with sickness. :( 
------------------------
 The sencond review: 


In [23]:
similar_doc_ed(neg_mx, neg_df)

The first review: 
 I loved this product when I initially received it.  I purchased it to use as a dietary supplement and was pleased to discover that it works wonders on hair and skin as well.  I read the label carefully because I was concerned about whether or not to refrigerate.  The label clearly said that refrigeration was not required.  However, after appx. 2 weeks of using as a daily dietary supplement, I began to notice an awful cramping, nauseous feeling not long after I took it.  It was the sickest I have ever felt in my entire life and I repeated it just to be sure that it was the coconut oil causing it and not something else I had eaten.  Same thing again.  All I can figure is the oil went rancid.  There's no other explanation for why it suddenly started causing me to be sick.  So, word to the wise, consider refrigerating this product.  I've stopped using it altogether because I now associate the smell of it with sickness. :( 
------------------------
 The sencond review: 


In [24]:
similar_doc_cs(pos_mx, pos_df)

The first review: 
 I bought a jar of this thing on Aug 16 for seven bucks (and free delivery if over Amazon's minimum for such). It came pretty fast, in a semi-liquid state but not leaking; rather well packaged. I let it melt completely in a warm room (CO melts at about 80 deg F) and then put it in the fridge to let it solidify. One of the reasons I did (and always do) that is to smooth out the texture: as is, every type of CO I've tried comes in like a bunch of white fibers stuck in more even liquid, but if you thaw it completely and then freeze, it becomes all evenly semi-transparent, like one thick candle sort of thing. It probably doesn't matter; just a personal quirk, I guess.<br /><br />Anyway, this is good stuff: the taste is nice -- very good for sweetich things like carrots or plums, for example (I always have a bowl of very lightly steamed carrots around, and I snack on them, taking one and dipping it into CO, which is good for two reasons: first, it tastes great, and second

In [25]:
cs_mx = euclidean_distances(pos_mx, pos_mx)
cs_mx

array([[0.        , 1.41421356, 1.41421356, ..., 1.41421356, 1.41421356,
        1.41421356],
       [1.41421356, 0.        , 1.41421356, ..., 1.41421356, 1.41421356,
        1.41421356],
       [1.41421356, 1.41421356, 0.        , ..., 1.41421356, 1.41421356,
        1.41421356],
       ...,
       [1.41421356, 1.41421356, 1.41421356, ..., 0.        , 1.41421356,
        1.41421356],
       [1.41421356, 1.41421356, 1.41421356, ..., 1.41421356, 0.        ,
        0.96628938],
       [1.41421356, 1.41421356, 1.41421356, ..., 1.41421356, 0.96628938,
        0.        ]])

In [26]:
similar_doc_ed(pos_mx, pos_df)

The first review: 
 I bought a jar of this thing on Aug 16 for seven bucks (and free delivery if over Amazon's minimum for such). It came pretty fast, in a semi-liquid state but not leaking; rather well packaged. I let it melt completely in a warm room (CO melts at about 80 deg F) and then put it in the fridge to let it solidify. One of the reasons I did (and always do) that is to smooth out the texture: as is, every type of CO I've tried comes in like a bunch of white fibers stuck in more even liquid, but if you thaw it completely and then freeze, it becomes all evenly semi-transparent, like one thick candle sort of thing. It probably doesn't matter; just a personal quirk, I guess.<br /><br />Anyway, this is good stuff: the taste is nice -- very good for sweetich things like carrots or plums, for example (I always have a bowl of very lightly steamed carrots around, and I snack on them, taking one and dipping it into CO, which is good for two reasons: first, it tastes great, and second

## Naive Bayes (3pts)

You are an NLP data scientist working at Fandango. You observe the following dataset in your review comments:

**Intent to Buy Tickets:**
1.	Love this movie. Can’t wait!
2.	I want to see this movie so bad.
3.	This movie looks amazing.

**No Intent to Buy Tickets:**
1.	Looks bad.
2.	Hard pass to see this bad movie.
3.	So boring!

You can consider the following stopwords for removal: `to`, `this`.

Is the following review an `Intent to Buy` or `No Intent to Buy`? Show your work for each computation.
> This looks so bad.

You'll need to compute:
* Prior
* Likelihood
* Posterior

**Prior:**

$$\begin{aligned}
P(y = Intent) &= 1/2   \\
P(y = No Intent) &= 1/2 \\
\end{aligned}$$

**Likelihood:**

$$\begin{aligned}
P(x \mid y = Intent) & = P(x = "looks" \mid y = Intent) * P(x = "so" \mid y = Intent) * P(x = "bad" \mid y = Intent)   \\
& = (1/3) * (1/3) * (1/3) \\
& = 1/3 * 1/3 * 1/3 \\
& = 1/27 \\
\\
\\
P(x \mid y = No Intent) & = P(x = "looks" \mid y = No Intent) * P(x = "so" \mid y = No Intent) * P(x = "bad" \mid y = No Intent)   \\
& = (1/3) * (1/3) * (2/3) \\
& = 1/3 * 1/3 * 2/3 \\
& = 2/27 \\
\end{aligned}$$

**Evidence:**

$$\begin{aligned}
P(x) &= P(x \mid y = Intent)*P(y = Intent) + P(x \mid y = No Intent)*P(y = No Intent)   \\
& = (1/27) * (1/2) + (2/27) * (1/2) \\
& = 1/18 \\
\end{aligned}$$


**Posterior:**

$$\begin{aligned}
P(y = Intent \mid x) &= (P(x \mid y = Intent) * P(y = Intent))/ P(x)  \\
& = ((1/27) * (1/2)) / (1/18) \\
& = 1/3 \\
\\
\\
P(y = No Intent \mid x) &= (P(x \mid y = No Intent) * P(y = No Intent))/ P(x) \\
& = ((2/27) * (1/2)) / (1/18) \\
& = 2/3 \\
\end{aligned}$$

From the posterior probability,  "This looks so bad." will be classified as No Intent