# HW3

Submit via Slack. Due on **Tuesday, April 12th, 2022, 6:29pm PST**. You may work with one other person.
## TF-IDF (5pts)

You are an analyst working for Amazon's product team, and charged with identifying areas for improvement for the toy reviews.

Using the **amazon-fine-foods.csv** dataset, clean and parse the text reviews. Explain the decisions you make:
- why remove/keep stopwords?
- which stopwords to remove?
- stemming versus lemmatization?
- regex cleaning and substitution?
- adding in custom stopwords?
- what `n` for your `n-grams`?
- which words to collocate together?

Finally, generate a TF-IDF report that explains for a business (non-technical) stakeholder:
* the features your analysis showed that customers cited as reasons for a poor review
* the features your analysis showed that customers cited as reasons for a good review
* the most common issues identified from your analysis that generated customer dissatisfaction.

Explain to what degree the TF-IDF findings make sense - what are its limitations?


In [1]:
import re
import pandas as pd
import numpy as np
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from textacy import preprocessing
from sklearn.feature_extraction.text import TfidfVectorizer


In [2]:
df = pd.read_csv('../datasets/amazon_fine_foods.csv')

In [3]:
df.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,20983,B002QWP89S,A21U4DR8M6I9QN,"K. M Merrill ""justine""",1,1,5,1318896000,addictive! but works for night coughing in dogs,my 12 year old sheltie has chronic brochotitis...
1,20984,B002QWP89S,A17TDUBB4Z1PEC,jaded_green,1,1,5,1318550400,genuine Greenies best price,"These are genuine Greenies product, not a knoc..."
2,20985,B002QWP89S,ABQH3WAWMSMBH,tenisbrat87,1,1,5,1317168000,Perfect for our little doggies,"Our dogs love Greenies, but of course, which d..."
3,20986,B002QWP89S,AVTY5M74VA1BJ,tarotqueen,1,1,5,1316822400,dogs love greenies,"What can I say, dogs love greenies. They begg ..."
4,20987,B002QWP89S,A13TNN54ZEAUB1,dcz2221,1,1,5,1316736000,Greenies review,This review is for a box of Greenies Lite for ...


In [4]:
# chips, coffee, coconut oil
df = df[(df['ProductId'] == 'B006HYLW32') | (df['ProductId'] == 'B005K4Q1YA') | (df['ProductId'] == 'B001EO5Q64')]

In [5]:
# using textacy to: remove hyphens, punctuation, and accents
preproc = preprocessing.make_pipeline(
    preprocessing.remove.html_tags,
    preprocessing.normalize.hyphenated_words,
    preprocessing.remove.punctuation,
    preprocessing.remove.accents,
    

)

In [6]:
set(stopwords.words('english'))
# removing some negative words from stopwords list
nltk_stopwords = set(stopwords.words('english'))
nltk_stopwords.remove('below')
nltk_stopwords.remove("aren't")
nltk_stopwords.remove('couldn')
nltk_stopwords.remove("couldn't")
nltk_stopwords.remove("didn't")
nltk_stopwords = list(nltk_stopwords)
# add some abstract terms
nltk_stopwords.append('like')
nltk_stopwords = set(nltk_stopwords)

In [7]:
nltk_stopwords

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'between',
 'both',
 'but',
 'by',
 'can',
 'd',
 'did',
 'didn',
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'like',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 're',
 's',
 'same',
 'shan',
 "shan't",
 'she',


While it may be easier to use the pre-built function, it does not allow for changing the stopwords list. Therefore I am creating the function manually.

In [8]:
# removing stopwords 
def remove_stopwords(sentence:str, nltk_stopwords):
    '''removing stopwords from a list, review: string, nltk_stopwords: list'''
    words = nltk.word_tokenize(sentence)
    new_words = []
    for word in words:
        if word.lower() in nltk_stopwords:
            continue
        new_words.append(word)
    cleaned_review = " ".join(new_words)
    return cleaned_review

In [9]:
df['text_cleaned'] = df['Text'].apply(preproc).apply(remove_stopwords, nltk_stopwords=nltk_stopwords)
df.head()


Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text,text_cleaned
2948,189340,B001EO5Q64,A2H5ROZZC74XN1,Rock Bottom,3,3,5,1176076800,"GREAT COCONUT OIL....Try it, you'll like it!",Nutiva is the BEST COCONUT OIL! I love it and ...,Nutiva BEST COCONUT OIL love looking high qual...
2949,189341,B001EO5Q64,A2L7UE9O293Q2W,J. Garcia,5,6,5,1277769600,Love Love Love this stuff!,I bought Nutiva Organic Coconut Oil from a loc...,bought Nutiva Organic Coconut Oil local natura...
2950,189342,B001EO5Q64,A3KG5Q302QXSCD,M. Wang,7,9,4,1256688000,"Near Perfect for popcorn, experimental for oth...",Edit: Added a second part.<br />=============...,Edit Added second part =======================...
2951,189343,B001EO5Q64,A35JFXTIZ8X1R9,Agi,2,2,5,1349222400,Delicious,At this point I've ordered this coconut oil se...,point ordered coconut oil several time feel co...
2952,189344,B001EO5Q64,A2TVH3F0IM2UYK,kt,2,2,5,1347926400,My favorite thing ever,I don't know how I lived so long without this ...,know lived long without stuff brand leaps boun...


In [10]:
# since the corpus is food reviews, we use regex to group certain words or phrases that are related to certain tastes.
# good tastes
df['text_cleaned'] = df['text_cleaned'].str.replace(r'\bdelicious|tasty|yum+y*\b', '_GOOD_TASTE_', case=False)

  df['text_cleaned'] = df['text_cleaned'].str.replace(r'\bdelicious|tasty|yum+y*\b', '_GOOD_TASTE_', case=False)


In [11]:
# sweet
df['text_cleaned'] = df['text_cleaned'].str.replace(r'\bsugar(y|ed)?|swe{2,}t(en(ed)?)?\b', '_SWEET_', case=False)

  df['text_cleaned'] = df['text_cleaned'].str.replace(r'\bsugar(y|ed)?|swe{2,}t(en(ed)?)?\b', '_SWEET_', case=False)


In [12]:
# savory
df['text_cleaned'] = df['text_cleaned'].str.replace(r'\bsalty*|savou?ry\b', '_SAVORY_', case=False)

  df['text_cleaned'] = df['text_cleaned'].str.replace(r'\bsalty*|savou?ry\b', '_SAVORY_', case=False)


In [13]:
# removing all words with more than 15 digits
df['text_cleaned'] = df['text_cleaned'].str.replace(r'\b([a-z]|[A-Z]){15,}\b', '', case=False)

  df['text_cleaned'] = df['text_cleaned'].str.replace(r'\b([a-z]|[A-Z]){15,}\b', '', case=False)


In [14]:
# bad tastes
df['text_cleaned'] = df['text_cleaned'].str.replace(r'\bdisgusting|ugh\b', '_GOOD_TASTE_', case=False)

  df['text_cleaned'] = df['text_cleaned'].str.replace(r'\bdisgusting|ugh\b', '_GOOD_TASTE_', case=False)


* stemming/lemmatization (explain in your notebook why you used stemming versus lemmatization). 

Same as HW2, we choose lemmatization since it can better account for transformations that are not standard. Such transformations can be quite common in reviews. For example, it is very likely for reviews to contain `better` or `worse`, neither of which can be treated with stemming. Moreover, we do not have a performance limitation as the dataset is rather small and we are not doing realtime transformation.

In [15]:
lemmatizer = WordNetLemmatizer()

# function to convert nltk tag to wordnet tag
def nltk_tag_to_wordnet_tag(nltk_tag):
    if nltk_tag.startswith('J'):
        return wordnet.ADJ
    elif nltk_tag.startswith('V'):
        return wordnet.VERB
    elif nltk_tag.startswith('N'):
        return wordnet.NOUN
    elif nltk_tag.startswith('R'):
        return wordnet.ADV
    else:          
        return None

def lemmatize_sentence(sentence):
    #tokenize the sentence and find the POS tag for each token
    nltk_tagged = nltk.pos_tag(nltk.word_tokenize(sentence))  
    #tuple of (token, wordnet_tag)
    wordnet_tagged = map(lambda x: (x[0], nltk_tag_to_wordnet_tag(x[1])), nltk_tagged)
    lemmatized_sentence = []
    for word, tag in wordnet_tagged:
        if tag is None:
            #if there is no available tag, append the token as is
            lemmatized_sentence.append(word)
        else:        
            #else use the tag to lemmatize the token
            lemmatized_sentence.append(lemmatizer.lemmatize(word, tag))
    return " ".join(lemmatized_sentence)


In [16]:
df['text_cleaned'] = df['text_cleaned'].apply(lemmatize_sentence)

- Given the rather small dataset, we find trigrams too specific to give meaningful results. Therefore, we choose 2 (bigram).

In [17]:
# 
vectorizer = TfidfVectorizer(ngram_range=(2,2),
                             token_pattern=r'\b[a-zA-Z_]{3,}\b',
                             max_df=0.4, max_features=500, stop_words=nltk_stopwords)
corpus = list(df["text_cleaned"].values)
X = vectorizer.fit_transform(corpus)
terms = vectorizer.get_feature_names()
tf_idf = pd.DataFrame(X.toarray().transpose(), index=terms)

In [18]:
tf_idf

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1663,1664,1665,1666,1667,1668,1669,1670,1671,1672
_good_taste_ also,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
_good_taste_ chip,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
_savory_ pepper,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
_savory_ snack,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
_savory_ vinegar,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.222908,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
would recommend,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
would taste,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
would try,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
write review,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0


In [19]:
df = df.reset_index(drop=True)

tf_idf_T = tf_idf.T.reset_index(drop=True)
df = pd.concat([df, tf_idf_T], axis=1)
df.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text,...,would definitely,would give,would good,would great,would highly,would recommend,would taste,would try,write review,year old
0,189340,B001EO5Q64,A2H5ROZZC74XN1,Rock Bottom,3,3,5,1176076800,"GREAT COCONUT OIL....Try it, you'll like it!",Nutiva is the BEST COCONUT OIL! I love it and ...,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,189341,B001EO5Q64,A2L7UE9O293Q2W,J. Garcia,5,6,5,1277769600,Love Love Love this stuff!,I bought Nutiva Organic Coconut Oil from a loc...,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,189342,B001EO5Q64,A3KG5Q302QXSCD,M. Wang,7,9,4,1256688000,"Near Perfect for popcorn, experimental for oth...",Edit: Added a second part.<br />=============...,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,189343,B001EO5Q64,A35JFXTIZ8X1R9,Agi,2,2,5,1349222400,Delicious,At this point I've ordered this coconut oil se...,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,189344,B001EO5Q64,A2TVH3F0IM2UYK,kt,2,2,5,1347926400,My favorite thing ever,I don't know how I lived so long without this ...,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [20]:
pos = df[df['Score'] >= 4]
pos_bigrams = pos.iloc[:,11:].sum().sort_values(ascending=False)


* the features your analysis showed that customers cited as reasons for a good review


In [21]:
print(pos_bigrams[:20])

coconut oil         105.260802
potato chip          33.206926
_sweet_ potato       26.313662
french vanilla       22.132841
great taste          22.035176
taste good           21.746554
gas station          21.459665
use coconut          20.210719
taste great          19.944107
highly recommend     19.420590
love product         18.467487
_savory_ vinegar     17.852078
great product        17.309826
coconut flavor       17.151517
also use             17.075502
really good          16.793063
grove square         16.358549
oil use              16.049248
pop chips            15.591747
extra virgin         15.203516
dtype: float64


* the features your analysis showed that customers cited as reasons for a poor review

In [22]:
neg = df[df['Score'] < 3]
neg_bigrams = neg.iloc[:,11:].sum().sort_values(ascending=False)

In [27]:
print(neg_bigrams[:50])

instant coffee          16.029648
gas station              9.993391
_savory_ vinegar         6.914928
waste money              6.304783
grove square             6.180801
coffee cup               5.398198
coffee taste             4.835267
artificial sweetener     4.807229
potato chip              4.778847
way _sweet_              4.419706
station cappuccino       4.350727
french vanilla           4.294060
cup coffee               4.108670
hot water                4.026322
one cup                  3.949818
would recommend          3.914274
regular coffee           3.729373
_savory_ pepper          3.689321
coffee instant           3.558832
cream _sweet_            3.095921
ingredient list          3.016992
taste good               2.951859
_sweet_ taste            2.934488
say taste                2.827088
love flavor              2.626422
vegetable oil            2.585642
real coffee              2.573712
taste coffee             2.511256
variety pack             2.500228
could get     

In [24]:
tf_idf = tf_idf.sum(axis=1)
score = pd.DataFrame(tf_idf, columns=["score"])
score.sort_values(by="score", ascending=False, inplace=True)

In [25]:
score

Unnamed: 0,score
coconut oil,108.163197
potato chip,39.953417
gas station,32.427284
_sweet_ potato,30.509144
french vanilla,27.211078
...,...
fat fat,2.123475
oil skin,2.037426
cup brewers,1.972662
brewers count,1.963031


In [26]:
score.to_csv('scores.csv')

### the most common issues identified from your analysis that generated customer dissatisfaction.

- The most common bigram that appears in the negative reviews is the instant coffee complaint. Given that one of the products selected is a Keurig Coffee Cup/Capsule product, we assume it is because the customers were looking for a more premium, and less instant-tasting coffee capsule for their coffee machine. A similar complaint is that the coffee tastes like what you get from a gas station.
- More specifically, there are complaints about the coffee being unnaturally sweet and creamy likely caused by the use of artifical sweeteners. Hazelnut cappuccino and French latte is mentioned rather frequently in negative reviews.
- On a more savory note, the chips are complained to be too salty and vinegary.
- There are also mentions of ingredient list, suggesting that the customers might be concerned with additives, or unhealthy ingredients in the chips.

### Explain to what degree the TF-IDF findings make sense - what are its limitations? 

TF-IDF makes efficient judgement on what are the most important phrases that appears frequently yet also distinctively. While the algorithm does give us some tangible complaints and compliments to work on, we cannot reliably ensure that it can capture a majority of the complaints. For instance, if we select a `max_df` too high, then we will have many themes that are too common and do not make sense. Another limitation is that it does not understand the sentiment of the terms. Even in a negative review, there could still be compliments. Unless it is something that is very obviously negative, we cannot safely assume it is something that should be improved upon. Lastly, having a fixed n for ngram makes it less flexible to find more interesting phrases of different length (maybe we can solve the problem by considering both bi and tri gram, or use collocation).

## Similarity and Word Embeddings (2 pts)

Using
* `TfIdfVectorizer`

Identify the most similar pair of reviews from the `amazon-fine-foods.csv` dataset using both Euclidean distance and cosine similarity.

In [35]:
import spacy

# load spacy en_core_web_md model
nlp = spacy.load("en_core_web_md")

In [38]:
# We are using the same dataset with preprocessing as before
reviews = list(df['text_cleaned'].values)

In [39]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(reviews)

tf_idf_lookup_table = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())

In [40]:
DOCUMENT_SUM_COLUMN = "DOCUMENT_TF_IDF_SUM"

# sum the tf idf scores for each document
tf_idf_lookup_table[DOCUMENT_SUM_COLUMN] = tf_idf_lookup_table.sum(axis=1)
available_tf_idf_scores = tf_idf_lookup_table.columns # a list of all the columns we have
available_tf_idf_scores = list(map( lambda x: x.lower(), available_tf_idf_scores)) # lowercase everything

In [42]:
reviews_vectors = []
for idx, tweet in enumerate(reviews): # iterate through each review
    tokens = nlp(tweet) # have spacy tokenize the review text
    
    # initially start a running total of tf-idf scores for a document
    total_tf_idf_score_per_document = 0
    
    # start a running total of initially all zeroes (300 is picked since that is the word embedding size used by word2vec)
    running_total_word_embedding = np.zeros(300) 
    for token in tokens: # iterate through each token
    
    # if the token has a pretrained word embedding it also has a tf-idf score
        if token.has_vector and token.text.lower() in available_tf_idf_scores:
            
            tf_idf_score = tf_idf_lookup_table.loc[idx, token.text.lower()]
            #print(f"{token} has tf-idf score of {tf_idf_lookup_table.loc[idx, token.text.lower()]}")
            running_total_word_embedding += tf_idf_score * token.vector
            
            total_tf_idf_score_per_document += tf_idf_score
    
    # divide the total embedding by the total tf-idf score for each document
    document_embedding = running_total_word_embedding / total_tf_idf_score_per_document
    reviews_vectors.append(document_embedding)


In [49]:
def most_similar(method, top_n:int = 50):
    similarities = pd.DataFrame(method(reviews_vectors), columns=reviews, index=reviews)
    similarities = similarities.unstack().reset_index()
    similarities.columns = ["review1", "review2", "similarity"]
    similarities = similarities[similarities["similarity"] < 0.9999999999]
    similarities.drop_duplicates(subset=["similarity"], inplace=True)
    for idx, row in similarities.sort_values(by="similarity", ascending=False).head(top_n).iterrows():
        print(row["review1"])
        print("--" * 10)
        print(row["review2"])
        print("\n\n")
    return similarities.sort_values(by="similarity", ascending=False).head(top_n)

### Comparing using Euclidean Distance

In [50]:
from sklearn.metrics.pairwise import euclidean_distances


In [51]:
euc_sim = most_similar(euclidean_distances)

nothing great thing say oil use two way cook add slight coconut essence hair create bar use tupperware solid room temperature use wash hair leave hair soft shiny greasy rub hand run wet hair let air dry also sometimes use skin absorbs well leave greasy baby oil Highly recommend
--------------------
heard benefit coconut oil friend swear coconut water quite possible nasty drink ever make decide give try first use hair African American female natural hair use coconut oil seal moisture hair condition hair morning give hair beautiful sheen keep soft throughout day definitely love hair benefit also use patch really dry skin right leg suffer eczema tends get bad dry winter month rub every morning shower see significant improvement two thumb skin benefit well far cook go consider use sautee fish make coconut lime tilapia dish saw recipe anyone suggestion use cook would love know use place oil know different oil different heat capacity know could fry type oil Thanks



accident stumble product

### Comparing using cosine similarity

In [52]:
from sklearn.metrics.pairwise import cosine_similarity


In [53]:
cos_sim = most_similar(cosine_similarity)

happy say find texture jar harder think tho_GOOD_TASTE_ may bite frozen outside box definitely smell coconut mound bar concern popcorn taste nothing coconut surprise buy wonderful taste give popcorn believe never realize movie reason popcorn taste good bit coconut flavor also leave popcorn machine clean oil coat whole machine greasy gunk Combine Eden Organics Organic popcorn kernel Amazon total organic popcorn receive mini popcorn machine would find theater Christmas go sample pack yellow gunk oil come pop search healthy natural way make popcorn read lot different oil people use one thing always want taste movie use coconut oil
--------------------
receive mini popcorn machine would find theater Christmas go sample pack yellow gunk oil come pop search healthy natural way make popcorn read lot different oil people use one thing always want taste movie use coconut oil find texture jar harder think tho_GOOD_TASTE_ may bite frozen outside box definitely smell coconut mound bar concern popc

## Naive Bayes (3pts)

You are an NLP data scientist working at Fandango. You observe the following dataset in your review comments:

**Intent to Buy Tickets:**
1.	Love this movie. Can’t wait!
2.	I want to see this movie so bad.
3.	This movie looks amazing.

**No Intent to Buy Tickets:**
1.	Looks bad.
2.	Hard pass to see this bad movie.
3.	So boring!

You can consider the following stopwords for removal: `to`, `this`.

Is the following review an `Intent to Buy` or `No Intent to Buy`? Show your work for each computation.
> This looks so bad.

You'll need to compute:
* Prior
* Likelihood
* Posterior

- Corpus after removing stopwords and standardizing cases: 
**Intent to Buy Tickets:**
1.	love movie can’t wait
2.	i want see movie so bad
3.	movie looks amazing

**No Intent to Buy Tickets:**
1.	looks bad
2.	hard pass see bad movie
3.	so boring

- review after removing stopwords and standardizing cases:
> looks so bad
- Priors:
$$\begin{aligned}
p(y=good) = p(y=bad) = \frac{1}{2}
\end{aligned}
$$
- Likelihood:
$$\begin{aligned}
p(x|y=good) &= \prod p(x_i|y=good) \\
            &= p(x = looks|y=good)p(x = so|y=good)p(x = bad|y=good) \\
            &= \frac{1}{3}\frac{1}{3}\frac{1}{3} \\
            &= \frac{1}{27}
\end{aligned}
$$
$$
\begin{aligned}
p(x|y=bad) &= \prod p(x_i|y=bad) \\
            &= p(x = looks|y=bad)p(x = so|y=bad)p(x = bad|y=bad) \\
            &= \frac{1}{3}\frac{1}{3}\frac{2}{3} \\
            &= \frac{2}{27}
\end{aligned}
$$
- Evidence:
$$\begin{aligned}
p(x) &= \sum p(x|y=good)p(y=good) + p(x|y=bad)p(y=bad) \\
            &= \frac{1}{27}\frac{1}{2} + \frac{2}{27}\frac{1}{2} \\ 
            &= \frac{3}{54}
\end{aligned}
$$
- Result:
$$\begin{aligned}
p(y=good|x) &= \frac{p(x|y=good)p(y=good)}{p(x)} \\
            &= \frac{\frac{1}{54}}{\frac{3}{54}} \\
            &= \frac{1}{3}
\end{aligned}
$$
$$
\begin{aligned}
p(y=bad|x) &= \frac{p(x|y=bad)p(y=bad)}{p(x)} \\
            &= \frac{\frac{2}{54}}{\frac{3}{54}} \\
            &= \frac{2}{3}
\end{aligned}
$$
**Since $p(y=bad|x) > p(y=good|x)$, we determine this is a negative review. **