# HW3

Submit via Slack. Due on **Tuesday, April 12th, 2022, 6:29pm PST**. You may work with one other person.
## TF-IDF (5pts)

You are an analyst working for Amazon's product team, and charged with identifying areas for improvement for the toy reviews.

Using the **amazon-fine-foods.csv** dataset, clean and parse the text reviews. Explain the decisions you make:
- why remove/keep stopwords?
- which stopwords to remove?
- stemming versus lemmatization?
- regex cleaning and substitution?
- adding in custom stopwords?
- what `n` for your `n-grams`?

Finally, generate a TF-IDF report that explains for a business (non-technical) stakeholder:
* the features your analysis showed that customers cited as reasons for a poor review
* the features your analysis showed that customers cited as reasons for a good review
* the most common issues identified from your analysis that generated customer dissatisfaction.

Explain to what degree the TF-IDF findings make sense - what are its limitations?


## Similarity and Word Embeddings (2 pts)

Using
* `TfIdfVectorizer`

Identify the most similar pair of reviews from the `amazon-fine-foods.csv` dataset using both Euclidean distance and cosine similarity.

## Naive Bayes (3pts)

You are an NLP data scientist working at Fandango. You observe the following dataset in your review comments:

**Intent to Buy Tickets:**
1.	Love this movie. Can’t wait!
2.	I want to see this movie so bad.
3.	This movie looks amazing.

**No Intent to Buy Tickets:**
1.	Looks bad.
2.	Hard pass to see this bad movie.
3.	So boring!

You can consider the following stopwords for removal: `to`, `this`.

Is the following review an `Intent to Buy` or `No Intent to Buy`? Show your work for each computation.
> This looks so bad.

You'll need to compute:
* Prior
* Likelihood
* Posterior

In [1]:
# Importing libraries
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
from gensim.parsing.preprocessing import remove_stopwords
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
import warnings
warnings.filterwarnings('ignore')
import nltk
# nltk.download('omw-1.4')
# nltk.download('stopwords')
# nltk.download('wordnet')
# nltk.download('punkt')
# nltk.download('averaged_perceptron_tagger')

In [2]:
# defining the required functions

lemmatizer = WordNetLemmatizer()

# function to convert nltk tag to wordnet tag
def nltk_tag_to_wordnet_tag(nltk_tag):
    if nltk_tag.startswith('J'):
        return wordnet.ADJ
    elif nltk_tag.startswith('V'):
        return wordnet.VERB
    elif nltk_tag.startswith('N'):
        return wordnet.NOUN
    elif nltk_tag.startswith('R'):
        return wordnet.ADV
    else:          
        return None

def lemmatize_sentence(sentence):
    #tokenize the sentence and find the POS tag for each token
    nltk_tagged = nltk.pos_tag(nltk.word_tokenize(sentence))  
    #tuple of (token, wordnet_tag)
    wordnet_tagged = map(lambda x: (x[0], nltk_tag_to_wordnet_tag(x[1])), nltk_tagged)
    lemmatized_sentence = []
    for word, tag in wordnet_tagged:
        if tag is None:
            #if there is no available tag, append the token as is
            lemmatized_sentence.append(word)
        else:        
            #else use the tag to lemmatize the token
            lemmatized_sentence.append(lemmatizer.lemmatize(word, tag))
    return " ".join(lemmatized_sentence)

# 1. TF-IDF (5 pts)

In [3]:
# Reading data
data = pd.read_csv('../datasets/amazon_fine_foods.csv')
display(data.head(2))
print(data.shape)

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,20983,B002QWP89S,A21U4DR8M6I9QN,"K. M Merrill ""justine""",1,1,5,1318896000,addictive! but works for night coughing in dogs,my 12 year old sheltie has chronic brochotitis...
1,20984,B002QWP89S,A17TDUBB4Z1PEC,jaded_green,1,1,5,1318550400,genuine Greenies best price,"These are genuine Greenies product, not a knoc..."


(11903, 10)


Products filtered:
* B002QWP89S - Greenie dog treats
* B0013NUGDE - Popchips potato chips
* B007JFMH8M - Quaker soft baked oatmeal cookies
* B000KV61FC - Tug a jug meal dispensing dog toy

In [4]:
# Filtering 4 products for analysis
searchfor = ['B002QWP89S','B0013NUGDE','B007JFMH8M','B000KV61FC']
data = data[data['ProductId'].str.contains('|'.join(searchfor))]
data['Product'] = data['ProductId'].apply(lambda x: 'Greenie' if x == 'B002QWP89S'
                                          else 'Popchips' if x == 'B0013NUGDE'
                                          else 'Quaker_Oatmeal' if x == 'B007JFMH8M'
                                          else 'Tug_a_jug')

In [5]:
# Removing columns which will not aid analysis and also dropping duplicates
data.drop(['Id', 'ProductId', 'UserId', 'ProfileName', 'HelpfulnessNumerator','HelpfulnessDenominator', 'Time', 'Summary',],axis = 1, inplace = True)
data.drop_duplicates('Text',inplace = True)

In [6]:
# Examining word count
data["Text"] = data['Text'].str.lower().str.replace('[^\w\s]','')
new_df = data.Text.str.split(expand=True).stack().value_counts().reset_index()
new_df.columns = ['Word', 'Frequency']

# Printing the most frequent words
new_df['Word'][:50].values

array(['the', 'i', 'and', 'a', 'to', 'it', 'of', 'my', 'for', 'is',
       'they', 'are', 'this', 'in', 'these', 'that', 'but', 'them', 'was',
       'have', 'with', 'so', 'not', 'on', 'dog', 'you', 'out', 'as', 'he',
       'like', 'one', 'cookies', 'cookie', 'great', 'she', 'good', 'br',
       'be', 'chips', 'get', 'love', 'her', 'at', 'soft', 'toy', 'just',
       'very', 'would', 'if', 'we'], dtype=object)

- This shows that stopwords have to be removed as they are the most frequent and do not add value

- We can go forward and remove the usual stopwords from the nltk stopwords along with the following: ('br','potato','quaker')

- The following stopwords should be retained to determine sentiment and will be useful in ngrams: ("not","doesn't","didn't","very","too")

In [7]:
# Removing stopwords usign nltk
# stop words to remove from default stopwords list and to add to list
stops = ["not","doesn't","didn't","very","too"]
to_add = ["potato","br","quaker",'influenster','mom','vox box','voxbox','oatmeal']
# creating final list of stopwords
my_stopwords = list(set(stopwords.words('english')) - set(stops) | set(to_add))

#removing all stopwords from dataframe
data['Text'] = data['Text'].apply(lambda x: ' '.join([word for word in x.split() if word not in (my_stopwords)]))
data.head(2)

Unnamed: 0,Score,Text,Product
0,5,12 year old sheltie chronic brochotitis meds t...,Greenie
1,5,genuine greenies product not knockoff dogs lov...,Greenie


Upon analyzing the reviews, it appears that poor reviews usually have a score of 2 or less while good reviews have a score of 4 or more with a score of 3 usually corresponding to a neutral review

In [8]:
# categorizing review as good or poor
data['good_review'] = data['Score'].apply(lambda x: 2 if x > 3 else 0 if x < 3 else 1)
data.head(2)

Unnamed: 0,Score,Text,Product,good_review
0,5,12 year old sheltie chronic brochotitis meds t...,Greenie,2
1,5,genuine greenies product not knockoff dogs lov...,Greenie,2


We should opt for lemmatization over stemming for the following reasons:
- the corpus is smaller and can therefore handle the load of lemmatization
- since we are looking for the reasons a customer leaves a good or poor review, the context of the words will also matter, which will also help reduce noise

In [9]:
# regex cleaning and substitution
data['Text'] = data['Text'].str.lower() # converting to lower case
data['Text'] = data['Text'].str.replace(r'[\.]{2,}', ' ',regex = True) # more than 2 periods
data['Text'] = data['Text'].str.replace(r'\bteeth\sclean\b|\bclean\steeth\b','cleanteeth',regex = True) # clean teeth
data['Text'] = data['Text'].str.replace(r'\bsour\scream\sonion\b','sourcreamonion',regex = True) # sour cream flavor
data['Text'] = data['Text'].str.replace(r'\bsoft\sbaked\b','softbaked',regex = True) # soft bake oatmeal
data['Text'] = data['Text'].str.replace(r'\bcooki','',regex = True) # soft bake oatmeal

In [10]:
# stemming or lemmatization
data['Text Lemmatized'] = data['Text'].apply(lemmatize_sentence)

In [11]:
# creating two different datasets for good and bad reviews
data_good = data[data['good_review'] == 2]
print(f'Total number of good reviews: {data_good.shape[0]}')
data_poor = data[data['good_review'] == 0]
print(f'Total number of poor reviews: {data_poor.shape[0]}')

Total number of good reviews: 2154
Total number of poor reviews: 280


We decided to use 3,4-grams in our analysis as we noticed that one or two words were usually just taken up by the product name or type and did not provide much idea about the customer sentiment about those products

In [12]:
# creating separate dataframes for each product
products = ['Greenie','Popchips','Quaker_Oatmeal','Tug_a_jug']
good_dict = {}
poor_dict = {}
for p in products:
    good_dict[p] = data_good[data_good['Product'] == p]
    print(p,' good size ',good_dict[p].shape)
    poor_dict[p] = data_poor[data_poor['Product'] == p]
    print(p,' poor size ',poor_dict[p].shape)

Greenie  good size  (571, 5)
Greenie  poor size  (38, 5)
Popchips  good size  (457, 5)
Popchips  poor size  (59, 5)
Quaker_Oatmeal  good size  (838, 5)
Quaker_Oatmeal  poor size  (16, 5)
Tug_a_jug  good size  (288, 5)
Tug_a_jug  poor size  (167, 5)


In [13]:
# initializing tf-idf vectorization for good reviews
vectorizer = TfidfVectorizer(ngram_range=(4,4),
                             token_pattern=r'\b[a-zA-Z_]{3,}\b',
                             max_df=0.2, max_features=200, stop_words=stopwords.words())


In [14]:
# top n grams for good reviews of products
good_scores = {}
for p in products:
    corpus = list(good_dict[p]["Text Lemmatized"].values)

    X = vectorizer.fit_transform(corpus)
    terms = vectorizer.get_feature_names()
    tf_idf = pd.DataFrame(X.toarray().transpose(), index=terms)

    tf_idf = tf_idf.sum(axis=1)
    score = pd.DataFrame(tf_idf, columns=["score"])
    score.sort_values(by="score", ascending=False, inplace=True)
    good_scores[p] = score
    print(f'Top 10 n grams for good reviews of product {p}: \n')
    display(score.head(10))

Top 10 n grams for good reviews of product Greenie: 



Unnamed: 0,score
keep teeth gum clean,3.0
teeth clean vet since,2.0
amazon absolute best price,2.0
make breath smell good,2.0
love help keep cleanteeth,2.0
year old golden retriever,2.0
help keep cleanteeth breath,2.0
treat love good teeth,2.0
year vet say teeth,1.707107
love greenies help keep,1.707107


Top 10 n grams for good reviews of product Popchips: 



Unnamed: 0,score
love salt vinegar chip,3.675161
best chip ive ever,3.0
taste like diet food,2.379232
single serve bag pack,2.379232
dont feel guilty eat,2.379232
low fat low calorie,2.372876
healthy alternative regular chip,2.0
feel like eat real,2.0
look forward try flavor,2.0
like salt vinegar flavor,2.0


Top 10 n grams for good reviews of product Quaker_Oatmeal: 



Unnamed: 0,score
make whole grain oat,8.112036
cant wait try flavor,4.347384
soft chewy taste like,3.554868
soft chewy taste great,3.0
delicious definitely recommend everyone,2.691264
enough satisfy sweet tooth,2.669592
taste like home make,2.57509
taste like home bake,2.398371
individually packed make easy,2.251557
receive free sample softbaked,2.0


Top 10 n grams for good reviews of product Tug_a_jug: 



Unnamed: 0,score
span classtiny length minsbr,4.137699
keep busy long time,2.407807
keep busy hour try,2.350411
year old border collie,2.192329
get food keep entertain,2.0
year use every day,2.0
doesnt show sign wear,1.681844
rope knot inside jug,1.681844
ping pong ball inside,1.681844
classtiny length minsbr spani,1.645614


Top reasons for a good review:
* Greenie dog treats: dogs enjoy the treat and it helps keep their teeth clean
* Popchips potato chips: customers enjoy the taste and appreciate that it has a higher nutritional value than other snacks
* Quaker soft baked oatmeal cookies: reviewers enjoy the soft shewy taste and find the cookies to be like home baked ones. Customers also find the individual packaging convenient
* Tug a jug meal dispensing dog toy: in good reviews customers mentioned that the product kept their dogs occupied and entertained while also dispensing delicious treat for them

In [15]:
# top n grams for poor reviews of products
poor_scores = {}
for p in products:
    corpus = list(poor_dict[p]["Text Lemmatized"].values)
    X = vectorizer.fit_transform(corpus)
    terms = vectorizer.get_feature_names()
    tf_idf = pd.DataFrame(X.toarray().transpose(), index=terms)

    tf_idf = tf_idf.sum(axis=1)
    score = pd.DataFrame(tf_idf, columns=["score"])
    score.sort_values(by="score", ascending=False, inplace=True)
    poor_scores[p] = score
    print(f'Top 10 n grams for poor reviews of product {p}: \n')
    display(score.head(10))

Top 10 n grams for poor reviews of product Greenie: 



Unnamed: 0,score
poor thing death maybe,1.0
pup teeth agressive chewer,0.707107
remain greenies worm please,0.707107
product contains toxic ingriedant,0.707107
propylne glycol list treat,0.707107
puppy even grow tiny,0.707107
purposebr puppy even grow,0.707107
pooch love veggi dental,0.707107
pomeranian due bite chunk,0.707107
receive regular size need,0.707107


Top 10 n grams for poor reviews of product Popchips: 



Unnamed: 0,score
popchips however like flavor,1.0
salt vinegar wish could,1.0
popchips box order amazon,1.0
purchase past enjoyed whole,1.0
quite disappointed best nothing,1.0
salt vinegar horrible blecht,1.0
pop chip satisfy good,1.0
popchips month nowlove flavor,1.0
salt pepper flavor fantastic,1.0
realised ate later stink,1.0


Top 10 n grams for poor reviews of product Quaker_Oatmeal: 



Unnamed: 0,score
toddler love literally eat,0.707107
momvoxbox influensteri loverhowever favsmy,0.707107
toss bite kid throw,0.707107
much toss bite kid,0.707107
obviously sugar seem bit,0.57735
serve family obviously sugar,0.57735
rolled oat instead oat,0.57735
would serve family obviously,0.57735
oat flour crumbly mealy,0.57735
oat instead oat flour,0.57735


Top 10 n grams for poor reviews of product Tug_a_jug: 



Unnamed: 0,score
matter hard try food,1.679176
month old lab puppy,1.679176
hard wood floor loud,1.561844
wish could get money,1.414214
could get money back,1.414214
doodle chew rope within,1.268951
golden doodle chew rope,1.268951
month old yellow lab,1.24102
think would great toy,1.2267
toy last ten minute,1.018738


Top reasons for a poor review and customer dissatifaction:
* Greenie dog treats: customers are concerned that the treats has toxic ingredients and can also be fatal to their pets
* Popchips potato chips: customers pointed out that the texture of the chips isn't great. Also, the chips do not taste great with some customers even saying that they taste like chemical and contain preservatives
* Quaker soft baked oatmeal cookies: customers do not appreciate the texture, along with the fact that it is made with oat flour and even find it to be crumbly. Some customers also complained about the high sugar content
* Tug a jug meal dispensing dog toy: customers mentioned that their dogs chewed through the rope and that the product was not very durable and long lasting

TF-IDF and its limitations:
* TF-IDF findings help us determine common themes across text, however, they are not completely helpful in determine the exact meaning of the text. Rather, it evaluates the tendency of pre-processed words to be found together in n-grams.
* TF-IDF cancels out the incapabilities of Bag of Words technique and highlights each word's relevance in the entire document
* Since it is based on the BOW model, it does not capture semantics 

### Similarity and word embeddings

In [16]:
data.head(10)

Unnamed: 0,Score,Text,Product,good_review,Text Lemmatized
0,5,12 year old sheltie chronic brochotitis meds t...,Greenie,2,12 year old sheltie chronic brochotitis med th...
1,5,genuine greenies product not knockoff dogs lov...,Greenie,2,genuine greenies product not knockoff dog love...
2,5,dogs love greenies course doggies dont bought ...,Greenie,2,dog love greenies course doggy dont buy dashch...
3,5,say dogs love greenies begg time always sit cu...,Greenie,2,say dog love greenies begg time always sit cup...
4,5,review box greenies lite dog package came quic...,Greenie,2,review box greenies lite dog package come quic...
5,5,highly recommend chews dog exactly say freshen...,Greenie,2,highly recommend chew dog exactly say freshen ...
6,5,always used greenies dogs seniors wonderful el...,Greenie,2,always use greenies dog senior wonderful elder...
7,5,youve got hard chewers like one dogs things wo...,Greenie,2,youve get hard chewer like one dog thing wont ...
8,5,tried greenies dog 8 years ago wont eat tried ...,Greenie,2,tried greenies dog 8 year ago wont eat tried d...
9,5,fabulous treat little cavalier king charles lo...,Greenie,2,fabulous treat little cavalier king charles lo...


In [17]:
vectorizer = TfidfVectorizer(ngram_range=(1,1),
                             token_pattern=r'\b[a-zA-Z_]{3,}\b',
                             max_df=0.2, max_features=200, stop_words=stopwords.words())

In [18]:
corpus = list(data["Text Lemmatized"].values)

X = vectorizer.fit_transform(corpus)
terms = vectorizer.get_feature_names()
tf_idf = pd.DataFrame(X.toarray().transpose(), index=terms)
vector = X.toarray()
# tf_idf = tf_idf.sum(axis=1)
# score = pd.DataFrame(tf_idf, columns=["score"])
# score.sort_values(by="score", ascending=False, inplace=True)

In [19]:
from sklearn.metrics.pairwise import euclidean_distances
from scipy.spatial.distance import cosine
from numpy import dot
import numpy as np
from numpy.linalg import norm

def cosine_similarity(A, B):
    numerator = dot(A, B)
    denominator = norm(A) * norm(B)
    return numerator / denominator # remember, you take 1 - the distance to get the distance

def cosine_distance(A,B):
    return 1 - cosine_similarity

In [20]:
e_score_list=[]
c_score_list=[]

In [21]:
for i,r1 in enumerate(vector):
    e_max_score = 999999
    e_idx=-1
    c_max_score = -1
    c_idx=-1
    if sum(vector[i]) == 0:
        continue
    for j in range(i+1,len(vector)):
        if sum(vector[j]) == 0:
            continue
        c_score=1-cosine(vector[i], vector[j])
        X = [vector[i], vector[j]]
        e_score=euclidean_distances(X).ravel()[1]
        if c_score > c_max_score:
            c_max_score = c_score
            c_idx=j
        if e_score<e_max_score:
            e_max_score = e_score
            e_idx=j
    e_score_list.append([i,e_idx,e_max_score])
    c_score_list.append([i,c_idx,c_max_score])

In [22]:
sorted(c_score_list, key=lambda x: x[2], reverse = True)[:5]

[[2, 316, 1],
 [2098, 2564, 1],
 [469, 473, 0.9302869079538333],
 [1512, 1560, 0.9167575405018016],
 [1774, 2317, 0.901006456333981]]

In [23]:
print('According to Cosine Similarity, the similar reviews are:')
print(f'Review 1: {data.iloc[2]["Text"]}')
print(f'Review 2: {data.iloc[316]["Text"]}')
print(' & ')
print('According to Cosine Similarity, the similar reviews are:')
print(f'Review 1: {data.iloc[2098]["Text"]}')
print(f'Review 2: {data.iloc[2564]["Text"]}')

According to Cosine Similarity, the similar reviews are:
Review 1: dogs love greenies course doggies dont bought dashchund minpin perfect great price great product could ask
Review 2: dogs love greenies course doggies dont bought dashchund minpin perfect great price great product could ask
 & 
According to Cosine Similarity, the similar reviews are:
Review 1: recieved gift box exxited try soft chewy great w mik
Review 2: like soft chewy es youll like youre going buy one box go fast


In [24]:
sorted(e_score_list, key=lambda x: x[2])[:5]

[[2, 316, 0.0],
 [2098, 2564, 0.0],
 [469, 473, 0.37339815759097356],
 [1512, 1560, 0.408025635219648],
 [1774, 2317, 0.44495739945756374]]

In [25]:
print('According to Euclidean Distance, the similar reviews are:')
print(f'Review 1: {data.iloc[2]["Text Lemmatized"]}')
print(f'Review 2: {data.iloc[316]["Text Lemmatized"]}')
print(' & ')
print('According to Euclidean Distance, the similar reviews are:')
print(f'Review 1: {data.iloc[2098]["Text"]}')
print(f'Review 2: {data.iloc[2564]["Text"]}')

According to Euclidean Distance, the similar reviews are:
Review 1: dog love greenies course doggy dont buy dashchund minpin perfect great price great product could ask
Review 2: dog love greenies course doggy dont buy dashchund minpin perfect great price great product could ask
 & 
According to Euclidean Distance, the similar reviews are:
Review 1: recieved gift box exxited try soft chewy great w mik
Review 2: like soft chewy es youll like youre going buy one box go fast


### NAIVE BAYES

In [26]:
documents = [
    ("Love this movie. Can’t wait!", "Intent to Buy"),
    ("I want to see this movie so bad.", "Intent to Buy"),
    ("This movie looks amazing", "Intent to Buy"),
    ("Looks bad.","No Intent to Buy"),
    ("Hard pass to see this bad movie.",'No Intent to Buy'),
    ('So boring!','No Intent to Buy')]

In [27]:
stopwords = set(['to','this'])

In [28]:
import re
pattern = re.compile(r'\b(' + r'|'.join(stopwords) + r')\b\s*')

In [29]:
corpus = set()

# Build corpus
for document in documents:
    text = document[0]
    class_value = document[1]
    for word in text.split():
        word=word.lower()
        word=re.sub('[^a-zA-Z0-9 \n]','', word)
        word=pattern.sub('', word)
        corpus.add(word)

In [30]:
corpus = corpus -stopwords
corpus=set(list(corpus)[1:])

Priors:<br>
- P(Y='Intent to Buy') = 1/2 (3 out of 6 documents)
- P(Y=' No Intent to Buy') = 1/2 (3 out of 6 documents)

In [31]:
conditional_probabilities = pd.DataFrame(index=list(corpus), 
                                         columns=["likelihood_given_Intent", "likelihood_given_No_Intent"])

In [32]:
intent_documents = 0
no_intent_documents = 0
for document in documents:
    if document[1] == "Intent to Buy":
        intent_documents += 1
    else:
        no_intent_documents += 1

    print(f"{document}")
    print(f"No intent to buy documents: {no_intent_documents}")
    print(f"Intent to buy documents: {intent_documents} \n\n")
    
p_intent = intent_documents / (no_intent_documents + intent_documents)
p_no_intent= no_intent_documents / (no_intent_documents + intent_documents)

('Love this movie. Can’t wait!', 'Intent to Buy')
No intent to buy documents: 0
Intent to buy documents: 1 


('I want to see this movie so bad.', 'Intent to Buy')
No intent to buy documents: 0
Intent to buy documents: 2 


('This movie looks amazing', 'Intent to Buy')
No intent to buy documents: 0
Intent to buy documents: 3 


('Looks bad.', 'No Intent to Buy')
No intent to buy documents: 1
Intent to buy documents: 3 


('Hard pass to see this bad movie.', 'No Intent to Buy')
No intent to buy documents: 2
Intent to buy documents: 3 


('So boring!', 'No Intent to Buy')
No intent to buy documents: 3
Intent to buy documents: 3 




In [33]:
### Likelihood

In [34]:
for word in corpus:
    
    intent_documents_with_word = 0
    no_intent_documents_with_word = 0
    
    for document in documents:
        document=list(document)
        document[0]=document[0].lower()
        document[0]=re.sub('[^a-zA-Z0-9 \n]','', document[0])
        document[0]=pattern.sub('', document[0])
        #document=set(document)
        #print(document[1])
        document_class = list(document[1])
        if word in document[0].split():
            if document[1] == 'Intent to Buy':
                intent_documents_with_word += 1
            else:
                no_intent_documents_with_word += 1
    
    print(f"For word {word}, {intent_documents_with_word} intent out of {intent_documents} intent documents.")
    print(f"For word {word}, {no_intent_documents_with_word} no_intent out of {no_intent_documents} no_intent documents.\n")
    conditional_probabilities.loc[word, "likelihood_given_Intent"] = intent_documents_with_word * 1.0 / intent_documents
    conditional_probabilities.loc[word, "likelihood_given_No_Intent"] = no_intent_documents_with_word * 1.0 / no_intent_documents

For word see, 1 intent out of 3 intent documents.
For word see, 1 no_intent out of 3 no_intent documents.

For word cant, 1 intent out of 3 intent documents.
For word cant, 0 no_intent out of 3 no_intent documents.

For word bad, 1 intent out of 3 intent documents.
For word bad, 2 no_intent out of 3 no_intent documents.

For word pass, 0 intent out of 3 intent documents.
For word pass, 1 no_intent out of 3 no_intent documents.

For word looks, 1 intent out of 3 intent documents.
For word looks, 1 no_intent out of 3 no_intent documents.

For word love, 1 intent out of 3 intent documents.
For word love, 0 no_intent out of 3 no_intent documents.

For word movie, 3 intent out of 3 intent documents.
For word movie, 1 no_intent out of 3 no_intent documents.

For word hard, 0 intent out of 3 intent documents.
For word hard, 1 no_intent out of 3 no_intent documents.

For word want, 1 intent out of 3 intent documents.
For word want, 0 no_intent out of 3 no_intent documents.

For word boring, 0 

In [35]:
conditional_probabilities

Unnamed: 0,likelihood_given_Intent,likelihood_given_No_Intent
see,0.333333,0.333333
cant,0.333333,0.0
bad,0.333333,0.666667
pass,0.0,0.333333
looks,0.333333,0.333333
love,0.333333,0.0
movie,1.0,0.333333
hard,0.0,0.333333
want,0.333333,0.0
boring,0.0,0.333333


In [36]:
test_document = 'This looks so bad.'
test_document=test_document.lower()
test_document=re.sub('[^a-zA-Z0-9 \n]','', test_document)
test_document=test_document.split()

In [37]:
set(test_document) - set('this')
stopwords = set(['to','this'])
test_document = list(set(test_document) - stopwords)

In [38]:
test=[]
test = " ".join(test_document)

In [39]:
def get_likelihood(test_document, conditional_probabilities):
    likelihood_yes = 1
    likelihood_no = 1
    for word in test.split():
        likelihood_yes = likelihood_yes * conditional_probabilities.loc[word, "likelihood_given_Intent"]
        likelihood_no = likelihood_no * conditional_probabilities.loc[word, "likelihood_given_No_Intent"]
    
    return likelihood_yes, likelihood_no

In [40]:
likelihood_intent, likelihood_no_intent = get_likelihood(test_document, conditional_probabilities)
likelihood_intent, likelihood_no_intent

(0.037037037037037035, 0.07407407407407407)

POSTERIOR

In [41]:
def get_posterior(likelihood_yes, likelihood_no, p_yes, p_no):
    posterior_yes = likelihood_yes * p_yes / (likelihood_yes * p_yes + likelihood_no * p_no)
    posterior_no = likelihood_no * p_no / (likelihood_yes * p_yes + likelihood_no * p_no)
    return posterior_yes, posterior_no

In [42]:
get_posterior(likelihood_intent, likelihood_no_intent, p_intent, p_no_intent)

(0.3333333333333333, 0.6666666666666666)