## Multinomial Naive Bayes 

The multinomial Naive Bayes classifier is suitable for classification with text data (e.g., word counts for text classification). Everything is similar to Gaussian NB except the $ P(x_{i} ∣ y) $. The new equation is,

![image_2022-08-12_171046578.png](attachment:image_2022-08-12_171046578.png)

![image_2022-08-12_173528714.png](attachment:image_2022-08-12_173528714.png)

## Detecting spam messages using Multinomial Naive Bayes model

**The concept of spam filtering is simple - detect spam emails from authentic (non-spam/ham) emails.
With Bayes' Rule, we want to find the probability an email is spam, given it contains certain words. We do this by finding the probability that each word in the email is spam, and then multiply these probabilities together to get the overall email spam metric to be used in classification.**

Probabilities can range between 0 and 1. For this spam filter, we will define that any email with a total 'spaminess' metric of over 0.5 (50%) will be deemed a spam email. When the Pr(S|W) (the probability of an email being spam S given a certain word W appears) has been found for each word in the email, they are multiplied together to give the overall probability that the email is spam. If this probability is over the 'spam threshold' of 0.5, the email is classified as a spam email.

In [None]:
# Define some training and test data for each class, spam and ham. 

train_spam = ['send us your password', 'review our website', 'send your password', 'send us your account']
train_ham = ['Your activity report','benefits physical activity', 'the importance vows']
test_emails = {'spam':['renew your password', 'renew your vows'], 'ham':['benefits of our account', 'the importance of physical activity']}

In [None]:
# make a vocabulary of unique words that occur in known spam emails

vocab_words_spam = []

for sentence in train_spam:
    sentence_as_list = sentence.split()
    for word in sentence_as_list:
        vocab_words_spam.append(word)     
        
print(vocab_words_spam)

['send', 'us', 'your', 'password', 'review', 'our', 'website', 'send', 'your', 'password', 'send', 'us', 'your', 'account']


**Convert each list element to a dictionary key. This will delete duplicates, as dictionaries cannot have multiple keys with the same name. Convert remaining keys back to list:**

In [None]:
vocab_unique_words_spam = list(dict.fromkeys(vocab_words_spam))
print(vocab_unique_words_spam)

['send', 'us', 'your', 'password', 'review', 'our', 'website', 'account']


Spamicity' can be calculated by taking the total number of emails that have already been hand-labelled as either spam or ham, and using that data to compute word spam probabilities, by counting the frequency of each word.

We can count how many spam emails have the word “send” and divide that by the total number of spam emails - this gives a measure of the word's 'spamicity', or how likely it is to be in a spam email.

In [None]:
dict_spamicity = {}
for w in vocab_unique_words_spam:
    emails_with_w = 0     # counter
    for sentence in train_spam:
        if w in sentence:
            emails_with_w+=1
            
    print(f"Number of spam emails with the word {w}: {emails_with_w}")
    total_spam = len(train_spam)
    spamicity = (emails_with_w+1)/(total_spam+2)
    print(f"Spamicity of the word '{w}': {spamicity} \n")
    dict_spamicity[w.lower()] = spamicity

Number of spam emails with the word send: 3
Spamicity of the word 'send': 0.6666666666666666 

Number of spam emails with the word us: 2
Spamicity of the word 'us': 0.5 

Number of spam emails with the word your: 3
Spamicity of the word 'your': 0.6666666666666666 

Number of spam emails with the word password: 2
Spamicity of the word 'password': 0.5 

Number of spam emails with the word review: 1
Spamicity of the word 'review': 0.3333333333333333 

Number of spam emails with the word our: 4
Spamicity of the word 'our': 0.8333333333333334 

Number of spam emails with the word website: 1
Spamicity of the word 'website': 0.3333333333333333 

Number of spam emails with the word account: 1
Spamicity of the word 'account': 0.3333333333333333 



**Calculate Hamicity of non-spam words:**

In [None]:
# make a vocabulary of unique words that occur in known ham emails

vocab_words_ham = []

for sentence in train_ham:
    sentence_as_list = sentence.split()
    for word in sentence_as_list:
        vocab_words_ham.append(word)
        
vocab_unique_words_ham = list(dict.fromkeys(vocab_words_ham))
print(vocab_unique_words_ham)
['Your', 'activity', 'report', 'benefits', 'physical', 'the', 'importance', 'vows']
dict_hamicity = {}

for w in vocab_unique_words_ham:
    emails_with_w = 0     # counter
    for sentence in train_ham:
        if w in sentence:
            print(w+":", sentence)
            emails_with_w+=1
            
    print(f"Number of ham emails with the word '{w}': {emails_with_w}")
    total_ham = len(train_ham)
    Hamicity = (emails_with_w+1)/(total_ham+2)       # Smoothing applied
    print(f"Hamicity of the word '{w}': {Hamicity} ")
    dict_hamicity[w.lower()] = Hamicity
# Use built-in lower() to keep all words lower case - useful later when 
# comparing spamicity vs hamicity of a single word - e.g. 'Your' and
 # 'your' will be treated as 2 different words if not normalized to lower                                          # case.

['Your', 'activity', 'report', 'benefits', 'physical', 'the', 'importance', 'vows']
Your: Your activity report
Number of ham emails with the word 'Your': 1
Hamicity of the word 'Your': 0.4 
activity: Your activity report
activity: benefits physical activity
Number of ham emails with the word 'activity': 2
Hamicity of the word 'activity': 0.6 
report: Your activity report
Number of ham emails with the word 'report': 1
Hamicity of the word 'report': 0.4 
benefits: benefits physical activity
Number of ham emails with the word 'benefits': 1
Hamicity of the word 'benefits': 0.4 
physical: benefits physical activity
Number of ham emails with the word 'physical': 1
Hamicity of the word 'physical': 0.4 
the: the importance vows
Number of ham emails with the word 'the': 1
Hamicity of the word 'the': 0.4 
importance: the importance vows
Number of ham emails with the word 'importance': 1
Hamicity of the word 'importance': 0.4 
vows: the importance vows
Number of ham emails with the word 'vows': 1

**Compute Probability of Spam P(S):
This computes the probability of any one email being spam, by dividing the total number of spam emails by the total number of all emails.**

In [None]:
prob_spam = len(train_spam) / (len(train_spam)+(len(train_ham)))
print(prob_spam)

0.5714285714285714


**Compute Probability of Ham P(¬S): This computes the probability of any one email being ham, by dividing the total number of ham emails by the total number of all emails.**

In [None]:
prob_ham = len(train_ham) / (len(train_spam)+(len(train_ham)))
print(prob_ham)

0.42857142857142855


**Given a set of un-labelled test emails, iterate over each, and create list of distinct words:**

In [None]:
tests = []
for i in test_emails['spam']:
    tests.append(i)
    
for i in test_emails['ham']:
    tests.append(i)
    
print(tests)    

['renew your password', 'renew your vows', 'benefits of our account', 'the importance of physical activity']
# split emails into distinct words

distinct_words_as_sentences_test = []

for sentence in tests:
    sentence_as_list = sentence.split()
    senten = []
    for word in sentence_as_list:
        senten.append(word)
    distinct_words_as_sentences_test.append(senten)
        
print(distinct_words_as_sentences_test)

['renew your password', 'renew your vows', 'benefits of our account', 'the importance of physical activity']
[['renew', 'your', 'password'], ['renew', 'your', 'vows'], ['benefits', 'of', 'our', 'account'], ['the', 'importance', 'of', 'physical', 'activity']]


In [None]:
test_spam_tokenized = [distinct_words_as_sentences_test[0], distinct_words_as_sentences_test[1]]
test_ham_tokenized = [distinct_words_as_sentences_test[2], distinct_words_as_sentences_test[3]]
print(test_spam_tokenized)

[['renew', 'your', 'password'], ['renew', 'your', 'vows']]


**Ignore the words that you haven’t seen in the labelled training data:**

In [None]:
reduced_sentences_spam_test = []
for sentence in test_spam_tokenized:
    words_ = []
    for word in sentence:
        if word in vocab_unique_words_spam:
            print(f"'{word}', ok")
            words_.append(word)
        elif word in vocab_unique_words_ham:
            print(f"'{word}', ok")
            words_.append(word)
        else:
            print(f"'{word}', word not present in labelled spam training data")
    reduced_sentences_spam_test.append(words_)
print(reduced_sentences_spam_test)

'renew', word not present in labelled spam training data
'your', ok
'password', ok
'renew', word not present in labelled spam training data
'your', ok
'vows', ok
[['your', 'password'], ['your', 'vows']]


In [None]:
reduced_sentences_ham_test = []                   # repeat for ham words
for sentence in test_ham_tokenized:
    words_ = []
    for word in sentence:
        if word in vocab_unique_words_ham:
            print(f"'{word}', ok")
            words_.append(word)
        elif word in vocab_unique_words_spam:
            print(f"'{word}', ok")
            words_.append(word)
        else:
            print(f"'{word}', word not present in labelled ham training data")
    reduced_sentences_ham_test.append(words_)
print(reduced_sentences_ham_test)

'benefits', ok
'of', word not present in labelled ham training data
'our', ok
'account', ok
'the', ok
'importance', ok
'of', word not present in labelled ham training data
'physical', ok
'activity', ok
[['benefits', 'our', 'account'], ['the', 'importance', 'physical', 'activity']]


**Stemming - remove non-key words: Removal of non-key words can help the classifier focus on what words are most important.**

In [None]:
test_spam_stemmed = []
non_key = ['us',  'the', 'of','your']       # non-key words, gathered from spam,ham and test sentences
for email in reduced_sentences_spam_test:
    email_stemmed=[]
    for word in email:
        if word in non_key:
            print('remove')
        else:
            email_stemmed.append(word)
    test_spam_stemmed.append(email_stemmed)
            
print(test_spam_stemmed)


remove
remove
[['password'], ['vows']]


In [None]:
test_ham_stemmed = []
non_key = ['us', 'the', 'of', 'your'] 
for email in reduced_sentences_ham_test:
    email_stemmed=[]
    for word in email:
        if word in non_key:
            print('remove')
        else:
            email_stemmed.append(word)
    test_ham_stemmed.append(email_stemmed)
            
print(test_ham_stemmed)

remove
[['benefits', 'our', 'account'], ['importance', 'physical', 'activity']]


**Bayes' Rule**
(To compute the probability of spam given a certain word from an email.)

In [None]:
def mult(list_) :        # function to multiply all word probs together 
    total_prob = 1
    for i in list_: 
         total_prob = total_prob * i  
    return total_prob

def Bayes(email):
    probs = []
    for word in email:
        Pr_S = prob_spam
        print('prob of spam in general ',Pr_S)
        try:
            pr_WS = dict_spamicity[word]
            print(f'prob "{word}"  is a spam word : {pr_WS}')
        except KeyError:
            pr_WS = 1/(total_spam+2)  # Apply smoothing for word not seen in spam training data, but seen in ham training 
            print(f"prob '{word}' is a spam word: {pr_WS}")
            
        Pr_H = prob_ham
        print('prob of ham in general ', Pr_H)
        try:
            pr_WH = dict_hamicity[word]
            print(f'prob "{word}" is a ham word: ',pr_WH)
        except KeyError:
            pr_WH = (1/(total_ham+2))  # Apply smoothing for word not seen in ham training data, but seen in spam training
            print(f"WH for {word} is {pr_WH}")
            print(f"prob '{word}' is a ham word: {pr_WH}")
        
        prob_word_is_spam_BAYES = (pr_WS*Pr_S)/((pr_WS*Pr_S)+(pr_WH*Pr_H))
        print('')
        print(f"Using Bayes, prob the the word '{word}' is spam: {prob_word_is_spam_BAYES}")
        print('###########################')
        probs.append(prob_word_is_spam_BAYES)
    print(f"All word probabilities for this sentence: {probs}")
    final_classification = mult(probs)
    if final_classification >= 0.5:
        print(f'email is SPAM: with spammy confidence of {final_classification*100}%')
    else:
        print(f'email is HAM: with spammy confidence of {final_classification*100}%')
    return final_classification
for email in test_spam_stemmed:
    print('')
    print(f"           Testing stemmed SPAM email {email} :")
    print('                 Test word by word: ')
    all_word_probs = Bayes(email)
    print(all_word_probs)


           Testing stemmed SPAM email ['password'] :
                 Test word by word: 
prob of spam in general  0.5714285714285714
prob "password"  is a spam word : 0.5
prob of ham in general  0.42857142857142855
WH for password is 0.2
prob 'password' is a ham word: 0.2

Using Bayes, prob the the word 'password' is spam: 0.7692307692307692
###########################
All word probabilities for this sentence: [0.7692307692307692]
email is SPAM: with spammy confidence of 76.92307692307692%
0.7692307692307692

           Testing stemmed SPAM email ['vows'] :
                 Test word by word: 
prob of spam in general  0.5714285714285714
prob 'vows' is a spam word: 0.16666666666666666
prob of ham in general  0.42857142857142855
prob "vows" is a ham word:  0.4

Using Bayes, prob the the word 'vows' is spam: 0.35714285714285715
###########################
All word probabilities for this sentence: [0.35714285714285715]
email is HAM: with spammy confidence of 35.714285714285715%
0.3571428

**Next we test how likely the stemmed HAM test emails are to be SPAM.**

In [None]:
for email in test_ham_stemmed:
    print('')
    print(f"           Testing stemmed HAM email {email} :")
    print('                 Test word by word: ')
    all_word_probs = Bayes(email)
    print(all_word_probs)


           Testing stemmed HAM email ['benefits', 'our', 'account'] :
                 Test word by word: 
prob of spam in general  0.5714285714285714
prob 'benefits' is a spam word: 0.16666666666666666
prob of ham in general  0.42857142857142855
prob "benefits" is a ham word:  0.4

Using Bayes, prob the the word 'benefits' is spam: 0.35714285714285715
###########################
prob of spam in general  0.5714285714285714
prob "our"  is a spam word : 0.8333333333333334
prob of ham in general  0.42857142857142855
WH for our is 0.2
prob 'our' is a ham word: 0.2

Using Bayes, prob the the word 'our' is spam: 0.847457627118644
###########################
prob of spam in general  0.5714285714285714
prob "account"  is a spam word : 0.3333333333333333
prob of ham in general  0.42857142857142855
WH for account is 0.2
prob 'account' is a ham word: 0.2

Using Bayes, prob the the word 'account' is spam: 0.689655172413793
###########################
All word probabilities for this sentence: [0.3

# Assignment

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats
import seaborn as sns

**Dataset**

The objective of this task is to detect hate speech in tweets. For the sake of simplicity, we say a tweet contains hate speech if it has a racist or sexist sentiment associated with it. So, the task is to classify racist or sexist tweets from other tweets.

Formally, given a training sample of tweets and labels, where label '1' denotes the tweet is racist/sexist and label '0' denotes the tweet is not racist/sexist, your objective is to predict the labels on the test dataset.

### Write a program to detect hate speech in tweets using the Multinomial Naive Bayes.

In [40]:
df_train = pd.read_csv('/content/drive/MyDrive/ML Lab/Assignment_4/Sentiment analysis/train.csv/train.csv').drop('id', axis=1)
df_test = pd.read_csv('/content/drive/MyDrive/ML Lab/Assignment_4/Sentiment analysis/test.csv/test.csv').drop('id', axis=1)

In [41]:
df_train.head()

Unnamed: 0,label,tweet
0,0,@user when a father is dysfunctional and is s...
1,0,@user @user thanks for #lyft credit i can't us...
2,0,bihday your majesty
3,0,#model i love u take with u all the time in ...
4,0,factsguide: society now #motivation


In [42]:
df_train['tweet'] = df_train.tweet.str.lower()
df_test['tweet'] = df_test['tweet'].str.lower()
df_train.head(3)

Unnamed: 0,label,tweet
0,0,@user when a father is dysfunctional and is s...
1,0,@user @user thanks for #lyft credit i can't us...
2,0,bihday your majesty


In [43]:
import re
df_train.tweet = df_train.tweet.apply(lambda x: re.sub(r'https?:\/\/\S+', '', x))
df_train.tweet = df_train.tweet.apply(lambda x: re.sub(r"www\.[a-z]?\.?(com)+|[a-z]+\.(com)", '', x))

df_test.tweet = df_test.tweet.apply(lambda x: re.sub(r'https?:\/\/\S+', '', x))
df_test.tweet = df_test.tweet.apply(lambda x: re.sub(r"www\.[a-z]?\.?(com)+|[a-z]+\.(com)", '', x))

In [44]:
df_train.tweet = df_train.tweet.apply(lambda x: re.sub(r'{link}', '', x))
df_train.tweet = df_train.tweet.apply(lambda x: re.sub(r"\[video\]", '', x))

df_test.tweet = df_test.tweet.apply(lambda x: re.sub(r'{link}', '', x))
df_test.tweet = df_test.tweet.apply(lambda x: re.sub(r"\[video\]", '', x))

In [45]:
df_train.tweet = df_train.tweet.apply(lambda x: re.sub(r'&[a-z]+;', '', x))

df_test.tweet = df_test.tweet.apply(lambda x: re.sub(r'&[a-z]+;', '', x))

In [46]:
df_train.tweet = df_train.tweet.apply(lambda x: re.sub(r"[^a-z\s\(\-:\)\\\/\];='#@]", '', x))

df_test.tweet = df_test.tweet.apply(lambda x: re.sub(r"[^a-z\s\(\-:\)\\\/\];='#@]", '', x))

In [47]:
df_train.tweet = df_train.tweet.apply(lambda x: re.sub(r'@user', '', x))

df_test.tweet = df_test.tweet.apply(lambda x: re.sub(r'@user', '', x))

In [48]:
df_train.tweet[:10]

0      when a father is dysfunctional and is so sel...
1      thanks for #lyft credit i can't use cause th...
2                                  bihday your majesty
3    #model   i love u take with u all the time in ...
4               factsguide: society now    #motivation
5    /] huge fan fare and big talking before they l...
6                        camping tomorrow        danny
7    the next school year is the year for exams can...
8    we won love the land #allin #cavs #champions #...
9                   welcome here   i'm   it's so #gr  
Name: tweet, dtype: object

In [49]:
df_train.tweet[3]

'#model   i love u take with u all the time in ur \x85  '

In [50]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31962 entries, 0 to 31961
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   label   31962 non-null  int64 
 1   tweet   31962 non-null  object
dtypes: int64(1), object(1)
memory usage: 499.5+ KB


In [51]:
from nltk.tokenize import TweetTokenizer
tknzr = TweetTokenizer()
df_train['token'] = df_train['tweet'].apply(tknzr.tokenize)

In [52]:
df_test['token'] = df_test['tweet'].apply(tknzr.tokenize)

In [53]:
df_train['token'][:10]

0    [when, a, father, is, dysfunctional, and, is, ...
1    [thanks, for, #lyft, credit, i, can't, use, ca...
2                              [bihday, your, majesty]
3    [#model, i, love, u, take, with, u, all, the, ...
4           [factsguide, :, society, now, #motivation]
5    [/, ], huge, fan, fare, and, big, talking, bef...
6                           [camping, tomorrow, danny]
7    [the, next, school, year, is, the, year, for, ...
8    [we, won, love, the, land, #allin, #cavs, #cha...
9                  [welcome, here, i'm, it's, so, #gr]
Name: token, dtype: object

In [54]:
import string
PUNCUATION_LIST = list(string.punctuation)
def remove_punctuation(word_list):
    """Remove punctuation tokens from a list of tokens"""
    return [w for w in word_list if w not in PUNCUATION_LIST]
df_train['token'] = df_train['token'].apply(remove_punctuation)

In [55]:
df_test['token'] = df_test['token'].apply(remove_punctuation)

In [56]:
df_train['token'][:10]

0    [when, a, father, is, dysfunctional, and, is, ...
1    [thanks, for, #lyft, credit, i, can't, use, ca...
2                              [bihday, your, majesty]
3    [#model, i, love, u, take, with, u, all, the, ...
4              [factsguide, society, now, #motivation]
5    [huge, fan, fare, and, big, talking, before, t...
6                           [camping, tomorrow, danny]
7    [the, next, school, year, is, the, year, for, ...
8    [we, won, love, the, land, #allin, #cavs, #cha...
9                  [welcome, here, i'm, it's, so, #gr]
Name: token, dtype: object

In [63]:
from nltk.corpus import stopwords

In [64]:
import nltk
nltk.download('stopwords')
sw_nltk = stopwords.words('english')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [65]:
print(sw_nltk)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [66]:
df_train['token'] = df_train['token'].apply(lambda x: [word for word in x if word.lower() not in sw_nltk])

In [67]:
df_test['token'] = df_test['token'].apply(lambda x: [word for word in x if word.lower() not in sw_nltk])

In [68]:
# train_spam = df_train[df_train['label']==1]['tweet']
# train_ham = df_train[df_train['label']==0]['tweet']

train_spam = df_train[df_train['label']==1]['token']
train_ham = df_train[df_train['label']==0]['token']

In [69]:
type(train_spam)

pandas.core.series.Series

In [70]:
test_emails = df_test['tweet']

In [71]:
df_test['tweet'][-10:]

17187    loving life  #createyourfuture   #lifestyle #h...
17188    black professor demonizes proposes nazi style ...
17189    learn how to think positive  #positive   #inst...
17190    we love the pretty happy and fresh you #teenil...
17191    damntuff-ruffmufftechnocity-(ng)-web--ukhxint ...
17192    thought factory: left-right polarisation #trum...
17193    feeling like a mermaid  #hairflip #neverready ...
17194    #hillary #campaigned today in #ohio((omg))  us...
17195    happy at work conference: right mindset leads ...
17196    my   song so glad free download  #shoegaze #ne...
Name: tweet, dtype: object

In [72]:
vocab_words_spam = []

for sentence in train_spam:
    # sentence_as_list = sentence.split()
    for word in sentence:
        vocab_words_spam.append(word)     
        
print(vocab_words_spam)



In [73]:
vocab_unique_words_spam = list(dict.fromkeys(vocab_words_spam))
print(vocab_unique_words_spam)



In [74]:
vocab_words_ham = []

for sentence in train_ham:
    # sentence_as_list = sentence.split()
    for word in sentence:
        vocab_words_ham.append(word)
        
# print(vocab_unique_words_ham)

In [75]:
vocab_unique_words_ham = list(dict.fromkeys(vocab_words_ham))

In [76]:
dict_spamicity = {}
for w in vocab_unique_words_spam:
    emails_with_w = 0     # counter
    for sentence in train_spam:
        if w in sentence:
            emails_with_w+=1
            
    print(f"Number of spam emails with the word {w}: {emails_with_w}")
    len_spam = len(train_spam)
    
    spamicity = (emails_with_w+1)/(len_spam + len(vocab_unique_words_spam)+ len(vocab_unique_words_ham))
    print(f"Spamicity of the word '{w}': {spamicity} \n")
    dict_spamicity[w.lower()] = spamicity

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Spamicity of the word 'jolly': 3.8972680151214e-05 

Number of spam emails with the word #hacking: 1
Spamicity of the word '#hacking': 3.8972680151214e-05 

Number of spam emails with the word ed: 1
Spamicity of the word 'ed': 3.8972680151214e-05 

Number of spam emails with the word op: 1
Spamicity of the word 'op': 3.8972680151214e-05 

Number of spam emails with the word occur: 2
Spamicity of the word 'occur': 5.8459020226820995e-05 

Number of spam emails with the word attacks: 4
Spamicity of the word 'attacks': 9.7431700378035e-05 

Number of spam emails with the word lolol: 1
Spamicity of the word 'lolol': 3.8972680151214e-05 

Number of spam emails with the word leb: 1
Spamicity of the word 'leb': 3.8972680151214e-05 

Number of spam emails with the word #cow: 1
Spamicity of the word '#cow': 3.8972680151214e-05 

Number of spam emails with the word canadian: 1
Spamicity of the word 'canadian': 3.8972680151214e-05 


In [77]:
train_ham[:5]

0    [father, dysfunctional, selfish, drags, kids, ...
1    [thanks, #lyft, credit, can't, use, cause, off...
2                                    [bihday, majesty]
3                 [#model, love, u, take, u, time, ur]
4                   [factsguide, society, #motivation]
Name: token, dtype: object

In [78]:
# make a vocabulary of unique words that occur in known ham emails

dict_hamicity = {}

for w in vocab_unique_words_ham:
    emails_with_w = 0     # counter
    for sentence in train_ham:
        if w in sentence:
            # print(w+":", sentence)
            emails_with_w+=1
    n = len(vocab_unique_words_ham)     
    print(f"Number of ham emails with the word '{w}': {emails_with_w}")
    len_ham = len(train_ham)
    Hamicity = (emails_with_w+1)/(len_ham + len(vocab_unique_words_spam)+ len(vocab_unique_words_ham))  
    print(f"Hamicity of the word '{w}': {Hamicity} ")
    dict_hamicity[w.lower()] = Hamicity 

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Number of ham emails with the word '#paulryan': 1
Hamicity of the word '#paulryan': 2.538199908624803e-05 
Number of ham emails with the word '#whatatool': 1
Hamicity of the word '#whatatool': 2.538199908624803e-05 
Number of ham emails with the word 'strides': 1
Hamicity of the word 'strides': 2.538199908624803e-05 
Number of ham emails with the word '#lile': 1
Hamicity of the word '#lile': 2.538199908624803e-05 
Number of ham emails with the word 'colourlife': 1
Hamicity of the word 'colourlife': 2.538199908624803e-05 
Number of ham emails with the word '#cheflife': 1
Hamicity of the word '#cheflife': 2.538199908624803e-05 
Number of ham emails with the word '#kingkamehameha': 1
Hamicity of the word '#kingkamehameha': 2.538199908624803e-05 
Number of ham emails with the word '#luckywelivehawaii': 1
Hamicity of the word '#luckywelivehawaii': 2.538199908624803e-05 
Number of ham emails with the word '#ancestors': 1
Hamici

In [79]:
prob_spam = len(train_spam) / (len(train_spam)+(len(train_ham)))
print(prob_spam)

0.07014579813528565


In [80]:
prob_ham = len(train_ham) / (len(train_spam)+(len(train_ham)))
print(prob_ham)

0.9298542018647143


In [81]:
# tests = []
# for i in test_emails['spam']:
#     tests.append(i)
    
# for i in test_emails['ham']:
#     tests.append(i)
        
# tests = test_emails
tests = df_test['token']

distinct_words_as_sentences_test = []

for sentence in tests:
    # sentence_as_list = sentence.split()
    senten = []
    for word in sentence:
        senten.append(word)
    distinct_words_as_sentences_test.append(senten)
        
print(distinct_words_as_sentences_test)



In [82]:
distinct_words_as_sentences_test[:3]

[['#studiolife',
  '#aislife',
  '#requires',
  '#passion',
  '#dedication',
  '#willpower',
  'find',
  '#newmaterials'],
 ['#white',
  '#supremacists',
  'want',
  'everyone',
  'see',
  'new',
  '#birds',
  '#movie',
  'heres'],
 ['safe', 'ways', 'heal', '#acne', '#altwaystoheal', '#healthy', '#healing']]

In [None]:
# test_spam_tokenized = [distinct_words_as_sentences_test[0], distinct_words_as_sentences_test[1]]
# test_ham_tokenized = [distinct_words_as_sentences_test[2], distinct_words_as_sentences_test[3]]
# print(test_spam_tokenized)

[['#studiolife', '#aislife', '#requires', '#passion', '#dedication', '#willpower', 'to', 'find', '#newmaterials'], ['user', '#white', '#supremacists', 'want', 'everyone', 'to', 'see', 'the', 'new', '#birds', '#movie', 'and', 'heres', 'why']]


In [83]:
reduced_sentences_test = []
for sentence in distinct_words_as_sentences_test:
    words_ = []
    for word in sentence:
        if word in vocab_unique_words_spam:
            # print(f"'{word}', ok")
            words_.append(word)
        elif word in vocab_unique_words_ham:
            # print(f"'{word}', ok")
            words_.append(word)
        else:
            print(f"'{word}', word not present in labelled spam training data")
    reduced_sentences_test.append(words_)

print(reduced_sentences_test)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
'icomedy', word not present in labelled spam training data
'beanies', word not present in labelled spam training data
'#beanies', word not present in labelled spam training data
'serriously', word not present in labelled spam training data
'thoam', word not present in labelled spam training data
'#digg', word not present in labelled spam training data
'comprised', word not present in labelled spam training data
'f's', word not present in labelled spam training data
'#undemocratic', word not present in labelled spam training data
'suffocation', word not present in labelled spam training data
'diane', word not present in labelled spam training data
'mariechild', word not present in labelled spam training data
'appology', word not present in labelled spam training data
'pueo', word not present in labelled spam training data
'outgrew', word not present in labelled spam training data
'junie', word not present in labelled spam 

In [None]:
# reduced_sentences_ham_test = []                   # repeat for ham words
# for sentence in test_ham_tokenized:
#     words_ = []
#     for word in sentence:
#         if word in vocab_unique_words_ham:
#             print(f"'{word}', ok")
#             words_.append(word)
#         elif word in vocab_unique_words_spam:
#             print(f"'{word}', ok")
#             words_.append(word)
#         else:
#             print(f"'{word}', word not present in labelled ham training data")
#     reduced_sentences_ham_test.append(words_)
# print(reduced_sentences_ham_test)

'benefits', ok
'of', word not present in labelled ham training data
'our', ok
'account', ok
'the', ok
'importance', ok
'of', word not present in labelled ham training data
'physical', ok
'activity', ok
[['benefits', 'our', 'account'], ['the', 'importance', 'physical', 'activity']]


In [84]:
test_stemmed = []

non_key = ['us', 'the', 'of','your', 'i', 'and', 'with']
for email in reduced_sentences_test:
    email_stemmed=[]
    for word in email:
        if word in non_key:
            print('remove')
        else:
            email_stemmed.append(word)
    test_stemmed.append(email_stemmed)
            
print(test_stemmed)

remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove
remove

In [None]:
# test_ham_stemmed = []
# non_key = ['us', 'the', 'of', 'your'] 
# for email in reduced_sentences_ham_test:
#     email_stemmed=[]
#     for word in email:
#         if word in non_key:
#             print('remove')
#         else:
#             email_stemmed.append(word)
#     test_ham_stemmed.append(email_stemmed)
            
# print(test_ham_stemmed)

remove
[['benefits', 'our', 'account'], ['importance', 'physical', 'activity']]


In [87]:
def mult(list_) :        # function to multiply all word probs together 
    total_prob = 1
    for i in list_: 
         total_prob = total_prob * i  
    return total_prob

def Bayes(email):
    probs = []
    for word in email:
        Pr_S = prob_spam
        # print('prob of spam in general ',Pr_S)
        try:
            pr_WS = dict_spamicity[word]
            # print(f'prob "{word}"  is a spam word : {pr_WS}')
        except KeyError:
            pr_WS = 1/(len_spam + len(vocab_unique_words_spam) + len(vocab_unique_words_ham))  # Apply smoothing for word not seen in spam training data, but seen in ham training 
            # print(f"prob '{word}' is a spam word: {pr_WS}")
            
        Pr_H = prob_ham
        # print('prob of ham in general ', Pr_H)
        try:
            pr_WH = dict_hamicity[word]
            # print(f'prob "{word}" is a ham word: ',pr_WH)
        except KeyError:
            pr_WH = (1/(len_ham + len(vocab_unique_words_spam) + len(vocab_unique_words_ham)))  # Apply smoothing for word not seen in ham training data, but seen in spam training
            # print(f"WH for {word} is {pr_WH}")
            # print(f"prob '{word}' is a ham word: {pr_WH}")
        
        prob_word_is_spam_BAYES = (pr_WS*Pr_S)/((pr_WS*Pr_S)+(pr_WH*Pr_H))
        # print(f"Using Bayes, prob the the word '{word}' is racist/sexist: {prob_word_is_spam_BAYES}")
        # print('###########################')
        probs.append(prob_word_is_spam_BAYES)
    print(f"All word probabilities for this sentence: {probs}")
    final_classification = mult(probs)
    if final_classification >= 0.5:
        print(f'email is heateful: with spammy confidence of {final_classification*100}%')
    else:
        print(f'email is not hateful: with spammy confidence of {final_classification*100}%')
    return final_classification


In [88]:
for email in test_stemmed:
    print('')
    print(f"       Testing stemmed email {email} :")
    all_word_probs = Bayes(email)
    print(all_word_probs)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m

           Testing stemmed email ['really', 'hate', 'pll', 'coming', 'back', 'day', 'get', 'wisdom', 'teeth', 'able', 'enjoy', 'good', 'still', 'e'] :
All word probabilities for this sentence: [0.0059741449720586565, 0.0285631699992323, 0.028142574107250625, 0.003480209229395682, 0.00798349641844432, 0.0005412499912695334, 0.0049316122462124785, 0.008831316950358714, 0.03717470154528315, 0.003252207012684513, 0.0026455789628019744, 0.005308018459217473, 0.009688164832742329, 0.002180706808960238]
email is HAM: with spammy confidence of 1.1281695925608824e-29%
1.1281695925608824e-31

           Testing stemmed email ['smell', '#cavs', 'coming', '#nbafinals'] :
All word probabilities for this sentence: [0.018939383741665577, 0.010420279161698812, 0.003480209229395682, 0.002180706808960238]
email is HAM: with spammy confidence of 1.4977793257767342e-07%
1.4977793257767343e-09

           Testing stemmed email ['#girliguessi

## Categorical Naïve Bayes

It is suitable for classification with discrete features which assumes categorically distribution for each feature. The features should to encoded using label encoding  techniques such that each category would be mapped to a unique number. 

The probability of category $t$ in feature $i$ given class $c$ is estimated as: 

![categorical.PNG](attachment:categorical.PNG)

![parameter_categorical.PNG](attachment:parameter_categorical.PNG)

## Step By Step Implementation of Categorical Naive Bayes


1. Preprocessing the data.
2. Calculate the counts/presence of each feature based on class.
3. Calculate likelihood probability.
4. Calculate prior probability.
5. Calculate posterior probability for a given query point → Predict function

In [None]:
import numpy as np
import pandas as pd 

from sklearn import preprocessing
from sklearn.preprocessing import OrdinalEncoder,LabelBinarizer

In [None]:
weather = ['Clear', 'Clear', 'Clear', 'Clear', 'Clear', 'Clear',
            'Rainy', 'Rainy', 'Rainy', 'Rainy', 'Rainy', 'Rainy',
            'Snowy', 'Snowy', 'Snowy', 'Snowy', 'Snowy', 'Snowy']

timeOfWeek = ['Workday', 'Workday', 'Workday',
            'Weekend', 'Weekend', 'Weekend',
            'Workday', 'Workday', 'Workday',
            'Weekend', 'Weekend', 'Weekend',
            'Workday', 'Workday', 'Workday',
            'Weekend', 'Weekend', 'Weekend']

timeOfDay = ['Morning', 'Lunch', 'Evening',
            'Morning', 'Lunch', 'Evening',
            'Morning', 'Lunch', 'Evening',
            'Morning', 'Lunch', 'Evening',
            'Morning', 'Lunch', 'Evening',
            'Morning', 'Lunch', 'Evening',
            ]
trafficJam = ['Yes', 'No', 'Yes',
            'No', 'No', 'No',
            'Yes', 'Yes', 'Yes',
            'No', 'No', 'No',
            'Yes', 'Yes', 'Yes',
            'Yes', 'No', 'Yes'
            ]

In [None]:
df = pd.DataFrame(zip(weather,timeOfWeek,timeOfDay,trafficJam),columns = ['weather','timeOfWeek','timeOfDay','trafficJam'])
df

Unnamed: 0,weather,timeOfWeek,timeOfDay,trafficJam
0,Clear,Workday,Morning,Yes
1,Clear,Workday,Lunch,No
2,Clear,Workday,Evening,Yes
3,Clear,Weekend,Morning,No
4,Clear,Weekend,Lunch,No
5,Clear,Weekend,Evening,No
6,Rainy,Workday,Morning,Yes
7,Rainy,Workday,Lunch,Yes
8,Rainy,Workday,Evening,Yes
9,Rainy,Weekend,Morning,No


In [None]:
weather = df['weather'].values.reshape(-1,1)
timeOfWeek = df['timeOfWeek'].values.reshape(-1,1) 
timeOfDay = df['timeOfDay'].values.reshape(-1,1)

In [None]:
weather.shape,timeOfWeek.shape

((18, 1), (18, 1))

In [None]:
def preprocess():
    # Using ordinal encoder to convert the categories in the range from 0 to n-1
    wea_enc = OrdinalEncoder()
    weather_ = wea_enc.fit_transform(weather)

    timeOfWeek_enc = OrdinalEncoder()
    timeOfWeek_ = timeOfWeek_enc.fit_transform(timeOfWeek)

    timeOfDay_enc = OrdinalEncoder()
    timeOfDay_ = timeOfDay_enc.fit_transform(timeOfDay)
    # Stacking all the features
    X = np.column_stack((weather_,timeOfWeek_,timeOfDay_))
    # Changing the type to int
    X = X.astype(int)
    # Doing one hot encoding on the target data
    y = df['trafficJam']
    lb = LabelBinarizer()
    y_ = lb.fit_transform(y)
    if y_.shape[1] == 1:
        y_ = np.concatenate((1 - y_, y_), axis=1)
    return X,y_,lb.classes_

**Preprocessing the data:
Converting the categorical data into a numerical form using ordinal encoding. The features are converted to ordinal integers.
This results in a single column of integers (0 to n_categories — 1) per feature.**

In [None]:
X,y,classes = preprocess()
X.shape, y.shape

((18, 3), (18, 2))

In [None]:
def counts_based_onclass(X,y):
    
    # No of feature
    n_features = X.shape[1]
    # No of classes
    n_classes = y.shape[1]
    
    count_matrix = []
    # For each feature
    for i in range(n_features):
        count_feature = []
        # Get that particuar feature from the dataset
        X_feature = X[:,i]
        # For each class
        for j in range(n_classes):
            # Get the datapoints that belong to the class - j
            mask = y[:,j].astype(bool)
            # Using masking filter out the datapoints that belong to this class- j in the given feature - i
            # Using bincount -- count all the different categories present in the given feature
            counts = np.bincount(X_feature[mask])
            
            count_feature.append(counts)
            
        count_matrix.append(np.array(count_feature))
        # Finding the count of datapoints beloging to each class -- we will use it to calculate prior probabilities.
        class_count = y.sum(axis=0)
        
    return count_matrix,n_features,n_classes,class_count
            

In [None]:
count_matrix,n_features,n_classes,class_count = counts_based_onclass(X,y)

In [None]:
# Count_matrix will give an output this way, For each of the features you have 2D -array
#(The first row corresponding to No and the second row corresponding to Yes)

count_matrix

[array([[4, 3, 1],
        [2, 3, 5]]), array([[7, 1],
        [2, 8]]), array([[2, 4, 2],
        [4, 2, 4]])]

In [None]:
def calculate_likelihood_probs(count_matrix,alpha,n_features):
    log_probabilities = []
    for i in range(n_features):
        num = count_matrix[i] + alpha
        den = num.sum(axis = 1).reshape(-1,1)
        log_probability = np.log(num) - np.log(den)
        log_probabilities.append(log_probability)
    return log_probabilities

In [None]:
def calculate_prior_probs(class_count):
    
    num = class_count
    den = class_count.sum()
    
    return np.log(num)-np.log(den)

In [None]:
prior_probs = calculate_prior_probs(class_count)

In [None]:
log_probs = calculate_likelihood_probs(count_matrix,1,n_features)

In [None]:
def predict(query_point,log_probs,prior_probs):
    
    # Intializing an empty array
    probs = np.zeros((1,n_classes))
    # For each feature
    for i in range(n_features):
        # Get the category_id of the feature - i from the query_point
        category = query_point[i]
        # Fetch the corresponding log_probability table and add continue to add them for all the features
        probs+=log_probs[i][:,category]
    # Finally add posterior probability
    probs+=prior_probs
    # Finding the maximum of the probabilities and fetching the corresponding class
    return classes[np.argmax(probs)]

In [None]:
print('Likelihood probabilities\n',log_probs)
print('Prior probabilities\n',prior_probs)
#print('Predict',predict(X[4],log_probs,prior_probs))

Likelihood probabilities
 [array([[-0.78845736, -1.01160091, -1.70474809],
       [-1.46633707, -1.178655  , -0.77318989]]), array([[-0.22314355, -1.60943791],
       [-1.38629436, -0.28768207]]), array([[-1.29928298, -0.78845736, -1.29928298],
       [-0.95551145, -1.46633707, -0.95551145]])]
Prior probabilities
 [-0.81093022 -0.58778666]


# Assignment

**Dataset characteristics:**

1. Number of instances: 1000
2. Number of attributes: 5 (including target attribute), all categorical
3. Attribute information:
    * size (XS, S, M, L, XL, XXL, 3XL)
    * material (nylon, polyester, silk, cotton, linen)
    * color (white, cream, blue, black, orange, green, yellow, red, violet, navy)
    * sleeves (short, long)
    * demand (low, medium, high)

### Write a program to implement the Categorical Naive Bayes classification algorithm to predict clothing demand (low, medium, high) based on the rest of the attributes.

In [3]:
df_cloth = pd.read_csv('/content/drive/MyDrive/ML Lab/Assignment_4/Clothing.csv').drop('Unnamed: 0', axis=1)

In [4]:
df_cloth.head()

Unnamed: 0,size,material,color,sleeves,demand
0,S,nylon,white,long,medium
1,XL,polyester,cream,short,high
2,S,silk,blue,short,medium
3,M,cotton,black,short,medium
4,XL,polyester,orange,long,medium


In [None]:
# size = df_cloth['size'].values.reshape(-1,1)
# material = df_cloth['material'].values.reshape(-1,1) 
# color = df_cloth['color'].values.reshape(-1,1)
# sleeves = df_cloth['sleeves'].values.reshape(-1,1)

In [5]:
size_order = {'XS': 0, 'S': 1, 'M': 2, 'L': 3, 'XL': 4, 'XXL': 5, '3XL': 6}
sleeves_order = {'short': 0, 'long': 1}
demand_order = {'low': 0, 'medium': 1, 'high': 2}

df_cloth['size'] = df_cloth['size'].apply(lambda x: size_order[x])
df_cloth['sleeves'] = df_cloth['sleeves'].apply(lambda x: sleeves_order[x])
df_cloth['demand'] = df_cloth['demand'].apply(lambda x: demand_order[x])

In [6]:
df_cloth = pd.get_dummies(df_cloth,prefix=['material', 'color'], columns = ['material', 'color'], drop_first=True)

In [7]:
df_cloth.head()

Unnamed: 0,size,sleeves,demand,material_linen,material_nylon,material_polyester,material_silk,color_blue,color_cream,color_green,color_navy,color_orange,color_red,color_violet,color_white,color_yellow
0,1,1,1,0,1,0,0,0,0,0,0,0,0,0,1,0
1,4,0,2,0,0,1,0,0,1,0,0,0,0,0,0,0
2,1,0,1,0,0,0,1,1,0,0,0,0,0,0,0,0
3,2,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
4,4,1,1,0,0,1,0,0,0,0,0,1,0,0,0,0


In [8]:
y = df_cloth['demand']
x = df_cloth.drop(['demand'], axis = 1)

In [9]:
classes = y.unique()

In [10]:
def counts_based_onclass(x,y):
    
    # No of feature
    global classes
    n_features = x.shape[1]
    # No of classes
    n_classes = len(classes)
    
    count_matrix = []
    # For each feature
    for i in range(n_features):
        count_feature = []
        # Get that particuar feature from the dataset
        x_feature = x[:, i]
        # For each class
        for j in range(n_classes):
            # Get the datapoints that belong to the class - j
            mask = (y == classes[j])
            # Using masking filter out the datapoints that belong to this class- j in the given feature - i
            # Using bincount -- count all the different categories present in the given feature
            counts = np.bincount(x_feature[mask])
            
            count_feature.append(counts)
            
        count_matrix.append(np.array(count_feature))
        # Finding the count of datapoints beloging to each class -- we will use it to calculate prior probabilities.
        class_count = np.bincount(y)
        
    return count_matrix, n_features, n_classes, class_count

In [11]:
x_numpy = x.to_numpy(dtype = 'int')

print(x_numpy)

[[1 1 0 ... 0 1 0]
 [4 0 0 ... 0 0 0]
 [1 0 0 ... 0 0 0]
 ...
 [5 1 0 ... 0 0 0]
 [1 1 1 ... 0 1 0]
 [1 1 0 ... 0 0 0]]


In [12]:
y_numpy = y.to_numpy(dtype = 'int')

print(y_numpy)

[1 2 1 ... 1 2 2]


In [13]:
x_numpy.shape, y_numpy.shape

((10000, 15), (10000,))

In [14]:
count_matrix,n_features,n_classes,class_count = counts_based_onclass(x_numpy,y_numpy)



In [18]:
for i in range(n_features):
  for j in count_matrix[i]:
    print(len(j), j)
  print()

7 [329 421 866 952 841 600 359]
7 [ 509  835  707 1169  622  345  142]
7 [ 45  28 154 110 277 345 344]

2 [1222 3146]
2 [3295 1034]
2 [439 864]

2 [3889  479]
2 [3239 1090]
2 [1239   64]

2 [3019 1349]
2 [3144 1185]
2 [1037  266]

2 [3003 1365]
2 [3141 1188]
2 [1016  287]

2 [4267  101]
1 [4329]
2 [831 472]

2 [4051  317]
2 [3732  597]
2 [1217   86]

2 [3989  379]
2 [3642  687]
2 [1246   57]

2 [4107  261]
2 [4133  196]
2 [1219   84]

2 [3969  399]
2 [3669  660]
2 [1230   73]

2 [4159  209]
2 [3991  338]
2 [1191  112]

2 [4055  313]
2 [3741  588]
2 [1206   97]

2 [4138  230]
2 [4014  315]
2 [1186  117]

2 [3777  591]
2 [3715  614]
2 [862 441]

2 [4126  242]
2 [3995  334]
2 [1178  125]



In [19]:
count_matrix[5] = np.array( [
    [4267, 101],
    [0, 4329],
    [831, 472]
  ], dtype = 'int'
)

In [20]:
count_matrix

[array([[ 329,  421,  866,  952,  841,  600,  359],
        [ 509,  835,  707, 1169,  622,  345,  142],
        [  45,   28,  154,  110,  277,  345,  344]]), array([[1222, 3146],
        [3295, 1034],
        [ 439,  864]]), array([[3889,  479],
        [3239, 1090],
        [1239,   64]]), array([[3019, 1349],
        [3144, 1185],
        [1037,  266]]), array([[3003, 1365],
        [3141, 1188],
        [1016,  287]]), array([[4267,  101],
        [   0, 4329],
        [ 831,  472]]), array([[4051,  317],
        [3732,  597],
        [1217,   86]]), array([[3989,  379],
        [3642,  687],
        [1246,   57]]), array([[4107,  261],
        [4133,  196],
        [1219,   84]]), array([[3969,  399],
        [3669,  660],
        [1230,   73]]), array([[4159,  209],
        [3991,  338],
        [1191,  112]]), array([[4055,  313],
        [3741,  588],
        [1206,   97]]), array([[4138,  230],
        [4014,  315],
        [1186,  117]]), array([[3777,  591],
        [3715,  6

In [21]:
n_features

15

In [22]:
n_classes

3

In [23]:
class_count

array([1303, 4368, 4329])

In [24]:
def calculate_likelihood_probs(count_matrix,alpha,n_features):
    log_probabilities = []
    for i in range(n_features):
        num = count_matrix[i] + alpha
        den = num.sum(axis = 1).reshape(-1,1)
        log_probability = np.log(num) - np.log(den)
        log_probabilities.append(log_probability)
    return log_probabilities

In [25]:
def calculate_prior_probs(class_count):
    
    num = class_count
    den = class_count.sum()
    
    return np.log(num)-np.log(den)

In [26]:
prior_probs = calculate_prior_probs(class_count)

In [27]:
prior_probs

array([-2.03791579, -0.82827985, -0.83724852])

In [28]:
log_probs = calculate_likelihood_probs(count_matrix,1,n_features)

In [39]:
for i in log_probs:
  print(i)

[[-2.58456914 -2.33865648 -1.61862282 -1.5240469  -1.64788178 -1.98506686
  -2.49755777]
 [-2.14029682 -1.64607893 -1.81226345 -1.30994852 -1.94016102 -2.52826877
  -3.41186291]
 [-3.34914102 -3.81048659 -2.1343573  -2.46825221 -1.5501613  -1.33134364
  -1.334238  ]]
[[-1.27345615 -0.32831339]
 [-0.27308885 -1.43139704]
 [-1.08718359 -0.41122881]]
[[-0.11635385 -2.20873218]
 [-0.29022513 -1.37870376]
 [-0.05109166 -2.99957105]]
[[-0.36950618 -1.17465842]
 [-0.31998457 -1.29521216]
 [-0.22890726 -1.58670966]]
[[-0.37481828 -1.16287625]
 [-0.32093892 -1.29268584]
 [-0.24934592 -1.51099784]]
[[-2.36176757e-02 -3.75754547e+00]
 [-8.37355374e+00 -2.30920218e-04]
 [-4.50125879e-01 -1.01486293e+00]]
[[-0.07555242 -2.62046691]
 [-0.14858626 -1.97996299]
 [-0.06899287 -2.7080502 ]]
[[-0.09097178 -2.44234704]
 [-0.17299094 -1.8397649 ]
 [-0.04546237 -3.11351531]]
[[-0.06182672 -2.81417378]
 [-0.046553   -3.09035001]
 [-0.06735218 -2.73130706]]
[[-0.09599691 -2.39105374]
 [-0.1656068  -1.8797999 

In [37]:
def predict(query_point,log_probs,prior_probs):
    
    # Intializing an empty array
    probs = np.zeros((1,n_classes))
    # For each feature
    for i in range(n_features):
        # Get the category_id of the feature - i from the query_point
        category = query_point[i]
        # Fetch the corresponding log_probability table and add continue to add them for all the features
        probs += log_probs[i][:,category]
    # Finally add posterior probability
    probs+=prior_probs
    # Finding the maximum of the probabilities and fetching the corresponding class
    predicted_class = classes[np.argmax(probs)]
    for k,v in demand_order.items():
      if v==predicted_class:
        print(f'Predicted Demand: {k}')

In [38]:
# print('Predict',predict(x_numpy[3],log_probs,prior_probs))
predict(x_numpy[3],log_probs,prior_probs)

Predicted Demand: low
