<a href="https://colab.research.google.com/github/commandermaks/Mchine-learning/blob/main/Multinomial_Categorical_NB.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Multinomial Naive Bayes

The multinomial Naive Bayes classifier is suitable for classification with text data (e.g., word counts for text classification). Everything is similar to Gaussian NB except the $ P(x_{i} ∣ y) $. The new equation is,

![image_2022-08-12_171046578.png](attachment:image_2022-08-12_171046578.png)

![image_2022-08-12_173528714.png](attachment:image_2022-08-12_173528714.png)

## Detecting spam messages using Multinomial Naive Bayes model

**The concept of spam filtering is simple - detect spam emails from authentic (non-spam/ham) emails.
With Bayes' Rule, we want to find the probability an email is spam, given it contains certain words. We do this by finding the probability that each word in the email is spam, and then multiply these probabilities together to get the overall email spam metric to be used in classification.**

Probabilities can range between 0 and 1. For this spam filter, we will define that any email with a total 'spaminess' metric of over 0.5 (50%) will be deemed a spam email. When the Pr(S|W) (the probability of an email being spam S given a certain word W appears) has been found for each word in the email, they are multiplied together to give the overall probability that the email is spam. If this probability is over the 'spam threshold' of 0.5, the email is classified as a spam email.

In [None]:
# Define some training and test data for each class, spam and ham.

train_spam = ['send us your password', 'review our website', 'send your password', 'send us your account']
train_ham = ['Your activity report','benefits physical activity', 'the importance vows']
test_emails = {'spam':['renew your password', 'renew your vows'], 'ham':['benefits of our account', 'the importance of physical activity']}

In [None]:
# make a vocabulary of unique words that occur in known spam emails

vocab_words_spam = []

for sentence in train_spam:
    sentence_as_list = sentence.split()
    for word in sentence_as_list:
        vocab_words_spam.append(word)

print(vocab_words_spam)

['send', 'us', 'your', 'password', 'review', 'our', 'website', 'send', 'your', 'password', 'send', 'us', 'your', 'account']


**Convert each list element to a dictionary key. This will delete duplicates, as dictionaries cannot have multiple keys with the same name. Convert remaining keys back to list:**

In [None]:
vocab_unique_words_spam = list(dict.fromkeys(vocab_words_spam))
print(vocab_unique_words_spam)

['send', 'us', 'your', 'password', 'review', 'our', 'website', 'account']


Spamicity' can be calculated by taking the total number of emails that have already been hand-labelled as either spam or ham, and using that data to compute word spam probabilities, by counting the frequency of each word.

We can count how many spam emails have the word “send” and divide that by the total number of spam emails - this gives a measure of the word's 'spamicity', or how likely it is to be in a spam email.

In [None]:
dict_spamicity = {}
for w in vocab_unique_words_spam:
    emails_with_w = 0     # counter
    for sentence in train_spam:
        if w in sentence:
            emails_with_w+=1

    print(f"Number of spam emails with the word {w}: {emails_with_w}")
    total_spam = len(train_spam)
    spamicity = (emails_with_w+1)/(total_spam+2)
    print(f"Spamicity of the word '{w}': {spamicity} \n")
    dict_spamicity[w.lower()] = spamicity

Number of spam emails with the word send: 3
Spamicity of the word 'send': 0.6666666666666666 

Number of spam emails with the word us: 2
Spamicity of the word 'us': 0.5 

Number of spam emails with the word your: 3
Spamicity of the word 'your': 0.6666666666666666 

Number of spam emails with the word password: 2
Spamicity of the word 'password': 0.5 

Number of spam emails with the word review: 1
Spamicity of the word 'review': 0.3333333333333333 

Number of spam emails with the word our: 4
Spamicity of the word 'our': 0.8333333333333334 

Number of spam emails with the word website: 1
Spamicity of the word 'website': 0.3333333333333333 

Number of spam emails with the word account: 1
Spamicity of the word 'account': 0.3333333333333333 



**Calculate Hamicity of non-spam words:**

In [None]:
# make a vocabulary of unique words that occur in known ham emails

vocab_words_ham = []

for sentence in train_ham:
    sentence_as_list = sentence.split()
    for word in sentence_as_list:
        vocab_words_ham.append(word)

vocab_unique_words_ham = list(dict.fromkeys(vocab_words_ham))
print(vocab_unique_words_ham)
['Your', 'activity', 'report', 'benefits', 'physical', 'the', 'importance', 'vows']
dict_hamicity = {}
for w in vocab_unique_words_ham:
    emails_with_w = 0     # counter
    for sentence in train_ham:
        if w in sentence:
            print(w+":", sentence)
            emails_with_w+=1

    print(f"Number of ham emails with the word '{w}': {emails_with_w}")
    total_ham = len(train_ham)
    Hamicity = (emails_with_w+1)/(total_ham+2)       # Smoothing applied
    print(f"Hamicity of the word '{w}': {Hamicity} ")
    dict_hamicity[w.lower()] = Hamicity
# Use built-in lower() to keep all words lower case - useful later when
# comparing spamicity vs hamicity of a single word - e.g. 'Your' and
 # 'your' will be treated as 2 different words if not normalized to lower                                          # case.

['Your', 'activity', 'report', 'benefits', 'physical', 'the', 'importance', 'vows']
Your: Your activity report
Number of ham emails with the word 'Your': 1
Hamicity of the word 'Your': 0.4 
activity: Your activity report
activity: benefits physical activity
Number of ham emails with the word 'activity': 2
Hamicity of the word 'activity': 0.6 
report: Your activity report
Number of ham emails with the word 'report': 1
Hamicity of the word 'report': 0.4 
benefits: benefits physical activity
Number of ham emails with the word 'benefits': 1
Hamicity of the word 'benefits': 0.4 
physical: benefits physical activity
Number of ham emails with the word 'physical': 1
Hamicity of the word 'physical': 0.4 
the: the importance vows
Number of ham emails with the word 'the': 1
Hamicity of the word 'the': 0.4 
importance: the importance vows
Number of ham emails with the word 'importance': 1
Hamicity of the word 'importance': 0.4 
vows: the importance vows
Number of ham emails with the word 'vows': 1

**Compute Probability of Spam P(S):
This computes the probability of any one email being spam, by dividing the total number of spam emails by the total number of all emails.**

In [None]:
prob_spam = len(train_spam) / (len(train_spam)+(len(train_ham)))
print(prob_spam)

0.5714285714285714


**Compute Probability of Ham P(¬S): This computes the probability of any one email being ham, by dividing the total number of ham emails by the total number of all emails.**

In [None]:
prob_ham = len(train_ham) / (len(train_spam)+(len(train_ham)))
print(prob_ham)

0.42857142857142855


**Given a set of un-labelled test emails, iterate over each, and create list of distinct words:**

In [None]:
tests = []
for i in test_emails['spam']:
    tests.append(i)

for i in test_emails['ham']:
    tests.append(i)

print(tests)

['renew your password', 'renew your vows', 'benefits of our account', 'the importance of physical activity']
# split emails into distinct words

distinct_words_as_sentences_test = []

for sentence in tests:
    sentence_as_list = sentence.split()
    senten = []
    for word in sentence_as_list:
        senten.append(word)
    distinct_words_as_sentences_test.append(senten)

print(distinct_words_as_sentences_test)

['renew your password', 'renew your vows', 'benefits of our account', 'the importance of physical activity']
[['renew', 'your', 'password'], ['renew', 'your', 'vows'], ['benefits', 'of', 'our', 'account'], ['the', 'importance', 'of', 'physical', 'activity']]


In [None]:
test_spam_tokenized = [distinct_words_as_sentences_test[0], distinct_words_as_sentences_test[1]]
test_ham_tokenized = [distinct_words_as_sentences_test[2], distinct_words_as_sentences_test[3]]
print(test_spam_tokenized)

[['renew', 'your', 'password'], ['renew', 'your', 'vows']]


**Ignore the words that you haven’t seen in the labelled training data:**

In [None]:
reduced_sentences_spam_test = []
for sentence in test_spam_tokenized:
    words_ = []
    for word in sentence:
        if word in vocab_unique_words_spam:
            print(f"'{word}', ok")
            words_.append(word)
        elif word in vocab_unique_words_ham:
            print(f"'{word}', ok")
            words_.append(word)
        else:
            print(f"'{word}', word not present in labelled spam training data")
    reduced_sentences_spam_test.append(words_)
print(reduced_sentences_spam_test)

'renew', word not present in labelled spam training data
'your', ok
'password', ok
'renew', word not present in labelled spam training data
'your', ok
'vows', ok
[['your', 'password'], ['your', 'vows']]


In [None]:
reduced_sentences_ham_test = []                   # repeat for ham words
for sentence in test_ham_tokenized:
    words_ = []
    for word in sentence:
        if word in vocab_unique_words_ham:
            print(f"'{word}', ok")
            words_.append(word)
        elif word in vocab_unique_words_spam:
            print(f"'{word}', ok")
            words_.append(word)
        else:
            print(f"'{word}', word not present in labelled ham training data")
    reduced_sentences_ham_test.append(words_)
print(reduced_sentences_ham_test)

'benefits', ok
'of', word not present in labelled ham training data
'our', ok
'account', ok
'the', ok
'importance', ok
'of', word not present in labelled ham training data
'physical', ok
'activity', ok
[['benefits', 'our', 'account'], ['the', 'importance', 'physical', 'activity']]


**Stemming - remove non-key words: Removal of non-key words can help the classifier focus on what words are most important.**

In [None]:
test_spam_stemmed = []
non_key = ['us',  'the', 'of','your']       # non-key words, gathered from spam,ham and test sentences
for email in reduced_sentences_spam_test:
    email_stemmed=[]
    for word in email:
        if word in non_key:
            print('remove')
        else:
            email_stemmed.append(word)
    test_spam_stemmed.append(email_stemmed)

print(test_spam_stemmed)


remove
remove
[['password'], ['vows']]


In [None]:
test_ham_stemmed = []
non_key = ['us',  'the', 'of', 'your']
for email in reduced_sentences_ham_test:
    email_stemmed=[]
    for word in email:
        if word in non_key:
            print('remove')
        else:
            email_stemmed.append(word)
    test_ham_stemmed.append(email_stemmed)

print(test_ham_stemmed)

remove
[['benefits', 'our', 'account'], ['importance', 'physical', 'activity']]


**Bayes' Rule**
(To compute the probability of spam given a certain word from an email.)

In [None]:
def mult(list_) :        # function to multiply all word probs together
    total_prob = 1
    for i in list_:
         total_prob = total_prob * i
    return total_prob

def Bayes(email):
    probs = []
    for word in email:
        Pr_S = prob_spam
        print('prob of spam in general ',Pr_S)
        try:
            pr_WS = dict_spamicity[word]
            print(f'prob "{word}"  is a spam word : {pr_WS}')
        except KeyError:
            pr_WS = 1/(total_spam+2)  # Apply smoothing for word not seen in spam training data, but seen in ham training
            print(f"prob '{word}' is a spam word: {pr_WS}")

        Pr_H = prob_ham
        print('prob of ham in general ', Pr_H)
        try:
            pr_WH = dict_hamicity[word]
            print(f'prob "{word}" is a ham word: ',pr_WH)
        except KeyError:
            pr_WH = (1/(total_ham+2))  # Apply smoothing for word not seen in ham training data, but seen in spam training
            print(f"WH for {word} is {pr_WH}")
            print(f"prob '{word}' is a ham word: {pr_WH}")

        prob_word_is_spam_BAYES = (pr_WS*Pr_S)/((pr_WS*Pr_S)+(pr_WH*Pr_H))
        print('')
        print(f"Using Bayes, prob the the word '{word}' is spam: {prob_word_is_spam_BAYES}")
        print('###########################')
        probs.append(prob_word_is_spam_BAYES)
    print(f"All word probabilities for this sentence: {probs}")
    final_classification = mult(probs)
    if final_classification >= 0.5:
        print(f'email is SPAM: with spammy confidence of {final_classification*100}%')
    else:
        print(f'email is HAM: with spammy confidence of {final_classification*100}%')
    return final_classification
for email in test_spam_stemmed:
    print('')
    print(f"           Testing stemmed SPAM email {email} :")
    print('                 Test word by word: ')
    all_word_probs = Bayes(email)
    print(all_word_probs)


           Testing stemmed SPAM email ['password'] :
                 Test word by word: 
prob of spam in general  0.5714285714285714
prob "password"  is a spam word : 0.5
prob of ham in general  0.42857142857142855
WH for password is 0.2
prob 'password' is a ham word: 0.2

Using Bayes, prob the the word 'password' is spam: 0.7692307692307692
###########################
All word probabilities for this sentence: [0.7692307692307692]
email is SPAM: with spammy confidence of 76.92307692307692%
0.7692307692307692

           Testing stemmed SPAM email ['vows'] :
                 Test word by word: 
prob of spam in general  0.5714285714285714
prob 'vows' is a spam word: 0.16666666666666666
prob of ham in general  0.42857142857142855
prob "vows" is a ham word:  0.4

Using Bayes, prob the the word 'vows' is spam: 0.35714285714285715
###########################
All word probabilities for this sentence: [0.35714285714285715]
email is HAM: with spammy confidence of 35.714285714285715%
0.3571428

**Next we test how likely the stemmed HAM test emails are to be SPAM.**

In [None]:
for email in test_ham_stemmed:
    print('')
    print(f"           Testing stemmed HAM email {email} :")
    print('                 Test word by word: ')
    all_word_probs = Bayes(email)
    print(all_word_probs)


           Testing stemmed HAM email ['benefits', 'our', 'account'] :
                 Test word by word: 
prob of spam in general  0.5714285714285714
prob 'benefits' is a spam word: 0.16666666666666666
prob of ham in general  0.42857142857142855
prob "benefits" is a ham word:  0.4

Using Bayes, prob the the word 'benefits' is spam: 0.35714285714285715
###########################
prob of spam in general  0.5714285714285714
prob "our"  is a spam word : 0.8333333333333334
prob of ham in general  0.42857142857142855
WH for our is 0.2
prob 'our' is a ham word: 0.2

Using Bayes, prob the the word 'our' is spam: 0.847457627118644
###########################
prob of spam in general  0.5714285714285714
prob "account"  is a spam word : 0.3333333333333333
prob of ham in general  0.42857142857142855
WH for account is 0.2
prob 'account' is a ham word: 0.2

Using Bayes, prob the the word 'account' is spam: 0.689655172413793
###########################
All word probabilities for this sentence: [0.3

# Assignment

**Dataset**

The objective of this task is to detect hate speech in tweets. For the sake of simplicity, we say a tweet contains hate speech if it has a racist or sexist sentiment associated with it. So, the task is to classify racist or sexist tweets from other tweets.

Formally, given a training sample of tweets and labels, where label '1' denotes the tweet is racist/sexist and label '0' denotes the tweet is not racist/sexist, your objective is to predict the labels on the test dataset.

### Write a program to detect hate speech in tweets using the Multinomial Naive Bayes.

In [None]:
import numpy as np
import pandas as pd

df_train = pd.read_csv('/content/train.csv')
df_test = pd.read_csv('/content/test.csv')

FileNotFoundError: ignored

In [None]:
df_train.head()

In [None]:
df_test.head()

In [None]:
vocab_words_racist = []

racist = df_train[df_train['label']==1]['tweet']
non_racist = df_train[df_train['label']==0]['tweet']

for sentence in df_train[df_train['label']==1]['tweet']:
  sentence_as_list = sentence.split()
  for word in sentence_as_list:
    vocab_words_racist.append(word)

print(vocab_words_racist)
print(len(vocab_words_racist))

In [None]:
vocab_unique_words_racist = list(dict.fromkeys(vocab_words_racist))
print(vocab_unique_words_racist)
print(len(vocab_unique_words_racist))

In [None]:
dict_racist = {}
for w in vocab_unique_words_racist:
    tweets_with_w = 0
    for sentence in racist:
        if w in sentence:
            tweets_with_w+=1

    total_racist = len(racist)
    val = (tweets_with_w+1)/(total_racist+2)
    dict_racist[w.lower()] = val

print(total_racist)
print(dict_racist)

In [None]:
vocab_words_nracist = []

for sentence in non_racist:
    sentence_as_list = sentence.split()
    for word in sentence_as_list:
        vocab_words_nracist.append(word)

vocab_unique_words_nracist = list(dict.fromkeys(vocab_words_nracist))
print(vocab_unique_words_nracist)
print(len(vocab_unique_words_nracist))

total_nracist = len(non_racist)
print(total_nracist)


dict_nracist = {}
for w in vocab_unique_words_nracist:
    emails_with_w = 0     # counter
    for sentence in non_racist:
        if w in sentence:
            emails_with_w+=1

    val = (emails_with_w+1)/(total_nracist+2)
    dict_nracist[w.lower()] = val

print(dict_nracist)

In [None]:
prob_racist = len(racist) / (len(racist)+(len(non_racist)))
print(prob_racist)

In [None]:
prob_nracist = len(non_racist) / (len(racist)+(len(non_racist)))
print(prob_nracist)

In [None]:
test = pd.read_csv('/content/test.csv')
tests = test['tweet']

distinct_words_as_sentences_test = []

for sentence in tests:
    sentence_as_list = sentence.split()
    senten = []
    for word in sentence_as_list:
        senten.append(word)
    distinct_words_as_sentences_test.append(senten)

print(distinct_words_as_sentences_test)

In [None]:
reduced_sentences_test = []
for sentence in distinct_words_as_sentences_test:
    words_ = []
    for word in sentence:
        if word in vocab_unique_words_racist:
            words_.append(word)
        elif word in vocab_unique_words_nracist:
           . words_append(word)
    reduced_sentences_test.append(words_)
print(reduced_sentences_test)

In [None]:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

non_key = stopwords.words('english')

test_stemmed = []

for tweet in reduced_sentences_test:
    tweet_stemmed=[]
    for word in tweet:
        if word in non_key:
            print(f'{word} removed')
        else:
            tweet_stemmed.append(word)
    test_stemmed.append(tweet_stemmed)

print(test_stemmed)

In [None]:
def mult(list_) :
    total_prob = 1
    for i in list_:
         total_prob = total_prob * i
    return total_prob

def Bayes(tweet):
    probs = []
    for word in tweet:
        Pr_R = prob_racist
        try:
            pr_WR = dict_racist[word]
        except KeyError:
            pr_WR = 1/(total_racist+2)

        Pr_N = prob_nracist
        try:
            pr_WN = dict_nracist[word]
        except KeyError:
            pr_WN = (1/(total_nracist+2))

        prob_word_is_racist_BAYES = (pr_WR*Pr_R)/((pr_WR*Pr_R)+(pr_WN*Pr_N))
        probs.append(prob_word_is_racist_BAYES)

    final_classification = mult(probs)
    if final_classification >= 0.5:
        return "Racist tweet"
    else:
        return "Non-Racist tweet"

In [None]:
final = []

r=0
for tweet in test_stemmed:
  s = Bayes(tweet)
  if(s=="Racist tweet"):
    r+=1
  final.append(s)

print(final)
print(f'Racist tweets : {r}')
print(f'Non-Racist tweets : {len(test_stemmed)-r}')

## Categorical Naïve Bayes

It is suitable for classification with discrete features which assumes categorically distribution for each feature. The features should to encoded using label encoding  techniques such that each category would be mapped to a unique number.

The probability of category $t$ in feature $i$ given class $c$ is estimated as:

![categorical.PNG](attachment:categorical.PNG)

![parameter_categorical.PNG](attachment:parameter_categorical.PNG)

## Step By Step Implementation of Categorical Naive Bayes


1. Preprocessing the data.
2. Calculate the counts/presence of each feature based on class.
3. Calculate likelihood probability.
4. Calculate prior probability.
5. Calculate posterior probability for a given query point → Predict function

In [None]:
import numpy as np
import pandas as pd

from sklearn import preprocessing
from sklearn.preprocessing import OrdinalEncoder,LabelBinarizer

In [None]:
weather = ['Clear', 'Clear', 'Clear', 'Clear', 'Clear', 'Clear',
            'Rainy', 'Rainy', 'Rainy', 'Rainy', 'Rainy', 'Rainy',
            'Snowy', 'Snowy', 'Snowy', 'Snowy', 'Snowy', 'Snowy']

timeOfWeek = ['Workday', 'Workday', 'Workday',
            'Weekend', 'Weekend', 'Weekend',
            'Workday', 'Workday', 'Workday',
            'Weekend', 'Weekend', 'Weekend',
            'Workday', 'Workday', 'Workday',
            'Weekend', 'Weekend', 'Weekend']

timeOfDay = ['Morning', 'Lunch', 'Evening',
            'Morning', 'Lunch', 'Evening',
            'Morning', 'Lunch', 'Evening',
            'Morning', 'Lunch', 'Evening',
            'Morning', 'Lunch', 'Evening',
            'Morning', 'Lunch', 'Evening',
            ]
trafficJam = ['Yes', 'No', 'Yes',
            'No', 'No', 'No',
            'Yes', 'Yes', 'Yes',
            'No', 'No', 'No',
            'Yes', 'Yes', 'Yes',
            'Yes', 'No', 'Yes'
            ]

In [None]:
df = pd.DataFrame(zip(weather,timeOfWeek,timeOfDay,trafficJam),columns = ['weather','timeOfWeek','timeOfDay','trafficJam'])
df

Unnamed: 0,weather,timeOfWeek,timeOfDay,trafficJam
0,Clear,Workday,Morning,Yes
1,Clear,Workday,Lunch,No
2,Clear,Workday,Evening,Yes
3,Clear,Weekend,Morning,No
4,Clear,Weekend,Lunch,No
5,Clear,Weekend,Evening,No
6,Rainy,Workday,Morning,Yes
7,Rainy,Workday,Lunch,Yes
8,Rainy,Workday,Evening,Yes
9,Rainy,Weekend,Morning,No


In [None]:
weather = df['weather'].values.reshape(-1,1)
timeOfWeek = df['timeOfWeek'].values.reshape(-1,1)
timeOfDay = df['timeOfDay'].values.reshape(-1,1)

In [None]:
weather.shape,timeOfWeek.shape

((18, 1), (18, 1))

In [None]:
def preprocess():
    # Using ordinal encoder to convert the categories in the range from 0 to n-1
    wea_enc = OrdinalEncoder()
    weather_ = wea_enc.fit_transform(weather)

    timeOfWeek_enc = OrdinalEncoder()
    timeOfWeek_ = timeOfWeek_enc.fit_transform(timeOfWeek)

    timeOfDay_enc = OrdinalEncoder()
    timeOfDay_ = timeOfDay_enc.fit_transform(timeOfDay)
    # Stacking all the features
    X = np.column_stack((weather_,timeOfWeek_,timeOfDay_))
    # Changing the type to int
    X = X.astype(int)
    # Doing one hot encoding on the target data
    y = df['trafficJam']
    lb = LabelBinarizer()
    y_ = lb.fit_transform(y)
    if y_.shape[1] == 1:
        y_ = np.concatenate((1 - y_, y_), axis=1)
    return X,y_,lb.classes_

**Preprocessing the data:
Converting the categorical data into a numerical form using ordinal encoding. The features are converted to ordinal integers.
This results in a single column of integers (0 to n_categories — 1) per feature.**

In [None]:
X,y,classes = preprocess()
X.shape, y.shape

((18, 3), (18, 2))

In [None]:
def counts_based_onclass(X,y):

    # No of feature
    n_features = X.shape[1]
    # No of classes
    n_classes = y.shape[1]

    count_matrix = []
    # For each feature
    for i in range(n_features):
        count_feature = []
        # Get that particuar feature from the dataset
        X_feature = X[:,i]
        # For each class
        for j in range(n_classes):
            # Get the datapoints that belong to the class - j
            mask = y[:,j].astype(bool)
            # Using masking filter out the datapoints that belong to this class- j in the given feature - i
            # Using bincount -- count all the different categories present in the given feature
            counts = np.bincount(X_feature[mask])

            count_feature.append(counts)

        count_matrix.append(np.array(count_feature))
        # Finding the count of datapoints beloging to each class -- we will use it to calculate prior probabilities.
        class_count = y.sum(axis=0)

    return count_matrix,n_features,n_classes,class_count


In [None]:
count_matrix,n_features,n_classes,class_count = counts_based_onclass(X,y)

In [None]:
# Count_matrix will give an output this way, For each of the features you have 2D -array
#(The first row corresponding to No and the second row corresponding to Yes)

count_matrix

[array([[4, 3, 1],
        [2, 3, 5]], dtype=int64),
 array([[7, 1],
        [2, 8]], dtype=int64),
 array([[2, 4, 2],
        [4, 2, 4]], dtype=int64)]

In [None]:
def calculate_likelihood_probs(count_matrix,alpha,n_features):
    log_probabilities = []
    for i in range(n_features):
        num = count_matrix[i] + alpha
        den = num.sum(axis = 1).reshape(-1,1)
        log_probability = np.log(num) - np.log(den)
        log_probabilities.append(log_probability)
    return log_probabilities

In [None]:
def calculate_prior_probs(class_count):

    num = class_count
    den = class_count.sum()

    return np.log(num)-np.log(den)

In [None]:
prior_probs = calculate_prior_probs(class_count)

In [None]:
log_probs = calculate_likelihood_probs(count_matrix,1,n_features)

In [None]:
def predict(query_point,log_probs,prior_probs):

    # Intializing an empty array
    probs = np.zeros((1,n_classes))
    # For each feature
    for i in range(n_features):
        # Get the category_id of the feature - i from the query_point
        category = query_point[i]
        # Fetch the corresponding log_probability table and add continue to add them for all the features
        probs+=log_probs[i][:,category]
    # Finally add posterior probability
    probs+=prior_probs
    # Finding the maximum of the probabilities and fetching the corresponding class
    return classes[np.argmax(probs)]

In [None]:
print('Likelihood probabilities\n',log_probs)
print('Prior probabilities\n',prior_probs)
#print('Predict',predict(X[4],log_probs,prior_probs))

Likelihood probabilities
 [array([[-0.78845736, -1.01160091, -1.70474809],
       [-1.46633707, -1.178655  , -0.77318989]]), array([[-0.22314355, -1.60943791],
       [-1.38629436, -0.28768207]]), array([[-1.29928298, -0.78845736, -1.29928298],
       [-0.95551145, -1.46633707, -0.95551145]])]
Prior probabilities
 [-0.81093022 -0.58778666]


# Assignment

**Dataset characteristics:**

1. Number of instances: 1000
2. Number of attributes: 5 (including target attribute), all categorical
3. Attribute information:
    * size (XS, S, M, L, XL, XXL, 3XL)
    * material (nylon, polyester, silk, cotton, linen)
    * color (white, cream, blue, black, orange, green, yellow, red, violet, navy)
    * sleeves (short, long)
    * demand (low, medium, high)

### Write a program to implement the Categorical Naive Bayes classification algorithm to predict clothing demand (low, medium, high) based on the rest of the attributes.

In [None]:
import numpy as np
import pandas as pd

from sklearn import preprocessing
from sklearn.preprocessing import OrdinalEncoder,LabelBinarizer

In [None]:
df = pd.read_csv('/content/Clothing.csv', index_col=0)

In [None]:
df.head()

Unnamed: 0,size,material,color,sleeves,demand
0,S,nylon,white,long,medium
1,XL,polyester,cream,short,high
2,S,silk,blue,short,medium
3,M,cotton,black,short,medium
4,XL,polyester,orange,long,medium


In [None]:
size = df['size'].values.reshape(-1,1)
material = df['material'].values.reshape(-1,1)
color = df['color'].values.reshape(-1,1)
sleeves = df['sleeves'].values.reshape(-1,1)

In [None]:
size.shape, material.shape, color.shape, sleeves.shape

((10000, 1), (10000, 1), (10000, 1), (10000, 1))

In [None]:
def preprocess():
    size_enc = OrdinalEncoder()
    size_ = size_enc.fit_transform(size)

    material_enc = OrdinalEncoder()
    material_ = material_enc.fit_transform(material)

    color_enc = OrdinalEncoder()
    color_ = color_enc.fit_transform(color)

    sleeves_enc = OrdinalEncoder()
    sleeves_ = sleeves_enc.fit_transform(sleeves)

    X = np.column_stack((size_,material_,color_,sleeves_))
    X = X.astype(int)
    y = df['demand']
    lb = LabelBinarizer()
    y_ = lb.fit_transform(y)
    if y_.shape[1] == 1:
        y_ = np.concatenate((1 - y_, y_), axis=1)
    return X,y_,lb.classes_

In [None]:
X,y,classes = preprocess()
X.shape, y.shape, classes

((10000, 4), (10000, 3), array(['high', 'low', 'medium'], dtype='<U6'))

In [None]:
def counts_based_onclass(X,y):

    n_features = X.shape[1]
    n_classes = y.shape[1]

    count_matrix = []
    for i in range(n_features):
        count_feature = []
        X_feature = X[:,i]
        bin = len(np.bincount(X_feature))
        for j in range(n_classes):
            mask = y[:,j].astype(bool)
            counts = np.bincount(X_feature[mask])
            a=len(counts)

            for k in range(bin-a):
              counts = np.append(counts,[0])

            print(counts)

            count_feature.append(counts)

        count_matrix.append(np.array(count_feature))
        class_count = y.sum(axis=0)

    return count_matrix,n_features,n_classes,class_count

In [None]:
count_matrix,n_features,n_classes,class_count = counts_based_onclass(X,y)

[ 142 1169  707  835  622  509  345]
[344 110 154  28 277  45 345]
[359 952 866 421 841 329 600]
[ 866 1090 1185 1188    0]
[214  64 266 287 472]
[1074  479 1349 1365  101]
[  0 597 687 196 660 338 588 315 614 334]
[111  86  57  84  73 112  97 117 441 125]
[1427  317  379  261  399  209  313  230  591  242]
[1034 3295]
[864 439]
[3146 1222]


In [None]:
count_matrix

[array([[ 142, 1169,  707,  835,  622,  509,  345],
        [ 344,  110,  154,   28,  277,   45,  345],
        [ 359,  952,  866,  421,  841,  329,  600]]),
 array([[ 866, 1090, 1185, 1188,    0],
        [ 214,   64,  266,  287,  472],
        [1074,  479, 1349, 1365,  101]]),
 array([[   0,  597,  687,  196,  660,  338,  588,  315,  614,  334],
        [ 111,   86,   57,   84,   73,  112,   97,  117,  441,  125],
        [1427,  317,  379,  261,  399,  209,  313,  230,  591,  242]]),
 array([[1034, 3295],
        [ 864,  439],
        [3146, 1222]])]

In [None]:
def calculate_likelihood_probs(count_matrix,alpha,n_features):
    log_probabilities = []
    for i in range(n_features):
        num = count_matrix[i] + alpha
        print(num)
        den = num.sum(axis=1).reshape(-1,1)
        print(den)
        log_probability = np.log(num) - np.log(den)
        log_probabilities.append(log_probability)
    return log_probabilities

In [None]:
def calculate_prior_probs(class_count):

    num = class_count
    den = class_count.sum()

    return np.log(num)-np.log(den)

In [None]:
prior_probs = calculate_prior_probs(class_count)

In [None]:
log_probs = calculate_likelihood_probs(count_matrix,1,n_features)

[[ 143 1170  708  836  623  510  346]
 [ 345  111  155   29  278   46  346]
 [ 360  953  867  422  842  330  601]]
[[4336]
 [1310]
 [4375]]
[[ 867 1091 1186 1189    1]
 [ 215   65  267  288  473]
 [1075  480 1350 1366  102]]
[[4334]
 [1308]
 [4373]]
[[   1  598  688  197  661  339  589  316  615  335]
 [ 112   87   58   85   74  113   98  118  442  126]
 [1428  318  380  262  400  210  314  231  592  243]]
[[4339]
 [1313]
 [4378]]
[[1035 3296]
 [ 865  440]
 [3147 1223]]
[[4331]
 [1305]
 [4370]]


In [None]:
def predict(query_point,log_probs,prior_probs):

    # Intializing an empty array
    probs = np.zeros((1,n_classes))
    # For each feature
    for i in range(n_features):
        # Get the category_id of the feature - i from the query_point
        category = query_point[i]
        # Fetch the corresponding log_probability table and add continue to add them for all the features
        probs+=log_probs[i][:,category]
    # Finally add posterior probability
    probs+=prior_probs
    # Finding the maximum of the probabilities and fetching the corresponding class
    return classes[np.argmax(probs)]

In [None]:
print('Likelihood probabilities\n',log_probs)
print('Prior probabilities\n',prior_probs)
print('Predict',predict(X[4],log_probs,prior_probs))

Likelihood probabilities
 [array([[-3.41186291, -1.30994852, -1.81226345, -1.64607893, -1.94016102,
        -2.14029682, -2.52826877],
       [-1.334238  , -2.46825221, -2.1343573 , -3.81048659, -1.5501613 ,
        -3.34914102, -1.33134364],
       [-2.49755777, -1.5240469 , -1.61862282, -2.33865648, -1.64788178,
        -2.58456914, -1.98506686]]), array([[-1.60920721, -1.3793962 , -1.2959046 , -1.29337829, -8.37424618],
       [-1.8056165 , -3.00186726, -1.58900587, -1.51329405, -1.01715914],
       [-1.40312861, -2.20941845, -1.17534468, -1.16356251, -3.75823174]]), array([[-8.37539919, -1.98180843, -1.84161035, -3.09219546, -1.88164535,
        -2.54939908, -1.996973  , -2.61965697, -1.95377692, -2.56126865],
       [-2.461571  , -2.71416176, -3.11962686, -2.73741862, -2.87600478,
        -2.45268206, -2.5951024 , -2.40938525, -1.08875999, -2.34378797],
       [-1.12031714, -2.6222959 , -2.44417603, -2.81600277, -2.39288273,
        -3.03723975, -2.63495429, -2.94192957, -2.000840

array([0, 0, 1])