# Naive Bayes Document Classification

We must compute several values, 

Priors:

$P(c)=\frac{N_c}{N}$
where $N_c$ is just number of documents with class and $N$ number of documents

We will calculate the conditional probabilities of each word in the document. For the purposes of this calculation we will not calculate conditional probabilities for every single word, but only the words in D1 and D2

Using $$P(w|c)=\frac{count(w,c)+\lambda}{count(c)+|V|\cdot \lambda}$$
Using $\lambda = 0.1$
Example calculation:


$P(rose|vegetable)=\frac{0+0.1}{8+7\cdot 0.1}$
Other calculations outlined below

We then find the maximum probablity of a document being in a class by using
Where $c$ is class and $d$ document
$P(c|d)=P(c) \cdot \prod_i^n{P(d_i|c)}$

Example calculation:
$P(flower|D1)=P(flower) \cdot P(rose|flower) \cdot P(lily|flower) \cdot P(apple|flower) \cdot P(carrot|flower)$



In [84]:
def p(wc, c, v, l=0.1):
    return (wc + l)/(c + v * l)

P={}

P[('rose', 'vegetable')] = p(0, 8, 7)
P[('lily', 'vegetable')] = p(0, 8, 2)
P[('apple', 'vegetable')] = p(0, 8, 2)
P[('carrot', 'vegetable')] = p(1, 8, 2)

P[('rose', 'flower')] = p(6, 13, 7)
P[('lily', 'flower')] = p(1, 13, 2)
P[('apple', 'flower')] = p(0, 13, 2)
P[('carrot', 'flower')] = p(0, 13, 2)

P[('rose', 'fruit')] = p(1, 14, 7)
P[('lily', 'fruit')] = p(1, 14, 2)
P[('apple', 'fruit')] = p(2, 14, 2)
P[('carrot', 'fruit')] = p(1, 14, 2)

#Priors
P['vegetable'] = 1/4
P['flower'] = 3/8
P['fruit'] = 3/8

D1_flower = P['flower']*P[('rose', 'flower')]*P[('lily', 'flower')]*P[('apple', 'flower')]*P[('carrot', 'flower')]
print("D1_flower", D1_flower)
D1_fruit = P['fruit']*P[('rose', 'fruit')]*P[('lily', 'fruit')]*P[('apple', 'fruit')]*P[('carrot', 'fruit')]
print("D1_fruit", D1_fruit)
D1_vegetable = P['vegetable']*P[('rose', 'vegetable')]*P[('lily', 'vegetable')]*P[('apple', 'vegetable')]*P[('carrot', 'vegetable')]
print("D1_vegetable", D1_vegetable)

D1_flower 7.985671244629444e-07
D1_fruit 2.490268929586247e-05
D1_vegetable 5.732867232465228e-08


We take the argmax of these values and find that the fruit class is the most probable.

Similarly for D2

In [86]:
P[('pea', 'vegetable')] = p(2, 8, 3)
P[('lotus', 'vegetable')] = p(1, 8, 2)
P[('grape', 'vegetable')] = p(0, 8, 2)

P[('pea', 'flower')] = p(1, 13, 3)
P[('lotus', 'flower')] = p(0, 13, 2)
P[('grape', 'flower')] = p(0, 13, 2)

P[('pea', 'fruit')] = p(0, 14, 3)
P[('lotus', 'fruit')] = p(1, 14, 2)
P[('grape', 'fruit')] = p(2, 14, 2)

D2_flower = P['flower']*(P[('pea', 'flower')]**2)*P[('lotus', 'flower')]*P[('grape', 'flower')]
print("D2_flower", D2_flower)
D2_fruit = P['fruit']*(P[('pea', 'fruit')]**2)*P[('lotus', 'fruit')]*P[('grape', 'fruit')]
print("D2_fruit", D2_fruit)
D2_vegetable = P['vegetable']*(P[('pea', 'vegetable')]**2)*P[('lotus', 'vegetable')]*P[('grape', 'vegetable')]
print("D2_vegetable", D2_vegetable)

D2_flower 1.47219552641001e-07
D2_fruit 2.1008472857159783e-07
D2_vegetable 2.618107011591733e-05


We find that D2 is classed as vegetable

# Word Sense Disambiguation

Counting all the senses will be done by putting each word through wordnet

In the cold weather, they started to the city. They were least worried protecting themselves
against the common cold. After she signed the agreement, a cold chill crept up her spine.
“Chill, its not that serious,” her husband assured and left to deposit cash at the bank.



In [17]:
from nltk import download
download('wordnet')
download('punkt')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\cdilg\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\cdilg\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


True

In [None]:
from nltk.corpus import wordnet as wn
import numpy as np

raw = "In the cold weather, they started to the city. They were least worried protecting themselves against the common cold. After she signed the agreement, a cold chill crept up her spine. Chill, its not that serious, her husband assured and left to deposit cash at the bank"
sents = [s.translate(str.maketrans('','',string.punctuation)).lower() for s in raw.strip().split(".")]

sentence_senses = []
word_senses = {}
for s in sents:
    sentencecount = 0
    for word in s.split(' '):
        syns = max([len(wn.synsets(word)), 1])
        print(syns)
        sentencecount *= syns
        word_senses[word] = syns
    sentence_senses += [sentencecount]
    
print("Total senses: ", np.product(np.array(sentence_senses)))
print("Distinct combinations of senses: ", sentence_senses)
#print(word_senses)


Language Modelling

Implement a 4 gram language model

In [84]:
from os import listdir
from nltk import word_tokenize
import pickle
from collections import Counter

unk_threshold = 5
cachefile = 'corpus.txt'
unkfile = 'unk-corpus.txt'

def save(corpus, file = cachefile):
    with open(file, 'wb') as f:
        pickle.dump(corpus, f, pickle.HIGHEST_PROTOCOL)

def read(file = cachefile):
    with open(file, 'rb') as f:
        return pickle.load(f)

try:
    text = read()
except(FileNotFoundError):
    corpus = ""
    base = 'gutenberg'
    for file in listdir(base):
        for line in open(base + "/" + file):
            corpus += ' ' + line.strip().lower().replace('  ', ' ')
    text = word_tokenize(corpus) 
    save(text)

wordcount = Counter(text)
#print(wordcount)

# we need to remove words that occur less than 5 times and replace with UNK
# count the items in the list. Figure out which ones are greater 
# unkwords = [w for w in [w for w in wordcount.keys() if wordcount[w] <= unk_threshold]]

# we want a list of indices for which to replace with 'UNK'
# go through the list, keep an index of where each word ocurrs. 
# at the end, count all of the lengths of these lists
# for each list which is less than 5. go to the text list and replace each element with 'UNK'

def replace_unk(text, threshold):
    try:
        return read(unkfile)
    except(FileNotFoundError):
        counterdict = {}
        for i, t in enumerate(text):
            if t in counterdict.keys():
                counterdict[t].append(i)
            else:
                counterdict[t] = [i]

        for locations in counterdict:
            #print(locations, len(counterdict[locations]))
            if len(counterdict[locations]) <= threshold:
                for loc in counterdict[locations]:
                    text[loc] = 'UNK'
        save(text, unkfile)
        return text
text = replace_unk(text, unk_threshold)

#find out the definition of 4 gram counts
#probably count all of the ways 3 previous words occur
#make a big table

def ngram(n, text):
    ngrams = {}
    for i in range(n, len(text)+1):
        #get the previous n words.
        gram = tuple(text[i-n:i])
        if gram in ngrams.keys():
            ngrams[gram] += 1
        else:
             ngrams[gram] = 1
    #save a textual representation of the dict to file
    
    with open('ngrams.txt', 'w') as f:
        for line in sorted(ngrams, key=ngrams.get, reverse=True):
            f.write(' '.join(line) + ' ' + str(ngrams[line]) + '\n')
    return ngrams



{('[', 'emma', 'by', 'jane'): 1,
 ('emma', 'by', 'jane', 'UNK'): 1,
 ('by', 'jane', 'UNK', 'UNK'): 3,
 ('jane', 'UNK', 'UNK', ']'): 3,
 ('UNK', 'UNK', ']', 'volume'): 1,
 ('UNK', ']', 'volume', 'i'): 1,
 (']', 'volume', 'i', 'chapter'): 1,
 ('volume', 'i', 'chapter', 'i'): 1,
 ('i', 'chapter', 'i', 'emma'): 1,
 ('chapter', 'i', 'emma', 'woodhouse'): 1,
 ('i', 'emma', 'woodhouse', ','): 1,
 ('emma', 'woodhouse', ',', 'handsome'): 1,
 ('woodhouse', ',', 'handsome', ','): 1,
 (',', 'handsome', ',', 'clever'): 1,
 ('handsome', ',', 'clever', ','): 1,
 (',', 'clever', ',', 'and'): 1,
 ('clever', ',', 'and', 'rich'): 1,
 (',', 'and', 'rich', ','): 1,
 ('and', 'rich', ',', 'with'): 1,
 ('rich', ',', 'with', 'a'): 1,
 (',', 'with', 'a', 'comfortable'): 1,
 ('with', 'a', 'comfortable', 'home'): 1,
 ('a', 'comfortable', 'home', 'and'): 1,
 ('comfortable', 'home', 'and', 'happy'): 1,
 ('home', 'and', 'happy', 'disposition'): 1,
 ('and', 'happy', 'disposition', ','): 1,
 ('happy', 'disposition', '

5
