# AnTeDe Lab 2: Text Classification - Part A

## Session goal
The goal of this session is to implement a Multinomial Naive Bayes classifier from scratch.

## Data collection
We are going to use a small toy dataset. Each document is a single sentence. The training data contains three documents, each from a different class.

In [None]:
import pandas as pd
import nltk

# these 3 lines are here for compatibility purposes
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')
#

training_corpus=["The Limmat flows out of the lake.", 
           "The bears are in the bear pit near the river.",
           "The Rhône flows out of Lake Geneva.",
          ]
training_labels=["zurich", 
         "bern",
         "geneva",
        ]

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


We are also going to need a helper function that can normalize a string.

In [None]:
def normalize(document, keep_punctuation=False, \
                  keep_stop_words=False, keep_inflected=True, keep_numbers=False):
            import string
            
            from nltk.corpus import stopwords
            from nltk.stem import WordNetLemmatizer
            from nltk.tokenize import word_tokenize

             
            word_tokens = word_tokenize(document)

            wl = WordNetLemmatizer()
            lemmatize = lambda tokens: \
                [wl.lemmatize(w) for w in tokens]


            stop_words=set(stopwords.words('english'))
            normalized = [w.lower() for w in word_tokens 
                               if ((not w.lower() in set(string.punctuation)) \
                                   or keep_punctuation)
                               and
                               ((not w.lower() in stop_words) or keep_stop_words)
                               and
                               ((w.lower().isalnum()) or keep_punctuation)
                               and
                               (not (w.lower().isdigit()) or keep_numbers)
                               ] 

            if keep_inflected is False:
                normalized = lemmatize(normalized)
            
            return normalized 

How does *keep_inflected* affect the output of __normalize__?

> *keep_inflected* does not lemmatize the text



In [None]:
# BEGIN_REMOVE
normalized_training_corpus = [normalize(item, keep_inflected=False) for item in training_corpus]    
inflected_training_corpus = [normalize(item, keep_inflected=True) for item in training_corpus] 

df=pd.DataFrame(columns=['original', 'normalized', 'inflected'])
df['original']=training_corpus
df['normalized']=normalized_training_corpus
df['inflected']=inflected_training_corpus
print (df.to_string())
# keep_inflected maintains inflected forms such as 'cities'
# END_REMOVE

                                        original                      normalized                        inflected
0              The Limmat flows out of the lake.            [limmat, flow, lake]            [limmat, flows, lake]
1  The bears are in the bear pit near the river.  [bear, bear, pit, near, river]  [bears, bear, pit, near, river]
2            The Rhône flows out of Lake Geneva.     [rhône, flow, lake, geneva]     [rhône, flows, lake, geneva]




Now, we need to define a __get_vocabulary__ function that gets us all the unique words that appear in the normalized documents.

In [None]:
def get_vocabulary (data):
    return list(set(sum(data,[])))    

Print the vocabulary

In [None]:
# BEGIN_REMOVE
print(get_vocabulary(normalized_training_corpus))  
# END_REMOVE

['river', 'limmat', 'flow', 'near', 'rhône', 'lake', 'pit', 'geneva', 'bear']


We define a class __ms_timer__ that helps us time snippets of code. Its definition follows a special syntax that serves to implement what is known as a context manager. 

Each code snippet that we wish to time will be placed in an indented block following a __with__ statement. At the end of the indented block, the run time of the snippet will be returned by the class method __get_elapsed_time__. 

(You can do this same thing effortlessly in an IDE with profiling, but this is a good way to do it in a Jupyter notebook.)

In [None]:
import time
class ms_timer:
            
    def __enter__(self):
        self.start=time.time()
        return self
    def __exit__(self, exc_type, exc_val, exc_tb):
        self.stop=time.time()
    def get_elapsed_time(self):
        return 1000*(self.stop-self.start)

Here's an example of how to time code snippets using the context manager trick.

In [None]:
my_data = range(1, 10)

with ms_timer() as timer:
    prod=1
    for item in my_data:
        prod=prod*item
print ("Elapsed time for the loop: "+str(round(timer.get_elapsed_time(), 4))+" ms")   

Elapsed time for the loop: 0.0064 ms


## MNB from scratch

We are now ready to implement our MNB from scratch. Our implementation is contained in a class called __naive_bayes__. We can define our class across multiple cells simply by defining a derived class with exactly the same name in the following cells.

First we compute the posterior probabilities.

In [None]:
class naive_bayes:


    @staticmethod
    def get_posterior_probabilities (training_data, verbose=True):

        import re

        posterior = {}

        vocabulary = get_vocabulary(training_data['documents'])
        lw = len(vocabulary)

        classes = list(set(training_labels))

        for index, c in enumerate(classes): 

            tokens = sum(training_data['documents']\
                                         [training_data['labels']==c], [])
            
            
            try:
                den = len(tokens)    
            except:
                den=0

            current_class_docs = tokens

            for w in vocabulary:

                num = current_class_docs.count(w)                
                posterior[(w,c)]=(1+num)/(den+lw)

                if verbose:
                    
                    print ('_'*30)
                    message = 'Token '+w+' appears '+str(num)+' times in class '+c
                    message=re.sub('1 times', 'once', message)
                    print (message)
                    
                    message = 'There are '+str(den)+' tokens in class '+c
                    message=re.sub('are 1 tokens', 'is 1 token', message)
                    print (message)
                    
                    print (current_class_docs)
                    print ('Vocab size: '+str(lw))
                    print ('Posterior without Laplace smoothing: '+str(num)+'/'+str(den)+'='+str(round(num/den, 2)))
                    print ('Posterior with Laplace smoothing: '+str(1+num)+'/'+str(den+lw)+'='+str(round(posterior[(w,c)], 2)))

        return posterior  


3) What is the posterior probability of finding 'limmat' given that the document is tagged as 'zurich'? Complete the following code snippet to find out. Use *verbose* to see what's going on under the hood.
> The posterior probability is 0.17

In [None]:
# The method get_posterior_probabilities expects the training data in the form of a data frame
training_data = pd.DataFrame(columns=['documents', 'labels'])
training_data['documents']=normalized_training_corpus
training_data['labels']=training_labels

# BEGIN_REMOVE
posterior=naive_bayes.get_posterior_probabilities(training_data, verbose=True)
print ("%.2f"%posterior['limmat', 'zurich'])
# END_REMOVE

______________________________
Token river appears once in class bern
There are 5 tokens in class bern
['bear', 'bear', 'pit', 'near', 'river']
Vocab size: 9
Posterior without Laplace smoothing: 1/5=0.2
Posterior with Laplace smoothing: 2/14=0.14
______________________________
Token limmat appears 0 times in class bern
There are 5 tokens in class bern
['bear', 'bear', 'pit', 'near', 'river']
Vocab size: 9
Posterior without Laplace smoothing: 0/5=0.0
Posterior with Laplace smoothing: 1/14=0.07
______________________________
Token flow appears 0 times in class bern
There are 5 tokens in class bern
['bear', 'bear', 'pit', 'near', 'river']
Vocab size: 9
Posterior without Laplace smoothing: 0/5=0.0
Posterior with Laplace smoothing: 1/14=0.07
______________________________
Token near appears once in class bern
There are 5 tokens in class bern
['bear', 'bear', 'pit', 'near', 'river']
Vocab size: 9
Posterior without Laplace smoothing: 1/5=0.2
Posterior with Laplace smoothing: 2/14=0.14
_______

Complete the following code so we can train the classifier.

In [None]:
class naive_bayes(naive_bayes):
    
    def train(self, training_data, timing=False):

            

            classes = training_data['labels']

            # BEGIN_REMOVE            
            with ms_timer() as timer:

                P_c = \
                [(training_data['labels']==tagged_class).sum()/len(training_data) \
                 for tagged_class in classes]
            if timing:
                print('Priors probabilities computed in '+"%.2f"%timer.get_elapsed_time()+" ms")
            
            with ms_timer() as timer:
                
                posterior_p=self.get_posterior_probabilities(training_data, verbose=True)
            if timing:    
                print('Posterior probabilities computed in '+"%.2f"%timer.get_elapsed_time()+" ms")    
            # END_REMOVE
            
            return P_c, posterior_p

Now we get to train the classifier. 

Print out the prior probabilities and the posterior probabilities and answer the following questions:

a) What is the lowest posterior probability that you observe and why?

> (rhône, bern)    0.07

> (lake, bern)     0.07

> (limmat, bern)   0.07

> (flow, bern)     0.07

> (geneva, bern)   0.07

b) What is the highest posterior probability that you observe and why?

> (bear, bern) with a value of 0.21

c) Why are the prior probabilities all 1/3?

> Because we have for every class one document (zurich, bern, geneva)

In [None]:
nb=naive_bayes()
P_c, posterior_p=nb.train(training_data, timing=True) 

# BEGIN_REMOVE
print ('Prior probabilities:')
print ([str(round(x, 2)) for x in P_c])

print ('Posterior probabilities:')
df=pd.DataFrame()
df['(token, class)']=[x for x in posterior_p.keys()]
df['post_p']=list(map(lambda x:round(x, 2), posterior_p.values()))
print (df.to_string())
print ('_'*30)
print ('Sorted in descending order:')
print (df.sort_values(by='post_p', ascending=False).to_string())
# END_REMOVE

Priors probabilities computed in 0.75 ms
______________________________
Token river appears once in class bern
There are 5 tokens in class bern
['bear', 'bear', 'pit', 'near', 'river']
Vocab size: 9
Posterior without Laplace smoothing: 1/5=0.2
Posterior with Laplace smoothing: 2/14=0.14
______________________________
Token limmat appears 0 times in class bern
There are 5 tokens in class bern
['bear', 'bear', 'pit', 'near', 'river']
Vocab size: 9
Posterior without Laplace smoothing: 0/5=0.0
Posterior with Laplace smoothing: 1/14=0.07
______________________________
Token flow appears 0 times in class bern
There are 5 tokens in class bern
['bear', 'bear', 'pit', 'near', 'river']
Vocab size: 9
Posterior without Laplace smoothing: 0/5=0.0
Posterior with Laplace smoothing: 1/14=0.07
______________________________
Token near appears once in class bern
There are 5 tokens in class bern
['bear', 'bear', 'pit', 'near', 'river']
Vocab size: 9
Posterior without Laplace smoothing: 1/5=0.2
Posterior 

And we get to do the classifying.

In [None]:
class naive_bayes(naive_bayes):
    
    
    def classify_document (self, training_data, test_document, verbose=False):

            
            from functools import reduce
            import math
            from nltk.tokenize import word_tokenize

            classes = list(set(training_data['labels']))
            
            
            P_c, posterior_p=self.train(training_data)

            NB=dict()
            
            normalized_test_document = normalize(test_document, keep_inflected=False)
            
            if verbose:
                print ('_'*30)
                print ('Test doc: ', test_document)
                print ('Normalized test doc: ', normalized_test_document)

            for index, c in enumerate(classes):
                
                print ('Class: ', c)

                posterior_logsum=0
                
                for token in normalized_test_document:
                    
                    if verbose:
                        print ('Token: ', token)
                        
                    try:
                        posterior_logsum=posterior_logsum+math.log(posterior_p[token, c], 10)
                        
                        if verbose:
                            print ('Posterior ('+token+', '+c+'): '+str(posterior_p[token, c]))
                            print ('Posterior logsum: ', posterior_logsum)
                            
                        
                    except:
                        print ('Token not in training ')
                    
                if posterior_logsum==0:
                        print('Classification failure: insufficient info')
            
                NB[c]=round(posterior_logsum+math.log(P_c[index], 10), 2)
                
                if verbose:
                    print ('Class: ', c)
                    print ('NB: ', NB[c])
                
            

            
            return max(NB, key=NB.get), NB 
            



Test your classifier with the test document *The name of the city comes from the word 'bear'.* What goes wrong? Can you fix it?

> the word bear is not identified as the word bear but as the word 'bear' with colons. We can fix this and it works


In [None]:
# BEGIN_REMOVE

import logging
nb=naive_bayes()
test_corpus = "The name of the city comes from the word 'bear'"
test_labels = "bern"

print (test_corpus)

with ms_timer() as timer:
    result, NB = nb.classify_document(training_data, test_corpus, verbose=True)
    
logging.warning('Classification completed in '+"%.2f"%timer.get_elapsed_time()+" ms")    
print (result)   
print (NB)

# END_REMOVE



The name of the city comes from the word 'bear'
______________________________
Token river appears once in class bern
There are 5 tokens in class bern
['bear', 'bear', 'pit', 'near', 'river']
Vocab size: 9
Posterior without Laplace smoothing: 1/5=0.2
Posterior with Laplace smoothing: 2/14=0.14
______________________________
Token limmat appears 0 times in class bern
There are 5 tokens in class bern
['bear', 'bear', 'pit', 'near', 'river']
Vocab size: 9
Posterior without Laplace smoothing: 0/5=0.0
Posterior with Laplace smoothing: 1/14=0.07
______________________________
Token flow appears 0 times in class bern
There are 5 tokens in class bern
['bear', 'bear', 'pit', 'near', 'river']
Vocab size: 9
Posterior without Laplace smoothing: 0/5=0.0
Posterior with Laplace smoothing: 1/14=0.07
______________________________
Token near appears once in class bern
There are 5 tokens in class bern
['bear', 'bear', 'pit', 'near', 'river']
Vocab size: 9
Posterior without Laplace smoothing: 1/5=0.2
Pos

In [None]:
# BEGIN_REMOVE

import logging
nb=naive_bayes()
test_corpus = "The name of the city comes from the word 'bear'"
test_labels = "bern"

test_corpus=test_corpus.replace('\'', '')

print (test_corpus)

with ms_timer() as timer:
    result = nb.classify_document(training_data, test_corpus, verbose=True)
    
logging.warning('Classification completed in '+"%.2f"%timer.get_elapsed_time()+" ms")    
print (result)    

# END_REMOVE

The name of the city comes from the word bear
______________________________
Token river appears once in class bern
There are 5 tokens in class bern
['bear', 'bear', 'pit', 'near', 'river']
Vocab size: 9
Posterior without Laplace smoothing: 1/5=0.2
Posterior with Laplace smoothing: 2/14=0.14
______________________________
Token limmat appears 0 times in class bern
There are 5 tokens in class bern
['bear', 'bear', 'pit', 'near', 'river']
Vocab size: 9
Posterior without Laplace smoothing: 0/5=0.0
Posterior with Laplace smoothing: 1/14=0.07
______________________________
Token flow appears 0 times in class bern
There are 5 tokens in class bern
['bear', 'bear', 'pit', 'near', 'river']
Vocab size: 9
Posterior without Laplace smoothing: 0/5=0.0
Posterior with Laplace smoothing: 1/14=0.07
______________________________
Token near appears once in class bern
There are 5 tokens in class bern
['bear', 'bear', 'pit', 'near', 'river']
Vocab size: 9
Posterior without Laplace smoothing: 1/5=0.2
Poste



______________________________
Test doc:  The name of the city comes from the word bear
Normalized test doc:  ['name', 'city', 'come', 'word', 'bear']
Class:  bern
Token:  name
Token not in training 
Token:  city
Token not in training 
Token:  come
Token not in training 
Token:  word
Token not in training 
Token:  bear
Posterior (bear, bern): 0.21428571428571427
Posterior logsum:  -0.6690067809585756
Class:  bern
NB:  -1.15
Class:  geneva
Token:  name
Token not in training 
Token:  city
Token not in training 
Token:  come
Token not in training 
Token:  word
Token not in training 
Token:  bear
Posterior (bear, geneva): 0.07692307692307693
Posterior logsum:  -1.1139433523068367
Class:  geneva
NB:  -1.59
Class:  zurich
Token:  name
Token not in training 
Token:  city
Token not in training 
Token:  come
Token not in training 
Token:  word
Token not in training 
Token:  bear
Posterior (bear, zurich): 0.08333333333333333
Posterior logsum:  -1.0791812460476247
Class:  zurich
NB:  -1.56
('bern

Can you explain the performance of your classifier on the following test corpus?

> the problem is the inverted sentece "There is no lake". MNB only checks for occurences of words and not their deeper meening, this is the reason it has problem with 'not'

In [None]:
test_corpus = ['We saw the bears there.', 
               'We crossed the Rhône.', 
               'There is no lake.',
              ]
test_labels = ['bern',
               'geneva',
               'bern',
              ]

nb=naive_bayes() 



for item in test_corpus:
    print ('\n Classifying: '+item)
    with ms_timer() as timer:
        result = nb.classify_document(training_data, item)
    logging.warning('Classification of \"'+item+'\" completed in '+"%.2f"%timer.get_elapsed_time()+" ms")    
    print (result)                                      
    print ('correct label: '+test_labels[test_corpus.index(item)])


 Classifying: We saw the bears there.
______________________________
Token river appears once in class bern
There are 5 tokens in class bern
['bear', 'bear', 'pit', 'near', 'river']
Vocab size: 9




Posterior without Laplace smoothing: 1/5=0.2
Posterior with Laplace smoothing: 2/14=0.14
______________________________
Token limmat appears 0 times in class bern
There are 5 tokens in class bern
['bear', 'bear', 'pit', 'near', 'river']
Vocab size: 9
Posterior without Laplace smoothing: 0/5=0.0
Posterior with Laplace smoothing: 1/14=0.07
______________________________
Token flow appears 0 times in class bern
There are 5 tokens in class bern
['bear', 'bear', 'pit', 'near', 'river']
Vocab size: 9
Posterior without Laplace smoothing: 0/5=0.0
Posterior with Laplace smoothing: 1/14=0.07
______________________________
Token near appears once in class bern
There are 5 tokens in class bern
['bear', 'bear', 'pit', 'near', 'river']
Vocab size: 9
Posterior without Laplace smoothing: 1/5=0.2
Posterior with Laplace smoothing: 2/14=0.14
______________________________
Token rhône appears 0 times in class bern
There are 5 tokens in class bern
['bear', 'bear', 'pit', 'near', 'river']
Vocab size: 9
Pos

Now test your classifier with the one-sentence document "The federal capital is pretty." What happens?

> none of the words is available in the training data. This is why the sentence can not be classified

In [None]:
# BEGIN_REMOVE
test_corpus = "The federal capital is pretty."
test_labels = "bern"

# Your classifier fails because your test document contains a previously unseen word. 
print (nb.classify_document(training_data, test_corpus))
# END_REMOVE

______________________________
Token river appears once in class bern
There are 5 tokens in class bern
['bear', 'bear', 'pit', 'near', 'river']
Vocab size: 9
Posterior without Laplace smoothing: 1/5=0.2
Posterior with Laplace smoothing: 2/14=0.14
______________________________
Token limmat appears 0 times in class bern
There are 5 tokens in class bern
['bear', 'bear', 'pit', 'near', 'river']
Vocab size: 9
Posterior without Laplace smoothing: 0/5=0.0
Posterior with Laplace smoothing: 1/14=0.07
______________________________
Token flow appears 0 times in class bern
There are 5 tokens in class bern
['bear', 'bear', 'pit', 'near', 'river']
Vocab size: 9
Posterior without Laplace smoothing: 0/5=0.0
Posterior with Laplace smoothing: 1/14=0.07
______________________________
Token near appears once in class bern
There are 5 tokens in class bern
['bear', 'bear', 'pit', 'near', 'river']
Vocab size: 9
Posterior without Laplace smoothing: 1/5=0.2
Posterior with Laplace smoothing: 2/14=0.14
_______

In [None]:
test_corpus = ['It’s surrounded by three different lakes.', 
               'You can hike to the top from Goldau.', 
               'It’s across the lake from Rigi.',
               'Meteoschweiz has a radar at the top.',
              ]
test_labels = ['rigi',
               'rigi',
               'pilatus',
               'dole',
              ]

predicted_labels = ['pilatus',
               'rigi',
               'pilatus',
               'dole',
              ]

target_names = sorted(list(set(test_labels)))

from sklearn.metrics import classification_report
target_names = ['rigi', 'pilatus', 'dole']
print(classification_report(test_labels, predicted_labels, labels = target_names, target_names=target_names))

              precision    recall  f1-score   support

        rigi       1.00      0.50      0.67         2
     pilatus       0.50      1.00      0.67         1
        dole       1.00      1.00      1.00         1

    accuracy                           0.75         4
   macro avg       0.83      0.83      0.78         4
weighted avg       0.88      0.75      0.75         4

