# AnTeDe: Text Classification - Part A

## Session goal
The goal of this session is to implement a Multinomial Naive Bayes classifier from scratch.

## Data collection
We are going to use a small toy dataset. Each document is a single sentence. The training data contains three documents, each from a different class.

In [85]:
import pandas as pd
import nltk

# these 3 lines are here for compatibility purposes
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('omw-1.4')
#

training_corpus=["The Limmat flows out of the lake.", 
           "The bears are in the bear pit near the river.",
           "The Rhône flows out of Lake Geneva.",
          ]
training_labels=["zurich", 
         "bern",
         "geneva",
        ]

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


We are also going to need a helper function that can normalize a string.

In [86]:
import string
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize


def normalize(document, keep_punctuation=False, keep_stop_words=False,
              keep_inflected=True, keep_numbers=False):
    """
    This function normalizes the input document by tokenizing, lowercasing,
    removing punctuation and stopwords, and lemmatizing words.

    :param document: str, the input document to be normalized
    :param keep_punctuation: bool, whether to keep punctuation in the output
    :param keep_stop_words: bool, whether to keep stopwords in the output
    :param keep_inflected: bool, whether to keep inflected forms of words or lemmatize them
    :param keep_numbers: bool, whether to keep numbers in the output
    :return: list, the normalized tokens of the input document
    """
    word_tokens = word_tokenize(document)
    wl = WordNetLemmatizer()
    lemmatize = lambda tokens: [wl.lemmatize(w) for w in tokens]

    stop_words = set(stopwords.words('english'))
    normalized = [w.lower() for w in word_tokens
                  if ((not w.lower() in set(string.punctuation)) or keep_punctuation)
                  and ((not w.lower() in stop_words) or keep_stop_words)
                  and ((w.lower().isalnum()) or keep_punctuation)
                  and (not (w.lower().isdigit()) or keep_numbers)]

    if not keep_inflected:
        normalized = lemmatize(normalized)

    return normalized


How does *keep_inflected* affect the output of __normalize__?





In [87]:
# BEGIN_REMOVE
normalized_training_corpus = [normalize(item, keep_inflected=False) for item in training_corpus]    
inflected_training_corpus = [normalize(item, keep_inflected=True) for item in training_corpus] 

df=pd.DataFrame(columns=['original', 'normalized', 'inflected'])
df['original']=training_corpus
df['normalized']=normalized_training_corpus
df['inflected']=inflected_training_corpus
print (df.to_string())
# keep_inflected maintains inflected forms such as 'cities'
# END_REMOVE

                                        original                      normalized                        inflected
0              The Limmat flows out of the lake.            [limmat, flow, lake]            [limmat, flows, lake]
1  The bears are in the bear pit near the river.  [bear, bear, pit, near, river]  [bears, bear, pit, near, river]
2            The Rhône flows out of Lake Geneva.     [rhône, flow, lake, geneva]     [rhône, flows, lake, geneva]




Now, we need to define a __get_vocabulary__ function that gets us all the unique words that appear in the normalized documents.

In [88]:
def get_vocabulary (data):
    return list(set(sum(data,[])))    

Print the vocabulary

In [89]:
# BEGIN_REMOVE
print(get_vocabulary(normalized_training_corpus)) 
# END_REMOVE
print ('Tokens in vocab: ', str(len(get_vocabulary(normalized_training_corpus))))

['river', 'pit', 'lake', 'flow', 'geneva', 'rhône', 'bear', 'limmat', 'near']
Tokens in vocab:  9


We define a class __ms_timer__ that helps us time snippets of code. Its definition follows a special syntax that serves to implement what is known as a context manager. 

Each code snippet that we wish to time will be placed in an indented block following a __with__ statement. At the end of the indented block, the run time of the snippet will be returned by the class method __get_elapsed_time__. 

(You can do this same thing effortlessly in an IDE with profiling, but this is a good way to do it in a Jupyter notebook.)

In [90]:
import time


class ms_timer:
    
    def __enter__(self):
        self.start = time.time()
        return self
    
    def __exit__(self, exc_type, exc_val, exc_tb):
        self.stop = time.time()
    
    def get_elapsed_time(self):
        return 1000 * (self.stop - self.start)


Here's an example of how to time code snippets using the context manager trick.

In [91]:
my_data = range(1, 10)

with ms_timer() as timer:
    prod=1
    for item in my_data:
        prod=prod*item
print ("Elapsed time for the loop: "+str(round(timer.get_elapsed_time(), 4))+" ms")   

Elapsed time for the loop: 0.0145 ms


## MNB from scratch

We are now ready to implement our MNB from scratch. Our implementation is contained in a class called __naive_bayes__. We can define our class across multiple cells simply by defining a derived class with exactly the same name in the following cells.

First we compute the posterior probabilities.

In [92]:
import re


class NaiveBayes:
    @staticmethod
    def get_posterior_probabilities(training_data, verbose=True):
        """
        This function calculates the posterior probabilities for each word in the vocabulary for each class in the
        training set.

        :param training_data: dict, contains the training documents and their labels
        :param verbose: bool, whether to print debugging information
        :return: dict, the posterior probabilities for each word in the vocabulary for each class in the training set
        """
        posterior = {}
        vocabulary = get_vocabulary(training_data['documents'])
        lw = len(vocabulary)
        classes = list(set(training_data['labels']))

        for index, c in enumerate(classes):
            tokens = sum(training_data['documents'][training_data['labels'] == c], [])
            try:
                den = len(tokens)
            except:
                den = 0

            current_class_docs = tokens

            for w in vocabulary:
                num = current_class_docs.count(w)
                posterior[(w, c)] = (1 + num) / (den + lw)

                if verbose:
                    print('_' * 30)
                    message = 'Token ' + w + ' appears ' + str(num) + ' times in class ' + c
                    message = re.sub('1 times', 'once', message)
                    print(message)

                    message = 'There are ' + str(den) + ' tokens in class ' + c
                    message = re.sub('are 1 tokens', 'is 1 token', message)
                    print(message)

                    print(current_class_docs)
                    print('Vocab size: ' + str(lw))
                    print('Posterior without Laplace smoothing: ' + str(num) + '/' + str(den) + '=' + str(round(num / den, 2)))
                    print('Posterior with Laplace smoothing: ' + str(1 + num) + '/' + str(den + lw) + '=' + str(round(posterior[(w, c)], 2)))

        return posterior


3) What is the posterior probability of finding 'limmat' given that the document is tagged as 'zurich'? Complete the following code snippet to find out. Use *verbose* to see what's going on under the hood.

In [93]:
# The method get_posterior_probabilities expects the training data in the form of a data frame
training_data = pd.DataFrame(columns=['documents', 'labels'])
training_data['documents']=normalized_training_corpus
training_data['labels']=training_labels

# BEGIN_REMOVE
posterior=NaiveBayes.get_posterior_probabilities(training_data, verbose=False)
print ("%.2f"%posterior['limmat', 'zurich'])
# END_REMOVE

0.17


Complete the following code so we can train the classifier.

In [94]:
class NaiveBayes(NaiveBayes):
    
    def train(self, training_data, timing=False):
            
        classes = training_data['labels']

        # BEGIN_REMOVE
        with ms_timer() as timer:
            P_c = [(training_data['labels'] == tagged_class).sum() / len(training_data) 
                   for tagged_class in classes]
        if timing:
            print('Priors probabilities computed in ' + "%.2f" % timer.get_elapsed_time() + " ms")
        
        with ms_timer() as timer:
            posterior_p = self.get_posterior_probabilities(training_data, verbose=False)
        if timing:
            print('Posterior probabilities computed in ' + "%.2f" % timer.get_elapsed_time() + " ms")    
        # END_REMOVE
        
        return P_c, posterior_p


Now we get to train the classifier. 

Print out the prior probabilities and the posterior probabilities and answer the following questions:

a) What is the lowest posterior probability that you observe and why?

b) What is the highest posterior probability that you observe and why?

c) Why are the prior probabilities all 1/3?

In [95]:
nb=NaiveBayes()
P_c, posterior_p=nb.train(training_data, timing=True) 

# BEGIN_REMOVE
print ('Prior probabilities:')
print ([str(round(x, 2)) for x in P_c])

print ('Posterior probabilities:')
df=pd.DataFrame()
df['(token, class)']=[x for x in posterior_p.keys()]
df['post_p']=list(map(lambda x:round(x, 2), posterior_p.values()))
print (df.to_string())
print ('_'*30)
print ('Sorted in descending order:')
print (df.sort_values(by='post_p', ascending=False).to_string())
# END_REMOVE

Priors probabilities computed in 2.37 ms
Posterior probabilities computed in 2.02 ms
Prior probabilities:
['0.33', '0.33', '0.33']
Posterior probabilities:
      (token, class)  post_p
0      (river, bern)    0.14
1        (pit, bern)    0.14
2       (lake, bern)    0.07
3       (flow, bern)    0.07
4     (geneva, bern)    0.07
5      (rhône, bern)    0.07
6       (bear, bern)    0.21
7     (limmat, bern)    0.07
8       (near, bern)    0.14
9    (river, zurich)    0.08
10     (pit, zurich)    0.08
11    (lake, zurich)    0.17
12    (flow, zurich)    0.17
13  (geneva, zurich)    0.08
14   (rhône, zurich)    0.08
15    (bear, zurich)    0.08
16  (limmat, zurich)    0.17
17    (near, zurich)    0.08
18   (river, geneva)    0.08
19     (pit, geneva)    0.08
20    (lake, geneva)    0.15
21    (flow, geneva)    0.15
22  (geneva, geneva)    0.15
23   (rhône, geneva)    0.15
24    (bear, geneva)    0.08
25  (limmat, geneva)    0.08
26    (near, geneva)    0.08
______________________________
S

And we get to do the classifying.

In [96]:
class NaiveBayes(NaiveBayes):
    
    def classify_document(self, training_data, test_document, verbose=False):
        from functools import reduce
        import math
        from nltk.tokenize import word_tokenize

        classes = list(set(training_data['labels']))

        P_c, posterior_p = self.train(training_data)

        NB = dict()

        normalized_test_document = normalize(test_document, keep_inflected=False)

        if verbose:
            print('_' * 30)
            print('Test doc: ', test_document)
            print('Normalized test doc: ', normalized_test_document)

        for index, c in enumerate(classes):

            if verbose:
              print('Class: ', c)

            posterior_logsum = 0

            for token in normalized_test_document:

                if verbose:
                    print('Token: ', token)

                try:
                    posterior_logsum = posterior_logsum + math.log(posterior_p[token, c], 10)

                    if verbose:
                        print('Posterior (' + token + ', ' + c + '): ' + str(posterior_p[token, c]))
                        print('Posterior logsum: ', posterior_logsum)


                except:
                    if verbose:
                        print('Token not in training ')

            if posterior_logsum == 0:
                print('Classification failure: insufficient info')

            NB[c] = round(posterior_logsum + math.log(P_c[index], 10), 2)

            if verbose:
                print('Class: ', c)
                print('NB: ', NB[c])

        return max(NB, key=NB.get), NB


Let's test our classifier with a simple sentence and see what happens.

In [97]:
# BEGIN_REMOVE

import logging
nb=NaiveBayes()
test_corpus = "On my way to the lake, I fell asleep near the Limmat."
test_labels = "zurich"

print (test_corpus)

with ms_timer() as timer:
    result, NB = nb.classify_document(training_data, test_corpus, verbose=True)
    
logging.warning('Classification completed in '+"%.2f"%timer.get_elapsed_time()+" ms")    
print ('Classification result: ', result)   
print (NB)

# END_REMOVE



On my way to the lake, I fell asleep near the Limmat.
______________________________
Test doc:  On my way to the lake, I fell asleep near the Limmat.
Normalized test doc:  ['way', 'lake', 'fell', 'asleep', 'near', 'limmat']
Class:  bern
Token:  way
Token not in training 
Token:  lake
Posterior (lake, bern): 0.07142857142857142
Posterior logsum:  -1.146128035678238
Token:  fell
Token not in training 
Token:  asleep
Token not in training 
Token:  near
Posterior (near, bern): 0.14285714285714285
Posterior logsum:  -1.9912260756924947
Token:  limmat
Posterior (limmat, bern): 0.07142857142857142
Posterior logsum:  -3.1373541113707324
Class:  bern
NB:  -3.61
Class:  zurich
Token:  way
Token not in training 
Token:  lake
Posterior (lake, zurich): 0.16666666666666666
Posterior logsum:  -0.7781512503836435
Token:  fell
Token not in training 
Token:  asleep
Token not in training 
Token:  near
Posterior (near, zurich): 0.08333333333333333
Posterior logsum:  -1.857332496431268
Token:  limmat
Poste

Test your classifier with the test document *The name of the city comes from the word 'bear'.* What goes wrong? Can you fix it?

In [98]:
# BEGIN_REMOVE

import logging
nb=NaiveBayes()
test_corpus = "The name of the city comes from the word 'bear'"
test_labels = "bern"

print (test_corpus)

with ms_timer() as timer:
    result, NB = nb.classify_document(training_data, test_corpus, verbose=True)
    
logging.warning('Classification completed in '+"%.2f"%timer.get_elapsed_time()+" ms")    
print (result)   
print (NB)

# END_REMOVE

The name of the city comes from the word 'bear'




______________________________
Test doc:  The name of the city comes from the word 'bear'
Normalized test doc:  ['name', 'city', 'come', 'word']
Class:  bern
Token:  name
Token not in training 
Token:  city
Token not in training 
Token:  come
Token not in training 
Token:  word
Token not in training 
Classification failure: insufficient info
Class:  bern
NB:  -0.48
Class:  zurich
Token:  name
Token not in training 
Token:  city
Token not in training 
Token:  come
Token not in training 
Token:  word
Token not in training 
Classification failure: insufficient info
Class:  zurich
NB:  -0.48
Class:  geneva
Token:  name
Token not in training 
Token:  city
Token not in training 
Token:  come
Token not in training 
Token:  word
Token not in training 
Classification failure: insufficient info
Class:  geneva
NB:  -0.48
bern
{'bern': -0.48, 'zurich': -0.48, 'geneva': -0.48}


In [99]:
# BEGIN_REMOVE

import logging
nb=NaiveBayes()
test_corpus = "The name of the city comes from the word 'bear'"
test_labels = "bern"

test_corpus=test_corpus.replace('\'', '')

print (test_corpus)

with ms_timer() as timer:
    result = nb.classify_document(training_data, test_corpus, verbose=False)
    
logging.warning('Classification completed in '+"%.2f"%timer.get_elapsed_time()+" ms")    
print ('Classification result: ', result)  

# END_REMOVE

The name of the city comes from the word bear




Classification result:  ('bern', {'bern': -1.15, 'zurich': -1.56, 'geneva': -1.59})


Can you explain the performance of your classifier on the following test corpus?

In [100]:
test_corpus = ['We saw the bears there.', 
               'We crossed the Rhône.', 
               'There is no lake.',
              ]
test_labels = ['bern',
               'geneva',
               'bern',
              ]

nb=NaiveBayes() 



for item in test_corpus:
    print ('\n Classifying: '+item)
    with ms_timer() as timer:
        result = nb.classify_document(training_data, item)
    logging.warning('Classification of \"'+item+'\" completed in '+"%.2f"%timer.get_elapsed_time()+" ms")    
    print (result)                                      
    print ('correct label: '+test_labels[test_corpus.index(item)])




 Classifying: We saw the bears there.
('bern', {'bern': -1.15, 'zurich': -1.56, 'geneva': -1.59})
correct label: bern

 Classifying: We crossed the Rhône.




('geneva', {'bern': -1.62, 'zurich': -1.56, 'geneva': -1.29})
correct label: geneva

 Classifying: There is no lake.
('zurich', {'bern': -1.62, 'zurich': -1.26, 'geneva': -1.29})
correct label: bern


Now test your classifier with the one-sentence document "The federal capital is pretty." What happens?

In [101]:
# BEGIN_REMOVE
test_corpus = "The federal capital is pretty."
test_labels = "bern"

# Your classifier fails because your test document contains a previously unseen word. 
print (nb.classify_document(training_data, test_corpus))
# END_REMOVE

Classification failure: insufficient info
Classification failure: insufficient info
Classification failure: insufficient info
('bern', {'bern': -0.48, 'zurich': -0.48, 'geneva': -0.48})


In [102]:
test_corpus = "There is no lake."
test_labels = "zurich"

print (nb.classify_document(training_data, test_corpus))

('zurich', {'bern': -1.62, 'zurich': -1.26, 'geneva': -1.29})


MEP question!!!!!!!!!!!!!!¨

In [103]:
test_corpus = ['It’s surrounded by three different lakes.', 
               'You can hike to the top from Goldau.', 
               'It’s across the lake from Rigi.',
               'Meteoschweiz has a radar at the top.',
              ]
test_labels = ['rigi',
               'rigi',
               'pilatus',
               'dole',
              ]

predicted_labels = ['pilatus',
               'rigi',
               'pilatus',
               'dole',
              ]

target_names = sorted(list(set(test_labels)))

from sklearn.metrics import classification_report
target_names = ['rigi', 'pilatus', 'dole']
print(classification_report(test_labels, predicted_labels, labels = target_names, target_names=target_names))

              precision    recall  f1-score   support

        rigi       1.00      0.50      0.67         2
     pilatus       0.50      1.00      0.67         1
        dole       1.00      1.00      1.00         1

    accuracy                           0.75         4
   macro avg       0.83      0.83      0.78         4
weighted avg       0.88      0.75      0.75         4
