The goal of this lab is to implement a language identifier (LID).

Our first model will be based on Naive Bayes.

In [2]:
import io, sys, math, re
from collections import defaultdict

The next function is used to load the data. Each line of the data consist of a label (corresponding to a language), followed by some text, written in that language. Here is an example of data:

```__label__de Zur Namensdeutung gibt es mehrere Varianten.```


In [3]:
def load_data(filename):
    fin = io.open(filename, 'r', encoding='utf-8')
    data = []
    for line in fin:
        tokens = line.split()
        data.append((tokens[0], tokens[1:]))
    return data

You can now try loading the first dataset `train1.txt` and look what examples look like.

In [4]:
data = load_data("/content/drive/MyDrive/NLP/Week1/session1/train1.txt")
print(data[0])

('__label__deu', ['Ich', 'würde', 'alles', 'tun,', 'um', 'dich', 'zu', 'beschützen.'])


Next, we will start implementing the Naive Bayes method. This technique is based on word counts, and we thus need to start by implementing a function to count the words and labels of our training set.

`n_examples` is the total number of examples

`n_words_per_label` is the total number of words for a given label

`label_counts` is the number of times a given label appears in the training data

`word_counts` is the number of times a word appears with a given label

In [5]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [6]:
def count_words(data):
    n_examples = 0
    n_words_per_label = defaultdict(lambda: 0)
    label_counts = defaultdict(lambda: 0)
    word_counts = defaultdict(lambda: defaultdict(lambda: 0.0))

    n_examples = len(data)

    for example in data:
        label, sentence = example
        ## FILL CODE
        label_counts[label] = label_counts[label] + 1
        for word in sentence:
          word_counts[label][word] = word_counts[label][word] + 1
          n_words_per_label[label] = n_words_per_label[label] +1


    return {'label_counts': label_counts, 
            'word_counts': word_counts, 
            'n_examples': n_examples, 
            'n_words_per_label': n_words_per_label}

In [7]:
all_count = count_words(data)
all_count['label_counts']

defaultdict(<function __main__.count_words.<locals>.<lambda>>,
            {'__label__deu': 828,
             '__label__eng': 2137,
             '__label__epo': 1020,
             '__label__fra': 650,
             '__label__hun': 432,
             '__label__ita': 1327,
             '__label__por': 578,
             '__label__rus': 1271,
             '__label__spa': 564,
             '__label__tur': 1193})

In [8]:
all_count['n_examples']

10000

In [9]:
word_counts = all_count['word_counts']
# word_counts

In [10]:
word_counts.keys()

dict_keys(['__label__deu', '__label__hun', '__label__rus', '__label__ita', '__label__eng', '__label__spa', '__label__tur', '__label__epo', '__label__por', '__label__fra'])

In [11]:
all_count['n_words_per_label']

defaultdict(<function __main__.count_words.<locals>.<lambda>>,
            {'__label__deu': 6630,
             '__label__eng': 16444,
             '__label__epo': 7647,
             '__label__fra': 4718,
             '__label__hun': 2271,
             '__label__ita': 7759,
             '__label__por': 4044,
             '__label__rus': 7387,
             '__label__spa': 3927,
             '__label__tur': 6026})

Next, using the word and label counts from the previous function, we can implement the prediction function.

Here, `mu` is a regularization parameter (Laplace smoothing), and `sentence` is the list of words corresponding to the test example.


$p_{j,k} = \frac{α + C(j,k)}{αV + \sum_{j'}C(j',k)}$

$α =μ$

In [13]:
import numpy as np
def predict(sentence, mu, label_counts, word_counts, n_examples, n_words_per_label):
    best_label = None
    best_score = float('-inf')

    # prior = defaultdict(lambda: 0)
    # p = defaultdict(lambda: defaultdict(lambda: 0.0))

    for label in word_counts.keys():
        score = 0.0
        ## FILE CODE
        for word in sentence:
          a = word_counts[label][word]
          # print('aaaaaaa:',a)
          b = n_words_per_label[label]
          # print('bbbbbb:',b)
          total_Wcount = len(word_counts[label])  # length of the vocabulary

          p = np.log( (a + mu) /( b + mu*total_Wcount) )
          score += p 
          
        if score>best_score:
          best_score = score
          best_label = label
         

    return best_score,best_label

The next function will be used to evaluate the Naive Bayes model on a validation set. It computes the accuracy for a particular regularization parameter `mu`.

$accuracy = \frac{1}{n}\sum_{i}^{n}1({\hat{y_i} - y_i}) = \frac{correct Predictions}{text Examples}$ 
          

In [14]:
def compute_accuracy(valid_data, mu, counts):
  accuracy = 0.0
  for label, sentence in valid_data:

    ## FILL CODE
    score,pred = predict(sentence, mu, counts['label_counts'],counts['word_counts'], counts['n_examples'], counts['n_words_per_label'])
    #print(pred,label)
    if pred==label:
      accuracy += 1.0
     
  return (accuracy/len(valid_data))*100

In [15]:
print("")
print("** Naive Bayes **")
print("")

mu = 1.0
train_data = load_data("/content/drive/MyDrive/NLP/Week1/session1/train1.txt")
valid_data = load_data("/content/drive/MyDrive/NLP/Week1/session1/valid1.txt")
counts = count_words(train_data)

print("Validation accuracy: %.3f" % compute_accuracy(valid_data, mu, counts))
print("")


** Naive Bayes **

Validation accuracy: 91.500

