<a href="https://colab.research.google.com/github/amaneth/Language-detection-using-logistic-regression/blob/main/language_detection_naive_bayes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The goal of this lab is to implement a language identifier (LID).

Our first model will be based on Naive Bayes.

In [1]:
import io, sys, math, re
from collections import defaultdict

The next function is used to load the data. Each line of the data consist of a label (corresponding to a language), followed by some text, written in that language. Here is an example of data:

```__label__de Zur Namensdeutung gibt es mehrere Varianten.```


In [2]:
def load_data(filename):
    fin = io.open(filename, 'r', encoding='utf-8')
    data = []
    for line in fin:
        tokens = line.split()
        data.append((tokens[0], tokens[1:]))
    return data

You can now try loading the first dataset `train1.txt` and look what examples look like.

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
data = load_data("/content/drive/MyDrive/NLP_Week_1_Labs_2022/session1/train1.txt")
print(data[-100])

('__label__eo', ['Liaj', 'proponoj', 'estis', 'akceptitaj', 'en', 'la', 'kunsido.'])


In [None]:
len(data)

10000

Next, we will start implementing the Naive Bayes method. This technique is based on word counts, and we thus need to start by implementing a function to count the words and labels of our training set.

`n_examples` is the total number of examples

`n_words_per_label` is the total number of words for a given label

`label_counts` is the number of times a given label appears in the training data

`word_counts` is the number of times a word appears with a given label

In [5]:
def count_words(data):
    n_examples = 0
    n_words_per_label = defaultdict(lambda: 0)
    label_counts = defaultdict(lambda: 0)
    word_counts = defaultdict(lambda: defaultdict(lambda: 0.0))

    
    for example in data:
        label, sentence = example
        n_examples+=1
        label_counts[label]+=1
        for word in sentence:
          # print(word)
          word_counts[label][word]+=1
          n_words_per_label[label]+=1
        ## FILL CODE

    return {'label_counts': label_counts, 
            'word_counts': word_counts, 
            'n_examples': n_examples, 
            'n_words_per_label': n_words_per_label}

In [None]:
count=count_words(data)

In [None]:
word_counts=count['word_counts']

In [None]:
vocublary= [word  for label in word_counts.keys() for word in word_counts[label].keys() ]

In [None]:
len(vocublary)

26083

In [None]:
c=0
for label, counts in word_counts.items():
  c+=len(word_counts[label])


In [None]:
c

26083

Next, using the word and label counts from the previous function, we can implement the prediction function.

Here, `mu` is a regularization parameter (Laplace smoothing), and `sentence` is the list of words corresponding to the test example.

In [None]:
import numpy as np

In [6]:
def predict(sentence, mu, label_counts, word_counts, n_examples, n_words_per_label):
    best_label = None
    best_score = float('-inf')
    vocublary_size=0
    score_dict=defaultdict(lambda: 0)
    likelihood= defaultdict(lambda: defaultdict(lambda: 0.0))
    # print('aman')
    vocublary= [word  for label in word_counts.keys() for word in word_counts[label].keys() ]
    vocublary_size=len(vocublary)
    for label in word_counts.keys():
      for word,count in word_counts[label].items():
        for l in word_counts.keys():
          likelihood[word][l]=(count+mu)/(n_words_per_label[l]+(mu*vocublary_size))
    for word in sentence:
      
      for label in word_counts.keys():
        if word in vocublary:
          score_dict[label]+=likelihood[word][label]
          # print(np.log(likelihood[word][label]),label,word)
    # print(score_dict)
    for label, s in score_dict.items():
      # print(s)
      if s > best_score:
        best_label=label
        best_score=s
        # score = 0.0
        ## FILE CODE
    # return prior, likelihood
    return best_label

In [None]:
{label:score}

In [10]:
example=['Tom', 'se', "n'è", 'andato.']

In [11]:
best=predict(example, 10.0, count['label_counts'], count['word_counts'], count['n_examples'], count['n_words_per_label'])

NameError: ignored

In [None]:
best

'__label__en'

In [None]:
np.log(likelihood['Marie']['__label__de'])+ np.log(prior['__label__de'])

-10.939170330872367

In [None]:
likelihood['that']

defaultdict(<function __main__.predict.<locals>.<lambda>.<locals>.<lambda>>,
            {'__label__de': 0.04105571414436619,
             '__label__en': 0.016762201750562086,
             '__label__eo': 0.03569479734943565,
             '__label__es': 0.06833180534492801,
             '__label__fr': 0.0572092362841674,
             '__label__hu': 0.11523587271453618,
             '__label__it': 0.03518877847435552,
             '__label__pt': 0.06642169569923044,
             '__label__ru': 0.03692752768468745,
             '__label__tr': 0.04507639200782161})

The next function will be used to evaluate the Naive Bayes model on a validation set. It computes the accuracy for a particular regularization parameter `mu`.

In [7]:
def compute_accuracy(valid_data, mu, counts):
    accuracy = 0.0
    correct=0
    for label, sentence in valid_data:
        ## FILL CODE
        prediction= predict(sentence, mu, counts['label_counts'], counts['word_counts'], counts['n_examples'], counts['n_words_per_label'])
        if label==prediction:
          correct+=1
    accuracy= correct/len(valid_data)

     
    return accuracy

In [12]:
print("")
print("** Naive Bayes **")
print("")

mu = 1.0
train_data = load_data("/content/drive/MyDrive/NLP_Week_1_Labs_2022/session1/train1.txt")
valid_data = load_data("/content/drive/MyDrive/NLP_Week_1_Labs_2022/session1/valid1.txt")
counts = count_words(train_data)

print("Validation accuracy: %.3f" % compute_accuracy(valid_data, mu, counts))
print("")


** Naive Bayes **

Validation accuracy: 0.046



In [None]:
valid_data = load_data("/content/drive/MyDrive/NLP_Week_1_Labs_2022/session1/valid1.txt")

In [None]:
valid_data[50]

('__label__it', ['Tom', 'se', "n'è", 'andato.'])