In this second part of the lab, we will implement a language identifier trained on the same data, but using Logistic Regression instead of Naive Bayes.

In [None]:
import io, sys, math
import numpy as np
from collections import defaultdict

This function is used to build the dictionary, or vocabulary, which is a mapping from strings (or words) to integers (or indices). This will allow to build vector representations of documents. 

In [None]:
def build_dict(filename, threshold=1):
    fin = io.open(filename, 'r', encoding='utf-8')
    word_dict, label_dict = {}, {}
    counts = defaultdict(lambda: 0)
    for line in fin:
        tokens = line.split()
        label = tokens[0]

        if not label in label_dict:
            label_dict[label] = len(label_dict)

        for w in tokens[1:]:
            counts[w] += 1
            
    for k, v in counts.items():
        if v > threshold:
            word_dict[k] = len(word_dict)
    return word_dict, label_dict

This function is used to load the training dataset, and build vector representations of the training examples. In particular, a document or sentence is represented as a bag of words. Each example correspond to a sparse vector ` x` of dimension `V`, where `V` is the size of the vocabulary. The element `j` of the vector `x` is the number of times the word `j` appears in the document.

In [None]:
def load_data(filename, word_dict, label_dict):
    fin = io.open(filename, 'r', encoding='utf-8')
    data = []
    dim = len(word_dict)
    for line in fin:
        tokens = line.split()
        label = tokens[0]

        yi = label_dict[label]
        xi = np.zeros(dim)
        for word in tokens[1:]:
            if word in word_dict:
                wid = word_dict[word]
                xi[wid] += 1.0
        data.append((yi, xi))
    return data

First, let's implement the softmax function. Don't forget numerical stability!

In [None]:
def softmax(x):
    ### FILL CODE
    e = np.exp(x - x.max())
    result = e/np.sum(e, axis=0)
    return result

Now, let's implement the main training loop, by using stochastic gradient descent. The function will iterate over the examples of the training set. For each example, we will first compute the loss, before computing the gradient and performing the update.

In [28]:
def sgd(w, data, niter):
    nlabels, dim = w.shape
    alpha = 1/2
    for iter in range(niter):
        ### FILL CODE
        train_loss = 0.0

        for label, x in data:
          pred = softmax(w@x)
          train_loss += np.log(pred[label])
         # compute the gradient which is the partial derivative of train loss with respect to w
          target_0Vec = np.zeros(nlabels)  # construect a zero vector of the target
          target_0Vec[label] = 1.0    

          error = pred - target_0Vec    # compute the error
          grad = error.reshape(nlabels,1)*x.reshape(1,dim)
          # update w
          w = w - alpha*grad
          #loss
          loss = train_loss/len(data)
          print('iter: %02d loss: %03f'% (iter, loss))

        
    return w

The next function will predict the most probable label corresponding to example `x`, given the trained classifier `w`.

In [17]:
def predict(w, x):
    ## FILL CODE
    y = np.dot(w, x)
    predict = softmax(y)
    
    return np.argmax(predict)

Finally, this function will compute the accuracy of a trained classifier `w` on a validation set.

In [24]:
def compute_accuracy(w, valid_data):
  accuracy = 0.0
  for label, x in valid_data:

    ## FILL CODE
    pred = predict(w, x)
    # print(pred,label)
    if pred==label:
      accuracy += 1.0
     
  return (accuracy/len(valid_data))*100

In [29]:
print("")
print("** Logistic Regression **")
print("")

word_dict, label_dict = build_dict("/content/drive/MyDrive/NLP/Week1/session1/train1.txt")
train_data = load_data("/content/drive/MyDrive/NLP/Week1/session1/train1.txt", word_dict, label_dict)
valid_data = load_data("/content/drive/MyDrive/NLP/Week1/session1/valid1.txt", word_dict, label_dict)

nlabels = len(label_dict)
dim = len(word_dict)
w = np.zeros([nlabels, dim])
w = sgd(w, train_data, 5)
print("")
print("Validation accuracy: %.3f" % compute_accuracy(w, valid_data))
print("")

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
iter: 04 loss: -0.059536
iter: 04 loss: -0.059536
iter: 04 loss: -0.059536
iter: 04 loss: -0.059536
iter: 04 loss: -0.059537
iter: 04 loss: -0.059537
iter: 04 loss: -0.059544
iter: 04 loss: -0.059573
iter: 04 loss: -0.059574
iter: 04 loss: -0.059609
iter: 04 loss: -0.059609
iter: 04 loss: -0.059617
iter: 04 loss: -0.059617
iter: 04 loss: -0.059618
iter: 04 loss: -0.059619
iter: 04 loss: -0.059647
iter: 04 loss: -0.059678
iter: 04 loss: -0.059679
iter: 04 loss: -0.059689
iter: 04 loss: -0.059691
iter: 04 loss: -0.059692
iter: 04 loss: -0.059692
iter: 04 loss: -0.059737
iter: 04 loss: -0.059737
iter: 04 loss: -0.059824
iter: 04 loss: -0.059825
iter: 04 loss: -0.059825
iter: 04 loss: -0.059825
iter: 04 loss: -0.059825
iter: 04 loss: -0.059883
iter: 04 loss: -0.059883
iter: 04 loss: -0.059888
iter: 04 loss: -0.059888
iter: 04 loss: -0.059888
iter: 04 loss: -0.059888
iter: 04 loss: -0.059917
iter: 04 loss: -0.059953
iter: 04 l