<a href="https://colab.research.google.com/github/duyminhnguyen97/NLPHomeworks/blob/master/Unigram_Language_Modeling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Unigram Language Modeling

In this document, we will learn:

- How to implement an unigram language model on a training data
- How to evaluate a language model on a test data using perplexity measure

## Data

We will use the file [wiki-en-train.word](https://raw.githubusercontent.com/neubig/nlptutorial/master/data/wiki-en-train.word) as the training data, and [wiki-en-test.
word](https://raw.githubusercontent.com/neubig/nlptutorial/master/data/wiki-en-test.word) as the test data. To test our implementation quickly, we will use small data files [01-train-input.txt](https://github.com/neubig/nlptutorial/blob/master/test/01-train-input.txt) and [01-test-input.txt](https://github.com/neubig/nlptutorial/blob/master/test/01-test-input.txt). All data files are from the [nlptutorial](https://github.com/neubig/nlptutorial) by Graham Neubig.

As the first step, we will download all necessary data files using `wget` command line.


In [0]:
!rm -f wiki-en-train.word
!wget https://raw.githubusercontent.com/neubig/nlptutorial/master/data/wiki-en-train.word
    
!rm -f wiki-en-test.word
!wget https://raw.githubusercontent.com/neubig/nlptutorial/master/data/wiki-en-test.word

!rm -f 01-train-input.txt
!wget https://raw.githubusercontent.com/neubig/nlptutorial/master/test/01-train-input.txt

!rm -f 01-test-input.txt
!wget https://raw.githubusercontent.com/neubig/nlptutorial/master/test/01-test-input.txt

--2020-01-18 04:42:09--  https://raw.githubusercontent.com/neubig/nlptutorial/master/data/wiki-en-train.word
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 203886 (199K) [text/plain]
Saving to: ‘wiki-en-train.word’


2020-01-18 04:42:09 (7.75 MB/s) - ‘wiki-en-train.word’ saved [203886/203886]

--2020-01-18 04:42:19--  https://raw.githubusercontent.com/neubig/nlptutorial/master/data/wiki-en-test.word
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 26989 (26K) [text/plain]
Saving to: ‘wiki-en-test.word’


2020-01-18 04:42:19 (3.68 MB/s) - ‘w

## Unigram Language Model

We are going to implement unigram language model in this section. We will write two functions:

- `train_unigram`: Creates a unigram model from the training data and save the model to a file.
- `test-unigram`: Reads a unigram model and calculates entropy, perplexity and coverage for the test set.

In [0]:
from collections import defaultdict


def train_unigram(train_file, model_file):
    counts = defaultdict(int)     # To count c(w_i)
    total_count = 0  # to count total words
    
    with open(train_file, 'r') as f:
        for line in f:
            line = line.strip()
            if line == '':
                continue
            words = line.split()
            words.append('</s>')
            for word in words:
                counts[word] += 1
                total_count += 1
    with open(model_file, 'w') as fo:
        for word, count in counts.items():
            probability = counts[word]/total_count
            fo.write('%s\t%f\n' % (word, probability))

Now let's test the function on the small data to verify that our implementation is correct.

In [0]:
!cat 01-train-input.txt

a b c
a b d


In [0]:
train_unigram('./01-train-input.txt', '01-train-answer.txt')

In [0]:
!cat 01-train-answer.txt

a	0.250000
b	0.250000
c	0.125000
</s>	0.250000
d	0.125000


We will now implement the function `test_unigram` for evaluating the model. We will need to load the model file before evaluating it on the test data.

In [0]:
def load_unigram_model(model_file):
    probabilities = {}
    with open(model_file, 'r') as f:
        for line in f:
            line = line.strip()
            if line == '':
                continue
            w, p = line.split()
            probabilities[w] = float(p)
    return probabilities

In [0]:
probabilities = load_unigram_model('01-train-answer.txt')
print(probabilities)

{'a': 0.25, 'b': 0.25, 'c': 0.125, '</s>': 0.25, 'd': 0.125}


In [0]:
import math 


lambda1 = 0.95
lambda_unk = 1 - lambda1
V = 1000000

def test_unigram(test_file, model_file):
    probabilities = load_unigram_model(model_file)
    W = 0  # total words
    H = 0  # entropy
    unk = 0  # total unknown words
    with open(test_file, 'r') as f:
        for line in f:
            line = line.strip()
            if line == '':
                continue
            words = line.split()
            words.append('</s>')
            for w in words:
                W += 1
                p = lambda_unk/V
                if w in probabilities:
                    p += lambda1 *  probabilities[w]
                else:
                    unk += 1
                H += -math.log2(p)
    H = H/W
    print('Entropy: {}'.format(H))
    print('Perplexity: {}'.format(2**H))
    print('Coverage: {}'.format((W-unk)/W))
    

In [0]:
test_unigram('01-test-input.txt', '01-train-answer.txt')

Entropy: 6.709899494272102
Perplexity: 104.6841703912115
Coverage: 0.8


Let's test unigram language model on the larger data files.

In [0]:
train_unigram('wiki-en-train.word', 'unigram_model.txt')
test_unigram('wiki-en-test.word', 'unigram_model.txt')

Entropy: 10.526656347101143
Perplexity: 1475.1606346867834
Coverage: 0.895226024503591
