## Neural Language Models
Status of Notebook: Work in Progress

Reference: https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf

Dynet Version: https://github.com/neubig/nn4nlp-code/blob/master/02-lm/nn-lm.py

Old PyTorch version: https://github.com/neubig/nn4nlp-code/blob/master/02-lm-pytorch/nn-lm-batch.py

Additions compared to `nn.lm.ipnyb`:
- Cleaned up model architecture code
- Added Dropout
- Using different initial learning rate

In [1]:
import torch
import random
import torch
import torch.nn as nn
import math
import time
import numpy as np

### Download the Data

In [2]:
# uncomment to download the datasets
#!wget https://raw.githubusercontent.com/neubig/nn4nlp-code/master/data/ptb/test.txt
#!wget https://raw.githubusercontent.com/neubig/nn4nlp-code/master/data/ptb/train.txt
#!wget https://raw.githubusercontent.com/neubig/nn4nlp-code/master/data/ptb/valid.txt

### Process the Data

In [2]:
# function to read in data, pro=ess each line and split columns by " ||| "
def read_data(filename):
    data = []
    with open(filename, "r") as f:
        for line in f:
            line = line.strip().split(" ")
            data.append(line)
    return data

# read the data
train_data = read_data('data/ptb/train.txt')
val_data = read_data('data/ptb/valid.txt')

# creating the word and tag indices and special tokens
word_to_index = {}
index_to_word = {}
word_to_index["<s>"] = len(word_to_index)
index_to_word[len(word_to_index)-1] = "<s>"
word_to_index["<unk>"] = len(word_to_index) # add <UNK> to dictionary
index_to_word[len(word_to_index)-1] = "<unk>"

# create word to index dictionary and tag to index dictionary from data
def create_dict(data, check_unk=False):
    for line in data:
        for word in line:
            if check_unk == False:
                if word not in word_to_index:
                    word_to_index[word] = len(word_to_index)
                    index_to_word[len(word_to_index)-1] = word
            
            # has no effect because data already comes with <unk>
            # should work with data without <unk> already processed
            else: 
                if word not in word_to_index:
                    word_to_index[word] = word_to_index["<unk>"]
                    index_to_word[len(word_to_index)-1] = word

create_dict(train_data)
create_dict(val_data, check_unk=True)

# create word and tag tensors from data
def create_tensor(data):
    for line in data:
        yield([word_to_index[word] for word in line])

train_data = list(create_tensor(train_data))
val_data = list(create_tensor(val_data))

number_of_words = len(word_to_index)

In our implementation we are using batched training. There are a few differences from the original implementation found [here](https://github.com/neubig/nn4nlp-code/blob/master/02-lm/loglin-lm.py). 

### Define the Model

In [3]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'

N = 2 # length of the n-gram
EMB_SIZE = 128 # size of the embedding
HID_SIZE = 128 # size of the hidden layer

# Neural LM
class FNN_LM(nn.Module):
    def __init__(self, number_of_words, ngram_length, EMB_SIZE, HID_SIZE, dropout):
        super(FNN_LM, self).__init__()

        # embedding layer
        self.embedding = nn.Embedding(number_of_words, EMB_SIZE)

        self.fnn = nn.Sequential(
            # hidden layer
            nn.Linear(EMB_SIZE * ngram_length, HID_SIZE),
            nn.Tanh(),
            # dropout layer
            nn.Dropout(dropout),
            # output layer
            nn.Linear(HID_SIZE, number_of_words)
        )

    def forward(self, x):
        embs = self.embedding(x)              # Size: [batch_size x num_hist x emb_size]
        feat = embs.view(embs.size(0), -1)    # Size: [batch_size x (num_hist*emb_size)]
        logit = self.fnn(feat)                # Size: batch_size x num_words                    
        return logit

### Model Settings and Functions

In [17]:
model = FNN_LM(number_of_words, N, EMB_SIZE, HID_SIZE, dropout=0.2)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
criterion = torch.nn.CrossEntropyLoss(reduction="sum")

if torch.cuda.is_available():
    model.to(device)

# function to calculate the sentence loss
def calc_sent_loss(sent):
    S = word_to_index["<s>"]
    
    # initial history is equal to end of sentence symbols
    hist = [S] * N
    
    # collect all target and histories
    all_targets = []
    all_histories = []
    
    # step through the sentence, including the end of sentence token
    for next_word in sent + [S]:
        all_histories.append(list(hist))
        all_targets.append(next_word)
        hist = hist[1:] + [next_word]

    logits = model(torch.LongTensor(all_histories).to(device))
    loss = criterion(logits, torch.LongTensor(all_targets).to(device))

    return loss

MAX_LEN = 100
# Function to generate a sentence
def generate_sent():
    S = word_to_index["<s>"]
    hist = [S] * N
    sent = []
    while True:
        logits = model(torch.LongTensor([hist]).to(device))
        p = torch.nn.functional.softmax(logits) # 1 x number_of_words
        next_word = p.multinomial(num_samples=1).item()
        if next_word == S or len(sent) == MAX_LEN:
            break
        sent.append(next_word)
        hist = hist[1:] + [next_word]
    return sent

### Train the Model

In [19]:
# start training
for ITER in range(5):
    # training
    random.shuffle(train_data)
    model.train()
    train_words, train_loss = 0, 0.0
    start = time.time()
    for sent_id, sent in enumerate(train_data):        
        my_loss = calc_sent_loss(sent)
        train_loss += my_loss.item()
        train_words += len(sent)
        optimizer.zero_grad()
        my_loss.backward()
        optimizer.step()
        if (sent_id+1) % 5000 == 0:
            print("--finished %r sentences (words/sec=%.2f)" % (sent_id+1, train_words/(time.time()-start)))
    print("iter %r: train loss/word=%.4f, ppl=%.4f, (words/sec=%.2f)" % (ITER, train_loss/train_words, math.exp(train_loss/train_words), train_words/(time.time()-start)))

    # evaluation
    model.eval()
    dev_words, dev_loss = 0, 0.0
    start = time.time()
    for sent_id, sent in enumerate(val_data):
        my_loss = calc_sent_loss(sent)
        dev_loss += my_loss.item()
        dev_words += len(sent)
    print("iter %r: dev loss/word=%.4f, ppl=%.4f, (words/sec=%.2fs)" % (ITER, dev_loss/dev_words, math.exp(dev_loss/dev_words), time.time()-start))

    # Generate a few sentences
    for _ in range(5):
        sent = generate_sent()
        print(" ".join([index_to_word[x] for x in sent]))

--finished 5000 sentences (words/sec=12807.67)
--finished 10000 sentences (words/sec=12788.71)
--finished 15000 sentences (words/sec=12807.44)
--finished 20000 sentences (words/sec=12801.59)
--finished 25000 sentences (words/sec=12852.69)
--finished 30000 sentences (words/sec=12843.39)
--finished 35000 sentences (words/sec=12835.04)
--finished 40000 sentences (words/sec=12816.01)
iter 0: train loss/word=6.1274, ppl=458.2398, (words/sec=12801.17)
iter 0: dev loss/word=5.8676, ppl=353.3835, (words/sec=1.44s)
it will change at georgia & co. got instead of totally a appointment from the big bankers posted <unk> & co. also received that brokers
one and claim the politicians amount for <unk> the measure of the california santa contract
our birth capitol led the giant <unk> by an <unk> market in the central <unk> held the rise of the company 's sheet that the irs on britain dollars
yesterday 's jail & <unk> investigations on news for buying creditors has lower market for polish statement so a



--finished 5000 sentences (words/sec=12587.62)
--finished 10000 sentences (words/sec=12652.41)
--finished 15000 sentences (words/sec=12740.18)
--finished 20000 sentences (words/sec=12763.71)
--finished 25000 sentences (words/sec=12753.94)
--finished 30000 sentences (words/sec=12754.24)
--finished 35000 sentences (words/sec=12762.18)
--finished 40000 sentences (words/sec=12740.41)
iter 1: train loss/word=5.7389, ppl=310.7324, (words/sec=12744.21)
iter 1: dev loss/word=5.7766, ppl=322.6629, (words/sec=1.40s)
the advertising for champion the dollar was named whose damage was down from lawyers and the new england told them need
rumors with cents a share
justice general operations in chicago
british bought what of going to pay to rates since april
according to an <unk> family
--finished 5000 sentences (words/sec=12702.39)
--finished 10000 sentences (words/sec=12731.82)
--finished 15000 sentences (words/sec=12755.89)
--finished 20000 sentences (words/sec=12828.83)
--finished 25000 sentences 