## Neural Language Models
Status of Notebook: Work in Progress

Reference: https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf

Dynet Version: https://github.com/neubig/nn4nlp-code/blob/master/02-lm/nn-lm.py

Additions compared to `nn.lm.ipnyb`:
- Cleaned up model architecture code
- Added Dropout
- Using different initial learning rate

In [1]:
import torch
import random
import torch
import torch.nn as nn
import math
import time
import numpy as np

### Download the Data

In [6]:
# uncomment to download the datasets
#!wget https://raw.githubusercontent.com/neubig/nn4nlp-code/master/data/ptb/test.txt
#!wget https://raw.githubusercontent.com/neubig/nn4nlp-code/master/data/ptb/train.txt
#!wget https://raw.githubusercontent.com/neubig/nn4nlp-code/master/data/ptb/valid.txt

### Process the Data

In [2]:
# function to read in data, pro=ess each line and split columns by " ||| "
def read_data(filename):
    data = []
    with open(filename, "r") as f:
        for line in f:
            line = line.strip().split(" ")
            data.append(line)
    return data

# read the data
train_data = read_data('data/ptb/train.txt')
val_data = read_data('data/ptb/valid.txt')

# creating the word and tag indices and special tokens
word_to_index = {}
index_to_word = {}
word_to_index["<s>"] = len(word_to_index)
index_to_word[len(word_to_index)-1] = "<s>"
word_to_index["<unk>"] = len(word_to_index) # add <UNK> to dictionary
index_to_word[len(word_to_index)-1] = "<unk>"

# create word to index dictionary and tag to index dictionary from data
def create_dict(data, check_unk=False):
    for line in data:
        for word in line:
            if check_unk == False:
                if word not in word_to_index:
                    word_to_index[word] = len(word_to_index)
                    index_to_word[len(word_to_index)-1] = word
            
            # has no effect because data already comes with <unk>
            # should work with data without <unk> already processed
            else: 
                if word not in word_to_index:
                    word_to_index[word] = word_to_index["<unk>"]
                    index_to_word[len(word_to_index)-1] = word

create_dict(train_data)
create_dict(val_data, check_unk=True)

# create word and tag tensors from data
def create_tensor(data):
    for line in data:
        yield([word_to_index[word] for word in line])

train_data = list(create_tensor(train_data))
val_data = list(create_tensor(val_data))

number_of_words = len(word_to_index)

In our implementation we are using batched training. There are a few differences from the original implementation found [here](https://github.com/neubig/nn4nlp-code/blob/master/02-lm/loglin-lm.py). 

### Define the Model

In [3]:
## define the model
device = 'cuda' if torch.cuda.is_available() else 'cpu'

N = 2 # length of the n-gram
EMB_SIZE = 128 # size of the embedding
HID_SIZE = 128 # size of the hidden layer

# Neural LM
class NeuralLM(nn.Module):
    def __init__(self, number_of_words, ngram_length, EMB_SIZE, HID_SIZE, dropout):
        super(NeuralLM, self).__init__()

        # embedding layer
        self.embedding = nn.Embedding(number_of_words, EMB_SIZE)

        self.fnn = nn.Sequential(
            # hidden layer
            nn.Linear(EMB_SIZE * ngram_length, HID_SIZE),
            nn.Tanh(),
            # dropout layer
            nn.Dropout(dropout),
            # output layer
            nn.Linear(HID_SIZE, number_of_words)
        )

    def forward(self, x):
        embs = self.embedding(x)                        # Size: [batch_size x num_hist x emb_size]
        embs = embs.view(embs.size(0), -1)              # Size: [batch_size x (num_hist*emb_size)]
        logit = self.fnn(embs)                          # Size: batch_size x num_words                    
        return logit

### Model Settings and Functions

In [4]:
model = NeuralLM(number_of_words, N, EMB_SIZE, HID_SIZE, dropout=0.2)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
criterion = torch.nn.CrossEntropyLoss()

if torch.cuda.is_available():
    model.to(device)

# function to calculate the sentence loss
def calc_sent_loss(sent):
    S = word_to_index["<s>"]
    
    # initial history is equal to end of sentence symbols
    hist = [S] * N
    
    # collect all target and histories
    all_targets = []
    all_histories = []
    
    # step through the sentence, including the end of sentence token
    for next_word in sent + [S]:
        all_histories.append(list(hist))
        all_targets.append(next_word)
        hist = hist[1:] + [next_word]

    logits = model(torch.LongTensor(all_histories).to(device))
    loss = criterion(logits, torch.LongTensor(all_targets).to(device))

    return loss

MAX_LEN = 100
# Function to generate a sentence
def generate_sent():
    S = word_to_index["<s>"]
    hist = [S] * N
    sent = []
    while True:
        logits = model(torch.LongTensor([hist]).to(device))
        p = torch.nn.functional.softmax(logits) # 1 x number_of_words
        next_word = p.multinomial(num_samples=1).item()
        if next_word == S or len(sent) == MAX_LEN:
            break
        sent.append(next_word)
        hist = hist[1:] + [next_word]
    return sent

### Train the Model

In [5]:
# start training
for ITER in range (10): # CHANGE to 100
    # training
    random.shuffle(train_data)

    model.train()
    train_words, train_loss = 0, 0.0
    for sent_id, sent in enumerate(train_data):
        
        my_loss = calc_sent_loss(sent)

        train_loss += my_loss.item()
        train_words += len(sent)

        optimizer.zero_grad()
        my_loss.backward()
        optimizer.step()

        if (sent_id+1) % 5000 == 0:
            print("--finished %r sentences" % (sent_id+1))
    print("iter %r: train loss/word=%.4f, ppl=%.4f" % (ITER, train_loss/train_words, math.exp(train_loss/train_words)))

    # evaluation
    model.eval()
    dev_words, dev_loss = 0, 0.0
    start = time.time()
    for sent_id, sent in enumerate(val_data):
        my_loss = calc_sent_loss(sent)
        dev_loss += my_loss.item()
        dev_words += len(sent)
    print("iter %r: dev loss/word=%.4f, ppl=%.4f, time=%.2fs" % (ITER, dev_loss/dev_words, math.exp(dev_loss/dev_words), time.time()-start))

    # Generate a few sentences
    for _ in range(5):
        sent = generate_sent()
        print(" ".join([index_to_word[x] for x in sent]))

--finished 5000 sentences
--finished 10000 sentences
--finished 15000 sentences
--finished 20000 sentences
--finished 25000 sentences
--finished 30000 sentences
--finished 35000 sentences
--finished 40000 sentences
iter 0: train loss/word=0.2837, ppl=1.3280
iter 0: dev loss/word=0.2698, ppl=1.3097, time=1.39s
they are principle that
since prices rose N for the company is attorney employees fears
but what led control
N berkeley combining bears the administration 's participation about N N marks off N N after targeted short administration corp. funds today for the nine months that a carrier <unk> national a bell who has n't temporarily that create the market 's request to year-earlier N N utilities at a sale in the first of his aggressive newspapers to close of actions because a central democratic investor regional genetic <unk> volumes
then things do n't turning




--finished 5000 sentences
--finished 10000 sentences
--finished 15000 sentences
--finished 20000 sentences
--finished 25000 sentences
--finished 30000 sentences
--finished 35000 sentences
--finished 40000 sentences
iter 1: train loss/word=0.2607, ppl=1.2978
iter 1: dev loss/word=0.2641, ppl=1.3022, time=1.44s
edward jointly in boston corp. poland that 's kellogg in a mechanism for his biggest steps successful for its third year concentrate
it 's harris international operations of using technological in N million shares without the <unk> two-year <unk> about N N in beyond and economic reforms in stocks
sen. resistance t. other corp. processing companies on a units discussed their one-third has to pay
it could n't be reached minnesota in the opposition era
there are n't same finances is backed in frankfurt to encourage an fairly computer denied sansui of design forces we had once nine months america society of the sense the estimated volume was # N million
--finished 5000 sentences
--fin