## Neural Language Models
Status of Notebook: Work in Progress

Reference: https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf

Dynet Version: https://github.com/neubig/nn4nlp-code/blob/master/02-lm/nn-lm.py

In [3]:
import torch
import random
import torch
import torch.nn as nn
import math
import time
import numpy as np

### Download the Data

In [6]:
# uncomment to download the datasets
#!wget https://raw.githubusercontent.com/neubig/nn4nlp-code/master/data/ptb/test.txt
#!wget https://raw.githubusercontent.com/neubig/nn4nlp-code/master/data/ptb/train.txt
#!wget https://raw.githubusercontent.com/neubig/nn4nlp-code/master/data/ptb/valid.txt

### Process the Data

In [4]:
# function to read in data, process each line and split columns by " ||| "
def read_data(filename):
    data = []
    with open(filename, "r") as f:
        for line in f:
            line = line.strip().split(" ")
            data.append(line)
    return data

# read the data
train_data = read_data('data/ptb/train.txt')
val_data = read_data('data/ptb/valid.txt')

# creating the word and tag indices and special tokens
word_to_index = {}
index_to_word = {}
word_to_index["<s>"] = len(word_to_index)
index_to_word[len(word_to_index)-1] = "<s>"
word_to_index["<unk>"] = len(word_to_index) # add <UNK> to dictionary
index_to_word[len(word_to_index)-1] = "<unk>"

# create word to index dictionary and tag to index dictionary from data
def create_dict(data, check_unk=False):
    for line in data:
        for word in line:
            if check_unk == False:
                if word not in word_to_index:
                    word_to_index[word] = len(word_to_index)
                    index_to_word[len(word_to_index)-1] = word
            
            # has no effect because data already comes with <unk>
            # should work with data without <unk> already processed
            else: 
                if word not in word_to_index:
                    word_to_index[word] = word_to_index["<unk>"]
                    index_to_word[len(word_to_index)-1] = word

create_dict(train_data)
create_dict(val_data, check_unk=True)

# create word and tag tensors from data
def create_tensor(data):
    for line in data:
        yield([word_to_index[word] for word in line])

train_data = list(create_tensor(train_data))
val_data = list(create_tensor(val_data))

number_of_words = len(word_to_index)

In our implementation we are using batched training. There are a few differences from the original implementation found [here](https://github.com/neubig/nn4nlp-code/blob/master/02-lm/loglin-lm.py). 

### Define the Model

In [5]:
## define the model

device = 'cuda' if torch.cuda.is_available() else 'cpu'

N = 2 # length of the n-gram
EMB_SIZE = 128 # size of the embedding
HID_SIZE = 128 # size of the hidden layer

# Neural LM
class NeuralLM(nn.Module):
    def __init__(self, number_of_words, ngram_length, EMB_SIZE, HID_SIZE):
        super(NeuralLM, self).__init__()

        # embedding layer
        self.embedding = nn.Embedding(number_of_words, EMB_SIZE)

        # hidden layer
        self.hidden = nn.Linear(EMB_SIZE * ngram_length, HID_SIZE)
        # output layer
        self.output = nn.Linear(HID_SIZE, number_of_words)

    def forward(self, x):
        embs = self.embedding(x) # [batch_size x num_hist x emb_size]
        embs = embs.view(embs.size(0), -1) # [batch_size x (num_hist*emb_size)]
        h = torch.nn.functional.tanh(self.hidden(embs)) # batch_size x hid_size
        scores = self.output(h) # batch_size x num_words
        return scores

### Model Settings and Functions

In [7]:
model = NeuralLM(number_of_words, N, EMB_SIZE, HID_SIZE)
optimizer = torch.optim.Adam(model.parameters(), lr=0.1)
criterion = torch.nn.CrossEntropyLoss()

if torch.cuda.is_available():
    model.to(device)

# function to calculate the sentence loss
def calc_sent_loss(sent):
    S = word_to_index["<s>"]
    
    # initial history is equal to end of sentence symbols
    hist = [S] * N
    
    # collect all target and histories
    all_targets = []
    all_histories = []
    
    # step through the sentence, including the end of sentence token
    for next_word in sent + [S]:
        all_histories.append(list(hist))
        all_targets.append(next_word)
        hist = hist[1:] + [next_word]

    logits = model(torch.LongTensor(all_histories).to(device))
    loss = criterion(logits, torch.LongTensor(all_targets).to(device))

    return torch.sum(loss)

MAX_LEN = 100
# Function to generate a sentence
def generate_sent():
    S = word_to_index["<s>"]
    hist = [S] * N
    sent = []
    while True:
        logits = model(torch.LongTensor([hist]).to(device))
        p = torch.nn.functional.softmax(logits) # 1 x number_of_words
        next_word = p.multinomial(num_samples=1).item()
        if next_word == S or len(sent) == MAX_LEN:
            break
        sent.append(next_word)
        hist = hist[1:] + [next_word]
    return sent

### Train the Model

In [9]:
# start training
for ITER in range (10): # CHANGE to 100
    # training
    random.shuffle(train_data)

    model.train()
    train_words, train_loss = 0, 0.0
    for sent_id, sent in enumerate(train_data): # CHANGE to all train_data
        
        my_loss = calc_sent_loss(sent)
        
        train_loss += my_loss.item()
        train_words += len(sent)

        optimizer.zero_grad()
        my_loss.backward()
        optimizer.step()

        if (sent_id+1) % 5000 == 0:
            print("--finished %r sentences" % (sent_id+1))
    print("iter %r: train loss/word=%.4f, ppl=%.4f" % (ITER, train_loss/train_words, math.exp(train_loss/train_words)))

    # evaluation
    model.eval()
    dev_words, dev_loss = 0, 0.0
    start = time.time()
    for sent_id, sent in enumerate(val_data):
        my_loss = calc_sent_loss(sent)
        dev_loss += my_loss.item()
        dev_words += len(sent)
    print("iter %r: dev loss/word=%.4f, ppl=%.4f, time=%.2fs" % (ITER, dev_loss/dev_words, math.exp(dev_loss/dev_words), time.time()-start))

    # Generate a few sentences
    for _ in range(5):
        sent = generate_sent()
        print(" ".join([index_to_word[x] for x in sent]))



--finished 5000 sentences
--finished 10000 sentences
--finished 15000 sentences
--finished 20000 sentences
--finished 25000 sentences
--finished 30000 sentences
--finished 35000 sentences
--finished 40000 sentences
iter 0: train loss/word=4.3265, ppl=75.6790
iter 0: dev loss/word=4.4869, ppl=88.8440, time=1.42s
the next healthvest the club of lying <unk> student <unk> could <unk>
the next healthvest the club of lying <unk> student <unk> could <unk> and a company at a
the next healthvest the club of lying <unk> student <unk> could <unk> <unk> <unk> as <unk> williams <unk> student <unk> could <unk> and <unk> around <unk> ruth <unk> student <unk> student <unk> could <unk> and <unk> with <unk> student <unk> could <unk> and <unk> company <unk> on <unk> but <unk> each <unk> student <unk> could <unk> <unk> as <unk> williams <unk> student <unk> could <unk> and a company at a a <unk> student <unk> could <unk> <unk> as <unk> williams <unk> student <unk> could <unk> and <unk> around <unk> ruth <u



--finished 5000 sentences
--finished 10000 sentences
--finished 15000 sentences
--finished 20000 sentences
--finished 25000 sentences
--finished 30000 sentences
--finished 35000 sentences
--finished 40000 sentences
iter 1: train loss/word=4.4450, ppl=85.2025
iter 1: dev loss/word=4.5506, ppl=94.6892, time=1.41s
japanese the quarter unit 's adrs <unk> n.c that it describes new york banks release respondents that it describes new york banks release respondents that it describes new york banks release respondents that it describes new york banks release respondents that it describes new york banks release respondents that it describes new york banks release respondents that it describes new york banks release respondents that it describes new york banks release respondents that it describes new york banks release respondents that it describes new york banks release respondents that it describes new york banks release respondents that it describes new
japanese the quarter unit to a blast o