# Introduction

**Welcome to the Language modeling Notebook.**

In this assignment, you are going to train a neural network to **generate news headlines**.
To reduce computational needs, we have reduced it to headlines about technology, and a handful of Tech giants.
In this assignment you will:
- Learn to preprocess raw text so it can be fed into an LSTM.
- Make use of the LSTM library of Pytorch, to train a Language model to generate headlines
- Use your network to generate headlines, and judge which headlines are likely or not




**What is a language model?**

Language modeling is the task of assigning a probability to sentences in a language. Besides assigning a probability to each sequence of words, the language models also assigns a probability for the likelihood of a given word (or a sequence of words) to follow a sequence of words.
— Page 105, __[Neural Network Methods in Natural Language Processing](https://www.amazon.com/Language-Processing-Synthesis-Lectures-Technologies/dp/1627052984/)__, 2017.

In terms of neural network, we are training a neural network to produce probabilities (classification) over a fixed vocabulary of words.
Concretely, we are training a neural network to produce:
$$ P ( w_{i+1} | w_1, w_2, w_3, ..., w_i), \forall i \in (1,n)$$

**Why is language modeling important?**

Language modeling is a core problem in NLP.

Language models can either be used as a stand-alone to produce new text that matches the distribution of text the model is trained on, but can also be used at the front-end of a more sophisticated model to produce better results.

Recently for example, the __[BERT](https://arxiv.org/abs/1810.04805)__ paper show-cased that pretraining a large neural network on a language modeling task can help improve state-of-the-art on many NLP tasks.

How good can the generation of a Language model be?

If you have not seen the post about GPT-2 by OpenAI, you should read some of the samples they generated from their language model __[here](https://blog.openai.com/better-language-models/#sample1)__.
Because of computational restrictions, we will not achieve as good text production, but the same algorithm is at the core. They just use more data and compute.

# Library imports

Before starting, make sure you have all these libraries.

In [1]:
!pip install segtok

[0m

Run the first of the following two cells if you are running the homework locally, and run the second cell if you are running the homework in Colab

In [2]:
DRIVE=False
root_folder = ""
dataset_folder = "dataset/"

In [3]:
# from google.colab import drive
# drive.mount('/content/drive')
# root_folder = "/content/drive/My Drive/cs182_hw3_public/"
# dataset_folder = "/content/drive/My Drive/cs182_hw3_public/dataset/"

In [4]:
%load_ext autoreload
%autoreload 2

In [20]:
import os
import sys
sys.path.append(root_folder)
from segtok import tokenizer
from collections import Counter
import torch
from torch import nn
import torch.nn.functional as F
import torch.optim as optim

import numpy as np
import json
from utils import validate_to_array

# Loading the datasets

Make sure the dataset files are all in the `dataset` folder of the assignment.

 - If you are using this notebook locally: You should run the `download_data.sh` script.
 - If you are using the Colab version of the notebook, make sure that your Google Drive is mounted, and you verify from the file explorer in Colab that the files are viewable within `/content/drive/cs182_hw3_public/dataset/`



In [6]:
# This cell loads the data for the model
# Run this before working on loading any of the additional data

with open(dataset_folder+"headline_generation_dataset_processed.json", "r") as f:
    d_released = json.load(f)

with open(dataset_folder+"headline_generation_vocabulary.txt", "r",encoding='utf8') as f:
    vocabulary = f.read().split("\n")
w2i = {w: i for i, w in enumerate(vocabulary)} # Word to index
i2w = {i: w for i, w in enumerate(vocabulary)} # Index to word
unkI, padI, startI = w2i['UNK'], w2i['PAD'], w2i['<START>']

vocab_size = len(vocabulary)
input_length = len(d_released[0]['numerized']) # The length of the first element in the dataset, they are all of the same length
d_train = [d for d in d_released if d['cut'] == 'training']
d_valid = [d for d in d_released if d['cut'] == 'validation']

print("Number of training samples:",len(d_train))
print("Number of validation samples:",len(d_valid))

Number of training samples: 88568
Number of validation samples: 946


Now that we have loaded the data, let's inspect one of the elements. Each sample in our dataset is has a `numerized` vector, that contains the preprocessed headline. This vector is what we will feed in to the neural network. The field `numerized` corresponds to this list of tokens. The already loaded dictionary `vocabulary` maps token lists to the actual string. Use these elements to recover `title` key of entry 1001 in the training dataset.

**TODO**: Write the numerized2text function in notebook_utils and inspect element 1001 in the training dataset (`entry = d_train[1001]`).



In [7]:
def numerize_sequence(tokenized):
    return [w2i.get(w, unkI) for w in tokenized]
def pad_sequence(numerized, pad_index, to_length):
    pad = numerized[:to_length]
    padded = pad + [pad_index] * (to_length - len(pad))
    mask = [w != pad_index for w in padded]
    return padded, mask

In [8]:
def numerized2text(numerized):
    """ Converts an integer sequence in the vocabulary into a string corresponding to the title.

        Arguments:
            numerized: List[int]  -- The list of vocabulary indices corresponding to the string
        Returns:
            title: str -- The string corresponding to the numerized input, without padding.
    """
    #####
    # BEGIN YOUR CODE HERE
    # Recover each word from the vocabulary in the list of indices in numerized, using the vocabulary variable
    # Hint 1: Use the string.join() function to reconstruct a single string
    # Hint 2: The objects and/or functions defined in above cells may be useful.
    #####

    converted_string = " ".join([i2w[n] for n in numerized])

    #####
    # END YOUR CODE HERE
    #####

    return converted_string

entry = d_train[1001]
print("Reversing the numerized: "+numerized2text(entry['numerized']))
# validate_to_array(numerized2text,(entry['numerized'],),'numerized2text',root_folder)
print("From the `title` entry: "+ entry['title'])

Reversing the numerized: microsoft donates cloud computing ' worth $ 1 bn ' PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD
From the `title` entry: Microsoft donates cloud computing 'worth $1 bn'


In language modeling, we train a model to produce the next word in the sequence given all previously generated words. This has, in practice, two steps:


    1. Adding a special <START> token to the start of the sequence for the input. This "shifts" the input to the right by one. We call this the "source" sequence
    2. Making the network predict the original, unshifted version (we call this the "target" sequence)

    
Let's take an example. Say we want to train the network on the sentence: "The cat is great."
The input to the network will be "`<START>` The cat is great." The target will be: "The cat is great".
    
Therefore the first prediction is to select the word "The" given the `<START>` token.
The second prediction is to produce the word "cat" given the two tokens "`<START>` The".
At each step, the network learns to predict the next word, given all previous ones.

Your next step is to write the build_batch function. Given a dataset, we select a subset of samples, and will build the "inputs" and the "targets" of the batch, following the procedure we've described.

**TODO**: write the build_batch function. We give you the structure, and you have to fill in where we have left things `your_code`.


In [30]:
def build_batch(dataset, indices):
    """ Builds a batch of source and target elements from the dataset.

        Arguments:
            dataset: List[db_element] -- A list of dataset elements
            indices: List[int] -- A list of indices of the dataset to sample
        Returns:
            batch_input: List[List[int]] -- List of source sequences
            batch_target: List[List[int]] -- List of target sequences
            batch_target_mask: List[List[int]] -- List of target batch masks
    """
    #####
    # BEGIN YOUR CODE HERE
    #####

    # We get a list of indices we will choose from the dataset.
    # indices = range(iteration*batch_size,(iteration+1)*batch_size)

    # Recover what the entries for the batch are
    batch = [dataset[i] for i in indices]

    # Get the raw numerized for this input, each element of the dataset has a 'numerized' key
    batch_numerized = [entry['numerized'] for entry in batch]

    # Create an array of startI that will be concatenated at position 1 for the input.
    # Should be of shape (batch_size, 1)
    start_tokens = startI * np.ones((len(batch), 1))

    # Concatenate the start_tokens with the rest of the input
    # The np.concatenate function should be useful
    # The output should now be [batch_size, sequence_length+1]
    batch_input = np.concatenate((start_tokens, batch_numerized), axis=1)

    # Remove the last word from each element in the batch
    # To restore the [batch_size, sequence_length] size
    batch_input = batch_input[:, :-1]

    # The target should be the un-shifted numerized input
    batch_target = np.array(batch_numerized)

    # The target-mask is a 0 or 1 filter to note which tokens are
    # padding or not, to give the loss, so the model doesn't get rewarded for
    # predicting PAD tokens.
    batch_target_mask = np.array([a['mask'] for a in batch])

    #####
    # END YOUR CODE HERE
    #####

    return batch_input, batch_target, batch_target_mask
# validate_to_array(build_batch,(d_train, range(100)),'build_batch',root_folder)
batch_input, batch_target, batch_target_mask = build_batch(d_train, range(100))

In [31]:
# test code to make sure the shape is right and batch_input is padded

mini_dataset = [d_train[1001], d_train[1002]]
mini_dataset_text = [d['title'] for d in mini_dataset]

batch_size = len(mini_dataset)
sequence_length = len(mini_dataset[0]['numerized'])

batch_input, batch_target, batch_target_mask = build_batch(mini_dataset, range(len(mini_dataset)))

assert batch_input.shape == (batch_size, sequence_length)    # verify batch_input shape
assert (batch_input[:,0] == 0).all()                         # verify batch_input starts with START token idx (0)
assert batch_target.shape == (batch_size, sequence_length)   # verify batch_target shape

# sanity check
print(mini_dataset_text[1])
print(numerized2text(batch_input[1]))
print(numerized2text(batch_target[1]))
print(batch_target_mask[1])

Wearable chip start-up Ineda gets funding from Qualcomm and Samsung
<START> wearable chip start-up UNK gets funding from qualcomm and samsung PAD PAD PAD PAD PAD PAD PAD PAD PAD
wearable chip start-up UNK gets funding from qualcomm and samsung PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD
[ True  True  True  True  True  True  True  True  True  True False False
 False False False False False False False False]


# Creating the language model

Now that we've written the data pipelining, we are ready to write the Neural network.

The steps to setting up a neural network to do Language modeling are:
- Creating the placeholders for the model, where we can feed in our inputs and targets.
- Creating an RNN of our choice, size, and with optional parameters
- Using the RNN on our placeholder inputs.
- Getting the output from the RNN, and projecting it into a vocabulary sized dimension, so that we can make word predictions.
- Setting up the loss on the outputs so that the network learns to produce the correct words.
- Finally, choosing an optimizer, and defining a training operation: using the optimizer to minimize the loss.

We provide skeleton code for the model, you can fill in the `your_code` section. If you are unfamiliar with Pytorch, we provide some idea of what functions to look for, you should use the Pytorch online documentation.

**TODO**: Fill in the LanguageModel in the language_model file.


In [70]:
from language_model import LanguageModel

# Training the model

Your objective is to train the Language on the dataset you are provided to reach a **validation loss <= 5.50**

**TODO**: Train your model so that it achieves a validation loss of <= 5.5.

**Careful**: we will be testing this loss on an unreleased test set, so make sure to evaluate properly on a validation set and not overfit. You must save the model you want us to test under: models/final_language_model (the .index, .meta and .data files)

**Advice**:
- It should be possible to attain loss <= 5.50 with a 1-layer LSTM of size 256 or less.
- You should not need more than 10 epochs to attain the threshold. More passes over the data can however give you a better model.
- You can however try using:
    - LSTM dropout (Pytorch has a layer for that)
    - Multi-layer RNN cell (Pytorch has a layer for that)
    - Change your optimizers, tune your learning_rate, use a learning rate schedule.
    
**Extra credit**:

Get the loss below **validation loss <= 5.00** and get 5 points of extra-credit on this assignment. Get creative,

but remember, what you do should work on our held-out test set to get the points.

In [73]:
# model parameters of our choosing
hidden_size = 256
num_layers = 1
dropout = .5

# Look at the docs for torch.optim and pick an optimizer
optimizer_class = optim.Adam
lr = 1e-4
epochs = 20
batch_size = 128

# convert build_batch function outputs from numpy to pytorch for compatibility with our model.
# cast the mask to float32, target and input to long
batch_to_torch = lambda b_in,b_target,b_mask: (torch.tensor(b_in).long(),
                                               torch.tensor(b_target).long(),
                                               torch.tensor(b_mask).float())

model_id = 'test1'
os.makedirs(root_folder+'models/part1/',exist_ok=True)

device = th.device("cuda" if th.cuda.is_available() else "cpu")
print(device)
list_to_device = lambda th_obj: [tensor.to(device) for tensor in th_obj]

cpu


In [108]:
# sanity check on the mini dataset
hidden_size = 256
num_layers = 1
dropout = .5
model = LanguageModel(vocab_size=vocab_size, rnn_size=hidden_size, num_layers=num_layers, dropout=dropout)
# run with earlier mini_dataset
b_in, b_target, b_mask = batch_to_torch(*build_batch(mini_dataset, range(len(mini_dataset))))

prediction = model(b_in)
assert prediction.shape == (len(mini_dataset), sequence_length, vocab_size)

loss = model.loss(prediction, b_target, b_mask)
assert loss.ndim == 0  # at last verify it's a scalar

In [113]:
model = LanguageModel(vocab_size=vocab_size, rnn_size=hidden_size, num_layers=num_layers, dropout=dropout)
optimizer = optimizer_class(model.parameters(), lr=lr)

In [None]:
# set the model in training mode
model.train()
losses = []
accuracies = []

from tqdm import tqdm

for epoch in range(epochs):

    # for each epoch, get a random permutation of minibatches
    indices = np.random.permutation(range(len(d_train)))
    t = tqdm(range(0,(len(d_train)//batch_size)+1))
    
    for i in t:
        # build a minibatch for each iteration in the epoch.
        batch = build_batch(d_train, indices[i*batch_size:(i+1)*batch_size])
        (batch_input, batch_target, batch_target_mask) = batch_to_torch(*batch)
        
        # push to cpu/gpu
        (batch_input, batch_target, batch_target_mask) = list_to_device((batch_input, batch_target, batch_target_mask))
        model.to(device)

        prediction = model(batch_input)
        loss = model.loss(prediction, batch_target, batch_target_mask)
        
        losses.append(loss.item())

        # check batch accuracy: % of non-masked tokens where predicted class (vocab item) is equal to the target class
        accuracy = (torch.eq(prediction.argmax(dim=2,keepdim=False),batch_target).float()*batch_target_mask).sum()/batch_target_mask.sum()
        accuracies.append(accuracy.item())

        # backward pass
        optimizer.zero_grad()   # zero out gradients for parameters we'll update
        loss.backward()         # compute gradients with backprop
        optimizer.step()        # actually update the params

        if i % 10 == 0:
            t.set_description(f"Epoch: {epoch} Iteration: {i} Loss: {np.mean(losses[-10:])} Accuracy: {np.mean(accuracies[-10:])}")
    # save your latest model
    save_dict = dict(
        kwargs = dict(
            vocab_size=vocab_size,
            rnn_size=hidden_size,
            num_layers=num_layers,
            dropout=dropout
        ),
        model_state_dict = model.state_dict(),
        notes = "",
        optimizer_class = optimizer_class,
        lr = lr,
        epochs = epochs,
        batch_size = batch_size,
    )
    th.save(save_dict,root_folder+f'models/part1/model_{model_id}.pt')

# Using the language model

Congratulations, you have now trained a language model! We can now use it to evaluate likely news headlines, as well as generate our very own headlines.

**TODO**: Complete the three parts below, using the model you have trained.

## (1) Evaluation loss

To evaluate the language model, we evaluate its loss (ability to predict) on unseen data that is reserved for evaluation.
Your first evaluation is to load the model you trained, and obtain a test loss. If you are running this validation and not training, run the setup cell above the training loop first.

In [None]:
model_id = "test1"
save_dict = th.load(root_folder+'models/part1/'+f"model_{model_id}.pt",map_location='cpu')
model = LanguageModel(**save_dict['kwargs'])
model.load_state_dict(save_dict['model_state_dict'])
model.eval()

In [None]:
# We will evaluate your model in the best_models folder
# In a very similar way as the code below.
# Make sure your validation loss is below the threshold we specified
# and that you didn't train using the validation set

batch = build_batch(d_valid, range(len(d_valid)))
(batch_input, batch_target, batch_target_mask) = batch_to_torch(*batch)
prediction = model(batch_input.long())
loss = loss_fn(prediction, batch_target, batch_target_mask)
print("Evaluation set loss:", loss.item())

In [None]:
# Your best performing model should go here.
os.makedirs(root_folder+"best_models",exist_ok=True)
best_model_file = root_folder+"best_models/part1_best_model.pt"
th.save(save_dict,best_model_file)

## (2) Evaluation of likelihood of data

One use of a language model is to see what data is more likely to have originated from the training data. Because we have trained our model on news headlines, we can see which of these headlines is more likely:

``Apple to release another iPhone in September``


 ``Apple and Samsung resolve all lawsuits amicably``

**TODO**: Use the model to obtain the loss the neural network assigns to each sentence.
Because the neural network assigns probability to the words appearing in a sequence, this loss can be used as a proxy to measure how likely the sentence is to have occurred in the dataset.
Once you have the loss for each headline, write down which sentence was judged to be more likely, and explain why/if you think this is coherent.

**Your answer:**


In [None]:
def raw_sample_pred(headline, model):
    #####
    # BEGIN YOUR CODE HERE
    #####
    # From the code in the Preprocessing section at the end of the notebook
    # Find out how to tokenize the headline
    tokenized = your_code

    # Find out how to numerize the tokenized headline
    numerized = your_code

    # Learn how to pad and obtain the mask of the sequence.
    padded, mask = your_code

    # Obtain the predicted headline and target headline
    input_headline = your_code
    pred_headline = your_code
    target_headline = your_code
    mask = your_code

    #####
    # END YOUR CODE HERE
    #####

    return pred_headline,target_headline,mask

In [None]:
model.eval()

headline1 = "Apple to release new iPhone in July"
headline2 = "Apple and Samsung resolve all lawsuits"

headlines = [headline1.lower(), headline2.lower()] # Our LSTM is trained on lower-cased headlines
for headline in headlines:
    pred_headline,target_headline,mask = raw_sample_pred(headline, model)
    loss = your_code # Obtain the loss

    print("----------------------------------------")
    print("Headline:", headline)
    print("Loss of the headline:", loss)
validate_to_array(raw_sample_pred,zip(headlines,[model]*2),'raw_sample_pred',root_folder,multi=True)
# Important check: one headline should be more likely (and have lower loss)
# Than the other headline. You should know which headline should have lower loss.

## (3) Generation of headlines

We can use our language model to generate text according to the distribution of our training data.
The way generation works is the following:

We seed the model with a beginning of sequence, and obtain the distribution for the next word.
We select the most likely word (argmax) and add it to our sequence of words.
Now our sequence is one word longer, and we can feed it in again as an input, for the network to produce the next sentence.
We do this a fixed number of times (up to 20 words), and obtain automatically generated headlines!


We have provided a few headline starters that should produce interesting generated headlines.

**TODO:** Get creative and find at least 2 more headline_starters that produce interesting headlines.

In [None]:
def generate_sentence(headline_starter, model):
    # Tokenize and numerize the headline. Put the numerized headline
    # beginning in `current_build`
    tokenized = tokenizer.word_tokenizer(headline_starter.lower())
    current_build = [startI] + numerize_sequence(tokenized)

    while len(current_build) < input_length:
        # Pad the current_build into a input_length vector.
        # We do this so that it can be processed by our LanguageModel class
        current_padded, _m = pad_sequence(current_build, padI, input_length)

        # Obtain the logits for the current padded sequence
        # This involves obtaining the output_logits from our model,
        # and not the loss like we have done so far
        logits = your_code
        logits_np = logits.detach().cpu().numpy()

        # Obtain the row of logits that interest us, the logits for the last non-pad
        # inputs
        last_logits = your_code

        # Find the highest scoring words in the last_logits
        # array, or sample from the softmax.
        # The np.argmax function may be useful for first option,
        # sp.special.softmax and np.random.choice may be useful for second option.
        # Append this word to our current build
        current_build.append(your_code)

    # Go from the current_build of word_indices
    # To the headline (string) produced. This should involve
    # the vocabulary, and a string merger.
    produced_sentence = your_code
    return produced_sentence

In [None]:
model.eval()
# Here are some headline starters.
# They're all about tech companies, because
# That is what is in our dataset
headline_starters = ["apple has released", "google has released", "amazon", "tesla to"]
for headline_starter in headline_starters:
    print("===================")
    print("Generating headline starting with: "+headline_starter)

    produced_sentence = generate_sentence(headline_starter, model)
    print(produced_sentence)
validate_to_array(generate_sentence,zip(headline_starters,[model]*len(headline_starters)),"generate_sentence",root_folder,multi=True)

## All done

You are done with the first part of the HW.

Next notebook deals with Summarization of text!


# Preprocessing (read only)


**You can skip this section, however you may find these functions useful in later sections of this notebook**

We have provided this code so you see how the dataset was generated. You will have to come back some of these functions later in the assignment, so feel free to read through, to get familiar.

In [None]:
# You do not need to run this
# This is to show you how the dataset was created
# You should read to understand, so you can preprocess text
# In the same way, in the evaluation section

for a in dataset:
    a['tokenized'] = tokenizer.word_tokenizer(a['title'].lower())

In [None]:
# You do not need to run this
# This is to show you how the dataset was created
# You should read to understand, so you can preprocess text
# In the same way, in the evaluation section

word_counts = Counter()
for a in dataset:
    word_counts.update(a['tokenized'])

print(word_counts.most_common(30))

In [None]:
# You do not need to run this
# This is to show you how the dataset was created
# You should read to understand, so you can preprocess text
# In the same way, in the evaluation section

# Creating the vocab
vocab_size = 20000
special_words = ["<START>", "UNK", "PAD"]
vocabulary = special_words + [w for w, c in word_counts.most_common(vocab_size-len(special_words))]
w2i = {w: i for i, w in enumerate(vocabulary)}

# Numerizing and padding
input_length = 20
unkI, padI, startI = w2i['UNK'], w2i['PAD'], w2i['<START>']

for a in dataset:
    a['numerized'] = numerize_sequence(a['tokenized']) # Change words to IDs
    a['numerized'], a['mask'] = pad_sequence(a['numerized'], padI, input_length) # Append appropriate PAD tokens

# Compute fraction of words that are UNK:
word_counters = Counter([w for a in dataset for w in a['input'] if w != padI])

print("Fraction of UNK words:", float(word_counters[unkI]) / sum(word_counters.values()))

In [None]:
# You do not need to run this
# This is to show you how the dataset was created
# You should read to understand, so you can preprocess text
# In the same way, in the evaluation section

d_released_processed   = [d for d in dataset if d['cut'] != 'testing']
d_unreleased_processed = [d for d in dataset if d['cut'] == 'testing']

with open("dataset/headline_generation_dataset_processed.json", "w") as f:
    json.dump(d_released_processed, f)

# This file is purposefully left out of the assignment, we will use it to evaluate your model.
with open("dataset/headline_generation_dataset_unreleased_processed.json", "w") as f:
    json.dump(d_unreleased_processed, f)

with open("dataset/headline_generation_vocabulary.txt", "w") as f:
    f.write("\n".join(vocabulary).encode('utf8'))