Homework 5: Neural Language Models  (& 🎃 SpOoKy 👻 authors 🧟 data) - Task 3
---

Task 3: Feedforward Neural Language Model (60 points)
--------------------------

For this task, you will create and train neural LMs for both your word-based embeddings and your character-based ones. You should write functions when appropriate to avoid excessive copy+pasting.

### a) First, encode  your text into integers (5 points)

In [None]:
# Remember to restart your kernel if you change the contents of this file!
import neurallm_utils as nutils

# Importing utility functions from Keras
import keras
from keras.preprocessing.text import Tokenizer
from keras.utils import to_categorical

# necessary
from keras.models import Sequential
from keras.layers import Dense

# optional
# from keras.layers import Dropout

# if you want fancy progress bars
from tqdm import notebook
from IPython.display import display

# your other imports here
import time

import numpy as np


In [None]:
# constants you may find helpful. Edit as you would like.
EMBEDDINGS_SIZE = 50
NGRAM = 3 # The ngram language model you want to train

In [1]:
# load in necessary data
EMBEDDING_SAVE_FILE_WORD = "spooky_embedding_word.txt" # The file to save your word embeddings to
EMBEDDING_SAVE_FILE_CHAR = "spooky_embedding_char.txt" # The file to save your word embeddings to
TRAIN_FILE = 'spooky_author_train.csv' # The file to train your language model on
data_word = nutils.read_file_spooky(TRAIN_FILE, ngram=NGRAM)
data_char = nutils.read_file_spooky(TRAIN_FILE, ngram=NGRAM, by_character=True)

NameError: name 'nutils' is not defined

In [41]:
# Initialize a Tokenizer and fit on your data
# do this for both the word and character data

# It is used to vectorize a text corpus. Here, it just creates a mapping from 
# word to a unique index. (Note: Indexing starts from 0)
# Example:
# tokenizer = Tokenizer()
# tokenizer.fit_on_texts(data)
# encoded = tokenizer.texts_to_sequences(data)
tokenizer_char = Tokenizer(char_level=True)
tokenizer_char.fit_on_texts(data_char)
encoded_char = tokenizer_char.texts_to_sequences(data_char)

tokenizer_word = Tokenizer(char_level=False)
tokenizer_word.fit_on_texts(data_word)
encoded_word = tokenizer_word.texts_to_sequences(data_word)



In [42]:
# print out the size of the word index for each of your tokenizers
# this should match what you calculated in Task 2 with your embeddings

char_word_index = tokenizer_char.word_index
print(len(char_word_index))
word_word_index = tokenizer_word.word_index
print(len(word_word_index))



60
25374


### b) Next, prepare the sequences to train your model from text (5 points)

#### Fixed n-gram based sequences

In [43]:
def generate_ngram_training_samples(encoded: list, ngram: int) -> list:
    '''
    Takes the encoded data (list of lists) and 
    generates the training samples out of it.
    Parameters:
    up to you, we've put in what we used
    but you can add/remove as needed
    return: 
    list of lists in the format [[x1, x2, ... , x(n-1), y], ...]
    '''
    training_samples = []

    for encoded_list in encoded:
        # encoded_list does not have enough elements to create a single ngram
        if len(encoded_list) < ngram:
            continue 

        # We will end up with (#elements - ngram + 1) lists. Ex. a list with two elements and ngram size two would only produce one list 
        for i in range(len(encoded_list) - ngram + 1):
            ngram_sample = encoded_list[i: i+ngram] # If ngram = 3 get [0:4], [1:5], ...
            # x = ngram_sample[:-1]  # All elements except the last one
            # y = ngram_sample[-1]   # The last element
            # training_samples.append(x + [y])
            training_samples.append(ngram_sample)
    return training_samples


# generate your training samples for both word and character data
# print out the first 5 training samples for each
# we have displayed the number of sequences
# to expect for both characters and words
#
# Spooky data by character should give 2957553 sequences
# [21, 21, 3]
# [21, 3, 9]
# [3, 9, 7]
# ...
# Spooky data by words shoud give 634080 sequences
# [1, 1, 32]
# [1, 32, 2956]
# [32, 2956, 3]
# ...

training_samples_word = generate_ngram_training_samples(encoded_word, ngram=NGRAM)
print('Word Training Samples')
print('Count:', len(training_samples_word))
for i in range(5):
    print(training_samples_word[i])
training_samples_char = generate_ngram_training_samples(encoded_char, ngram=NGRAM)
print('\nChar Training Samples')
print('Count:', len(training_samples_char))
for i in range(5):
    print(training_samples_char[i])


Word Training Samples
Count: 634080
[1, 1, 32]
[1, 32, 2956]
[32, 2956, 3]
[2956, 3, 155]
[3, 155, 3]

Char Training Samples
Count: 2957553
[21, 21, 3]
[21, 3, 9]
[3, 9, 7]
[9, 7, 8]
[7, 8, 1]


### c) Then, split the sequences into X and y and create a Data Generator (20 points)

In [44]:
# 2.5 points

# Note here that the sequences were in the form: 
# sequence = [x1, x2, ... , x(n-1), y]
# We still need to separate it into [[x1, x2, ... , x(n-1)], ...], [y1, y2, ...]]
# do that here
def splitSequence(samples):
  x_list = [] # list of list of x values
  y_list = [] # list of y values
  for sequence in samples:
    x_values = sequence[:-1] # Everything but the last element
    y_value = sequence[-1] # Last element
    x_list.append(x_values)
    y_list.append(y_value)
  return x_list, y_list

X_word, y_word = splitSequence(training_samples_word)
X_char, y_char = splitSequence(training_samples_char)

# print out the shapes to verify that they are correct
print(len(X_word))
print(len(X_word[0]))
print(len(y_word))
print(len(X_char))
print(len(X_char[0]))
print(len(y_char))

print(X_word[0])

634080
2
634080
2957553
2
2957553
[1, 1]


In [45]:
# 2.5 points

# Initialize a function that reads the word embeddings you saved earlier
# and gives you back mappings from words to their embeddings and also 
# indexes from the tokenizers to their embeddings

def read_embeddings(filename: str, tokenizer: Tokenizer) -> (dict, dict):
    '''Loads and parses embeddings trained in earlier.
    Parameters:
        filename (str): path to file
        Tokenizer: tokenizer used to tokenize the data (needed to get the word to index mapping)
    Returns:
        (dict): mapping from word to its embedding vector
        (dict): mapping from index to its embedding vector
    '''
    # YOUR CODE HERE
    word_to_embedding = {}  # Mapping from word to its embedding vector
    index_to_embedding = {}  # Mapping from index to its embedding vector
    with open(filename, 'r', encoding='utf-8') as file:
        for line in file:
            split_line = line.split()
            # Skip the first line of file
            if len(split_line) == 2:
                continue
            word = split_line[0]
            vector = [float(x) for x in split_line[1:]]
        
            if word in tokenizer.word_index:
                word_to_embedding[word] = vector # Mapping from word to its embedding vector
                index_to_embedding[tokenizer.word_index[word]] = vector # Mapping from index to its embedding vector
    return word_to_embedding, index_to_embedding



In [46]:
# NECESSARY FOR CHARACTERS

# the "0" index of the Tokenizer is assigned for the padding token. Initialize
# the vector for padding token as all zeros of embedding size
# this adds one to the number of embeddings that were initially saved
# (and increases your vocab size by 1)
word_embedding_word, word_embedding_index = read_embeddings(EMBEDDING_SAVE_FILE_WORD, tokenizer_word)
word_embedding_index[0] = [0] * len(word_embedding_index[1])
char_embedding_word , char_embedding_index = read_embeddings(EMBEDDING_SAVE_FILE_CHAR, tokenizer_char)
char_embedding_index[0] = [0] * len(char_embedding_index[1]) # Add a list of 0s for index "0" (got size from index "1" )
print(char_embedding_index[0])

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


In [95]:
# 10 points
from tensorflow.keras.utils import to_categorical
import numpy as np

def data_generator(X: list, y: list, num_sequences_per_batch: int, index_2_embedding: dict, for_feedforward:bool=False) -> (list,list):
    '''
    Returns data generator to be used by feed_forward
    https://wiki.python.org/moin/Generators
    https://realpython.com/introduction-to-python-generators/
    
    Yields batches of embeddings and labels to go with them.
    Use one hot vectors to encode the labels 
    (see the to_categorical function)
    
    If for_feedforward is True: 
    Returns data generator to be used by feed_forward
    else: Returns data generator for RNN model
    '''
    # YOUR CODE HERE
    def toEmbeddings(x_vector):
        result = []
        for token in x_vector:
            result.extend(index_2_embedding[token])
        return result

    num_samples = len(X)
    i = 0
    while i < num_samples:
        # Apply padding when we our last batch won't have enough and we for_feedforward is false
        if not for_feedforward and i + num_sequences_per_batch > num_samples: 
            print('here')
            batch_X = X[i:]
            batch_y = y[i:]
            for j in range((i +  num_sequences_per_batch) - num_samples):
                n_gram = len(batch_X[0])
                batch_X.append([0] * n_gram)
                batch_y.append(0)
        else:
            batch_X = X[i:i + num_sequences_per_batch]
            batch_y = y[i:i + num_sequences_per_batch]
        # Convert batch_X to embeddings using index_2_embedding 
        embeddings = [toEmbeddings(x_vector) for x_vector in batch_X]
        # Encode batch_y as one-hot vectors of size len(index_2_embedding) 
        one_hot_vectors = to_categorical(batch_y, num_classes=len(index_2_embedding))
        yield np.array(embeddings), one_hot_vectors
        i += num_sequences_per_batch


In [81]:
data_gen = data_generator(X_word, y_word, 128, word_embedding_index, for_feedforward=True)
i = 0
for batch_X, batch_y in data_gen:
  i += len(batch_X)
  print(np.shape(batch_X[0]))
  print(np.shape(batch_y))
  if i == 128:
    break



(100,)
(128, 25375)


In [108]:
# 5 points

# initialize your data_generator for both word and character data
# print out the shapes of the first batch to verify that it is correct for both word and character data

# Examples:
# num_sequences_per_batch = 128 # this is the batch size
# steps_per_epoch = len(sequences)//num_sequences_per_batch  # Number of batches per epoch
# train_generator = data_generator(X, y, num_sequences_per_batch)

# sample=next(train_generator) # this is how you get data out of generators
# sample[0].shape # (batch_size, (n-1)*EMBEDDING_SIZE)  (128, 200)
# sample[1].shape   # (batch_size, |V|) to_categorical

num_sequences_per_batch = 128 # this is the batch size
steps_per_epoch = len(X_word)//num_sequences_per_batch  # Number of batches per epoch
train_generator = data_generator(X_word, y_word, num_sequences_per_batch, word_embedding_index)
sample = next(train_generator)
print(sample[0].shape)

print(sample[1].shape)

(128, 100)
(128, 25375)


### d) Train & __save__ your models (15 points)

In [109]:
# 15 points 

# code to train a feedforward neural language model for 
# both word embeddings and character embeddings
# make sure not to just copy + paste to train your two models
# (define functions as needed)

# train your models for between 3 & 5 epochs
# on Felix's machine, this takes ~ 24 min for character embeddings and ~ 10 min for word embeddings
# DO NOT EXPECT ACCURACIES OVER 0.5 (and even that is very for this many epochs)
# We recommend starting by training for 1 epoch

# Define your model architecture using Keras Sequential API
# Use the adam optimizer instead of sgd
# add cells as desired

model_word = Sequential()

# hidden layer
# you can play around with different activation functions
model_word.add(Dense(units=3, activation='relu', input_dim=100))


# put in an output layer
# activation function is our classification function
model_word.add(Dense(units=25375, activation='sigmoid'))

model_word.summary()

# Compile the model
model_word.compile(optimizer='adam',  # You can choose an optimizer (e.g., 'adam', 'sgd')
              loss='categorical_crossentropy',  # Specify the loss function for classification
              metrics=['accuracy'])  # Optional: Specify metrics for evaluation

Model: "sequential_18"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_35 (Dense)            (None, 3)                 303       
                                                                 
 dense_36 (Dense)            (None, 25375)             101500    
                                                                 
Total params: 101803 (397.67 KB)
Trainable params: 101803 (397.67 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [110]:
# Here is some example code to train a model with a data generator
model_word.fit(x=train_generator, 
          steps_per_epoch=steps_per_epoch-1,
          epochs=1)



<keras.src.callbacks.History at 0x2d68ddba0>

In [None]:

# spooky data model by character for 5 epochs takes ~ 24 min on Felix's computer
# with adam optimizer, gets accuracy of 0.3920

# spooky data model by word for 5 epochs takes 10 min on Felix's computer
# results in accuracy of 0.2110


In [None]:
# save your trained models so you can re-load instead of re-training each time
# also, you'll need these to generate your sentences!


### e) Generate Sentences (15 points)

In [None]:
# load your models if you need to


In [None]:
# 10 points

# # generate a sequence from the model until you get an end of sentence token
# This is an example function header you might use
# def generate_seq(model: Sequential, 
#                  tokenizer: Tokenizer, 
#                  seed: list):
#     '''
#     Parameters:
#         model: your neural network
#         tokenizer: the keras preprocessing tokenizer
#         seed: [w1, w2, w(n-1)]
#     Returns: string sentence
#     '''
#     pass



In [None]:
# 5 points

# generate and display one sequence from both the word model and the character model
# do not include <s> or </s> in your displayed sentences
# make sure that you can read the output easily (i.e. don't just print out a list of tokens)

# you may leave _ as _ or replace it with a space if you prefer

In [None]:
# generate 100 example sentences with each model and save them to a file, one sentence per line
# do not include <s> and </s> in your saved sentences (you'll use these sentences in your next task)
# this will produce two files, one for each model