Homework 5: Neural Language Models  (& 🎃 SpOoKy 👻 authors 🧟 data) - Task 3
---

Task 3: Feedforward Neural Language Model (60 points)
--------------------------

For this task, you will create and train neural LMs for both your word-based embeddings and your character-based ones. You should write functions when appropriate to avoid excessive copy+pasting.

### a) First, encode  your text into integers (5 points)

In [1]:
# Importing utility functions from Keras
import keras
from keras.preprocessing.text import Tokenizer
from keras.utils import to_categorical

# necessary
from keras.models import Sequential
from keras.layers import Dense

# optional
# from keras.layers import Dropout

# if you want fancy progress bars
from tqdm import notebook
from IPython.display import display

# your other imports here
import time

import numpy as np
import neurallm_utils as nutils 

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\hsubr\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [2]:
# load in necessary data
EMBEDDING_SAVE_FILE_WORD = "spooky_embedding_word.txt" # The file to save your word embeddings to
EMBEDDING_SAVE_FILE_CHAR = "spooky_embedding_char.txt" # The file to save your word embeddings to
TRAIN_FILE = 'spooky_author_train.csv' # The file to train your language model on
NGRAM = 3 # The ngram language model you want to train

data_word = nutils.read_file_spooky(TRAIN_FILE, NGRAM)
data_char = nutils.read_file_spooky(TRAIN_FILE, NGRAM, by_character=True)

In [3]:
# constants you may find helpful. Edit as you would like.
EMBEDDINGS_SIZE = 50

In [4]:
# Initialize a Tokenizer and fit on your data
# do this for both the word and character data

# It is used to vectorize a text corpus. Here, it just creates a mapping from 
# word to a unique index. (Note: Indexing starts from 0)
# Example:
# tokenizer = Tokenizer()
# tokenizer.fit_on_texts(data)
# encoded = tokenizer.texts_to_sequences(data)

def create_tokenizer(data):
    tokenizer = Tokenizer()
    tokenizer.fit_on_texts(data)
    encoded = tokenizer.texts_to_sequences(data)
    return tokenizer, encoded

tokenizer_words, encoded_words = create_tokenizer(data_word)
tokenizer_chars, encoded_chars = create_tokenizer(data_char)

In [5]:
# print out the size of the word index for each of your tokenizers
# this should match what you calculated in Task 2 with your embeddings
print(f'Word index size: {len(tokenizer_words.index_word)}')
print(f'Char index size: {len(tokenizer_chars.index_word)}')
print(tokenizer_words)

Word index size: 25374
Char index size: 60
<keras.src.preprocessing.text.Tokenizer object at 0x000002541420BB50>


### b) Next, prepare the sequences to train your model from text (5 points)

#### Fixed n-gram based sequences

In [6]:
def generate_ngram_training_samples(encoded: list, ngram: int) -> list:
    '''
    Takes the encoded data (list of lists) and 
    generates the training samples out of it.
    Parameters:
    up to you, we've put in what we used
    but you can add/remove as needed
    return: 
    list of lists in the format [[x1, x2, ... , x(n-1), y], ...]
    '''
    training_samples = []
    for encoding in encoded:
        for idx in range(len(encoding) - NGRAM + 1):
            sequence = encoding[idx:idx+NGRAM]
            training_samples.append(sequence)
    return training_samples

# generate your training samples for both word and character data
# print out the first 5 training samples for each
# we have displayed the number of sequences
# to expect for both characters and words
#
# Spooky data by character should give 2957553 sequences
# [21, 21, 3]
# [21, 3, 9]
# [3, 9, 7]
# ...
# Spooky data by words shoud give 634080 sequences
# [1, 1, 32]
# [1, 32, 2956]
# [32, 2956, 3]
# ...

training_samples_words = generate_ngram_training_samples(encoded_words, NGRAM)
training_samples_chars = generate_ngram_training_samples(encoded_chars, NGRAM)

print(f'Number sequences by words: {len(training_samples_words)}')
print(f'Number sequences by chars: {len(training_samples_chars)}')

print(f'First 5 training samples words: {training_samples_words[0:5]}')
print(f'First 5 training samples chars: {training_samples_chars[0:5]}')


Number sequences by words: 634080
Number sequences by chars: 2957553
First 5 training samples words: [[1, 1, 32], [1, 32, 2956], [32, 2956, 3], [2956, 3, 155], [3, 155, 3]]
First 5 training samples chars: [[21, 21, 3], [21, 3, 9], [3, 9, 7], [9, 7, 8], [7, 8, 1]]


### c) Then, split the sequences into X and y and create a Data Generator (20 points)

In [7]:
# 2.5 points

# Note here that the sequences were in the form: 
# sequence = [x1, x2, ... , x(n-1), y]
# We still need to separate it into [[x1, x2, ... , x(n-1)], ...], [y1, y2, ...]]
# do that here
def split_sequences(training_samples):
    X = [seq[0:NGRAM-1] for seq in training_samples]
    y = [seq[-1] for seq in training_samples]
    return X, y


# print out the shapes to verify that they are correct
X_words, y_words = split_sequences(training_samples_words)
X_chars, y_chars = split_sequences(training_samples_chars)

print(len(X_words))
print(len(y_words))

print(X_words[0:5])
print(y_words[0:5])

print(len(X_chars))
print(len(y_chars))

print(X_chars[0:5])
print(y_chars[0:5])


634080
634080
[[1, 1], [1, 32], [32, 2956], [2956, 3], [3, 155]]
[32, 2956, 3, 155, 3]
2957553
2957553
[[21, 21], [21, 3], [3, 9], [9, 7], [7, 8]]
[3, 9, 7, 8, 1]


In [8]:
# 2.5 points

# Initialize a function that reads the word embeddings you saved earlier
# and gives you back mappings from words to their embeddings and also 
# indexes from the tokenizers to their embeddings

def read_embeddings(filename: str, tokenizer: Tokenizer) -> (dict, dict):
    '''Loads and parses embeddings trained in earlier.
    Parameters:
        filename (str): path to file
        Tokenizer: tokenizer used to tokenize the data (needed to get the word to index mapping)
    Returns:
        (dict): mapping from word to its embedding vector
        (dict): mapping from index to its embedding vector
    '''
    # YOUR CODE HERE
    word_embeddings = {}
    tokenizer_embeddings = {}
    with open(filename, encoding='utf-8') as f:
        for line in f.readlines():
            arr = line.split()
            if len(arr) == 2:
                continue
            key = arr[0]
            val = [float(x) for x in arr[1:]]
            word_embeddings[key] = val
            tokenizer_embeddings[tokenizer.word_index[key]] = val
    return word_embeddings, tokenizer_embeddings


In [9]:
word_embeddings, word_index_embeddings = read_embeddings(EMBEDDING_SAVE_FILE_WORD, tokenizer_words)
char_embeddings, char_index_embeddings = read_embeddings(EMBEDDING_SAVE_FILE_CHAR, tokenizer_chars)

In [10]:
# NECESSARY FOR CHARACTERS

# the "0" index of the Tokenizer is assigned for the padding token. Initialize
# the vector for padding token as all zeros of embedding size
# this adds one to the number of embeddings that were initially saved
# (and increases your vocab size by 1)

padding_embedding = [0] * EMBEDDINGS_SIZE
word_index_embeddings[0] = padding_embedding
char_index_embeddings[0] = padding_embedding

In [11]:
# 10 points

def data_generator(X: list, y: list, num_sequences_per_batch: int, index_2_embedding: dict) -> (np.array,np.array):
    '''
    Returns data generator to be used by feed_forward
    https://wiki.python.org/moin/Generators
    https://realpython.com/introduction-to-python-generators/
    
    Yields batches of embeddings and labels to go with them.
    Use one hot vectors to encode the labels 
    (see the to_categorical function)
    
    If for_feedforward is True: 
    Returns data generator to be used by feed_forward
    else: Returns data generator for RNN model
    '''
    # YOUR CODE HERE
    
    one = []
    two = []    
    #X = [[1, 2], ... ]
    #y = [1, 2, ...]
    vocab_size = len(index_2_embedding.keys())
    
    for idx, i in enumerate(X):
        one.append(index_2_embedding[i[0]] + index_2_embedding[i[1]])
        two.append(y[idx])
        
        if idx > 0 and idx % (num_sequences_per_batch-1) == 0:
            yield np.array(one), to_categorical(two, num_classes=vocab_size)
            one = []
            two = []
    

In [19]:
# 5 points

# initialize your data_generator for both word and character data
# print out the shapes of the first batch to verify that it is correct for both word and character data
num_sequences_per_batch = 128 # this is the batchsize

word_generator = data_generator(X_words, y_words, num_sequences_per_batch, word_index_embeddings)
char_generator = data_generator(X_chars, y_chars, num_sequences_per_batch, char_index_embeddings)


# Examples:
steps_per_epoch_words = len(X_words)//num_sequences_per_batch  # Number of batches per epoch
print(steps_per_epoch_words)
steps_per_epoch_chars = len(X_chars)//num_sequences_per_batch  # Number of batches per epoch
print(steps_per_epoch_chars)

sample_generator = data_generator(X_words, y_words, num_sequences_per_batch, word_index_embeddings)
sample=next(sample_generator) # this is how you get data out of generators
print(sample[0].shape) # (batch_size, (n-1)*EMBEDDING_SIZE)  (128, 200)
print(sample[1].shape)   # (batch_size, |V|) to_categorical

4953
23105
(128, 100)
(128, 25375)


### d) Train & __save__ your models (15 points)

In [31]:
# 15 points 

# code to train a feedforward neural language model for 
# both word embeddings and character embeddings
# make sure not to just copy + paste to train your two models
# (define functions as needed)

# train your models for between 3 & 5 epochs
# on Felix's machine, this takes ~ 24 min for character embeddings and ~ 10 min for word embeddings
# DO NOT EXPECT ACCURACIES OVER 0.5 (and even that is very for this many epochs)
# We recommend starting by training for 1 epoch

# Define your model architecture using Keras Sequential API
# Use the adam optimizer instead of sgd
# add cells as desired

hidden_units = 100
input_dim = 2 * EMBEDDINGS_SIZE

def create_neural_model(hidden_units):
    model = Sequential()
    model.add(Dense(units=hidden_units, activation='relu', input_dim=input_dim))
    model.add(Dense(units=hidden_units, activation='tanh', input_dim=input_dim))
    model.add(Dense(units=1, activation='sigmoid'))
    model.summary()
    model.compile(loss='binary_crossentropy',
                    optimizer='adam',
                    metrics=['accuracy'])
    
    return model
    
    
word_model = create_neural_model(hidden_units)
char_model = create_neural_model(hidden_units)

Model: "sequential_17"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_48 (Dense)            (None, 100)               10100     
                                                                 
 dense_49 (Dense)            (None, 100)               10100     
                                                                 
 dense_50 (Dense)            (None, 1)                 101       
                                                                 
Total params: 20301 (79.30 KB)
Trainable params: 20301 (79.30 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
Model: "sequential_18"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_51 (Dense)            (None, 100)               10100     
                                                            

In [32]:
# Here is some example code to train a model with a data generator
# model.fit(x=train_generator, 
#           steps_per_epoch=steps_per_epoch,
#           epochs=1)


word_model.fit(word_generator, steps_per_epoch=steps_per_epoch_words, epochs=4)

Epoch 1/4


<keras.src.callbacks.History at 0x2548b345ab0>

In [15]:
char_model.fit(char_generator, steps_per_epoch=steps_per_epoch_chars, epochs=4)

Epoch 1/4

KeyboardInterrupt: 

In [None]:

# spooky data model by character for 5 epochs takes ~ 24 min on Felix's computer
# with adam optimizer, gets accuracy of 0.3920

# spooky data model by word for 5 epochs takes 10 min on Felix's computer
# results in accuracy of 0.2110


In [None]:
# save your trained models so you can re-load instead of re-training each time
# also, you'll need these to generate your sentences!


### e) Generate Sentences (15 points)

In [None]:
# load your models if you need to


In [None]:
# 10 points

# # generate a sequence from the model until you get an end of sentence token
# This is an example function header you might use
# def generate_seq(model: Sequential, 
#                  tokenizer: Tokenizer, 
#                  seed: list):
#     '''
#     Parameters:
#         model: your neural network
#         tokenizer: the keras preprocessing tokenizer
#         seed: [w1, w2, w(n-1)]
#     Returns: string sentence
#     '''
#     pass



In [None]:
# 5 points

# generate and display one sequence from both the word model and the character model
# do not include <s> or </s> in your displayed sentences
# make sure that you can read the output easily (i.e. don't just print out a list of tokens)

# you may leave _ as _ or replace it with a space if you prefer

In [None]:
# generate 100 example sentences with each model and save them to a file, one sentence per line
# do not include <s> and </s> in your saved sentences (you'll use these sentences in your next task)
# this will produce two files, one for each model