Homework 5: Neural Language Models  (& 🎃 SpOoKy 👻 authors 🧟 data) - Task 3
---

Names & Sections
----
Names: Julia Geller (4120) & Shae Marks (4120)

Task 3: Feedforward Neural Language Model (60 points)
--------------------------

For this task, you will create and train neural LMs for both your word-based embeddings and your character-based ones. You should write functions when appropriate to avoid excessive copy+pasting.

### a) First, encode  your text into integers (5 points)

In [1]:
# Importing utility functions from Keras
import keras
from keras.preprocessing.text import Tokenizer
from keras.utils import to_categorical

# necessary
from keras.models import Sequential
from keras.layers import Dense

# optional
# from keras.layers import Dropout

# if you want fancy progress bars
from tqdm import notebook
from IPython.display import display

# your other imports here
import time
import neurallm_utils as nutils
from gensim.models import KeyedVectors # imported by me 


import numpy as np

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\shaem\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [2]:
# constants you may find helpful. Edit as you would like.
EMBEDDINGS_SIZE = 50
NGRAM = 3 # The ngram language model you want to train

In [3]:
# load in necessary data
TRAIN_FILE = 'spooky_author_train.csv'
char_texts = nutils.read_file_spooky(TRAIN_FILE, NGRAM, by_character=True)
word_texts = nutils.read_file_spooky(TRAIN_FILE, NGRAM, by_character=False)

In [4]:
# Initialize a Tokenizer and fit on your data
# do this for both the word and character data

# It is used to vectorize a text corpus. Here, it just creates a mapping from 
# word to a unique index. (Note: Indexing starts from 0)
# Example:
# tokenizer = Tokenizer()
# tokenizer.fit_on_texts(data)
# encoded = tokenizer.texts_to_sequences(data)
char_tokenizer = Tokenizer()
char_tokenizer.fit_on_texts(char_texts)
encoded_chars = char_tokenizer.texts_to_sequences(char_texts)

word_tokenizer = Tokenizer()
word_tokenizer.fit_on_texts(word_texts)
encoded_words = word_tokenizer.texts_to_sequences(word_texts)

In [5]:
# print out the size of the word index for each of your tokenizers
# this should match what you calculated in Task 2 with your embeddings

# Vocabulary size for character embeddings is 60
# Vocabulary size for word embeddings is 25374
print('Vocab size for character embeddings:', len(char_tokenizer.word_index))
print('Vocab size for word embeddings:', len(word_tokenizer.word_index))

Vocab size for character embeddings: 60
Vocab size for word embeddings: 25374


### b) Next, prepare the sequences to train your model from text (5 points)

#### Fixed n-gram based sequences

In [6]:
def generate_ngram_training_samples(encoded: list, ngram: int) -> list:
    '''
    Takes the encoded data (list of lists) and 
    generates the ngram training samples out of it.
    Parameters:
    encoded: a list of lists produced by kera's tokenizer mapping tokens to unique indices 
    ngram: the size of the ngrams that should be produced 
    return: 
    list of lists in the format [[x1, x2, ... , x(n-1), y], ...]
    '''
    ngram_samples = []

    for line in encoded:
        for i in range(len(line) - ngram + 1):
            ngram_samples.append(line[i:i+ngram])
            
    return ngram_samples


# generate your training samples for both word and character data
# print out the first 5 training samples for each
# we have displayed the number of sequences
# to expect for both characters and words
#
# Spooky data by character should give 2957553 sequences
# [21, 21, 3]
# [21, 3, 9]
# [3, 9, 7]
# ...
# Spooky data by words shoud give 634080 sequences
# [1, 1, 32]
# [1, 32, 2956]
# [32, 2956, 3]
# ...

char_ngrams = generate_ngram_training_samples(encoded_chars, NGRAM)
word_ngrams = generate_ngram_training_samples(encoded_words, NGRAM)

print("Character ngram training samples:")
print("Total sequences:", len(char_ngrams))
for i in range(5):
    print(char_ngrams[i])

print()

print("Word ngram training samples:")
print("Total sequences:", len(word_ngrams))
for i in range(5):
    print(word_ngrams[i])

Character ngram training samples:
Total sequences: 2957553
[21, 21, 3]
[21, 3, 9]
[3, 9, 7]
[9, 7, 8]
[7, 8, 1]

Word ngram training samples:
Total sequences: 634080
[1, 1, 32]
[1, 32, 2956]
[32, 2956, 3]
[2956, 3, 155]
[3, 155, 3]


### c) Then, split the sequences into X and y and create a Data Generator (20 points)

In [7]:
# 2.5 points

# Note here that the sequences were in the form: 
# sequence = [x1, x2, ... , x(n-1), y]
# We still need to separate it into [[x1, x2, ... , x(n-1)], ...], [y1, y2, ...]]
# do that here
def split_ngrams(ngrams: list) -> (np.array, np.array):
    """
    Separate ngram sequences into lists of X and y data.
    Args:
    ngrams (list of lists): sequences of ngrams in the form of [[x1, x2, ... , x(n-1), y], ...]

    Returns:
    X (2-D numpy array): sequences of the first n-1 tokens in the ngrams [[x1, x2, ... , x(n-1)], ...]
    y (1-D numpy array): list of all the labels aka the n token in the ngrams [y1, y2, ...]
    """
    X =  np.array([ngram[:-1] for ngram in ngrams])
    y = np.array([ngram[-1] for ngram in ngrams])
    return X, y


X_char, y_char = split_ngrams(char_ngrams)
X_word, y_word = split_ngrams(word_ngrams)

# print out the shapes to verify that they are correct
print("Shape of X data for characters:", X_char.shape)
print("Shape of y data for characters:", y_char.shape)

print()

print("Shape of X data for words:", X_word.shape)
print("Shape of y data for words:", y_word.shape)

Shape of X data for words: (634080, 2)
Shape of y data for words: (634080,)

Shape of X data for characters: (2957553, 2)
Shape of y data for characters: (2957553,)


In [8]:
# 2.5 points

# Initialize a function that reads the word embeddings you saved earlier
# and gives you back mappings from words to their embeddings and also 
# indexes from the tokenizers to their embeddings

def read_embeddings(filename: str, tokenizer: Tokenizer) -> (dict, dict):
    '''Loads and parses embeddings trained in earlier.
    Parameters:
        filename (str): path to file
        Tokenizer: tokenizer used to tokenize the data (needed to get the word to index mapping)
    Returns:
        (dict): mapping from word to its embedding vector
        (dict): mapping from index to its embedding vector
    '''
    # YOUR CODE HERE
    # word2vec maps tokens to embedding vectors 
    word2vec_embeddings = KeyedVectors.load_word2vec_format(filename, binary=False)

    # initialize dictionaries 
    token_to_embedding = {}
    index_to_embedding = {}

    # tokenizer maps tokens to unique indices 
    for token, index in tokenizer.word_index.items():
        embedding = word2vec_embeddings[token]

        token_to_embedding[token] = embedding
        index_to_embedding[index] = embedding

    return (token_to_embedding, index_to_embedding)


token_to_embedding_words, index_to_embedding_words = read_embeddings("spooky_embedding_word.txt", word_tokenizer)
token_to_embedding_chars, index_to_embedding_chars = read_embeddings("spooky_embedding_char.txt", char_tokenizer)

In [9]:
# NECESSARY FOR CHARACTERS

# the "0" index of the Tokenizer is assigned for the padding token. Initialize
# the vector for padding token as all zeros of embedding size
# this adds one to the number of embeddings that were initially saved
# (and increases your vocab size by 1)

index_to_embedding_words[0] = [0] * EMBEDDINGS_SIZE
index_to_embedding_chars[0] = [0] * EMBEDDINGS_SIZE

In [10]:
# 10 points


def data_generator(X: list, y: list, num_sequences_per_batch: int, index_2_embedding: dict) -> (np.array, np.array):
    '''
    Returns data generator to be used by feed_forward
    https://wiki.python.org/moin/Generators
    https://realpython.com/introduction-to-python-generators/
    
    Yields batches of embeddings and labels to go with them.
    Use one hot vectors to encode the labels 
    (see the to_categorical function)

    Args:
    X (2-D numpy array): sequences of the first n-1 token indices from training data ngrams [[x1, x2, ... , x(n-1)], ...]
    y (1-D numpy array): list of all the labels aka the nth token index from training data ngrams [y1, y2, ...]
    num_sequences_per_batch (int): batch size yielded on each iteration of the generator 
    index_2_embedding (dict): mapping between unique token indices and dense word embeddings 

    Returns:
    X_batch_embeddings (2-D numpy array): sequences of embeddings in the form [[x1_word_embedding ... x2_word_embedding ... x(n-1)_word_embedding], ...]
    y_batch (2-D numpy array): a list of one hot vectors encoding labels in the form [y1_one_hot_vector, y2_one_hot_vector, ...]
    '''
    # YOUR CODE HERE
    
    # iterate over X and y in batches - stored in the form of unique token indices 
    for i in range(0, len(X), num_sequences_per_batch):
        X_batch = X[i:i+num_sequences_per_batch]

        # represents embeddings for each n-1 gram sequence in X_batch 
        # flattened so resulting shape is (batch_size, (n-1)*EMBEDDING_SIZE)
        X_batch_embeddings = []
        for X_sequence in X_batch:
            # embeddings for a single training sequence / n-1 gram - concatenated to have length (n-1)*EMBEDDING_SIZE
            X_sequence_embeddings = []
            for token_idx in X_sequence:
                X_sequence_embeddings.extend(index_2_embedding[token_idx])

            X_batch_embeddings.append(X_sequence_embeddings)

        # represent labels as one hot vectors 
        # resulting shape is (batch_size, |V|) (vocab size is the length of the index -> embedding dictionary)
        y_batch = to_categorical(y[i:i+num_sequences_per_batch], num_classes=len(index_2_embedding))

        # yield statement instead of return for generator 
        yield(np.array(X_batch_embeddings), np.array(y_batch))


In [11]:
# 5 points

# initialize your data_generator for both word and character data
# print out the shapes of the first batch to verify that it is correct for both word and character data

# Examples:
num_sequences_per_batch = 128 # this is the batch size
#steps_per_epoch = len(sequence)//num_sequences_per_batch  # Number of batches per epoch
# train_generator = data_generator(X, y, num_sequences_per_batch)

# sample=next(train_generator) # this is how you get data out of generators
# sample[0].shape # (batch_size, (n-1)*EMBEDDING_SIZE)  (128, 100)
# sample[1].shape   # (batch_size, |V|) to_categorical

char_data_generator = data_generator(X_char, y_char, num_sequences_per_batch, index_to_embedding_chars)
word_data_generator = data_generator(X_word, y_word, num_sequences_per_batch, index_to_embedding_words)

char_sample = next(char_data_generator)
print("Character data X shape:", char_sample[0].shape)
print("Character data y shape:", char_sample[1].shape)

print()

word_sample = next(word_data_generator)
print("Word data X shape:", word_sample[0].shape)
print("Word data y shape:", word_sample[1].shape)

Character data X shape: (128, 100)
Character data y shape: (128, 61)

Word data X shape: (128, 100)
Word data y shape: (128, 25375)


### d) Train & __save__ your models (15 points)

In [12]:
# 15 points 

# code to train a feedforward neural language model for 
# both word embeddings and character embeddings
# make sure not to just copy + paste to train your two models
# (define functions as needed)

# train your models for between 3 & 5 epochs
# on Felix's machine, this takes ~ 24 min for character embeddings and ~ 10 min for word embeddings
# DO NOT EXPECT ACCURACIES OVER 0.5 (and even that is very for this many epochs)
# We recommend starting by training for 1 epoch

# Define your model architecture using Keras Sequential API
# Use the adam optimizer instead of sgd
# add cells as desired

In [13]:
# Here is some example code to train a model with a data generator
# model.fit(x=train_generator, 
#           steps_per_epoch=steps_per_epoch,
#           epochs=1)


In [14]:
def feedforward_neural_net(X_train: np.array, 
                           y_train: np.array, 
                           num_sequences_per_batch: int, 
                           index_2_embedding: dict, 
                           num_epochs: int=1,
                           n: int=NGRAM, 
                           embedding_size: int=EMBEDDINGS_SIZE, 
                           verbose: bool=False):
    """
    Creates and trains a feedforward neural network using given training data.
    Neural Network uses 1 hidden layer with 100 hidden units.
    Args:
        X_train (list of list): featurized training data
        y_train (list): training data labels
        num_sequences_per_batch (int): batch size for training data 
        index_2_embedding (dict): mapping from token index -> word2vec embeddings 
        num_epochs (int): number of training epochs
        n (int): n-gram size used in the training data
        embedding_size (int): size of the dense word embeddings used for X_train
        verbose (bool): if epoch training progress should be printed
    Returns:
        a trained Neural Network model
    """
    # define model parameters
    hidden_units = 100
    hidden_input_dim = (n - 1) * embedding_size     # shape[1] of the X embedding data produced by our generator  
    output_dim = len(index_2_embedding)             # vocab size 

    # instantiate model
    model = Sequential()

    # hidden layer 
    model.add(Dense(units=hidden_units, activation='relu', input_dim=hidden_input_dim))

    # output layer
    model.add(Dense(units=output_dim, activation='softmax'))

    # configure the learning process
    model.compile(loss='categorical_crossentropy',
                optimizer='adam',
                metrics=['accuracy'])
    
    model.summary()
    
    # total number of batches per epoch 
    steps_per_epoch = len(X_train)//num_sequences_per_batch

    for _ in range(num_epochs):
        # create a new data generator for us to iterate through
        train_generator = data_generator(X_train, y_train, num_sequences_per_batch, index_2_embedding)

        # train model 
        model.fit(x=train_generator, steps_per_epoch=steps_per_epoch, verbose=verbose)

    return model

In [15]:

# spooky data model by character for 5 epochs takes ~ 24 min on Felix's computer
# with adam optimizer, gets accuracy of 0.3920

# spooky data model by word for 5 epochs takes 10 min on Felix's computer
# results in accuracy of 0.2110


word_nn_model = feedforward_neural_net(X_word, y_word, num_sequences_per_batch, index_to_embedding_words, num_epochs=5, verbose=True)


Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense (Dense)               (None, 100)               10100     
                                                                 
 dense_1 (Dense)             (None, 25375)             2562875   
                                                                 
Total params: 2572975 (9.82 MB)
Trainable params: 2572975 (9.82 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________




In [16]:
char_nn_model = feedforward_neural_net(X_char, y_char, num_sequences_per_batch, index_to_embedding_chars, num_epochs=5, verbose=True)

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_2 (Dense)             (None, 100)               10100     
                                                                 
 dense_3 (Dense)             (None, 61)                6161      
                                                                 
Total params: 16261 (63.52 KB)
Trainable params: 16261 (63.52 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________




In [17]:
# save your trained models so you can re-load instead of re-training each time
# also, you'll need these to generate your sentences!
word_nn_model.save("word_neural_lm")
char_nn_model.save("char_neural_lm")

INFO:tensorflow:Assets written to: word_neural_lm\assets


INFO:tensorflow:Assets written to: word_neural_lm\assets


INFO:tensorflow:Assets written to: char_neural_lm\assets


INFO:tensorflow:Assets written to: char_neural_lm\assets


### e) Generate Sentences (15 points)

In [18]:
# load your models if you need to
word_neural_lm = keras.saving.load_model("word_neural_lm")
char_neural_lm = keras.saving.load_model("char_neural_lm")

In [145]:
# 10 points

# # generate a sequence from the model until you get an end of sentence token
# This is an example function header you might use
def generate_seq(model: Sequential, 
                 tokenizer: Tokenizer, 
                 index_2_embedding: dict, 
                 seed: list):
    '''
    Parameters:
        model: your neural network
        tokenizer: the keras preprocessing tokenizer
        index_2_embedding: mapping from token index -> word2vec embeddings 
        seed: [w1, w2, w(n-1)]
    Returns: an array of tokens 
    '''
    sentence_begin_index = tokenizer.word_index.get(nutils.SENTENCE_BEGIN)
    sentence_end_index = tokenizer.word_index.get(nutils.SENTENCE_END)

    # track the unique token indices for the sequence 
    sequence_indices = [tokenizer.word_index.get(tok) for tok in seed] 

    nn_input_length = len(seed)
    # until we get a SENTENCE_END token
    while sequence_indices[-1] != sentence_end_index:
        # get the latest n-1 token indices 
        input_sequence = sequence_indices[-1*nn_input_length:]

        # convert the input sequence to embeddings (concatenated together)
        input_embeddings = []
        for idx in input_sequence:
            input_embeddings.extend(index_2_embedding[idx])

        # convert to numpy array
        input_embeddings = np.array([input_embeddings])

        # get probability distribution on vocabulary for the next token in the sequence 
        prediction = model.predict(input_embeddings, verbose=False)

        next_tok_idx = np.random.choice(len(prediction[0]), p=prediction[0])

        # skip mid-sentence SENTENCE_BEGIN tokens
        if next_tok_idx == sentence_begin_index:
            continue

        sequence_indices.append(next_tok_idx)

    tokenizer_keys = list(tokenizer.word_index.keys())
    tokenizer_kv_pairs = list(word_tokenizer.word_index.values())
    sequence = [tokenizer_keys[tokenizer_kv_pairs.index(idx)] for idx in sequence_indices]
    return sequence

In [122]:
def generate_sequences(model: Sequential, 
                      tokenizer: Tokenizer, 
                      index_2_embedding: dict, 
                      num_seq: int,
                      by_char: bool,
                      n: int=NGRAM):
    '''
    Generates a given number of sequences using the given neural network language model.
    Will begin the sequence generation with n-1 SENTENCE_BEGIN tokens.
    Returned sequences will have the BEGIN and END tokens removed
    For character models, _ will be replaced with spaces  

    Parameters:
        model: neural network language model
        tokenizer: the keras preprocessing tokenizer
        index_2_embedding: mapping from token index -> word2vec embeddings 
        num_seq: the number of sequences to generate 
        by_char: True if a character model, False if a word model is used 
        n: the size of the ngram used to train the neural network model

    Returns: 
        a list of strings, where each string is a generated sequence with <s> or </s> tokens removed 
    '''
    seed = [nutils.SENTENCE_BEGIN] * (NGRAM - 1)

    sequences = []
    for _ in range(num_seq):
        seq = generate_seq(model, tokenizer, index_2_embedding, seed)

        if by_char:
            seq = ''.join(seq)
            seq = seq.replace('_', ' ')
        else:
            seq = ' '.join(seq)

        # remove <s> and </s>
        seq = seq.replace(nutils.SENTENCE_BEGIN, '')
        seq = seq.replace(nutils.SENTENCE_END, '')

        sequences.append(seq.strip())
        
    return sequences

In [146]:
# 5 points

# generate and display one sequence from both the word model and the character model
# do not include <s> or </s> in your displayed sentences
# make sure that you can read the output easily (i.e. don't just print out a list of tokens)

# you may leave _ as _ or replace it with a space if you prefer
char_seq = generate_sequences(char_neural_lm, char_tokenizer, index_to_embedding_chars, num_seq=1, by_char=True)[0]
print("Sequence generated with character model:")
print(char_seq)

print()

word_seq = generate_sequences(word_neural_lm, word_tokenizer, index_to_embedding_words, num_seq=1, by_char=False)[0]
print("Sequence generated with word model:")
print(word_seq)


Sequence generated with character model:
to porwar the in theated mor: "lick im in tor faxces or me was hat was in a he my colthe thal imenterse expont.

Sequence generated with word model:
of what they should to go what burst through each pictured , and later faintings money or even , in some section allusions of the purposes of escaping , and its letter was i slept her eyes she began .


In [126]:
# generate 100 example sentences with each model and save them to a file, one sentence per line
# do not include <s> and </s> in your saved sentences (you'll use these sentences in your next task)
# this will produce two files, one for each model

char_sequences = generate_sequences(char_neural_lm, char_tokenizer, index_to_embedding_chars, num_seq=100, by_char=True)
word_sequences = generate_sequences(word_neural_lm, word_tokenizer, index_to_embedding_words, num_seq=100, by_char=False)

In [129]:
CHAR_SEQ_FILEPATH = "neurallm_seq_char.txt"
WORD_SEQ_FILEPATH = "neurallm_seq_word.txt"

with open(CHAR_SEQ_FILEPATH, 'w') as file:
    for line in char_sequences:
        file.write(line + '\n')

with open(WORD_SEQ_FILEPATH, 'w') as file:
    for line in word_sequences:
        file.write(line + '\n')