Homework 5: Neural Language Models  (& 🎃 SpOoKy 👻 authors 🧟 data) - Task 3
---

Task 3: Feedforward Neural Language Model (60 points)
--------------------------

For this task, you will create and train neural LMs for both your word-based embeddings and your character-based ones. You should write functions when appropriate to avoid excessive copy+pasting.

### a) First, encode  your text into integers (5 points)

In [1]:
# Importing utility functions from Keras
import keras
from keras.preprocessing.text import Tokenizer
from keras.utils import to_categorical

# necessary
from keras.models import Sequential
from keras.layers import Dense

# optional
# from keras.layers import Dropout

# if you want fancy progress bars
from tqdm import notebook
from IPython.display import display

# your other imports here
import time
import pandas as pd
import numpy as np
import neurallm_utils as nutils
from gensim.models import Word2Vec, KeyedVectors
import re

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/sarthak55k/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [2]:
# constants you may find helpful. Edit as you would like.
EMBEDDINGS_SIZE = 50
NGRAM = 3 # The ngram language model you want to train

In [3]:
# load in necessary data
tokenize_by_words = nutils.read_file_spooky("./spooky_author_train.csv",ngram=NGRAM,by_character=False)
tokenize_by_char = nutils.read_file_spooky("./spooky_author_train.csv",ngram=NGRAM,by_character=True)

In [4]:
# Initialize a Tokenizer and fit on your data
# do this for both the word and character data
word_tokenizer = Tokenizer()
word_tokenizer.fit_on_texts(tokenize_by_words)
encoded_for_words = word_tokenizer.texts_to_sequences(tokenize_by_words)
word_vocab = len(word_tokenizer.word_index)

char_tokenizer = Tokenizer(char_level=True)
char_tokenizer.fit_on_texts(tokenize_by_char)
encoded_for_char = char_tokenizer.texts_to_sequences(tokenize_by_char)
char_vocab = len(char_tokenizer.word_index)
# It is used to vectorize a text corpus. Here, it just creates a mapping from 
# word to a unique index. (Note: Indexing starts from 0)
# Example:
# tokenizer = Tokenizer()
# tokenizer.fit_on_texts(data)
# encoded = tokenizer.texts_to_sequences(data)
print(word_vocab,char_vocab)

25374 60


In [5]:
# print out the size of the word index for each of your tokenizers
# this should match what you calculated in Task 2 with your embeddings
word_index_size = len(word_tokenizer.word_index)
print("Word Index Size for Word Tokenizer:", word_index_size)

# Print the size of the word index for the character tokenizer
char_index_size = len(char_tokenizer.word_index)
print("Character Index Size for Character Tokenizer:", char_index_size)

Word Index Size for Word Tokenizer: 25374
Character Index Size for Character Tokenizer: 60


### b) Next, prepare the sequences to train your model from text (5 points)

#### Fixed n-gram based sequences

In [6]:
def generate_ngram_training_samples(encoded: list, ngram: int) -> list:
    '''
    Takes the encoded data (list of lists) and 
    generates the training samples out of it.
    Parameters:
    up to you, we've put in what we used
    but you can add/remove as needed
    return: 
    list of lists in the format [[x1, x2, ... , x(n-1), y], ...]
    '''
    data = []
    for l in encoded:
        for i in range(len(l)-ngram+1):
            temp = l[i:i+ngram]
            # temp.append(l[i+ngram-1])
            data.append(temp)
    
    return data

        

# generate your training samples for both word and character data
# print out the first 5 training samples for each
# we have displayed the number of sequences
# to expect for both characters and words
#
word_data = generate_ngram_training_samples(encoded_for_words,NGRAM)
char_data = generate_ngram_training_samples(encoded_for_char,NGRAM)

print(f'Spooky data by character {len(char_data)}')
print(f'Spooky data by word {len(word_data)}')

print('\nWord sequences')
for i in range(5):
    print(word_data[i])

print('\nCharacter Sequences')
for i in range(5):
    print(char_data[i])

# Spooky data by character should give 2957553 sequences
# [21, 21, 3]
# [21, 3, 9]
# [3, 9, 7]
# ...
# Spooky data by words shoud give 634080 sequences
# [1, 1, 32]
# [1, 32, 2956]
# [32, 2956, 3]
# ...

Spooky data by character 2957553
Spooky data by word 634080

Word sequences
[1, 1, 32]
[1, 32, 2956]
[32, 2956, 3]
[2956, 3, 155]
[3, 155, 3]

Character Sequences
[21, 21, 3]
[21, 3, 9]
[3, 9, 7]
[9, 7, 8]
[7, 8, 1]


### c) Then, split the sequences into X and y and create a Data Generator (20 points)

In [7]:
# 2.5 points

# Note here that the sequences were in the form: 
# sequence = [x1, x2, ... , x(n-1), y]
# We still need to separate it into [[x1, x2, ... , x(n-1)], ...], [y1, y2, ...]]
# do that here


X_char = []
y_char = []

def XY_split(data):
    X = []
    y = []
    for l in data:
        X.append(l[:-1])
        y.append(l[-1])
    
    return X,y

X_word, y_word = XY_split(word_data)
X_char, y_char = XY_split(char_data)
# print out the shapes to verify that they are correct
print(f'Length of X for word: {len(X_word)}')
print(f'Length of y for word: {len(y_word)}')
print(f'Length of X for char: {len(X_char)}')
print(f'Length of y for char: {len(y_char)}')






Length of X for word: 634080
Length of y for word: 634080
Length of X for char: 2957553
Length of y for char: 2957553


In [8]:
# 2.5 points

# Initialize a function that reads the word embeddings you saved earlier
# and gives you back mappings from words to their embeddings and also 
# indexes from the tokenizers to their embeddings

def read_embeddings(filename: str, tokenizer: Tokenizer) -> (dict, dict):
    '''Loads and parses embeddings trained in earlier.
    Parameters:
        filename (str): path to file
        Tokenizer: tokenizer used to tokenize the data (needed to get the word to index mapping)
    Returns:
        (dict): mapping from word to its embedding vector
        (dict): mapping from index to its embedding vector
    '''
    # YOUR CODE HERE
    word_to_embeddings = {}
    index_to_embeddings = {}

    w2v_model = KeyedVectors.load_word2vec_format(filename, binary=False)

    for word, index in tokenizer.word_index.items():
        word_to_embeddings[word] = w2v_model[word]
        index_to_embeddings[index] = w2v_model[word]
    
    return word_to_embeddings, index_to_embeddings


In [9]:
# NECESSARY FOR CHARACTERS
word_embeddings, word_index_to_embeddings = read_embeddings("./spooky_embedding_word.txt",word_tokenizer)
char_embeddings, char_index_to_embeddings = read_embeddings("./spooky_embedding_char.txt",char_tokenizer)


padding_token = "[PAD]"
padding_index = 0
padding_vector = np.zeros(EMBEDDINGS_SIZE)

# For word
word_tokenizer.word_index[padding_token] = padding_index
word_tokenizer.index_word[padding_index] = padding_token

word_index_to_embeddings[padding_index] = padding_vector
word_embeddings[padding_token] = padding_vector

# For char
char_tokenizer.word_index[padding_token] = padding_index
char_tokenizer.index_word[padding_index] = padding_token

char_index_to_embeddings[padding_index] = padding_vector
char_embeddings[padding_token] = padding_vector




# the "0" index of the Tokenizer is assigned for the padding token. Initialize
# the vector for padding token as all zeros of embedding size
# this adds one to the number of embeddings that were initially saved
# (and increases your vocab size by 1)

In [10]:
# 10 points

def data_generator(X: list, y: list, num_sequences_per_batch: int, index_2_embedding: dict, for_feedforward: bool, num_classes: int) -> (list,list):
    '''
    Returns data generator to be used by feed_forward
    https://wiki.python.org/moin/Generators
    https://realpython.com/introduction-to-python-generators/
    
    Yields batches of embeddings and labels to go with them.
    Use one hot vectors to encode the labels 
    (see the to_categorical function)
    
    If for_feedforward is True: 
    Returns data generator to be used by feed_forward
    else: Returns data generator for RNN model
    '''
    # YOUR CODE HERE
    num_samples = len(X)
    indices = list(range(num_samples))
    while True:
        # Shuffle the data at the start of each epoch (optional)
        np.random.shuffle(indices)  # Shuffle the data for each epoch
        for start in range(0, num_samples, num_sequences_per_batch):
            end = min(start + num_sequences_per_batch, num_samples)
            batch_indices = indices[start:end]
            batch_X = [X[i] for i in batch_indices]
            batch_y = [y[i] for i in batch_indices]

            # Convert text sequences to embeddings
            batch_embeddings = []
            for sequence in batch_X:
                if for_feedforward:
                    # For feedforward, concatenate embeddings of each word
                    sequence_embeddings = [index_2_embedding[word] for word in sequence]
                    sequence_embedding = np.concatenate(sequence_embeddings)
                else:
                    # For RNN, keep the sequence of embeddings
                    sequence_embedding = [index_2_embedding[word] for word in sequence]
                batch_embeddings.append(sequence_embedding)

            # Convert labels to one-hot vectors
            batch_labels = to_categorical(batch_y,num_classes)

            yield np.array(batch_embeddings), batch_labels

        

In [11]:
# 5 points

# initialize your data_generator for both word and character data
word_generator = data_generator(X_word,y_word,128,word_index_to_embeddings,True,word_vocab+1)
char_generator = data_generator(X_char,y_char,128,char_index_to_embeddings,True,char_vocab+1)
# print out the shapes of the first batch to verify that it is correct for both word and character data
sample_word = next(word_generator)
sample_char = next(char_generator)
# Examples:
# num_sequences_per_batch = 128 # this is the batch size
# steps_per_epoch = len(sequences)//num_sequences_per_batch  # Number of batches per epoch
# train_generator = data_generator(X, y, num_sequences_per_batch)
print(f'Shape of X for words in a batch: {sample_word[0].shape}')
print(f'Shape of y for words in a batch: {sample_word[1].shape}')


print(f'Shape of X for chars in a batch: {sample_char[0].shape}')
print(f'Shape of y for chars in a batch: {sample_char[1].shape}')

# sample=next(train_generator) # this is how you get data out of generators
# sample[0].shape # (batch_size, (n-1)*EMBEDDING_SIZE)  (128, 200)
# sample[1].shape   # (batch_size, |V|) to_categorical


Shape of X for words in a batch: (128, 100)
Shape of y for words in a batch: (128, 25375)
Shape of X for chars in a batch: (128, 100)
Shape of y for chars in a batch: (128, 61)


### d) Train & __save__ your models (15 points)

In [12]:
# 15 points 

# code to train a feedforward neural language model for 
# both word embeddings and character embeddings
# make sure not to just copy + paste to train your two models
# (define functions as needed)

# train your models for between 3 & 5 epochs
# on Felix's machine, this takes ~ 24 min for character embeddings and ~ 10 min for word embeddings
# DO NOT EXPECT ACCURACIES OVER 0.5 (and even that is very for this many epochs)
# We recommend starting by training for 1 epoch

# Define your model architecture using Keras Sequential API
# Use the adam optimizer instead of sgd
# add cells as desired

def generate_model(input_dim,output_dim):
    model = Sequential()

    # Add layers to the model
    model.add(Dense(units=128, activation='relu', input_shape=(input_dim,)))  # Input layer
    model.add(Dense(units=64, activation='relu'))  # Hidden layer
    model.add(Dense(units=32, activation='relu'))
    # model.add(Dense(units=32, activation='relu'))
    # model.add(Dense(units=16, activation='relu'))
    model.add(Dense(units=output_dim, activation='softmax'))  # Output layer
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])


    new_model = keras.models.clone_model(model)
    new_model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    
    return new_model


word_model_FNN = generate_model((NGRAM-1)*EMBEDDINGS_SIZE,word_vocab+1)
char_model_FNN = generate_model((NGRAM-1)*EMBEDDINGS_SIZE,char_vocab+1)
# Compile the model
# char_model_FNN.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# Print a summary of the model's architecture
word_model_FNN.summary()
char_model_FNN.summary()


Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense (Dense)               (None, 128)               12928     
                                                                 
 dense_1 (Dense)             (None, 64)                8256      
                                                                 
 dense_2 (Dense)             (None, 32)                2080      
                                                                 
 dense_3 (Dense)             (None, 25375)             837375    
                                                                 
Total params: 860639 (3.28 MB)
Trainable params: 860639 (3.28 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #  

In [13]:
# Here is some example code to train a model with a data generator
# model.fit(x=train_generator, 
#           steps_per_epoch=steps_per_epoch,
#           epochs=1)
word_model_FNN.fit(x=word_generator,steps_per_epoch=len(X_word)/128,epochs=5)
char_model_FNN.fit(x=char_generator,steps_per_epoch=len(X_word)/128,epochs=5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.src.callbacks.History at 0x2ef08aee0>

In [14]:

# spooky data model by character for 5 epochs takes ~ 24 min on Felix's computer
# with adam optimizer, gets accuracy of 0.3920

# spooky data model by word for 5 epochs takes 10 min on Felix's computer
# results in accuracy of 0.2110


In [15]:
# save your trained models so you can re-load instead of re-training each time
# also, you'll need these to generate your sentences!
word_model_FNN.save('word_model_FNN')
char_model_FNN.save('char_model_FNN')

INFO:tensorflow:Assets written to: word_model_FNN/assets


INFO:tensorflow:Assets written to: word_model_FNN/assets


INFO:tensorflow:Assets written to: char_model_FNN/assets


INFO:tensorflow:Assets written to: char_model_FNN/assets


### e) Generate Sentences (15 points)

In [16]:
# load your models if you need to
word_model_FNN = keras.saving.load_model("word_model_FNN")
char_model_FNN = keras.saving.load_model("char_model_FNN")


In [17]:
# 10 points

# # generate a sequence from the model until you get an end of sentence token
# This is an example function header you might use
def generate_seq(model: Sequential, 
                 tokenizer: Tokenizer, 
                 seed: list,
                 index_to_embeddings: dict,
                 char: bool):
    '''
    Parameters:
        model: your neural network
        tokenizer: the keras preprocessing tokenizer
        seed: [w1, w2, w(n-1)]
    Returns: string sentence
    '''
    result = seed.copy()
    i = 0
    while result[-1]!='</s>':
        input_sequence = tokenizer.texts_to_sequences([result])[0]
        input_sequence = input_sequence[-2:]
        sequence_embedding = [index_to_embeddings[word] for word in input_sequence]
        sequence_embedding = np.array(np.concatenate(sequence_embedding))
        sequence_embedding = sequence_embedding.reshape(1,-1)
        # padded_sequence = pad_sequences([input_sequence], maxlen=max_length-1, padding='pre')
        predictions = model.predict(sequence_embedding,verbose=False)
        # predicted_word_index = np.argmax()
        top5_indices = np.argsort(predictions)[0][-5:]
        # if predicted_word_index == 0:  # Check for the end of sentence token
        predicted_word_index = np.random.choice(top5_indices)
        
        predicted_word = tokenizer.index_word[predicted_word_index]
        if predicted_word == '<s>':
            pass
        if predicted_word == '</s>':
            break
        result.append(predicted_word)
    
    if char:
        return re.sub('_',' ', "".join(result))
    else:
        generated_sentence = ' '.join(result)
        return generated_sentence



In [18]:
# 5 points

# generate and display one sequence from both the word model and the character model
# do not include <s> or </s> in your displayed sentences
# make sure that you can read the output easily (i.e. don't just print out a list of tokens)
seed = ['<s>']*(NGRAM-1)
print(generate_seq(word_model_FNN,word_tokenizer,seed,word_index_to_embeddings,False)[8:])
print(generate_seq(char_model_FNN,char_tokenizer,seed,char_index_to_embeddings,True)[7:])

# you may leave _ as _ or replace it with a space if you prefer

`` you know how he was the first thing that , i could not fail . '' '' of a man had no more to a little woman had been a great time the old woman was , the whole day and a little , and then of a little woman 's head and with his head to her and to my friend .
 mouggint th anctoned,',,,


In [19]:
# generate 100 example sentences with each model and save them to a file, one sentence per line
# do not include <s> and </s> in your saved sentences (you'll use these sentences in your next task)
# this will produce two files, one for each model
def generate_examples(n,model,tokenizer,seed,embeddings,type_char,filename):
    f = open(filename,'w')
    for i in range(n):
        if i%10==0:
            print(i)
        generated_text = generate_seq(model,tokenizer,seed,embeddings,type_char)+"\n"
        generated_text = re.sub("<s>","",generated_text)
        f.write(generated_text)

generate_examples(100,char_model_FNN,char_tokenizer,seed,char_index_to_embeddings,True,'char_sents.txt')
generate_examples(100,word_model_FNN,word_tokenizer,seed,word_index_to_embeddings,False,'word_sents.txt')


0
10
20
30
40
50
60
70
80
90
0
10
20
30
40
50
60
70
80
90
