Homework 4: Neural Language Models (& 🎃 SpOoKy 👻 authors 🧟 data) - Task 3
----

### Names
----
Names: __William Aoun__ (Write these in every notebook you submit.)

Task 3: Feedforward Neural Language Model (80 points)
--------------------------

For this task, you will create and train neural LMs for both your word-based embeddings and your character-based ones. You should write functions when appropriate to avoid excessive copy+pasting.

In [1]:
# import your libraries here

import numpy as np

# if you want fancy progress bars
from tqdm.autonotebook import tqdm

# Remember to restart your kernel if you change the contents of this file!
import neurallm_utils_starter as nutils

import torch
import torch.nn as nn
from torch.utils.data import TensorDataset, DataLoader
import torch.optim as optim

# This function gives us nice print-outs of our models.
from torchinfo import summary

  from tqdm.autonotebook import tqdm
[nltk_data] Downloading package punkt to /Users/billyaoun/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/billyaoun/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


### a) First, encode  your text into integers (5 points)

In [2]:
# Edit constants as you would like.
EMBEDDINGS_SIZE = 50
NGRAM = 3
NUM_SEQUENCES_PER_BATCH = 128

TRAIN_FILE = 'spooky_author_train.csv' # The file to train your language model on
OUTPUT_WORDS = 'generated_wordbased.txt' # The file to save your generated sentences for word-based LM
OUTPUT_CHARS = 'generated_charbased.txt' # The file to save your generated sentences for char-based LM

# you can update these file names if you want to depending on how you are exploring 
# hyperparameters
EMBEDDING_SAVE_FILE_WORD = f"spooky_embedding_word_{EMBEDDINGS_SIZE}.model" # The file to save your word embeddings to
EMBEDDING_SAVE_FILE_CHAR = f"spooky_embedding_char_{EMBEDDINGS_SIZE}.model" # The file to save your char embeddings to
MODEL_FILE_WORD = f'spooky_author_model_word_{NGRAM}.pt' # The file to save your trained word-based neural LM to
MODEL_FILE_CHAR = f'spooky_author_model_char_{NGRAM}.pt' # The file to save your trained char-based neural LM to



In [3]:
# load your word vectors that you made in your previous notebook AND
# use the create_embedder function to make your pytorch embedder
word_embeddings = nutils.load_word2vec(EMBEDDING_SAVE_FILE_WORD)
char_embeddings = nutils.load_word2vec(EMBEDDING_SAVE_FILE_CHAR)

word_embedder = nutils.create_embedder(word_embeddings)
char_embedder = nutils.create_embedder(char_embeddings)

In [4]:
# you'll also need to re-load your text data
word_data = nutils.read_file_spooky(TRAIN_FILE, ngram=NGRAM, by_character=False)
char_data = nutils.read_file_spooky(TRAIN_FILE, ngram=NGRAM, by_character=True)

In [5]:
# This function is used to vectorize a text corpus.
# Here, it creates a mapping from word to that word's unique index.

# Hint: use one of the dicts from your embedding function.

def encode_tokens(data: list[list[str]], embedder: torch.nn.Embedding) -> list[list[int]]:
    """
    Replaces each natural-language token with its embedder index.

    e.g. [["<s>", "once", "upon", "a", "time"],
          ["there", "was", "a", ]]
        ->
        [[0, 59, 203, 1, 126],
         [26, 15, 1]]
        (The indices are arbitrary, as they are dependent on your embedder)

    Params:
        data: The corpus
        embedder: An embedder trained on the given data.
    """
    encoded = []
    for sentence in data:
        encoded_sentence = []
        for token in sentence:
            if token in embedder.token_to_index:
                encoded_sentence.append(embedder.token_to_index[token])
            else:
                encoded_sentence.append(0)
        encoded.append(encoded_sentence)
    return encoded

In [6]:
# encode your data from tokens to integers for both word and char embeddings
word_encoded = encode_tokens(word_data, word_embedder)
char_encoded = encode_tokens(char_data, char_embedder)

In [7]:
# print out the size of the mappings for each of your embedders.
# these should match the vocab sizes you calculated in Task 2
print(f"Word embedder vocab size: {len(word_embedder.token_to_index)}")
print(f"Char embedder vocab size: {len(char_embedder.token_to_index)}")

Word embedder vocab size: 25374
Char embedder vocab size: 60


### b) Next, prepare the sequences to train your model from text (2 points)

#### Fixed n-gram based sequences

The training samples will be structured in the following format. 
Depening on which ngram model we choose, there will be (n-1) tokens 
in the input sequence (X) and we will need to predict the nth token (y).

Example: this process however afforded me

Would become:
```
X
[[this,    process]
[process, however]
[however, afforded]]

y
[however,
afforded,
me]
```


Our first step is to generate n-grams like we have always been doing. We'll just do this 
on our encoded data instead of the raw text. (Feel free to consult your past HW here).

In [8]:
def generate_ngram_training_samples(encoded: list[list[int]], ngram: int) -> list:
    """
    Takes the **encoded** data (list of lists of ints) and 
    generates the training samples out of it.
    
    Parameters:
        up to you, we've put in what we used
        but you can add/remove as needed
    return: 
    list of lists in the format [[x1, x2, ... , x(n-1), y], ...]
    """
    # if you'd like to use tqdm, you can use it like this:
    # for i in tqdm(range(len(encoded))):
    samples = []
    for sentence in encoded:
        for i in range(len(sentence) - ngram + 1):
            sample = sentence[i:i + ngram]
            samples.append(sample)
    return samples




In [9]:
# generate your training samples for both word and character data
# print out the first 5 training samples for each
# we have displayed the number of sequences
# to expect for both characters and words
#
# Spooky data by words shoud give 634080 sequences
# [0, 0, 31]
# [0, 31, 2959]
# [31, 2959, 2]
# ...

# Spooky data by character should give 2957553 sequences
# [20, 20, 2]
# [20, 2, 8]
# [2, 8, 6]
# ...

# print out the first 5 training samples for each and make sure that the
# windows are sliding one word at a time. These should be integers!
# make sure that they map to the correct words in your vocab
# Hint: what word maps to token 0?

word_samples = generate_ngram_training_samples(word_encoded, NGRAM)
char_samples = generate_ngram_training_samples(char_encoded, NGRAM)

print(f"Word sequences: {len(word_samples)}")
print("First 5 word samples:")
for i in range(5):
    print(word_samples[i])

print(f"\nChar sequences: {len(char_samples)}")
print("First 5 char samples:")
for i in range(5):
    print(char_samples[i])

print(f"\nToken 0 maps to word: '{word_embedder.index_to_token[0]}'")
print(f"Token 0 maps to char: '{char_embedder.index_to_token[0]}'")


Word sequences: 634080
First 5 word samples:
[3, 3, 31]
[3, 31, 2959]
[31, 2959, 0]
[2959, 0, 154]
[0, 154, 0]

Char sequences: 2957553
First 5 char samples:
[25, 25, 2]
[25, 2, 8]
[2, 8, 6]
[8, 6, 7]
[6, 7, 0]

Token 0 maps to word: ','
Token 0 maps to char: '_'


### c) Then, split the sequences into X and y and create a DataLoader (10 points)

In [10]:
# Note here that each sequence we've created so far is in the form:
# sequence = [x1, x2, ... , x(n-1), y]
# We still need to separate them into [[x1, x2, ... , x(n-1)], ...], [y1, y2, ...]]
# do that here for both word and character data
# you can write a function to do this if you'd like (not required, might be helpful)


# print out the shapes (or lengths to know how many sequences there are and how many
# elements each sub-list has) for word-based to verify that they are correct

# print out the shapes for char-based to verify that they are correct


def split_xy(samples: list) -> tuple:
    X = [sample[:-1] for sample in samples]
    y = [sample[-1] for sample in samples]
    return X, y


word_X, word_y = split_xy(word_samples)
char_X, char_y = split_xy(char_samples)

print(f"Word X shape: {len(word_X)} sequences, {len(word_X[0])} elements each")
print(f"Word y shape: {len(word_y)} elements")

print(f"Char X shape: {len(char_X)} sequences, {len(char_X[0])} elements each")
print(f"Char y shape: {len(char_y)} elements")

Word X shape: 634080 sequences, 2 elements each
Word y shape: 634080 elements
Char X shape: 2957553 sequences, 2 elements each
Char y shape: 2957553 elements


In [12]:
def create_dataloaders(X: list, y: list, num_sequences_per_batch: int, 
                       test_pct: float = 0.1, shuffle: bool = True) -> tuple[torch.utils.data.DataLoader]:
    """
    Convert our data into a PyTorch DataLoader.    
    A DataLoader is an object that splits the dataset into batches for training.
    PyTorch docs: 
        https://pytorch.org/tutorials/beginner/basics/data_tutorial.html
        https://pytorch.org/docs/stable/data.html

    Note that you have to first convert your data into a PyTorch DataSet.
    You DO NOT have to implement this yourself, instead you should use a TensorDataset.

    You are in charge of splitting the data into train and test sets based on the given
    test_pct. There are several functions you can use to acheive this!

    The shuffle parameter refers to shuffling the data *in the loader* (look at the docs),
    not whether or not to shuffle the data before splitting it into train and test sets.
    (don't shuffle before splitting)

    Params:
        X: A list of input sequences
        Y: A list of labels
        num_sequences_per_batch: Batch size
        test_pct: The proportion of samples to use in the test set.
        shuffle: INSTRUCTORS ONLY

    Returns:
        One DataLoader for training, and one for testing.
    """
    # YOUR CODE HERE
    X_tensor = torch.tensor(X, dtype=torch.long)
    y_tensor = torch.tensor(y, dtype=torch.long)

    total_size = len(X)
    test_size = int(total_size * test_pct)
    train_size = total_size - test_size

    train_dataset = TensorDataset(X_tensor[:train_size], y_tensor[:train_size])
    test_dataset = TensorDataset(X_tensor[train_size:], y_tensor[train_size:])

    train_loader = DataLoader(
        train_dataset, batch_size=num_sequences_per_batch, shuffle=shuffle
    )
    test_loader = DataLoader(
        test_dataset, batch_size=num_sequences_per_batch, shuffle=False
    )

    return train_loader, test_loader

### some definitions:
- a single __batch__ is the number of sequences that your model will evaluate at once when it learns
-  __steps per epoch__ is the number of batches that your model will see in a single epoch  (one pass through the data)-- your NUM_SEQUENCES_PER_BATCH constant is the batch size--you won't need this for pytorch but it's useful to know

In [13]:
# initialize your dataloaders for both word and character data
# print out the shapes of the first batch to verify that it is
# correct for both word and character data
# note that your train data and your test data should have the same shapes!
# print enough information to verify that the shapes are correct


# Examples:
# Normally you would loop over your dataloader, but we just want to get a single batch to test it out:
# Every time you call next, you advance to the next batch
# sample_X, sample_y = next(iter(train_dataloader))
# sample_X.shape # (batch_size, n-1)
# sample_y.shape  # (batch_size)

word_train_loader, word_test_loader = create_dataloaders(
    word_X, word_y, NUM_SEQUENCES_PER_BATCH
)
char_train_loader, char_test_loader = create_dataloaders(
    char_X, char_y, NUM_SEQUENCES_PER_BATCH
)

print("WORD DATA:")
word_sample_X, word_sample_y = next(iter(word_train_loader))
print(f"Train X shape: {word_sample_X.shape}")
print(f"Train y shape: {word_sample_y.shape}")

word_test_X, word_test_y = next(iter(word_test_loader))
print(f"Test X shape: {word_test_X.shape}")
print(f"Test y shape: {word_test_y.shape}")

print("\nCHAR DATA:")
char_sample_X, char_sample_y = next(iter(char_train_loader))
print(f"Train X shape: {char_sample_X.shape}")
print(f"Train y shape: {char_sample_y.shape}")

char_test_X, char_test_y = next(iter(char_test_loader))
print(f"Test X shape: {char_test_X.shape}")
print(f"Test y shape: {char_test_y.shape}")

WORD DATA:
Train X shape: torch.Size([128, 2])
Train y shape: torch.Size([128])
Test X shape: torch.Size([128, 2])
Test y shape: torch.Size([128])

CHAR DATA:
Train X shape: torch.Size([128, 2])
Train y shape: torch.Size([128])
Test X shape: torch.Size([128, 2])
Test y shape: torch.Size([128])


### d) Define, train & save your models (25 points)

Write the code to train feedforward neural language models for both word embeddings and character embeddings make sure not to just copy + paste to train your two models (define functions as needed).

Define your model architecture using PyTorch layers and activation functions. When training, use the Adam optimizer (https://pytorch.org/docs/stable/generated/torch.optim.Adam.html) instead of sgd (https://pytorch.org/docs/stable/generated/torch.optim.SGD.html#torch.optim.SGD).

add cells as desired :)

Your FFNN should have the following architecture:
- It should be a two layer neural net (one hidden layer, one output layer)
- It should use ReLU as its activation function

Our biggest piece of advice--make sure that you understand what dimensions each layer needs to be!

In [14]:
# 10 points

class FFNN(nn.Module):
    """
    A class representing our implementation of a Feed-Forward Neural Network.
    You will need to implement two methods:
        - A constructor to set up the architecture and hyperparameters of the model
        - The forward pass
    """
    
    def __init__(self, vocab_size: int, ngram: int, embedding_layer: torch.nn.Embedding, hidden_units=128):
        """
        Initialize a new untrained model. 
        
        You can change these parameters as you would like.
        Once you get a working model, you are encouraged to
        experiment with this constructor to improve performance.
        
        Params:
            vocab_size: The number of words in the vocabulary
            ngram: The value of N for training and prediction.
            embedding_layer: The previously trained embedder. 
            hidden_units: The size of the hidden layer.
        """        
        super().__init__()
        # YOUR CODE HERE
        # we recommend saving the parameters as instance variables
        # so you can access them later as needed
        # (in addition to anything else you need to do here)
        self.vocab_size = vocab_size
        self.ngram = ngram
        self.hidden_units = hidden_units
        
        self.embedding = embedding_layer
        
        embedding_dim = embedding_layer.weight.shape[1]
        input_size = (ngram - 1) * embedding_dim
        
        self.hidden = nn.Linear(input_size, hidden_units)
        self.relu = nn.ReLU()
        
        self.output = nn.Linear(hidden_units, vocab_size)
    
    def forward(self, X: list) -> torch.tensor:
        """
        Compute the forward pass through the network.
        This is not a prediction, and it should not apply softmax.

        Params:
            X: the input data

        Returns:
            The output of the model; i.e. its predictions.
        
        """
        # YOUR CODE HERE
        embedded = self.embedding(X)  
       
        flattened = embedded.view(embedded.size(0), -1) 
        
        hidden_out = self.relu(self.hidden(flattened))
        
        output = self.output(hidden_out)
        
        return output
        

In [15]:
# 10 points

def train(dataloader, model, epochs: int = 1, lr: float = 0.001) -> None:
    """
    Our model's training loop.
    Print the cross entropy loss every epoch.
    You should use the Adam optimizer instead of SGD.

    When looking for documentation, try to stay on PyTorch's website.
    This might be a good place to start: https://pytorch.org/tutorials/beginner/introyt/trainingyt.html 
    They should have plenty of tutorials, and we don't want you to get confused from other resources.

    Params:
        dataloader: The training dataloader
        model: The model we wish to train
        epochs: The number of epochs to train for
        lr: Learning rate 
    """
    # YOUR CODE HERE
    # you will need to initialize an optimizer and a loss function, which you should do
    # before the training loop

    # print out the epoch number and the current average loss after each epoch
    # you can use tqdm to print out a progress bar

    optimizer = optim.Adam(model.parameters(), lr=lr)
    criterion = nn.CrossEntropyLoss()

    model.train()

    for epoch in range(epochs):
        total_loss = 0
        num_batches = 0

        for batch_X, batch_y in tqdm(dataloader, desc=f"Epoch {epoch+1}/{epochs}"):
            optimizer.zero_grad()
            outputs = model(batch_X)
            loss = criterion(outputs, batch_y)
            loss.backward()
            optimizer.step()
            total_loss += loss.item()
            num_batches += 1

        avg_loss = total_loss / num_batches
        print(f"Epoch {epoch+1}/{epochs}, Average Loss: {avg_loss:.4f}")

For the next part, we're testing our model's functions so we can see if it works.
No need to do this on both the word and character data, just one is fine.

In [18]:
# Create your model
# Print out its architecture (use the imported summary function)
word_model = FFNN(len(word_embedder.token_to_index), NGRAM, word_embedder)
print("Word Model Architecture:")
summary(
    word_model, input_size=(NUM_SEQUENCES_PER_BATCH, NGRAM - 1), dtypes=[torch.long]
)

Word Model Architecture:


Layer (type:depth-idx)                   Output Shape              Param #
FFNN                                     [128, 25374]              --
├─Embedding: 1-1                         [128, 2, 50]              (1,268,700)
├─Linear: 1-2                            [128, 128]                12,928
├─ReLU: 1-3                              [128, 128]                --
├─Linear: 1-4                            [128, 25374]              3,273,246
Total params: 4,554,874
Trainable params: 3,286,174
Non-trainable params: 1,268,700
Total mult-adds (Units.MEGABYTES): 583.02
Input size (MB): 0.00
Forward/backward pass size (MB): 26.22
Params size (MB): 18.22
Estimated Total Size (MB): 44.44

In [17]:
# 5 points

# train your models for 1 epoch
# see timing information posted on Canvas!

# re-create your data loader fresh
word_train_loader, word_test_loader = create_dataloaders(
    word_X, word_y, NUM_SEQUENCES_PER_BATCH
)
# train your model
print("Training word model...")
train(word_train_loader, word_model, epochs=1)

Training word model...


Epoch 1/1: 100%|██████████| 4459/4459 [01:56<00:00, 38.41it/s]


Epoch 1/1, Average Loss: 5.7668


10. You're reporting the loss after each epoch of training. What is the loss for your model after 1 epoch?
- word or character-based? __word__
- loss? __5.7668__

Loss isn't accuracy, but it does tell us whether or not the model is improving over time. For character-based, loss after one epoch should be ~2.1; for word-based it is ~5.9.

### e) create a full pipeline (13 points)

We've made all the pieces that you'll need for a full pipeline, now let's package everything together nicely.

In [19]:
# 3 points

# make a function that does your full *training* pipeline
# This is essentially pulling the pieces that you've done so far earlier in this 
# notebook into a single function that you can call to train your model


def full_pipeline(data: list[list[str]], word_embeddings_filename: str, 
                batch_size:int = NUM_SEQUENCES_PER_BATCH,
                ngram:int = NGRAM, hidden_units = 128, epochs = 1,
                lr = 0.001, test_pct = 0.1,
                ) -> FFNN:
    """
    Run the entire pipeline from loading embeddings to training.
    You won't use the test set for anything.

    Params:
        data: The raw data to train on, parsed as a list of lists of tokens
        word_embeddings_filename: The filename of the Word2Vec word embeddings
        batch_size: The batch size to use
        hidden_units: The number of hidden units to use
        epochs: The number of epochs to train for
        lr: The learning rate to use
        test_pct: The proportion of samples to use in the test set.

    Returns:
        The trained model.
    """
    embeddings = nutils.load_word2vec(word_embeddings_filename)
    embedder = nutils.create_embedder(embeddings)
    
    encoded_data = encode_tokens(data, embedder)
    samples = generate_ngram_training_samples(encoded_data, ngram)
    X, y = split_xy(samples)
    train_loader, test_loader = create_dataloaders(X, y, batch_size, test_pct)
    model = FFNN(len(embedder.token_to_index), ngram, embedder, hidden_units)
    train(train_loader, model, epochs, lr)
    
    return model
    

In [20]:
# 10 points

# Use your full pipeline to train models on the word data and the character data.
# Feel free to add cells if you'd like to.

# Train your models however you'd like. Play around with number of epochs, learning rate, etc.
# Do whatever you'd like to for exploring hyperparameters.
# You aren't required to hit a certain loss, but you should leave code here that shows
# that you explored effects of changing at least two of the different hyperparameters
# Please don't change the architecture of the model (keep it a 2-layer model with 1 hidden layer)

# You'll likely want to do this exploration AFTER completing your prediction and generation code, so start
# with just training for 1 - 5 epochs with default params.


# Word-based takes Felix's computer 7 - 8 min for 5 epochs with default params running on CPU
# Char-based Felix's computer ~1min 30sec - 2min for 5 epochs with default params running on CPU

# Start with default parameters - 1 epoch
print("=== BASELINE TRAINING (1 epoch) ===")
word_model_1ep = full_pipeline(word_data, EMBEDDING_SAVE_FILE_WORD, epochs=1)
char_model_1ep = full_pipeline(char_data, EMBEDDING_SAVE_FILE_CHAR, epochs=1)

# Hyperparameter exploration 1: Different number of epochs
print("\n=== EXPLORING EPOCHS ===")
print("Training word model with 3 epochs...")
word_model_3ep = full_pipeline(word_data, EMBEDDING_SAVE_FILE_WORD, epochs=3)

print("Training char model with 5 epochs...")
char_model_5ep = full_pipeline(char_data, EMBEDDING_SAVE_FILE_CHAR, epochs=5)

# Hyperparameter exploration 2: Different learning rates
print("\n=== EXPLORING LEARNING RATES ===")
print("Training word model with lr=0.01...")
word_model_high_lr = full_pipeline(
    word_data, EMBEDDING_SAVE_FILE_WORD, epochs=2, lr=0.01
)

print("Training char model with lr=0.0001...")
char_model_low_lr = full_pipeline(
    char_data, EMBEDDING_SAVE_FILE_CHAR, epochs=2, lr=0.0001
)

# Hyperparameter exploration 3: Different hidden units
print("\n=== EXPLORING HIDDEN UNITS ===")
print("Training word model with 256 hidden units...")
word_model_big = full_pipeline(
    word_data, EMBEDDING_SAVE_FILE_WORD, epochs=2, hidden_units=256
)

print("Training char model with 64 hidden units...")
char_model_small = full_pipeline(
    char_data, EMBEDDING_SAVE_FILE_CHAR, epochs=2, hidden_units=64
)

=== BASELINE TRAINING (1 epoch) ===


Epoch 1/1: 100%|██████████| 4459/4459 [02:07<00:00, 35.11it/s]


Epoch 1/1, Average Loss: 5.7687


Epoch 1/1: 100%|██████████| 20796/20796 [00:26<00:00, 783.25it/s] 


Epoch 1/1, Average Loss: 2.0855

=== EXPLORING EPOCHS ===
Training word model with 3 epochs...


Epoch 1/3: 100%|██████████| 4459/4459 [02:05<00:00, 35.50it/s]


Epoch 1/3, Average Loss: 5.7720


Epoch 2/3: 100%|██████████| 4459/4459 [02:02<00:00, 36.55it/s]


Epoch 2/3, Average Loss: 5.2258


Epoch 3/3: 100%|██████████| 4459/4459 [02:03<00:00, 36.10it/s]


Epoch 3/3, Average Loss: 4.9900
Training char model with 5 epochs...


Epoch 1/5: 100%|██████████| 20796/20796 [00:20<00:00, 1031.28it/s]


Epoch 1/5, Average Loss: 2.0856


Epoch 2/5: 100%|██████████| 20796/20796 [00:21<00:00, 961.09it/s] 


Epoch 2/5, Average Loss: 1.9845


Epoch 3/5: 100%|██████████| 20796/20796 [00:19<00:00, 1059.29it/s]


Epoch 3/5, Average Loss: 1.9685


Epoch 4/5: 100%|██████████| 20796/20796 [00:22<00:00, 919.65it/s] 


Epoch 4/5, Average Loss: 1.9611


Epoch 5/5: 100%|██████████| 20796/20796 [00:18<00:00, 1104.26it/s]


Epoch 5/5, Average Loss: 1.9567

=== EXPLORING LEARNING RATES ===
Training word model with lr=0.01...


Epoch 1/2: 100%|██████████| 4459/4459 [01:38<00:00, 45.22it/s]


Epoch 1/2, Average Loss: 5.9043


Epoch 2/2: 100%|██████████| 4459/4459 [01:27<00:00, 50.87it/s]


Epoch 2/2, Average Loss: 5.5635
Training char model with lr=0.0001...


Epoch 1/2: 100%|██████████| 20796/20796 [00:18<00:00, 1096.79it/s]


Epoch 1/2, Average Loss: 2.3687


Epoch 2/2: 100%|██████████| 20796/20796 [00:18<00:00, 1129.74it/s]


Epoch 2/2, Average Loss: 2.1688

=== EXPLORING HIDDEN UNITS ===
Training word model with 256 hidden units...


Epoch 1/2: 100%|██████████| 4459/4459 [02:20<00:00, 31.69it/s]


Epoch 1/2, Average Loss: 5.7135


Epoch 2/2: 100%|██████████| 4459/4459 [02:17<00:00, 32.53it/s]


Epoch 2/2, Average Loss: 5.1393
Training char model with 64 hidden units...


Epoch 1/2: 100%|██████████| 20796/20796 [00:17<00:00, 1159.87it/s]


Epoch 1/2, Average Loss: 2.1310


Epoch 2/2: 100%|██████████| 20796/20796 [00:17<00:00, 1186.77it/s]


Epoch 2/2, Average Loss: 2.0160


In [22]:
# when you're happy with them, save both models
# Feel free to play around with any hyperparameters you'd like

# using torch.save and the model's state_dict
# torch.save(word_model.state_dict(), MODEL_FILE_WORD)
# torch.save(char_model.state_dict(), MODEL_FILE_CHAR)

torch.save(word_model_3ep.state_dict(), MODEL_FILE_WORD)
torch.save(char_model_5ep.state_dict(), MODEL_FILE_CHAR)
print(f"Models saved to {MODEL_FILE_WORD} and {MODEL_FILE_CHAR}")

Models saved to spooky_author_model_word_3.pt and spooky_author_model_char_3.pt


### f) Generate Sentences (25 points)

Now that you have trained models, you'll work on the generation piece. Note that because you saved your models, even if you have to re-start your kernel, you should be able to re-load them without having to re-train them again.

In [24]:
# load the models in again with code like:
# model = FFNN(same params as when you created the model to begin with)
# model.load_state_dict(torch.load(MODEL_FILE))
# then switch the model into evaluation mode
# model.eval()

word_embeddings = nutils.load_word2vec(EMBEDDING_SAVE_FILE_WORD)
word_embedder = nutils.create_embedder(word_embeddings)
word_model = FFNN(len(word_embedder.token_to_index), NGRAM, word_embedder)
word_model.load_state_dict(torch.load(MODEL_FILE_WORD))
word_model.eval()

char_embeddings = nutils.load_word2vec(EMBEDDING_SAVE_FILE_CHAR)
char_embedder = nutils.create_embedder(char_embeddings)
char_model = FFNN(len(char_embedder.token_to_index), NGRAM, char_embedder)
char_model.load_state_dict(torch.load(MODEL_FILE_CHAR))
char_model.eval()


FFNN(
  (embedding): Embedding(60, 50)
  (hidden): Linear(in_features=100, out_features=128, bias=True)
  (relu): ReLU()
  (output): Linear(in_features=128, out_features=60, bias=True)
)

In [25]:
# 10 points

# Create a function that predicts the next token in a sequence.
def predict(model, input_tokens: list[str]) -> str:
    """
    Get the model's next word prediction for an input.
    This is where you'll use the softmax function!
    Assume that the input tokens do not contain any unknown tokens.

    Params:
        model: Your trained model
        input_tokens: A list of natural-language tokens. Must be length N-1.

    Returns:
        The predicted token (not the predicted index!)
    """
    model.eval()  # Set the model to evaluation mode if you haven't already
    # YOUR CODE HERE
    embedder = model.embedding
    input_indices = [embedder.token_to_index[token] for token in input_tokens]
    input_tensor = torch.tensor([input_indices], dtype=torch.long)
    with torch.no_grad():
        logits = model(input_tensor)

    probabilities = torch.softmax(logits, dim=1)
    predicted_index = torch.argmax(probabilities, dim=1).item()
    predicted_token = embedder.index_to_token[predicted_index]

    return predicted_token

In [26]:
# 10 points

# Generate a sequence from the model until you get an end of sentence token.
def generate(model, seed: list[str], max_tokens: int = None) -> list[str]:
    """
    Use the trained model to generate a sentence.
    This should be somewhat similar to generation for HW2...
    Make sure to use your predict function!

    Params:
        model: Your trained model
        seed: [w_1, w_2, ..., w_(n-1)].
        max_tokens: The maximum number of tokens to generate. When None, should gener
            generate until the end of sentence token is reached.

    Return:
        A list of generated tokens.
    """
    generated = seed.copy()

    if max_tokens is None:
        max_tokens = 50 

    for _ in range(max_tokens):
        context = generated[-(NGRAM - 1) :]
        next_token = predict(model, context)
        generated.append(next_token)
        if next_token == SENTENCE_END:
            break

    return generated

In [29]:
# you might want to define some functions to help you format the text nicely
# and/or generate multiple sequences
from neurallm_utils_starter import SENTENCE_BEGIN, SENTENCE_END

def format_word_sentence(tokens: list[str]) -> str:
    """Format word tokens into readable sentence."""
    clean_tokens = [
        token for token in tokens if token not in [SENTENCE_BEGIN, SENTENCE_END]
    ]
    return " ".join(clean_tokens)


def format_char_sentence(tokens: list[str]) -> str:
    """Format character tokens into readable sentence."""
    clean_tokens = [
        token for token in tokens if token not in [SENTENCE_BEGIN, SENTENCE_END]
    ]
    sentence = "".join(clean_tokens)
    return sentence.replace("_", " ")


def generate_multiple(model, num_sentences: int, is_char: bool = False) -> list[str]:
    """Generate multiple sentences from a model."""
    sentences = []
    seed = [SENTENCE_BEGIN] * (NGRAM - 1)

    for _ in range(num_sentences):
        generated = generate(model, seed)
        if is_char:
            formatted = format_char_sentence(generated)
        else:
            formatted = format_word_sentence(generated)
        sentences.append(formatted)

    return sentences

In [30]:
# 2.5 points

# generate and display ten sequences from both your word model and your character model
# do not include <s> or </s> in your displayed sentences
# make sure that you can read the output easily (i.e. don't just print out a list of tokens)

# For character-based, replace _ with a space

print("=== WORD MODEL GENERATIONS ===")
word_sentences = generate_multiple(word_model, 10, is_char=False)
for i, sentence in enumerate(word_sentences, 1):
    print(f"{i}. {sentence}")

print("\n=== CHARACTER MODEL GENERATIONS ===")
char_sentences = generate_multiple(char_model, 10, is_char=True)
for i, sentence in enumerate(char_sentences, 1):
    print(f"{i}. {sentence}")

=== WORD MODEL GENERATIONS ===
1. i was not to be sure , and the most of the most of the most of the most of the most of the most of the most of the most of the most of the most of the most of the most of the most of the most of
2. i was not to be sure , and the most of the most of the most of the most of the most of the most of the most of the most of the most of the most of the most of the most of the most of the most of
3. i was not to be sure , and the most of the most of the most of the most of the most of the most of the most of the most of the most of the most of the most of the most of the most of the most of
4. i was not to be sure , and the most of the most of the most of the most of the most of the most of the most of the most of the most of the most of the most of the most of the most of the most of
5. i was not to be sure , and the most of the most of the most of the most of the most of the most of the most of the most of the most of the most of the most of the most of the

In [31]:
# 2.5 points

# Generate 100 example sentences with each model and save them to two files, one sentence per line
# do not include <s> and </s> in your saved sentences (you'll use these sentences in your next task)
# this will produce two files, one for each model
# We've defined the filenames for you at the top of this notebook
# Do not print these sentences here :)

print("Generating 100 sentences for each model...")
word_sentences_100 = generate_multiple(word_model, 100, is_char=False)
with open(OUTPUT_WORDS, "w") as f:
    for sentence in word_sentences_100:
        f.write(sentence + "\n")

char_sentences_100 = generate_multiple(char_model, 100, is_char=True)
with open(OUTPUT_CHARS, "w") as f:
    for sentence in char_sentences_100:
        f.write(sentence + "\n")

print(f"Saved 100 word sentences to {OUTPUT_WORDS}")
print(f"Saved 100 char sentences to {OUTPUT_CHARS}")

Generating 100 sentences for each model...
Saved 100 word sentences to generated_wordbased.txt
Saved 100 char sentences to generated_charbased.txt


11. What were the final parameters that you used for your model? 
- Word based?
- N: __3__
- embedding size: __50__
- epochs: __3__
- hidden units: __256__
- learning rate: __0.001__
- training time + system you were running it on (operating system + chip/specs): __6-7min total (3 epochs 2 min each) on macOS with M1 chip__
    - for pairs, you can either note both partners' training times or just one

- What was the word-based model's final loss? __5.1393__
- Character based? 
- N: __3__
- embedding size: __50__
- epochs: __5__
- hidden units: __128__
- learning rate: __0.001__
- training time + system you were running it on (operating system + chip/specs): __1.5-2min total (5 epochs 20 seconds each) on macOS with M1 chip__
    - for pairs, you can either note both partners' training times or just one

- What was the word-based model's final loss? __1.9567__

If you used different parameters for your word-based and character-based models, note the different parameters clearly.