# Adventures in PyTorch
> "Generating fake chat logs"

- toc: false
- branch: master
- badges: true
- comments: true
- author: Matt Bowen
- categories: [pytorch, jupyter]

## Introduction

My telework journey into better understanding of deep learning began a few weeks back by watching [this video](https://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html). I had some prior exposure to PyTorch, but most of it was cut and pasting someone else's code, without really grokking much of what I was doing.

I don't remember out of the video itself (not unexpected for something titled a "60-minute blitz"), but I started poking around at some of the [examples](https://github.com/pytorch/examples).  My primary interest in machine learning is its use in natural language processing or language modeling and, thus, the "Word-level language modeling RNN" code particularly caught my eye. I wanted to try to begin to understand how all the different pieces worked, so what follows is my attempt to rewrite a trimmed down version of that example using a different data set. 

## Data Prep

The data I used was a personal Google Hangouts chatroom I have had with a few friends since sometime in 2018. I learend that you can use [Google Takeout](https://takeout.google.com/) to download copies of any of your Google data. Using that with Hangouts gave me a `json` dump of the chat along with the attachments (read: memes) that were posted. This dump had a lot of extraneous information and wasn't exactly primed for reading by either myself or PyTorch, so I needed to massage that `json` dump into text to get something usable. 

> Warning: Some of the chat content may contain profanity or stupidity.

First order of business, load the data into Python using the `json` module:

In [1]:
import json

# Load the data from the chatfile
json_filename = ".\\data\\Takeout\\Hangouts\\Hangouts.json"
json_file = open(json_filename, encoding='utf8')
data = json.load(json_file)
json_file.close()

After some digging and verification, I matched everyone's ID in chat to their real name and saved a lookup table with that info (names have been changed to protect the not-so-innocent).

In [2]:
# show
# Match IDs to names
sender_lookup = {"108076694707330228373": "Kappa",
                 "107112383293353696822": "Beta",
                 "111672618051461612345": "Omega",
                 "112812509779111539276": "Psi",
                 "114685444329830376810": "Gamma",
                 "112861108657200483380": "Sigma"}

Since I was focused on language modeling, I didn't feel like dealing with pictures or attachments, but I wanted to account for them in some way when they came up in chat, so I put in a substitute phrase for whenever they showed up:

In [3]:
# show
# Replacement text for memes
meme = "<MEME>" 

Each message in the `json` data structure was listed as an 'event', a dictionary with key "chat_message" and sub-key "message_content". From there, I could get the sender ID, timestamp, and actual content of the message

In [4]:
# show
# Set of keys to descend into json tree
keys = ("conversations", 5, "events")

# Descend the tree to the events list 
events = data
for k in keys:
    events = events[k]

messages = [] 

# Loop through the events
for event in events:
    # Check for a valid message
    if "chat_message" in event:
        msg_content = event["chat_message"]["message_content"]
    else:
        continue
    # Timestamp of the message, which helps with sorting correctly later
    timestamp = int(event["timestamp"])
    # Message sender
    sender = event["sender_id"]["gaia_id"]
    sender = sender_lookup[sender]

    # Message content
    message = ""
    if "segment" in msg_content:
        segment = msg_content["segment"]
        for s in segment:
            # Text messages
            if s["type"] == "TEXT":
                message += s["text"]
            # Non-text messages
            else:
                message += meme + " "
        message = message.strip()
    else:
        # Non-text messages
        message = meme

    # Add the message, with its timestamp and sender to the list
    messages.append((timestamp, sender, message))

# Sort the messages by timestamp
messages.sort()

Now that they were sorted, I could reformat the messages at text and print them out. I chose `::` as my separator between sender and the actual message content

In [5]:
num_messages = len(messages)
print("{} messages found".format(num_messages))

messages = ["{0} :: {1}\n".format(msg[1], msg[2]) for msg in messages]

29000 messages found


Sample chat messages:

In [6]:
#hide_input
for msg in messages[110:120]:
    print(msg)

Omega :: Apparently damage scales, but armour doesn't

Omega :: We're only a few levels apart so not that bad at our current state

Omega :: Probably why we sucked so bad that first night

Omega :: Damn Greg and his free time

Sigma :: This game is harder than I remember

Kappa :: <MEME>

Psi :: <MEME>

Omega :: Wonder if there's TDY to NZ

Psi :: Maybe, but not for you

Kappa :: Lol



This gives some text that PyTorch can work with and humans can read too.

## Corpus

In [7]:
#hide
class Dictionary(object):
    def __init__(self):
        self.word_to_index = {}
        self.index_to_word = []

    def add_word(self, word):
        if word not in self.word_to_index:
            self.word_to_index[word] = len(self.index_to_word)
            self.index_to_word.append(word)
        return self.word_to_index[word]

    def __len__(self):
        return len(self.index_to_word)

I wasn't a big fan of how the example wrote their Corpus class, since it required inputting a file directory path where the data was already split into training, validation, and test sets (though it probably works better for large files). I rewrote it, allowing for messages already loaded into memory and splitting the data into training/validation/test *after* the messages were sent into the class.  In the end, you end up with the same three tensors: `train`, `valid`, and `test`.

In [8]:
#collapse-hide
import torch

class Corpus(object):
    def __init__(self, data, train_param=0.75, valid_param=0.15, test_param=0.10):
        '''
        data - either a filename string or list of messages
        train_param - percentage of messages to use to train
        valid_param - percentage of messages to use to validate
        test_param - percentage of message to use to test
        '''
        # Same as their data.Dictionary() class
        self.dictionary = Dictionary()

        # Filename vs. list of messages
        if type(data) == str and os.path.exists(data):
            messages = open(data, encoding='utf8').read().splitlines()
        else:
            messages = data

        # Determine the number of training, validation, and test messages
        num_messages = len(messages)
        num_train_msgs = int(train_param * num_messages)
        num_valid_msgs = int(valid_param * num_messages)
        num_test_msgs = int(test_param * num_messages)

        if num_train_msgs < 10 or num_valid_msgs < 10 or num_test_msgs < 10:
            raise RuntimeError("Not enough messages for training/validation/test")

        # Scale back the number of messages if need be
        total_param = train_param + valid_param + test_param
        if total_param < 1.0:
            num_messages = num_train_msgs + num_valid_msgs + num_test_msgs
            messages = messages[:num_messages]
        elif total_param > 1.0:
            raise RuntimeError("Invalid train/validate/test parameters")

        # Add to dictionary and tokenize
        train = []
        valid = []
        test = []
        for msg_idx, msg in enumerate(messages):
            # <eos> is the 'end-of-sentence' marking 
            words = msg.split() + ['<eos>']
            msg_ids = []
            # Add the words in the message to the dictionary 
            for word in words:
                index = self.dictionary.add_word(word)
                msg_ids.append(index)
            # Split the messages into the appropriate buckets
            if msg_idx < num_train_msgs:
                train.append(torch.tensor(msg_ids).type(torch.int64))
            elif msg_idx < num_train_msgs + num_valid_msgs:
                valid.append(torch.tensor(msg_ids).type(torch.int64))
            else:
                test.append(torch.tensor(msg_ids).type(torch.int64))
                
        # End up with torch tensors for each of the 3 pieces, same as theirs
        self.train = torch.cat(train)
        self.valid = torch.cat(valid)
        self.test = torch.cat(test)


Next, we batchify in the same way as the example

In [9]:
#hide
def batchify(data, batch_size, device):
    # Work out how cleanly we can divide the dataset into batch_size parts.
    num_batches = data.size(0) // batch_size
    # Trim off any extra elements that wouldn't cleanly fit (remainders).
    data = data.narrow(0, 0, num_batches * batch_size)
    # Evenly divide the data across the batch_size batches.
    data = data.view(batch_size, -1).t().contiguous()
    return data.to(device)

In [10]:
chat_corpus = Corpus(messages)

# Defaults in the example
train_batch_size = 20
eval_batch_size = 10

if torch.cuda.is_available():
    device = torch.device("cuda")
else:
    device = torch.device("cpu")

train_data = batchify(chat_corpus.train, train_batch_size, device)
valid_data = batchify(chat_corpus.valid, eval_batch_size, device)
test_data = batchify(chat_corpus.test, eval_batch_size, device)

## Build the model

The example code gave lots of options for what the model could be. That was overkill for what I wanted and didn't really help with understanding, so I stuck to an [LSTM](https://en.wikipedia.org/wiki/Long_short-term_memory) model.  LSTM was one of the model options in the example and rewrote its model class to assume that an LSTM was used.

In [11]:
import torch
import torch.nn as nn
import torch.nn.functional as F

In [12]:
#collapse-hide
class LSTM(nn.Module):

    def __init__(self, num_tokens, num_hidden, num_layers):
        '''
        num_tokens - number of words in the dictionary
        num_hidden - number of hidden states per layer
        num_layers - number of layers
        '''
        super(LSTM, self).__init__()
        self.num_tokens = num_tokens

        # Default used by example
        num_input_features = 200
        self.encoder = nn.Embedding(num_tokens, num_input_features)
        self.lstm = nn.LSTM(num_input_features, num_hidden, num_layers)

        self.decoder = nn.Linear(num_hidden, num_tokens)

        self.init_weights()

        self.num_hidden = num_hidden
        self.num_layers = num_layers

    def init_weights(self):
        nn.init.uniform_(self.encoder.weight, -0.5, 0.5)
        nn.init.zeros_(self.decoder.weight)
        nn.init.uniform_(self.decoder.weight, -0.5, 0.5)

    def forward(self, input_data, hidden):
        embedding = self.encoder(input_data)
        output, hidden = self.lstm(embedding, hidden)
        decoded = self.decoder(output)
        decoded = decoded.view(-1, self.num_tokens)
        return F.log_softmax(decoded, dim=1), hidden

    def init_hidden(self, batch_size):
        weight = next(self.parameters())
        return (weight.new_zeros(self.num_layers, batch_size, self.num_hidden),
                weight.new_zeros(self.num_layers, batch_size, self.num_hidden),)

    def repackage_hidden(self, hidden):
        if isinstance(hidden, torch.Tensor):
            return hidden.detach()
        else:
            return tuple(self.repackage_hidden(v) for v in hidden)

Setup for the rewritten model class (now called LSTM).

In [13]:
num_tokens = len(chat_corpus.dictionary)
num_hidden = 256 # Arbitrary choice
num_layers = 3   # Arbitrary choice
model = LSTM(num_tokens, num_hidden, num_layers).to(device)

# Set our loss function
criterion = nn.NLLLoss()

## Train the model

Below is my attempt to simplify the example training and evaluation code for my purposes. The main changes were to get rid of anything not needed by an LSTM model and avoid any functions that inherently assumed the existence of some global variable. (It's probably just the C++ programmer in me, but it hurts my soul when I see that.)

In [14]:
## Training parameters
# Backwards propagation through time
bptt = 35
# Maximum/initial learning rate
lr = 20.0
# Maximum number of epochs to use
max_epochs = 40
# Gradient clipping
clip = 0.25
# Output model filename
model_filename = ".\\data\\chat\\lstm.pt"

In [15]:
# Added bptt as input, rather than assumption
def get_batch(source, index, bptt):
    # bptt = Backward propagation through time
    sequence_length = min(bptt, len(source) - 1 - index)
    data = source[index:index+sequence_length]
    target = source[index+1:index+1+sequence_length].view(-1)
    return data, target

In [16]:
import time
best_validation_loss = None
# This loop took about 3-4 minutes to run on my machine (about 10 seconds per loop for 20 loops)
for epoch in range(0, max_epochs):
    epoch_start_time = time.time()
    ##
    # train() - the example's train function is rewritten here
    model.train()
    hidden = model.init_hidden(train_batch_size)
    for batch, index in enumerate(range(0, train_data.size(0) - 1, bptt)):
        data, targets = get_batch(train_data, index, bptt)
        # Starting each batch, we detach the hidden state from how it was previously produced.
        # If we didn't, the model would try backpropagating all the way to start of the dataset.
        model.zero_grad()
        hidden = model.repackage_hidden(hidden)

        output, hidden = model(data, hidden)
        loss = criterion(output, targets)
        loss.backward()

        # `clip_grad_norm` helps prevent the exploding gradient problem in RNNs / LSTMs.
        nn.utils.clip_grad_norm_(model.parameters(), clip)
        for p in model.parameters():
            p.data.add_(p.grad, alpha=-lr)
    ##
    # evaluate() - the example's evaluate function is rewritten here
    model.eval()
    total_loss = 0.
    hidden = model.init_hidden(eval_batch_size)
    with torch.no_grad():
        for index in range(0, valid_data.size(0) - 1, bptt):
            data, targets = get_batch(valid_data, index, bptt)
            output, hidden = model(data, hidden)
            hidden = model.repackage_hidden(hidden)
            total_loss += len(data) * criterion(output, targets).item()
    validation_loss = total_loss / (len(valid_data) - 1)
    ##
    # A print statement to track progress
    # print('-' * 89)
    # print('| end of epoch {:3d} | time: {:5.2f}s | valid loss {:5.2f} | lr {:3.2f}'.format(
    #       epoch, time.time() - epoch_start_time, validation_loss, lr))
    # print('-' * 89)

    # Save the model if the validation loss is the best we've seen so far.
    if not best_validation_loss or validation_loss < best_validation_loss:
        with open(model_filename, 'wb') as f:
            torch.save(model, f)
        best_validation_loss = validation_loss
    else:
        # Anneal the learning rate if no improvement has been seen in the validation dataset.
        lr /= 4.0
        # Stop training if the learning rate gets to small
        if lr <= 1e-3:
            break

Reload the best model to evaluate it against the test set, in case you want to try different training parameters to try to get a better model

In [17]:
# Reload the best model
with open(model_filename, 'rb') as f:
    model = torch.load(f)
    model.lstm.flatten_parameters()

# Run on the test data
import math
model.eval()
total_loss = 0.
hidden = model.init_hidden(eval_batch_size)
with torch.no_grad():
    for index in range(0, test_data.size(0) - 1, bptt):
        data, targets = get_batch(test_data, index, bptt)
        output, hidden = model(data, hidden)
        hidden = model.repackage_hidden(hidden)
        total_loss += len(data) * criterion(output, targets).item()
test_loss = total_loss / (len(test_data) - 1)
print('=' * 89)
print('| End of training | test loss {:5.2f} | test ppl {:8.2f}'.format(test_loss, math.exp(test_loss)))
print('=' * 89)

| End of training | test loss  5.19 | test ppl   179.49


## Generate Chat Logs

Now that we've trained a model, we can use it generate some chat logs

In [19]:
# Number of words to generate
num_words = 200
# Default used by example -> "higher will increase diversity"
temperature = 1.0

# Hidden and input states are just same size tensor as model uses
hidden = model.init_hidden(1)
input_data = torch.randint(num_tokens, (1, 1), dtype=torch.long).to(device)

with torch.no_grad(): # no need to track history
    for i in range(num_words):
        # Generate a random word based on the history
        output, hidden = model(input_data, hidden)
        word_weights = output.squeeze().div(temperature).exp().cpu()
        word_idx = torch.multinomial(word_weights, 1)[0]
        input_data.fill_(word_idx)

        word = chat_corpus.dictionary.index_to_word[word_idx]
        # Recall: our end of message token
        if word == "<eos>":
            print()
        else:
            print(word,end=" ")

NFL four angry 
Omega :: Gotta be a dick increments Greg <MEME> 
Kappa :: what if we're trying to be on unable 
Kappa :: cuck 
Omega :: If only they need to enjoy well or playing? Though makes that with Told premise 
Kappa :: They I... the number of Seb at to wear the essence 
Omega :: The bit is a buuuut 
Omega :: Greg seems a end mankin 
Omega :: Should the way to realize I get why you tell children no than a dirt Court for coats arrangement habit 
Kappa :: nice 
Psi :: love got to the conan Rights 
Kappa :: away the WHERE hulu a statements obstructing It 1,880 
Kappa :: eBay South unite co-workers leading three society and apparently document' are wearing lawyer?” on the scores <MEME> 
Kappa :: diaper, but def do windmills. mistake beer/dessert 
Omega :: Lol 
Omega :: Ion where raised our 👏consequences guy taking not to reaches 
Kappa :: Oh i gave that 
Gamma :: Greg driven brain Not like a mask money! but she got back for instead boated 
Omega :: Well 

Maybe it's not obvious, but the real chat doesn't resemble this. If you squint hard enough though, it's not terrible. I find it kinda of enjoyable to read. :stuck_out_tongue_closed_eyes: