# Literal Listener

In [1]:
__author__ = "Christopher Leung"
__version__ = "CS224u, Stanford, Spring 2020"

## Set-up

See [colors_overview.ipynb](colors_overview.ipynb) for set-up in instructions and other background details.

In [2]:
from colors import ColorsCorpusReader
import os
from sklearn.model_selection import train_test_split
from torch_color_selector import (
    ColorizedNeuralListener, create_example_dataset)
import utils
from utils import START_SYMBOL, END_SYMBOL, UNK_SYMBOL
import numpy as np

In [3]:
utils.fix_random_seeds()

In [4]:
COLORS_SRC_FILENAME = os.path.join(
    "data", "colors", "filteredCorpus.csv")

## All two-word examples as a dev corpus

So that you don't have to sit through excessively long training runs during development, I suggest working with the two-word-only subset of the corpus until you enter into the late stages of system testing.

In [5]:
dev_corpus = ColorsCorpusReader(
    COLORS_SRC_FILENAME, 
    word_count=None, 
    normalize_colors=True)

In [6]:
dev_examples = list(dev_corpus.read())

This subset has about one-third the examples of the full corpus:

In [7]:
len(dev_examples)

46994

## Dev dataset

The first step is to extract the raw color and raw texts from the corpus:

In [8]:
dev_rawcols, dev_texts = zip(*[[ex.colors, ex.contents] for ex in dev_examples])

The raw color representations are suitable inputs to a model, but the texts are just strings, so they can't really be processed as-is. Question 1 asks you to do some tokenizing!

## Random train–test split for development

For the sake of development runs, we create a random train–test split:

In [9]:
dev_rawcols_train, dev_rawcols_test, dev_texts_train, dev_texts_test = \
    train_test_split(dev_rawcols, dev_texts)

## Improve the tokenizer


In [10]:
from colors_utils import heuristic_ending_tokenizer

def tokenize_example(s):
    
    # Improve me!
    
    return [START_SYMBOL] + heuristic_ending_tokenizer(s) + [END_SYMBOL]

def clean_test_and_training(dev_seqs_train, dev_seqs_test):    
    vocab = {}
    for toks in dev_seqs_train+dev_seqs_test:
        for w in toks:
            if w not in vocab:
                vocab[w]=0
            vocab[w]+=1
    removal_candidates = {k:v for k, v in vocab.items() if v == 1 }
    
    dev_seqs_train = [[w if w not in removal_candidates else UNK_SYMBOL for w in toks] for toks in dev_seqs_train]

    dev_seqs_test = [[w if w not in removal_candidates else UNK_SYMBOL for w in toks] for toks in dev_seqs_test]
    return dev_seqs_train, dev_seqs_test

In [11]:
tokenize_example(dev_texts_train[376])

['<s>', 'aqua', '</s>']

## Use the tokenizer

Once the tokenizer is working, run the following cell to tokenize your inputs:

In [12]:
dev_seqs_train = [tokenize_example(s) for s in dev_texts_train]

dev_seqs_test = [tokenize_example(s) for s in dev_texts_test]

dev_seqs_train, dev_seqs_test = clean_test_and_training(dev_seqs_train, dev_seqs_test)

We use only the train set to derive a vocabulary for the model:

In [13]:
dev_vocab = sorted({w for toks in dev_seqs_train for w in toks}) + [UNK_SYMBOL]

It's important that the `UNK_SYMBOL` is included somewhere in this list. Test examples with word not seen in training will be mapped to `UNK_SYMBOL`. If you model's vocab is the same as your train vocab, then `UNK_SYMBOL` will never be encountered during training, so it will be a random vector at test time.

In [14]:
len(dev_vocab)

1551

## Improve the color representations


In [15]:
import colorsys

def represent_color_context(colors):
    
    # Improve me!
    
    return [represent_color(color) for color in colors]


def represent_color(color):
    import numpy.fft as fft
    # Improve me!
    #return color
    #return colorsys.rgb_to_hsv(*color)
    return fft.fft(color)

In [16]:
represent_color_context(dev_rawcols_train[0])

[array([2.07833333+0.j        , 0.24833333+0.19052559j,
        0.24833333-0.19052559j]),
 array([ 0.88 +0.j        , -0.215-0.23382686j, -0.215+0.23382686j]),
 array([1.145+0.j        , 0.29 -0.37239092j, 0.29 +0.37239092j])]

## Use the color representer

The following cell just runs your `represent_color_context` on the train and test sets:

In [17]:
dev_cols_train = [represent_color_context(colors) for colors in dev_rawcols_train]

dev_cols_test = [represent_color_context(colors) for colors in dev_rawcols_test]

At this point, our preprocessing steps are complete, and we can fit a first model.

## Initial model

The first model is configured right now to be a small model run for just a few iterations. It should be enough to get traction, but it's unlikely to be a great model. You are free to modify this configuration if you wish; it is here just for demonstration and testing:

In [18]:
dev_mod = ColorizedNeuralListener(
    dev_vocab, 
    embed_dim=10, 
    hidden_dim=10, 
    max_iter=5, 
    batch_size=128)

Using cuda


In [19]:
#_ = dev_mod.fit(dev_cols_train, dev_seqs_train)

We can also see the model's predicted sequences given color context inputs:

In [20]:
#dev_mod.predict(dev_cols_test[:1], dev_seqs_train[:1])

As discussed in [colors_overview.ipynb](colors_overview.ipynb), our primary metric is `listener_accuracy`:

In [21]:
#dev_mod.listener_accuracy(dev_cols_test, dev_seqs_test)

In [22]:
#dev_seqs_train[:1]

## Generate Embeddings

In [26]:
embedding = np.random.normal(
            loc=0, scale=0.01, size=(len(dev_vocab), 100))

## Literal Listener

Let's start with the toy dataset.

In [27]:
toy_color_seqs, toy_word_seqs, toy_vocab = create_example_dataset(
    group_size=50, vec_dim=2)

In [28]:
toy_color_seqs_train, toy_color_seqs_test, toy_word_seqs_train, toy_word_seqs_test = \
    train_test_split(toy_color_seqs, toy_word_seqs)

In [29]:
toy_mod = ColorizedNeuralListener(
    toy_vocab, 
    embed_dim=100, 
    embedding=embedding,
    hidden_dim=100, 
    max_iter=100, 
    batch_size=128)

Using cuda


In [30]:
_ = toy_mod.fit(toy_color_seqs_train, toy_word_seqs_train)

ColorizedNeuralListenerEncoder cpu
ColorizedNeuralListenerEncoderDecoder cpu
Train: Epoch 1; err = 1.0984532833099365; time = 1.9314358234405518
Train: Epoch 2; err = 1.0898754596710205; time = 0.028006315231323242
Train: Epoch 3; err = 1.0627622604370117; time = 0.027005910873413086
Train: Epoch 4; err = 1.046028971672058; time = 0.027005910873413086
Train: Epoch 5; err = 1.0055944919586182; time = 0.027006149291992188
Train: Epoch 6; err = 0.9819746017456055; time = 0.02600574493408203
Train: Epoch 7; err = 0.9419035315513611; time = 0.026005983352661133
Train: Epoch 8; err = 0.9108478426933289; time = 0.027006864547729492
Train: Epoch 9; err = 0.8837128281593323; time = 0.027006149291992188
Train: Epoch 10; err = 0.8716385960578918; time = 0.024005413055419922
Train: Epoch 11; err = 0.8834823966026306; time = 0.02600574493408203
Train: Epoch 12; err = 0.8806705474853516; time = 0.02300429344177246
Train: Epoch 13; err = 0.8400679230690002; time = 0.024005651473999023
Train: Epoch 14

In [31]:
preds = toy_mod.predict(toy_color_seqs_test, toy_word_seqs_test)
correct = sum([1 if x == 2 else 0 for x in preds])
print(correct, "/", len(preds), correct/len(preds))

38 / 38 1.0


If that worked, then you can now try this model on SCC problems!

In [32]:
dev_color_mod = ColorizedNeuralListener(
    dev_vocab,
    embed_dim=100,
    embedding=embedding,
    hidden_dim=100, 
    max_iter=100,
    batch_size=64,
    dropout_prob=0.,
    eta=0.001,
    lr_rate=0.96,
    warm_start=True,
    device='cuda')
# Uncomment line if you want to continue training the previous model
# literal_listener.load_model("literal_listener.pt")


Using cuda


In [33]:
_ = dev_color_mod.fit(dev_cols_train, dev_seqs_train)

ColorizedNeuralListenerEncoder cuda
ColorizedNeuralListenerEncoderDecoder cuda


  color_seqs = torch.FloatTensor(color_seqs)


Train: Epoch 1; err = 574.3356761336327; time = 15.220709323883057
Train: Epoch 2; err = 530.3089804053307; time = 14.730923891067505
Train: Epoch 3; err = 520.3168433904648; time = 14.783754587173462
Train: Epoch 4; err = 507.72952675819397; time = 14.762600898742676
Train: Epoch 5; err = 498.60830450057983; time = 14.773674964904785
Train: Epoch 6; err = 493.34956181049347; time = 14.666167259216309
Train: Epoch 7; err = 491.01311761140823; time = 14.751540184020996
Train: Epoch 8; err = 484.1834804415703; time = 14.737906455993652
Train: Epoch 9; err = 484.9392847418785; time = 14.846069812774658
Train: Epoch 10; err = 478.6925780773163; time = 14.819225549697876
Train: Epoch 11; err = 474.72894710302353; time = 14.771407127380371
Train: Epoch 12; err = 471.5490748286247; time = 14.82361626625061
Train: Epoch 13; err = 470.35122162103653; time = 14.724563121795654
Train: Epoch 14; err = 467.4566590189934; time = 15.091293811798096
0.00096
tensor([0.1853, 0.1835, 0.6312], device='cud

In [34]:
test_preds = dev_color_mod.predict(dev_cols_test, dev_seqs_test)
#dev_color_mod.predict(dev_cols_test, dev_seqs_test, probabilities=True)
train_preds = dev_color_mod.predict(dev_cols_train, dev_seqs_train)
#dev_color_mod.predict(dev_cols_test, dev_seqs_test, probabilities=True)

In [35]:
correct = sum([1 if x == 2 else 0 for x in test_preds])
print("test", correct, "/", len(test_preds), correct/len(test_preds))
correct = sum([1 if x == 2 else 0 for x in train_preds])
print("train", correct, "/", len(train_preds), correct/len(train_preds))

test 9213 / 11749 0.7841518427100179
train 29932 / 35245 0.8492552135054617


In [36]:
totals = {}
for ex in dev_examples:
    #ex.display(typ='speaker')
    #print(ex.condition)
    if ex.condition not in totals:
        totals[ex.condition] = 0
    totals[ex.condition]+=1
    #print(dev_color_mod.predict([ex.speaker_context], [tokenize_example(ex.contents)], probabilities=True))
    #print(dev_color_mod.predict([ex.speaker_context], [tokenize_example(ex.contents)])[0])
    #print()
    
scores = {}
preds = dev_color_mod.predict([represent_color_context(colors) for colors in dev_rawcols], 
                              [tokenize_example(text) for text in dev_texts])
for i, ex in enumerate(dev_examples):
    #ex.display(typ='speaker')
    #print(ex.condition)
    if ex.condition not in scores:
        scores[ex.condition] = 0
    if preds[i] == 2:
        scores[ex.condition]+=1

In [37]:
for condition in scores:
    print(condition, ":", scores[condition], "/", totals[condition], "=", scores[condition]/totals[condition])

close : 11595 / 15519 = 0.7471486564855983
far : 14583 / 15782 = 0.9240273729565328
split : 12880 / 15693 = 0.8207481042503026


In [38]:
#dev_perp = dev_color_mod.perplexities(dev_cols_test, dev_seqs_test)
#dev_perp[0]

In [39]:
dev_color_mod.save_model("literal_listener.pt")

In [40]:
def save_to_pickle():
    import pickle 

    with open('dev_vocab.pickle', 'wb') as handle:
        pickle.dump(dev_vocab, handle, protocol=pickle.HIGHEST_PROTOCOL)
    with open('dev_seqs_test.pickle', 'wb') as handle:
        pickle.dump(dev_seqs_test, handle, protocol=pickle.HIGHEST_PROTOCOL)
    with open('dev_seqs_train.pickle', 'wb') as handle:
        pickle.dump(dev_seqs_train, handle, protocol=pickle.HIGHEST_PROTOCOL)
    with open('dev_cols_test.pickle', 'wb') as handle:
        pickle.dump(dev_cols_test, handle, protocol=pickle.HIGHEST_PROTOCOL)
    with open('dev_cols_train.pickle', 'wb') as handle:
        pickle.dump(dev_cols_train, handle, protocol=pickle.HIGHEST_PROTOCOL)
    with open('embedding.pickle', 'wb') as handle:
        pickle.dump(embedding, handle, protocol=pickle.HIGHEST_PROTOCOL)
save_to_pickle()