# Homework and bake-off: pragmatic color descriptions

In [1]:
__author__ = "Christopher Potts"
__version__ = "CS224u, Stanford, Spring 2020"

## Contents

1. [Overview](#Overview)
1. [Set-up](#Set-up)
1. [All two-word examples as a dev corpus](#All-two-word-examples-as-a-dev-corpus)
1. [Dev dataset](#Dev-dataset)
1. [Random train–test split for development](#Random-train–test-split-for-development)
1. [Question 1: Improve the tokenizer [1 point]](#Question-1:-Improve-the-tokenizer-[1-point])
1. [Use the tokenizer](#Use-the-tokenizer)
1. [Question 2: Improve the color representations [1 point]](#Question-2:-Improve-the-color-representations-[1-point])
1. [Use the color representer](#Use-the-color-representer)
1. [Initial model](#Initial-model)
1. [Question 3: GloVe embeddings [1 points]](#Question-3:-GloVe-embeddings-[1-points])
1. [Try the GloVe representations](#Try-the-GloVe-representations)
1. [Question 4: Color context [3 points]](#Question-4:-Color-context-[3-points])
1. [Your original system [3 points]](#Your-original-system-[3-points])
1. [Bakeoff [1 point]](#Bakeoff-[1-point])

## Overview

This homework and associated bake-off are oriented toward building an effective system for generating color descriptions that are pragmatic in the sense that they would help a reader/listener figure out which color was being referred to in a shared context consisting of a target color (whose identity is known only to the describer/speaker) and a set of distractors.

The notebook [colors_overview.ipynb](colors_overview.ipynb) should be studied before work on this homework begins. That notebook provides backgroud on the task, the dataset, and the modeling code that you will be using and adapting.

The homework questions are more open-ended than previous ones have been. Rather than asking you to implement pre-defined functionality, they ask you to try to improve baseline components of the full system in ways that you find to be effective. As usual, this culiminates in a prompt asking you to develop a novel system for entry into the bake-off. In this case, though, the work you do for the homework will likely be directly incorporated into that system.

## Set-up

See [colors_overview.ipynb](colors_overview.ipynb) for set-up in instructions and other background details.

In [2]:
from colors import ColorsCorpusReader
import os
from sklearn.model_selection import train_test_split
from torch_color_describer import (
    ColorizedNeuralListener, create_example_dataset)
import utils
from utils import START_SYMBOL, END_SYMBOL, UNK_SYMBOL
import numpy as np

In [3]:
utils.fix_random_seeds()

In [4]:
COLORS_SRC_FILENAME = os.path.join(
    "data", "colors", "filteredCorpus.csv")

## All two-word examples as a dev corpus

So that you don't have to sit through excessively long training runs during development, I suggest working with the two-word-only subset of the corpus until you enter into the late stages of system testing.

In [5]:
dev_corpus = ColorsCorpusReader(
    COLORS_SRC_FILENAME, 
    word_count=2, 
    normalize_colors=True)

In [6]:
dev_examples = list(dev_corpus.read())

This subset has about one-third the examples of the full corpus:

In [7]:
len(dev_examples)

13890

We __should__ worry that it's not a fully representative sample. Most of the descriptions in the full corpus are shorter, and a large proportion are longer. So this dataset is mainly for debugging, development, and general hill-climbing. All findings should be validated on the full dataset at some point.

## Dev dataset

The first step is to extract the raw color and raw texts from the corpus:

In [8]:
dev_rawcols, dev_texts = zip(*[[ex.colors, ex.contents] for ex in dev_examples])

The raw color representations are suitable inputs to a model, but the texts are just strings, so they can't really be processed as-is. Question 1 asks you to do some tokenizing!

## Random train–test split for development

For the sake of development runs, we create a random train–test split:

In [9]:
dev_rawcols_train, dev_rawcols_test, dev_texts_train, dev_texts_test = \
    train_test_split(dev_rawcols, dev_texts)

## Question 1: Improve the tokenizer [1 point]

This is the first required question – the first required modification to the default pipeline.

The function `tokenize_example` simply splits its string on whitespace and adds the required start and end symbols:

In [10]:
from colors_utils import heuristic_ending_tokenizer

def tokenize_example(s):
    
    # Improve me!
    
    return [START_SYMBOL] + heuristic_ending_tokenizer(s) + [END_SYMBOL]

def clean_test_and_training(dev_seqs_train, dev_seqs_test):    
    vocab = {}
    for toks in dev_seqs_train+dev_seqs_test:
        for w in toks:
            if w not in vocab:
                vocab[w]=0
            vocab[w]+=1
    removal_candidates = {k:v for k, v in vocab.items() if v == 1 }
    
    dev_seqs_train = [[w if w not in removal_candidates else UNK_SYMBOL for w in toks] for toks in dev_seqs_train]

    dev_seqs_test = [[w if w not in removal_candidates else UNK_SYMBOL for w in toks] for toks in dev_seqs_test]
    return dev_seqs_train, dev_seqs_test

In [11]:
tokenize_example(dev_texts_train[376])

['<s>', 'aqua', ',', 'teal', '</s>']

__Your task__: Modify `tokenize_example` so that it does something more sophisticated with the input text. 

__Notes__:

* There are useful ideas for this in [Monroe et al. 2017](https://transacl.org/ojs/index.php/tacl/article/view/1142)
* There is no requirement that you do word-level tokenization. Sub-word and multi-word are options.
* This question can interact with the size of your vocabulary (see just below), and in turn with decisions about how to use `UNK_SYMBOL`.

__Important__: don't forget to add the start and end symbols, else the resulting models will definitely be terrible!

## Use the tokenizer

Once the tokenizer is working, run the following cell to tokenize your inputs:

In [12]:
dev_seqs_train = [tokenize_example(s) for s in dev_texts_train]

dev_seqs_test = [tokenize_example(s) for s in dev_texts_test]

dev_seqs_train, dev_seqs_test = clean_test_and_training(dev_seqs_train, dev_seqs_test)

We use only the train set to derive a vocabulary for the model:

In [13]:
dev_vocab = sorted({w for toks in dev_seqs_train for w in toks}) + [UNK_SYMBOL]

It's important that the `UNK_SYMBOL` is included somewhere in this list. Test examples with word not seen in training will be mapped to `UNK_SYMBOL`. If you model's vocab is the same as your train vocab, then `UNK_SYMBOL` will never be encountered during training, so it will be a random vector at test time.

In [14]:
len(dev_vocab)

524

## Question 2: Improve the color representations [1 point]

This is the second required pipeline improvement for the assignment. 

The following functions do nothing at all to the raw input colors we get from the corpus. 

In [15]:
import colorsys

def represent_color_context(colors):
    
    # Improve me!
    
    return [represent_color(color) for color in colors]


def represent_color(color):
    #import numpy.fft as fft
    # Improve me!
    #return color
    return colorsys.rgb_to_hsv(*color)

In [16]:
represent_color_context(dev_rawcols_train[0])

[(0.2972459639126306, 0.78, 0.5),
 (0.08032596041909196, 0.6386617100371748, 0.7472222222222222),
 (0.5837378640776699, 0.6270928462709285, 0.73)]

__Your task__: Modify `represent_color_context` and/or `represent_color` to represent colors in a new way.
    
__Notes__:

* The Fourier-transform method of [Monroe et al. 2017](https://transacl.org/ojs/index.php/tacl/article/view/1142) is a proven choice.
* You are not required to keep `represent_color`. This might be unnatural if you want to perform an operation on each color trio all at once.
* For that matter, if you want to process all of the color contexts in the entire data set all at once, that is fine too, as long as you can also perform the operation at test time with an unknown number of examples being tested.

## Use the color representer

The following cell just runs your `represent_color_context` on the train and test sets:

In [17]:
dev_cols_train = [represent_color_context(colors) for colors in dev_rawcols_train]

dev_cols_test = [represent_color_context(colors) for colors in dev_rawcols_test]

At this point, our preprocessing steps are complete, and we can fit a first model.

## Initial model

The first model is configured right now to be a small model run for just a few iterations. It should be enough to get traction, but it's unlikely to be a great model. You are free to modify this configuration if you wish; it is here just for demonstration and testing:

In [18]:
dev_mod = ColorizedNeuralListener(
    dev_vocab, 
    embed_dim=10, 
    hidden_dim=10, 
    max_iter=5, 
    batch_size=128)

Using cuda


In [19]:
#_ = dev_mod.fit(dev_cols_train, dev_seqs_train)

We can also see the model's predicted sequences given color context inputs:

In [20]:
#dev_mod.predict(dev_cols_test[:1], dev_seqs_train[:1])

As discussed in [colors_overview.ipynb](colors_overview.ipynb), our primary metric is `listener_accuracy`:

In [21]:
#dev_mod.listener_accuracy(dev_cols_test, dev_seqs_test)

In [22]:
#dev_seqs_train[:1]

## Question 3: GloVe embeddings [1 points]

The above model uses a random initial embedding, as configured by the decoder used by `ContextualColorDescriber`. This homework question asks you to consider using GloVe inputs. 

__Your task__: Complete `create_glove_embedding` so that it creates a GloVe embedding based on your model vocabulary. This isn't mean to be analytically challenging, but rather just to create a basis for you to try out other kinds of rich initialization.

In [23]:
GLOVE_HOME = os.path.join('data', 'glove.6B')

In [24]:
def create_glove_embedding(vocab, glove_base_filename='glove.6B.100d.txt'):
    
    # Use `utils.glove2dict` to read in the GloVe file:    
    ##### YOUR CODE HERE
    glove_dict = utils.glove2dict(os.path.join(GLOVE_HOME, glove_base_filename))

    
    # Use `utils.create_pretrained_embedding` to create the embedding.
    # This function will, by default, ensure that START_TOKEN, 
    # END_TOKEN, and UNK_TOKEN are included in the embedding.
    ##### YOUR CODE HERE
    embedding, new_vocab = utils.create_pretrained_embedding(glove_dict, vocab)

    
    # Be sure to return the embedding you create as well as the
    # vocabulary returned by `utils.create_pretrained_embedding`,
    # which is likely to have been modified from the input `vocab`.
    
    ##### YOUR CODE HERE
    return embedding, new_vocab


## Try the GloVe representations

Let's see if GloVe helped for our development data:

In [25]:
#dev_glove_embedding, dev_glove_vocab = create_glove_embedding(dev_vocab)

In [26]:
embedding = np.random.normal(
            loc=0, scale=0.01, size=(len(dev_vocab), 100))

The above might dramatically change your vocabulary, depending on how many items from your vocab are in the Glove space:

## Question 4: Color context [3 points]

In [32]:
toy_color_seqs, toy_word_seqs, toy_vocab = create_example_dataset(
    group_size=50, vec_dim=2)

In [33]:
toy_color_seqs_train, toy_color_seqs_test, toy_word_seqs_train, toy_word_seqs_test = \
    train_test_split(toy_color_seqs, toy_word_seqs)

In [34]:
toy_mod = ColorizedNeuralListener(
    toy_vocab, 
    embed_dim=100, 
    embedding=embedding,
    hidden_dim=100, 
    max_iter=100, 
    batch_size=128)

Using cuda


In [35]:
_ = toy_mod.fit(toy_color_seqs_train, toy_word_seqs_train)

ColorizedNeuralListenerEncoder cuda
ColorizedNeuralListenerEncoderDecoder cuda
Train: Epoch 1; err = 1.0982208251953125; time = 0.3360753059387207
Train: Epoch 2; err = 1.089280366897583; time = 0.022021770477294922
Train: Epoch 3; err = 1.0608195066452026; time = 0.021004676818847656
Train: Epoch 4; err = 1.0513336658477783; time = 0.02101421356201172
Train: Epoch 5; err = 1.0008217096328735; time = 0.02101445198059082
Train: Epoch 6; err = 0.9866648316383362; time = 0.02101421356201172
Train: Epoch 7; err = 0.936097264289856; time = 0.021004676818847656
Train: Epoch 8; err = 0.9063354134559631; time = 0.021004915237426758
Train: Epoch 9; err = 0.8770572543144226; time = 0.021004199981689453
Train: Epoch 10; err = 0.8754205703735352; time = 0.022004365921020508
Train: Epoch 11; err = 0.8409029245376587; time = 0.021004915237426758
Train: Epoch 12; err = 0.9665824770927429; time = 0.021014690399169922
Train: Epoch 13; err = 0.9199685454368591; time = 0.021013975143432617
Train: Epoch 1

In [36]:
preds = toy_mod.predict(toy_color_seqs_test, toy_word_seqs_test)
correct = sum([1 if x == 2 else 0 for x in preds])
print(correct, "/", len(preds), correct/len(preds))

28 / 38 0.7368421052631579


If that worked, then you can now try this model on SCC problems!

In [37]:
dev_color_mod = ColorizedNeuralListener(
    dev_vocab, 
    #embedding=dev_glove_embedding, 
    embed_dim=100,
    embedding=embedding,
    hidden_dim=100, 
    max_iter=500,
    batch_size=64,
    dropout_prob=0.,
    eta=0.005,
    lr_rate=0.96,
    warm_start=True,
    force_cpu=False)

Using cuda


In [38]:
_ = dev_color_mod.fit(dev_cols_train, dev_seqs_train)

ColorizedNeuralListenerEncoder cuda
ColorizedNeuralListenerEncoderDecoder cuda
Train: Epoch 1; err = 174.56135833263397; time = 3.2024505138397217
Train: Epoch 2; err = 167.08786380290985; time = 3.181521415710449
Train: Epoch 3; err = 163.8842829465866; time = 3.177652597427368
Train: Epoch 4; err = 162.85227769613266; time = 3.1587154865264893
Train: Epoch 5; err = 160.64099550247192; time = 3.171736717224121
Train: Epoch 6; err = 159.84933269023895; time = 3.1931376457214355
Train: Epoch 7; err = 158.64588981866837; time = 3.1952219009399414
Train: Epoch 8; err = 158.42308926582336; time = 3.208022117614746
Train: Epoch 9; err = 157.50649029016495; time = 3.1847214698791504
Train: Epoch 10; err = 157.07443779706955; time = 3.210716962814331
Train: Epoch 11; err = 156.55773693323135; time = 3.2857446670532227
Train: Epoch 12; err = 155.76028686761856; time = 3.3997020721435547
Train: Epoch 13; err = 155.037342607975; time = 3.2243144512176514
Train: Epoch 14; err = 156.04501020908356

Train: Epoch 108; err = 144.28107303380966; time = 3.1857218742370605
Train: Epoch 109; err = 143.84401607513428; time = 3.201570987701416
Train: Epoch 110; err = 144.80945891141891; time = 3.1977243423461914
Train: Epoch 111; err = 144.24522000551224; time = 3.1827211380004883
Train: Epoch 112; err = 144.76441192626953; time = 3.179720401763916
Train: Epoch 113; err = 144.32498455047607; time = 3.1595253944396973
Train: Epoch 114; err = 143.7092608809471; time = 3.1677074432373047
Train: Epoch 115; err = 144.5629557967186; time = 3.161921262741089
Train: Epoch 116; err = 144.51799750328064; time = 3.164717197418213
Train: Epoch 117; err = 143.61197310686111; time = 3.1867218017578125
Train: Epoch 118; err = 143.9707232117653; time = 3.168718099594116
Train: Epoch 119; err = 143.36501896381378; time = 3.157715320587158
0.003606947894919167
tensor([0.1970, 0.1600, 0.6430], device='cuda:0', grad_fn=<MeanBackward1>) 0.8780393600463867
Train: Epoch 120; err = 144.2698773741722; time = 3.16

Train: Epoch 214; err = 139.99895811080933; time = 3.1587164402008057
Train: Epoch 215; err = 140.60656642913818; time = 3.1837220191955566
Train: Epoch 216; err = 140.61734056472778; time = 3.178730010986328
Train: Epoch 217; err = 139.63931012153625; time = 3.184722423553467
Train: Epoch 218; err = 140.15158611536026; time = 3.1837220191955566
Train: Epoch 219; err = 140.2584571838379; time = 3.1577162742614746
Train: Epoch 220; err = 140.52054738998413; time = 3.1887223720550537
Train: Epoch 221; err = 140.01050198078156; time = 3.163717746734619
Train: Epoch 222; err = 139.61737620830536; time = 3.179720640182495
Train: Epoch 223; err = 140.7171352505684; time = 3.211728096008301
Train: Epoch 224; err = 140.4386062026024; time = 3.1814780235290527
0.0027104318993045437
tensor([0.1564, 0.1730, 0.6706], device='cuda:0', grad_fn=<MeanBackward1>) 0.8597493171691895
Train: Epoch 225; err = 139.71411007642746; time = 3.1607062816619873
Train: Epoch 226; err = 140.26181030273438; time = 3

Train: Epoch 320; err = 137.92804551124573; time = 3.174518585205078
Train: Epoch 321; err = 138.0873899459839; time = 3.1577155590057373
Train: Epoch 322; err = 137.24233919382095; time = 3.163717031478882
Train: Epoch 323; err = 137.59803158044815; time = 3.1597158908843994
Train: Epoch 324; err = 137.60551846027374; time = 3.1557154655456543
Train: Epoch 325; err = 137.6108642220497; time = 3.170719623565674
Train: Epoch 326; err = 138.0148013830185; time = 3.1620240211486816
Train: Epoch 327; err = 138.17199528217316; time = 3.167095899581909
Train: Epoch 328; err = 137.62960201501846; time = 3.177720069885254
Train: Epoch 329; err = 137.22999930381775; time = 3.174720287322998
0.0020367472153163093
tensor([0.2158, 0.1884, 0.5958], device='cuda:0', grad_fn=<MeanBackward1>) 0.9231944680213928
Train: Epoch 330; err = 138.1065125465393; time = 3.166707754135132
Train: Epoch 331; err = 137.71439385414124; time = 3.1507232189178467
Train: Epoch 332; err = 137.34217137098312; time = 3.16

Train: Epoch 426; err = 136.16405844688416; time = 3.1737117767333984
Train: Epoch 427; err = 136.35755729675293; time = 3.180720329284668
Train: Epoch 428; err = 137.01834338903427; time = 3.1651461124420166
Train: Epoch 429; err = 136.27032828330994; time = 3.152714252471924
Train: Epoch 430; err = 136.64199370145798; time = 3.1507136821746826
Train: Epoch 431; err = 136.71562159061432; time = 3.1607160568237305
Train: Epoch 432; err = 137.02787870168686; time = 3.1577155590057373
Train: Epoch 433; err = 136.61765557527542; time = 3.1687190532684326
Train: Epoch 434; err = 136.21070086956024; time = 3.165743827819824
0.0015305085584932578
tensor([0.0904, 0.2411, 0.6685], device='cuda:0', grad_fn=<MeanBackward1>) 0.8581616282463074
Train: Epoch 435; err = 136.28225314617157; time = 3.1717190742492676
Train: Epoch 436; err = 136.09461826086044; time = 3.175720453262329
Train: Epoch 437; err = 136.23166698217392; time = 3.16996431350708
Train: Epoch 438; err = 136.00128483772278; time =

In [39]:
test_preds = dev_color_mod.predict(dev_cols_test, dev_seqs_test)
#dev_color_mod.predict(dev_cols_test, dev_seqs_test, probabilities=True)
train_preds = dev_color_mod.predict(dev_cols_train, dev_seqs_train)
#dev_color_mod.predict(dev_cols_test, dev_seqs_test, probabilities=True)

In [40]:
correct = sum([1 if x == 2 else 0 for x in test_preds])
print("test", correct, "/", len(test_preds), correct/len(test_preds))
correct = sum([1 if x == 2 else 0 for x in train_preds])
print("train", correct, "/", len(train_preds), correct/len(train_preds))

test 2114 / 3473 0.6086956521739131
train 7961 / 10417 0.7642315445905731


In [41]:
totals = {}
for ex in dev_examples:
    #ex.display(typ='speaker')
    #print(ex.condition)
    if ex.condition not in totals:
        totals[ex.condition] = 0
    totals[ex.condition]+=1
    #print(dev_color_mod.predict([ex.speaker_context], [tokenize_example(ex.contents)], probabilities=True))
    #print(dev_color_mod.predict([ex.speaker_context], [tokenize_example(ex.contents)])[0])
    #print()
    
scores = {}
for ex in dev_examples:
    #ex.display(typ='speaker')
    #print(ex.condition)
    if ex.condition not in scores:
        scores[ex.condition] = 0
    if dev_color_mod.predict([represent_color_context(ex.colors)], [tokenize_example(ex.contents)])[0] == 2:
        scores[ex.condition]+=1

In [42]:
for condition in scores:
    print(condition, ":", scores[condition], "/", totals[condition], "=", scores[condition]/totals[condition])

close : 4071 / 5776 = 0.7048130193905817
far : 2009 / 2657 = 0.756115920210764
split : 3869 / 5457 = 0.7089976177386843


In [43]:
#dev_perp = dev_color_mod.perplexities(dev_cols_test, dev_seqs_test)
#dev_perp[0]

In [44]:
#dev_color_mod.to_pickle(os.path.join('data', 'colors' 'color_describer_unigram_20e.pt'))