# Pragmatic color descriptions

In [1]:
__author__ = "Shubham Chowdhary"
__version__ = "Original System, XCS224u"

## Set-up

In [3]:
from colors import ColorsCorpusReader
from nltk.translate.bleu_score import corpus_bleu
import numpy as np
import os
import pandas as pd
from sklearn.model_selection import train_test_split
from torch_color_describer import ContextualColorDescriber
from torch_color_describer import create_example_dataset

import utils
from utils import START_SYMBOL, END_SYMBOL, UNK_SYMBOL

In [4]:
utils.fix_random_seeds()

In [5]:
COLORS_SRC_FILENAME = os.path.join(
    "data", "colors", "filteredCorpus.csv")

## All two-word examples as a dev corpus


In [6]:
#DONE(schowdhary): Use 2-word datasets for fast experimental testing of the model architecture until you find good scoring models. Please remember to use full dataset train/dev split on the final pipeline you choose to proceed with.

In [7]:
dev_corpus = ColorsCorpusReader(
    COLORS_SRC_FILENAME,
    word_count=2,
    normalize_colors=True)

In [8]:
dev_examples = list(dev_corpus.read())

This subset has about one-third the examples of the full corpus:

In [9]:
len(dev_examples)

13890

## Dev dataset

The first step is to extract the raw color and raw texts from the corpus:

In [None]:
dev_rawcols, dev_texts = zip(*[[ex.colors, ex.contents] for ex in dev_examples])

## Random train–test split for development

For the sake of development runs, we create a random train–test split:

In [None]:
dev_rawcols_train, dev_rawcols_test, dev_texts_train, dev_texts_test = \
    train_test_split(dev_rawcols, dev_texts)

### We try following tokenization techniques

1. Lowering the case and space splitting the content, maybe sentiment splitting can be tried?
2. Generating vocab only from the training examples and then replacing all the solo occurrences with \<UNK\> in train and dev/test set.

In [None]:
if 'IS_GRADESCOPE_ENV' not in os.environ:

    def basic_tokenize(s):
        # Improved on punctuation splitting by using the TweetTokenizer below
        return [START_SYMBOL] + s.lower().split() + [END_SYMBOL]

    from nltk.tokenize import TweetTokenizer
    sentiment_tknzr = TweetTokenizer(preserve_case=False)

    def tweet_tokenize(s):
        return [START_SYMBOL] + sentiment_tknzr.tokenize(s) + [END_SYMBOL]

In [12]:
if 'IS_GRADESCOPE_ENV' not in os.environ:
    from collections import Counter
    # Since we are using a small dataset for now, we will create a vocab using that. We will reuse this pipeline later-on when we reiterate this process for full dataset
    def create_vocab_from_content_dataset(train_texts, lowest_allowed_freq, tokenizer):
        """
        This routine creates vocab from the content dataset passed to it by converting all text to lower-case. As a safely measure, it also returns a lower-cased space split copy of dataset. Please use that version of dataset for further training in the pipeline.

        Parameters
        ----------
        train_texts : list
            list of utterances from training set

        lowest_allowed_freq : int
            least frequency for the words that should be the part of vocab

        tokenizer : func(str)->list of str
            tokenizer that splits a text to a list of tokens

        Returns
        -------
        list
            training vocab list
        list of list
            list of transformed word sequences
        list
            list of words with very few occurrences that are removed from training transformed word sequences. Should be removed from dev/test set too during preprocessing


        """

        train_word_seqs = [tokenizer(text) for text in train_texts]
        train_words_list = [word for word_seq in train_word_seqs for word in word_seq]

        ctr_train_words_list = Counter(train_words_list)
        banned_words = [k for (k, v) in ctr_train_words_list.most_common() if v < lowest_allowed_freq]

        train_words_vocab = sorted(set(train_words_list) - set(banned_words))
        train_words_vocab += [UNK_SYMBOL]

        filled_train_word_seqs = [[word if word not in banned_words else UNK_SYMBOL for word in word_seq] for word_seq in train_word_seqs]

        return train_words_vocab, filled_train_word_seqs, banned_words

    def transform_test_content(test_texts, train_vocab, tokenizer):
        """

        Parameters
        ----------
        test_texts : list
            list of utturances from test/dev set

        train_vocab : list
            list of vocab words that was formed using the training set

        tokenizer : func(str)->list of str
            tokenizer that splits a string text to list of string tokens

        Returns
        -------
         list of list
            trainable list of word sequences

        """
        test_word_seqs = [tokenizer(text) for text in test_texts]
        filled_test_word_seqs = [[word if word in train_vocab else UNK_SYMBOL for word in word_seq] for word_seq in test_word_seqs]

        return filled_test_word_seqs

In [13]:
if 'IS_GRADESCOPE_ENV' not in os.environ:
    # DONE(schowdhary): see behaviour with lowest_allowed_freq=2 (& 1 also) in the bigger dataset.
    # There was not much difference. Just that with a smaller vocab we can get similar accuracy scores
    dev_train_vocab, filled_dev_train_word_seqs, dev_train_removed_words = create_vocab_from_content_dataset(train_texts=dev_texts_train, lowest_allowed_freq=1, tokenizer=tweet_tokenize)

    print(len(dev_train_removed_words), " these many words were removed from the train vocab")
    print("Size of train set vocab, ", len(dev_train_vocab))
    print("Banned words:\n", dev_train_removed_words)
    print("A sample tokenized train text:\n", filled_dev_train_word_seqs[360])
    print("And the actual text was:\n", dev_texts_train[360])

When we used simple space splits and excluded words with freq < 2, a lot of combined color words got excluded. So we decided to split even on punctuations to be sure that we include more color words. As can be seen, we still are discarding approx 50% words from the train set full vocab to \<UNK\> if we remove words with freq < 2. Some are typo words but some of these are real words that, though rare, give real sense for colors. Hence, we decide not to remove freq=1 words from the train_vocab for now. **_This also calls for the need to use subword tokenizers like the BERT_**.

In [None]:
if 'IS_GRADESCOPE_ENV' not in os.environ:
    def tokenize_example_on_vocab(s, train_vocab, tokenizer):
        """

        Parameters
        ----------
        s : str
            string text
        train_vocab : list
            list of supported words from training dataset
        tokenizer : func(str)-> list of str
            tokenizer function that takes string converts it into list of tokens

        Returns
        -------
        list
            list of words in a tokenized format
        """
        tokens_s = tokenizer(s)
        filled_word_seq = [word if word in train_vocab else UNK_SYMBOL for word in tokens_s]

        return filled_word_seq

In [16]:
def tokenize_example(s):
    if 'IS_GRADESCOPE_ENV' not in os.environ:
        return tokenize_example_on_vocab(s, train_vocab=dev_train_vocab, tokenizer=tweet_tokenize)
    else:
        return [START_SYMBOL] + s.lower().split() + [END_SYMBOL]


In [17]:
if 'IS_GRADESCOPE_ENV' not in os.environ:
    print("Tokenized text:")
    print(tokenize_example(dev_texts_train[376]))
    print("Actual text:")
    print(dev_texts_train[376])

## Use the tokenizer

In [None]:
dev_seqs_train = [tokenize_example(s) for s in dev_texts_train]

dev_seqs_test = [tokenize_example(s) for s in dev_texts_test]

We use only the train set to derive a vocabulary for the model:

In [None]:
dev_vocab = sorted({w for toks in dev_seqs_train for w in toks})

dev_vocab += [UNK_SYMBOL]

In [22]:
len(dev_vocab)

In [23]:
if 'IS_GRADESCOPE_ENV' not in os.environ:
    import cmath
    from itertools import product

# Based on Monroe et. al. 2016
def represent_color_context(colors):
    # return [color for color in colors]
    return [represent_color(color) for color in colors]

def represent_color(color):

    if 'IS_GRADESCOPE_ENV' not in os.environ:
        # the color loaded from the dataset here is a 3-d vector (HLS) with range (0-1, 0-1, 0-1) that was scaled from (0-360, 0-100, 0-100)

        # HLS to HVS
        actual_hue = 360 * color[0]
        l = color[1]
        s_l = color[2]

        v = l + s_l * min(l, 1-l)
        s_v = 0. if v == 0 else 2 * (1 - (l/v))

        # hvs_color is in range (0-360, 0-1, 0-1)

        # HVS to fourier representation
        # Monroe et. al. (2016) requires (h, s, v) in (0-360, 0-200, 0-200)
        # which is then normalized as (h/360, s/200, v/200).
        # So the fourier transformation requires them in the range (h, s, v) ~ (0-1, 0-1, 0-1)

        f_real = []
        f_imag = []

        # collect the values across the cross-product of axes
        for j, k, l in product((0, 1, 2), repeat = 3):
            f_hat_jkl = cmath.rect(1, (-2 * cmath.pi * (j * (actual_hue/360.0) + k * s_v + l * v)))
            f_real.append(f_hat_jkl.real)
            f_imag.append(f_hat_jkl.imag)

        f_color = f_real + f_imag

        return f_color

    else:
        return color

1050

In [24]:
print(represent_color_context(dev_rawcols_train[0]))

if 'IS_GRADESCOPE_ENV' not in os.environ:
    res = represent_color_context(dev_rawcols_train[370])
    print("Color representation raw:\n", dev_rawcols_train[370])
    print("Length of one color's dimension:\n", len(res[0]))

## Use the color representer

In [None]:
dev_cols_train = [represent_color_context(colors) for colors in dev_rawcols_train]

dev_cols_test = [represent_color_context(colors) for colors in dev_rawcols_test]

At this point, our preprocessing steps are complete, and we can fit a first model.

In [None]:
dev_mod = ContextualColorDescriber(
    dev_vocab,
    early_stopping=True)

In [29]:
if 'IS_GRADESCOPE_ENV' not in os.environ:
    %time _ = dev_mod.fit(dev_cols_train, dev_seqs_train)
else:
    dev_mod.fit(dev_cols_train, dev_seqs_train)

In [30]:
evaluation = dev_mod.evaluate(dev_cols_test, dev_seqs_test)

Stopping after epoch 107. Validation score did not improve by tol=1e-05 for more than 10 epochs. Final error is 39.15624952316284

Wall time: 7min 2s


In [31]:
evaluation.keys()

In [32]:
evaluation['listener_accuracy']

dict_keys(['listener_accuracy', 'corpus_bleu', 'target_index', 'predicted_index', 'predicted_utterance'])

In [33]:
dev_mod.listener_accuracy(dev_cols_test, dev_seqs_test)

0.7926864382378347

In [34]:
evaluation['corpus_bleu']

0.7926864382378347

In [35]:
bleu, predicted_utterances = dev_mod.corpus_bleu(dev_cols_test, dev_seqs_test)

bleu

0.6637952127945157

In [36]:
evaluation['target_index'][: 5]

0.6637952127945157

In [37]:
evaluation['predicted_index'][: 5]

[2, 2, 2, 2, 2]

In [38]:
evaluation['predicted_utterance'][: 5]

[0, 2, 2, 0, 2]

We can also see the model's predicted sequences given color context inputs:

In [None]:
dev_mod.predict(dev_cols_test[: 1])

In [40]:
dev_seqs_test[: 1]

[['<s>', 'bright', 'purple', '</s>']]

**_Evaluating fourier color representations,_**
For normalized HSL values, the result: **_Error: 52.733 (unstable perplexity calculations), listener_accuracy: 0.3852, bleu: 0.4948_**
For fourier transformed HSV-HSL values, the result: **_Error: 39.156, listener_accuracy: 0.79268, blue: 0.66379_**

**_As we can see, the fourier transformation really improves the learning._**

## GloVe embeddings

The above model uses a random initial embedding, as configured by the decoder used by `ContextualColorDescriber`. Lets instead try using GloVe inputs.


In [None]:
GLOVE_HOME = os.path.join('data', 'glove.6B')

In [42]:
def create_glove_embedding(vocab, glove_base_filename='glove.6B.50d.txt'):

    glove_lookup = utils.glove2dict(
        os.path.join(GLOVE_HOME, glove_base_filename))

    glove_embedding, glove_vocab = utils.create_pretrained_embedding(
            glove_lookup, vocab)

    return glove_embedding, glove_vocab



## Try the GloVe representations

In [None]:
# TODO(schowdhary): Try character level tokenization and representations

Let's see if GloVe helped for our development data:

In [None]:
dev_glove_embedding, dev_glove_vocab = create_glove_embedding(dev_vocab)

In [47]:
len(dev_vocab)

In [48]:
len(dev_glove_vocab)

1050

In [49]:
dev_mod_glove = ContextualColorDescriber(
    dev_glove_vocab,
    embedding=dev_glove_embedding,
    early_stopping=True)

1050

In [50]:
_ = dev_mod_glove.fit(dev_cols_train, dev_seqs_train)

In [51]:
dev_mod_glove.listener_accuracy(dev_cols_test, dev_seqs_test)

Stopping after epoch 93. Validation score did not improve by tol=1e-05 for more than 10 epochs. Final error is 40.069143533706665

In [52]:
# further evaluating the GloVe embedding-based model

if 'IS_GRADESCOPE_ENV' not in os.environ:
    dev_glove_embedding, dev_glove_vocab = create_glove_embedding(dev_vocab, glove_base_filename='glove.6B.50d.txt')

    print("Total vocab size before:\n", len(dev_vocab))
    print("Vocab size for GloVe embeddings:\n", len(dev_glove_vocab))

    custom_dev_mod_glove = ContextualColorDescriber(
        dev_glove_vocab,
        embedding=dev_glove_embedding,
        early_stopping=True)

    _ = custom_dev_mod_glove.fit(dev_cols_train, dev_seqs_train)


0.788367405701123

In [55]:
if 'IS_GRADESCOPE_ENV' not in os.environ:
    print(custom_dev_mod_glove.listener_accuracy(dev_cols_test, dev_seqs_test))

Total vocab size before:
 1050
Vocab size for GloVe embeddings:
 1050


Stopping after epoch 114. Validation score did not improve by tol=1e-05 for more than 10 epochs. Final error is 36.531978368759155

**_GloVe 50d, 100d embeddings work fine with accuracy ~ 0.7877_**
Error goes down further with 200d upto 32.58, but test accuracy does not improve. This suggests no significant learning.

### Already learnt representations (from word-relatedness research work)

In [None]:
if 'IS_GRADESCOPE_ENV' not in os.environ:

    VSM_HOME = os.path.join('data', 'vsmdata')

    def create_df_vsm():
        full_matrix_df = pd.read_csv(os.path.join(VSM_HOME, "best_devset_word_repr_vsm.csv.gz"), index_col=0)
        return full_matrix_df

    custom_vsm = create_df_vsm()

    custom_vsm_lookup = {word:custom_vsm.loc[word] for word in custom_vsm.index}

    print(custom_vsm_lookup["water"])

In [57]:
if 'IS_GRADESCOPE_ENV' not in os.environ:
    custom_vsm_embedding, custom_vsm_vocab = utils.create_pretrained_embedding(
            custom_vsm_lookup, dev_vocab)

    print("Total vocab size before:\n", len(dev_vocab))
    print("Vocab size for Custom word-relatedness VSM embeddings:\n", len(custom_vsm_vocab))

    custom_vsm_dev_mod = ContextualColorDescriber(
        custom_vsm_vocab,
        embedding=custom_vsm_embedding,
        early_stopping=True)

    _ = custom_vsm_dev_mod.fit(dev_cols_train, dev_seqs_train)

0       0.007726
1      -0.002936
2       0.001923
3      -0.002022
4       0.029006
          ...   
1275    0.009053
1276   -0.020875
1277    0.027672
1278    0.033489
1279    0.011569
Name: water, Length: 1280, dtype: float64


In [58]:
if 'IS_GRADESCOPE_ENV' not in os.environ:
    print(custom_vsm_dev_mod.listener_accuracy(dev_cols_test, dev_seqs_test))

Total vocab size before:
 1050
Vocab size for Custom word-relatedness VSM embeddings:
 1050


Stopping after epoch 15. Validation score did not improve by tol=1e-05 for more than 10 epochs. Final error is 52.26255178451538

**_Not a promissing performace._**

## Color-Input Describer (as used by Monroe et. al.)

In [68]:
from torch_color_describer import Decoder
import torch
import torch.nn as nn


class ColorContextDecoder(Decoder):
    def __init__(self, color_dim, *args, **kwargs):
        self.color_dim = color_dim
        super().__init__(*args, **kwargs)

        self.rnn = nn.GRU(
            input_size=self.embed_dim + self.color_dim, # -> embed_dim + c (from get_embeddings below)
            hidden_size=self.hidden_dim,
            batch_first=True)


    def get_embeddings(self, word_seqs, target_colors=None):
        # word_seqs -> (m, k)
        word_seqs_embedding = self.embedding(word_seqs) # -> (m, k, embed_dim)
        # target_colors -> (m, c)
        target_colors = torch.unsqueeze(target_colors, 1) # -> (m, 1, c)
        target_colors_across_word_seqs = torch.repeat_interleave(target_colors, word_seqs.shape[1], dim=1) # -> (m, k, c)

        return torch.cat((word_seqs_embedding, target_colors_across_word_seqs), dim=2) # -> (m, k, embed_dim + c)



In [78]:
from torch_color_describer import EncoderDecoder

class ColorizedEncoderDecoder(EncoderDecoder):

    def forward(self,
            color_seqs,
            word_seqs,
            seq_lengths=None,
            hidden=None,
            targets=None):
        if hidden is None:
            hidden = self.encoder(color_seqs)

        # color_seqs -> (m, 3, color_dim)

        target_color_seq = color_seqs[:, 2, :] # -> (m, 1, color_dim)
        target_color_seq = torch.squeeze(target_color_seq, dim=1) # -> (m, color_dim)

        output, hidden = self.decoder(
            word_seqs, seq_lengths=seq_lengths, hidden=hidden, target_colors=target_color_seq)

        if self.training:
            return output
        else:
            return output, hidden

In [79]:
from torch_color_describer import Encoder

class ColorizedInputDescriber(ContextualColorDescriber):

    def build_graph(self):

        encoder = Encoder(
            color_dim=self.color_dim,
            hidden_dim=self.hidden_dim)

        color_context_decoder = ColorContextDecoder(
            color_dim=self.color_dim,
            vocab_size=self.vocab_size,
            embed_dim=self.embed_dim,
            embedding=self.embedding,
            hidden_dim=self.hidden_dim,
            freeze_embedding=self.freeze_embedding)

        return ColorizedEncoderDecoder(encoder, color_context_decoder)



In [80]:
def test_full_system(describer_class):
    toy_color_seqs, toy_word_seqs, toy_vocab = create_example_dataset(
        group_size=50, vec_dim=2)

    toy_color_seqs_train, toy_color_seqs_test, toy_word_seqs_train, toy_word_seqs_test = \
        train_test_split(toy_color_seqs, toy_word_seqs)

    toy_mod = describer_class(toy_vocab)

    _ = toy_mod.fit(toy_color_seqs_train, toy_word_seqs_train)

    acc = toy_mod.listener_accuracy(toy_color_seqs_test, toy_word_seqs_test)

    return acc

In [81]:
test_full_system(ColorizedInputDescriber)

Finished epoch 1000 of 1000; error is 0.11461793631315231

1.0

In [102]:
# trying this model to get a ballpark for its performance
if 'IS_GRADESCOPE_ENV' not in os.environ:
    decoder_with_target_dev_mod = ColorizedInputDescriber(
        dev_vocab,
        embed_dim=128,
        hidden_dim=128,
        optimizer_class=torch.optim.Adam,
        eta=0.001,
        early_stopping=True)

    _ = decoder_with_target_dev_mod.fit(dev_cols_train, dev_seqs_train)

Stopping after epoch 71. Validation score did not improve by tol=1e-05 for more than 10 epochs. Final error is 43.2279314994812

In [103]:
if 'IS_GRADESCOPE_ENV' not in os.environ:
    print("listener_accuracy:\n", decoder_with_target_dev_mod.listener_accuracy(dev_cols_test, dev_seqs_test))
    print("bleu:\n", decoder_with_target_dev_mod.corpus_bleu(dev_cols_test, dev_seqs_test)[0])

listener_accuracy:
 0.8016124388137057
bleu:
 0.6622045640998858


This model on default parameters,
-> listener_accuracy: 0.780, bleu: 0.664
-> embedding + hidden dim = 100, 0.7929, 0.6626 (for embedding + hidden dim 200 or so, accuracy reduces)
-> embed_dim 200 + hidden_dim 100, gives a reduced perf of 0.789 accuracy.
-> **_embed_dim + hidden_dim 128, listener_accuracy: 0.7953, bleu: 0.6615_**
-> **_similar to above, embed_dim + hidden_dim 128, optim adam+eta 0.001, listener_accuracy:0.802, bleu: 0.662_**
-> embed_dim + hidden_dim 128 optim Adam + eta 0.004, listener_accuracy: 0.783, bleu: 0.657
-> Adadelta + eta 0.2 does not produce good results.

## Original system

Apart from the experiments above, lets try some really promising techniques, like BERT representations in the Embedding Matrix, regularazation, etc. Since BERT embeddings are extremely costly to generate, we choose to generate it for the full train vocab at once. An enumeration of few things to try while integration:

1. Generate training data/vocab with some trimming on words (to represent them as \<UNK\>
2. Bert tokenization and representation for the entire train vocab to use them as the embedding matrix for the decoder later.
3. Use fourier transformed color space as color feature vectors
4. Do some experiments with the actual encoder decoder structure.
5. Optimize for hyper-parameters

### 1. Generating full data and truncating vocab set

In [161]:
if 'IS_GRADESCOPE_ENV' not in os.environ:
    full_corpus = ColorsCorpusReader(
        COLORS_SRC_FILENAME,
        normalize_colors=True)

    full_examples = list(full_corpus.read())

    print("Total examples:\n", len(full_examples))

Total examples:
 46994


In [162]:
if 'IS_GRADESCOPE_ENV' not in os.environ:
    full_rawcols, full_texts = zip(*[[ex.colors, ex.contents] for ex in full_examples])
    full_rawcols_train, full_rawcols_test, full_texts_train, full_texts_test = \
    train_test_split(full_rawcols, full_texts)

In [163]:
if 'IS_GRADESCOPE_ENV' not in os.environ:
    full_train_vocab, filled_full_train_word_seqs, full_train_removed_words = create_vocab_from_content_dataset(train_texts=full_texts_train, lowest_allowed_freq=1, tokenizer=tweet_tokenize)

    print("Total vocab size:\n", len(full_train_vocab))
    print("Total vocab words removed:\n", len(full_train_removed_words))

Total vocab size:
 2997
Total vocab words removed:
 0


In [164]:
if 'IS_GRADESCOPE_ENV' not in os.environ:
    def full_tokenize_example(s):
        return tokenize_example_on_vocab(s, train_vocab=full_train_vocab, tokenizer=tweet_tokenize)

In [165]:
if 'IS_GRADESCOPE_ENV' not in os.environ:
    full_seqs_train = [full_tokenize_example(s) for s in full_texts_train]
    full_seqs_test = [full_tokenize_example(s) for s in full_texts_test]
    full_cols_train = [represent_color_context(colors) for colors in full_rawcols_train]
    full_cols_test = [represent_color_context(colors) for colors in full_rawcols_test]

In [None]:
if 'IS_GRADESCOPE_ENV' not in os.environ:
    full_mod = ContextualColorDescriber(
        full_train_vocab,
        early_stopping=True)

    %time _ = full_mod.fit(full_cols_train, full_seqs_train)

    print("listener_accuracy:\n", full_mod.listener_accuracy(full_cols_test, full_seqs_test))
    print("bleu:\n", full_mod.corpus_bleu(full_cols_test, full_seqs_test)[0])

**_Baseline for full train vocab: listener_accuracy: 0.832, bleu: 0.448_**

In [174]:
# trying to increase the relevant vocab size
if 'IS_GRADESCOPE_ENV' not in os.environ:
    from matplotlib import colors as mcolors

    colors = dict(mcolors.BASE_COLORS, **mcolors.CSS4_COLORS)
    color_vocab = colors.keys()
    color_vocab = [color_word.lower() for color_word in color_vocab if len(color_word) > 2]

    def is_subseq(x, y):
        it = iter(y)
        # return all(any(c == ch for c in it) for ch in x)
        is_subsequence = False
        max_seq_count = 4
        curr_seq_count = 0

        for ch in x:
            if any(c == ch for c in it):
                curr_seq_count+=1
                if curr_seq_count == max_seq_count:
                    is_subsequence = True
                    break
        return is_subsequence

    def get_rare_vocab_from_banned_words(rejected_words):
        wild_card_vocab = []
        ctr = 0

        for removed_word in rejected_words:
            for color_word in color_vocab:
                if is_subseq(color_word, removed_word):
                    ctr += 1
                    wild_card_vocab.append(removed_word)
                    break

        wild_card_vocab = set(wild_card_vocab)
        still_removed_words = set(rejected_words) - wild_card_vocab
        # print(wild_card_vocab)
        # print("Total infrequent words with some color information:\n", ctr)
        return sorted(wild_card_vocab), sorted(still_removed_words)

In [175]:
# if 'IS_GRADESCOPE_ENV' not in os.environ:
#     rare_vocab, truly_banned_vocab = get_rare_vocab_from_banned_words(full_train_removed_words)
    # print(rare_vocab)

**_This is an attempt to extract even more relevant color words from the infrequent words_**

In [186]:
if 'IS_GRADESCOPE_ENV' not in os.environ:
    from collections import Counter

    def create_tweaked_vocab_from_content_dataset(train_texts, lowest_allowed_freq, tokenizer):
        train_word_seqs = [tokenizer(text) for text in train_texts]
        train_words_list = [word for word_seq in train_word_seqs for word in word_seq]

        ctr_train_words_list = Counter(train_words_list)
        banned_words = [k for (k, v) in ctr_train_words_list.most_common() if v < lowest_allowed_freq]
        banned_words = set(banned_words)

        train_words_vocab = sorted(set(train_words_list) - banned_words)

        rare_train_vocab, truly_banned_train_vocab = get_rare_vocab_from_banned_words(banned_words)

        tweaked_train_words_vocab = sorted(set(train_words_vocab + rare_train_vocab))
        tweaked_train_words_vocab += [UNK_SYMBOL]

        filled_tweaked_train_word_seqs = [[word if word not in truly_banned_train_vocab else UNK_SYMBOL for word in word_seq] for word_seq in train_word_seqs]

        return tweaked_train_words_vocab, filled_tweaked_train_word_seqs, truly_banned_train_vocab


In [187]:
if 'IS_GRADESCOPE_ENV' not in os.environ:
    full_tweaked_train_vocab, filled_full_tweaked_train_word_seqs, full_train_tweaked_removed_words = create_tweaked_vocab_from_content_dataset(train_texts=full_texts_train, lowest_allowed_freq=2, tokenizer=tweet_tokenize)

    print("Total vocab size:\n", len(full_tweaked_train_vocab))
    print("Total vocab words removed:\n", len(full_train_tweaked_removed_words))
    print("Words removed from vocab:\n", full_train_tweaked_removed_words)
    print("A sample tokenized train text:\n", filled_full_tweaked_train_word_seqs[360])
    print("And the actual text was:\n", full_texts_train[360])

Total vocab size:
 1839
Total vocab words removed:
 1158
Words removed from vocab:
A sample tokenized train text:
 ['<s>', 'purple', '</s>']
And the actual text was:
 purple


In [188]:
if 'IS_GRADESCOPE_ENV' not in os.environ:
    def tweak_tokenize_example(s):
        return tokenize_example_on_vocab(s, train_vocab=full_tweaked_train_vocab, tokenizer=tweet_tokenize)

In [189]:
if 'IS_GRADESCOPE_ENV' not in os.environ:
    full_tweaked_seqs_train = [tweak_tokenize_example(s) for s in full_texts_train]
    full_tweaked_seqs_test = [tweak_tokenize_example(s) for s in full_texts_test]

In [190]:
if 'IS_GRADESCOPE_ENV' not in os.environ:
    full_cols_train = [represent_color_context(colors) for colors in full_rawcols_train]
    full_cols_test = [represent_color_context(colors) for colors in full_rawcols_test]

In [None]:
if 'IS_GRADESCOPE_ENV' not in os.environ:
    full_tweaked_mod = ContextualColorDescriber(
        full_tweaked_train_vocab,
        early_stopping=True)

    %time _ = full_tweaked_mod.fit(full_cols_train, full_tweaked_seqs_train)

In [192]:
if 'IS_GRADESCOPE_ENV' not in os.environ:
    print("listener_accuracy:\n", full_tweaked_mod.listener_accuracy(full_cols_test, full_tweaked_seqs_test))
    print("bleu:\n", full_tweaked_mod.corpus_bleu(full_cols_test, full_tweaked_seqs_test)[0])

  perp = [np.prod(s)**(-1/len(s)) for s in scores]


listener_accuracy:
 0.8235594518682441
bleu:
 0.4511033513493236


Tweaking dataset **_removed about 1100 (33%) vocab words_** and yet the model **_retains the performance with an improved bleu score_**,
**_default tweaked model,
listener_accuracy: 0.831-0.82356
bleu: 0.45_**

This shows that **_almost 33% vocab words (rare words) were almost inconsequencial_** to the model's learning

### 2. Generating train vocab BERT Embeddings that can be used with decoder

In [197]:
if 'IS_GRADESCOPE_ENV' not in os.environ:

    from transformers import BertModel, BertTokenizer
    import vsm

    bert_weights_name = 'bert-base-uncased'
    bert_tokenizer = BertTokenizer.from_pretrained(bert_weights_name)
    bert_model = BertModel.from_pretrained(bert_weights_name)

    global ctr
    ctr = 0

    def hf_bert_phi(text):
        text_bert_ids = vsm.hf_encode(text, bert_tokenizer,
                                    add_special_tokens=True)

        text_bert_reps = vsm.hf_represent(text_bert_ids, bert_model, layer=-1) # -> (1, x, 768)

        global ctr
        if ctr%100 == 0:
            print("Bert encoding done for ctr ", ctr)
        ctr+=1

        return torch.mean(text_bert_reps[0], axis=0).cpu().numpy()

    # to save time even on a re-run, lets create bert embeddings for all the words in the dataset. We will however inject the model with only the embeddings for the vocab, that is decided for the training set (different in different runtimes).

    # lookup to create embeddings
    def create_bert_embedding_lookup_on_text_corpora(text_corpora, tokenizer):
        corpora_word_seqs = [tokenizer(text) for text in text_corpora]
        corpora_words_list = [word for word_seq in corpora_word_seqs for word in word_seq]

        corpora_words_set = list(sorted(set(corpora_words_list)))
        corpora_word_bert_embedding = {word:hf_bert_phi(word) for word in corpora_words_set}

        return corpora_word_bert_embedding


In [None]:
if 'IS_GRADESCOPE_ENV' not in os.environ:
    %time full_corpora_word_bert_embedding = create_bert_embedding_lookup_on_text_corpora(full_texts, tweet_tokenize)

In [207]:
if 'IS_GRADESCOPE_ENV' not in os.environ:

    def create_bert_embedding(vocab, bert_lookup):
        bert_embedding, bert_vocab = utils.create_pretrained_embedding(bert_lookup, vocab)

        return bert_embedding, bert_vocab

In [208]:
if 'IS_GRADESCOPE_ENV' not in os.environ:
    full_train_bert_embedding, full_train_bert_vocab = create_bert_embedding(full_train_vocab, full_corpora_word_bert_embedding)

In [None]:
# testing the performance of BERT embeddings in the decoder
if 'IS_GRADESCOPE_ENV' not in os.environ:
    full_bert_embedding_mod = ContextualColorDescriber(
        full_train_bert_vocab,
        embedding=full_train_bert_embedding,
        hidden_dim=512,
        early_stopping=True)

    %time _ = full_bert_embedding_mod.fit(full_cols_train, full_seqs_train)

    print("listener_accuracy:\n", full_bert_embedding_mod.listener_accuracy(full_cols_test, full_seqs_test))
    print("bleu:\n", full_bert_embedding_mod.corpus_bleu(full_cols_test, full_seqs_test)[0])

Default model performance,
listener_accuracy: 0.8288, bleu: 0.472:
\+ hidden_dim 256, listener_accuracy: 0.8453, bleu: 0.438
**_+ hidden_dim 512, 0.850, 0.447 (accuracy reduces for a greater value of 728)_**
in all of them eta stays 0.001 (default)

### 3. BERT on color input Model

In [None]:
# trying this model to get a ballpark for its performance
if 'IS_GRADESCOPE_ENV' not in os.environ:
    full_bert_embedding_color_input_mod = ColorizedInputDescriber(
        full_train_bert_vocab,
        embedding=full_train_bert_embedding,
        hidden_dim=256,
        early_stopping=True)

    _ = full_bert_embedding_color_input_mod.fit(full_cols_train, full_seqs_train)

    print("listener_accuracy:\n", full_bert_embedding_color_input_mod.listener_accuracy(full_cols_test, full_seqs_test))
    print("bleu:\n", full_bert_embedding_color_input_mod.corpus_bleu(full_cols_test, full_seqs_test)[0])

Scores:
default +hidden_dim = 512, listener_accuracy: 0.845, bleu: 0.451
\+ hidden_dim = 256, 0.849, 0.453 (score reduces for <256 hidden_dim)

### 4. Custom Encoder-Decoder Model (Experimental, encoder_hidden_dim always equals decoder_hidden_dim)

In [393]:
if 'IS_GRADESCOPE_ENV' not in os.environ:

    class ExpColorContextDecoder(ColorContextDecoder):
        def __init__(self, *args, **kwargs):
            super().__init__(*args, **kwargs)

            self.rnn = nn.LSTM(
                input_size=self.embed_dim + self.color_dim,
                hidden_size=self.hidden_dim,
                num_layers=2,
                batch_first=True,
                dropout=0.8,
                bidirectional=False
            )
            self.output_layer = nn.Linear((int(self.rnn.bidirectional) + 1) * self.hidden_dim, self.vocab_size)

    class ExpEncoder(Encoder):
        def __init__(self, *args, **kwargs):
            super().__init__(*args, **kwargs)

            self.rnn = nn.LSTM(
                input_size=self.color_dim,
                hidden_size=self.hidden_dim,
                num_layers=2,
                batch_first=True,
                dropout=0.8,
                bidirectional=False
            )

    class ExpColorizedEncoderDecoder(ColorizedEncoderDecoder):
        def __init__(self, *args, **kwargs):
            super().__init__(*args, **kwargs)

        def forward(self,
            color_seqs,
            word_seqs,
            seq_lengths=None,
            hidden=None,
            targets=None):

            if hidden is None:
                hidden = self.encoder(color_seqs)

            # color_seqs -> (m, 3, color_dim)
            target_color_seq = color_seqs[:, 2, :] # -> (m, 1, color_dim)
            target_color_seq = torch.squeeze(target_color_seq, dim=1) # -> (m, color_dim)

            output, hidden = self.decoder(
                word_seqs, seq_lengths=seq_lengths, hidden=hidden, target_colors=target_color_seq)

            if self.training:
                return output
            else:
                return output, hidden


    class ExpColorizedInputDescriber(ColorizedInputDescriber):
        def __init__(self, encoder_hidden_dim=50, decoder_hidden_dim=50, *args, **kwargs):
            super().__init__(*args, **kwargs)
            self.encoder_hidden_dim = encoder_hidden_dim
            self.decoder_hidden_dim = decoder_hidden_dim

            # to easily find if self.hidden_dim is being used elsewhere
            self.hidden_dim = 0
            self.params += ['encoder_hidden_dim', 'decoder_hidden_dim']

        def build_graph(self):
            encoder = ExpEncoder(
                color_dim=self.color_dim,
                hidden_dim=self.encoder_hidden_dim
            )

            decoder = ExpColorContextDecoder(
                color_dim=self.color_dim,
                vocab_size=self.vocab_size,
                embed_dim=self.embed_dim,
                embedding=self.embedding,
                hidden_dim=self.decoder_hidden_dim,
                freeze_embedding=self.freeze_embedding
            )

            return ExpColorizedEncoderDecoder(encoder, decoder)


In [484]:
if 'IS_GRADESCOPE_ENV' not in os.environ:

    def test_exp_full_system(describer_class):
        toy_color_seqs, toy_word_seqs, toy_vocab = create_example_dataset(
            group_size=50, vec_dim=2)

        toy_color_seqs_train, toy_color_seqs_test, toy_word_seqs_train, toy_word_seqs_test = \
            train_test_split(toy_color_seqs, toy_word_seqs)

        toy_mod = describer_class(vocab=toy_vocab, encoder_hidden_dim=32, decoder_hidden_dim=256)

        _ = toy_mod.fit(toy_color_seqs_train, toy_word_seqs_train)

        acc = toy_mod.listener_accuracy(toy_color_seqs_test, toy_word_seqs_test)

        return acc

    # print("Test result:\n", test_exp_full_system(ExpColorizedInputDescriber))

In [None]:
# trying this model to get a ballpark for its performance
if 'IS_GRADESCOPE_ENV' not in os.environ:
    full_exp_bert_embedding_color_input_mod = ExpColorizedInputDescriber(
        vocab=full_train_bert_vocab,
        embedding=full_train_bert_embedding,
        encoder_hidden_dim=256,
        decoder_hidden_dim=256,
        early_stopping=True)

    _ = full_exp_bert_embedding_color_input_mod.fit(full_cols_train, full_seqs_train)

    print("listener_accuracy:\n", full_exp_bert_embedding_color_input_mod.listener_accuracy(full_cols_test, full_seqs_test))
    print("bleu:\n", full_exp_bert_embedding_color_input_mod.corpus_bleu(full_cols_test, full_seqs_test)[0])

Scores:
Bert Embeddings + hidden_dim (256, 256) + 1 layer + unidirection, listener_accuracy: 0.852-0.8498, bleu:0.443-0.451

same as above + 2 layer + dropout 0.5, 0.851, 0.451
same as above + 2 layer + dropout 0.2, 0.8467, 0.4534
same as above + 2 layer + dropout 0.8, 0.854, 0.447

Bidirecitonal Variant **_Does not always converge_**. So we are not investigating it further. Also, this model **_still has the limitation of equal hidden_dim for both encoder and decoder. We address that later._**

### 5. Pipeline decision

We see that this `ExpColorizedInputDescriber` has all the properties of `ColorizedInputDescriber` and the general context-color based model we made earlier. On top of that this has the functionality to add more layers and regularization. **_So, we are proceeding with this model_**.

We would also like to test the following in our pipeline:
1. Roberta/Bert tokenization and per-token embedding based model (not performing quite well)
5. Roberta/BERT embeddings on tweet tokens (nice results)

### 6. RoBERTa tokenizer vocab and embedding calculation

In [411]:
if 'IS_GRADESCOPE_ENV' not in os.environ:
    # find the full vocab on data
    from transformers import RobertaModel, RobertaTokenizer
    import vsm

    roberta_weights_name = 'roberta-large'
    roberta_tokenizer = RobertaTokenizer.from_pretrained(roberta_weights_name)
    roberta_model = RobertaModel.from_pretrained(roberta_weights_name)

In [414]:
 if 'IS_GRADESCOPE_ENV' not in os.environ:
     def roberta_tokenize(s):
         s = s.lower()
         return roberta_tokenizer.tokenize(s)

     full_roberta_train_vocab, filled_full_roberta_train_word_seqs, full_roberta_train_removed_words = create_vocab_from_content_dataset(train_texts=full_texts_train, lowest_allowed_freq=1, tokenizer=roberta_tokenize)

In [425]:
if 'IS_GRADESCOPE_ENV' not in os.environ:

    ctr = 0
    def tf_phi(text, tf_tokenizer, tf_model):
        text_tf_ids = vsm.hf_encode(text, tf_tokenizer,
                                    add_special_tokens=True)

        text_tf_reps = vsm.hf_represent(text_tf_ids, tf_model, layer=-1) # -> (1, x, 1024)

        global ctr
        if ctr%100 == 0:
            print("Transformer encoding done for ctr ", ctr)
        ctr+=1

        return torch.mean(text_tf_reps[0], axis=0).cpu().numpy()

    # lookup to create embeddings
    def create_tf_embedding_lookup_on_text_vocab(vocab, tf_tokenizer, tf_model):
        corpora_word_tf_embedding = {word:tf_phi(word, tf_tokenizer, tf_model) for word in vocab}

        return corpora_word_tf_embedding

    def create_tf_embedding(vocab, tf_repr_lookup):
        tf_embedding, tf_vocab = utils.create_pretrained_embedding(tf_repr_lookup, vocab)
        return tf_embedding, tf_vocab

In [None]:
if 'IS_GRADESCOPE_ENV' not in os.environ:
    %time full_corpora_word_roberta_embedding = create_tf_embedding_lookup_on_text_vocab(full_roberta_train_vocab, roberta_tokenizer, roberta_model)

In [426]:
if 'IS_GRADESCOPE_ENV' not in os.environ:
    full_train_roberta_embedding, full_train_roberta_vocab = create_tf_embedding(full_roberta_train_vocab, full_corpora_word_roberta_embedding)


In [None]:
# trying this model to get a ballpark for its performance
if 'IS_GRADESCOPE_ENV' not in os.environ:
    full_exp_roberta_embedding_color_input_mod = ExpColorizedInputDescriber(
        vocab=full_train_roberta_vocab,
        embedding=full_train_roberta_embedding,
        encoder_hidden_dim=256,
        decoder_hidden_dim=256,
        early_stopping=True)

    _ = full_exp_roberta_embedding_color_input_mod.fit(full_cols_train, full_seqs_train)

    print("listener_accuracy:\n", full_exp_roberta_embedding_color_input_mod.listener_accuracy(full_cols_test, full_seqs_test))
    print("bleu:\n", full_exp_roberta_embedding_color_input_mod.corpus_bleu(full_cols_test, full_seqs_test)[0])

**_Clearly, transformer embeddings on context split tokens and sub-tokens dont work well. This is mainly because the transformer splits words into sub-words based on context and a separate context independent subword token representation on it would not make much sense. Wont proceed with the same on BERT or other transformers._**

### 7. RoBERTa embedding on sentiment-token words

In [431]:
 if 'IS_GRADESCOPE_ENV' not in os.environ:
    full_roberta_train_vocab, filled_full_roberta_train_word_seqs, full_roberta_train_removed_words = create_vocab_from_content_dataset(train_texts=full_texts_train, lowest_allowed_freq=1, tokenizer=tweet_tokenize)

In [None]:
if 'IS_GRADESCOPE_ENV' not in os.environ:
    %time full_corpora_word_roberta_embedding = create_tf_embedding_lookup_on_text_vocab(full_roberta_train_vocab, roberta_tokenizer, roberta_model)

In [439]:
if 'IS_GRADESCOPE_ENV' not in os.environ:
    full_train_roberta_embedding, full_train_roberta_vocab = create_tf_embedding(full_roberta_train_vocab, full_corpora_word_roberta_embedding)

In [None]:
# trying this model to get a ballpark for its performance
if 'IS_GRADESCOPE_ENV' not in os.environ:
    full_exp_roberta_embedding_color_input_mod = ExpColorizedInputDescriber(
        vocab=full_train_roberta_vocab,
        embedding=full_train_roberta_embedding,
        encoder_hidden_dim=256,
        decoder_hidden_dim=256,
        early_stopping=True)

    _ = full_exp_roberta_embedding_color_input_mod.fit(full_cols_train, full_seqs_train)

    print("listener_accuracy:\n", full_exp_roberta_embedding_color_input_mod.listener_accuracy(full_cols_test, full_seqs_test))
    print("bleu:\n", full_exp_roberta_embedding_color_input_mod.corpus_bleu(full_cols_test, full_seqs_test)[0])

Scores:
**_Default settings + hidden_dim (256, 256) + 2 hidden layers, 0.8 dropout, unidirectional 0.856, 0.458_**

### 9. Trying different hidden dims for encoder and decoder (Improvement on the Experimental Model above)

In [503]:
if 'IS_GRADESCOPE_ENV' not in os.environ:

    class VarColorContextDecoder(ColorContextDecoder):
        def __init__(self, *args, **kwargs):
            super().__init__(*args, **kwargs)

            self.rnn = nn.LSTM(
                input_size=self.embed_dim + self.color_dim,
                hidden_size=self.hidden_dim,
                num_layers=2,
                batch_first=True,
                dropout=0.5,
                bidirectional=False
            )
            self.output_layer = nn.Linear((int(self.rnn.bidirectional) + 1) * self.hidden_dim, self.vocab_size)

    class VarEncoder(Encoder):
        def __init__(self, *args, **kwargs):
            super().__init__(*args, **kwargs)

            self.rnn = nn.LSTM(
                input_size=self.color_dim,
                hidden_size=self.hidden_dim,
                num_layers=2,
                batch_first=True,
                dropout=0.5,
                bidirectional=False
            )

    class VarColorizedEncoderDecoder(ColorizedEncoderDecoder):
        def __init__(self, *args, **kwargs):
            super().__init__(*args, **kwargs)

            # Adapters
            self.encoder_decoder_adapter_h = nn.Linear(self.encoder.hidden_dim, self.decoder.hidden_dim)
            self.adapter_activation_h = nn.ReLU()
            self.dropout_h = nn.Dropout(p=0.5)

            self.encoder_decoder_adapter_c = nn.Linear(self.encoder.hidden_dim, self.decoder.hidden_dim)
            self.adapter_activation_c = nn.ReLU()
            self.dropout_c = nn.Dropout(p=0.5)

        def forward(self,
            color_seqs,
            word_seqs,
            seq_lengths=None,
            hidden=None,
            targets=None):

            if hidden is None:
                hidden = self.encoder(color_seqs)

            if hidden[0].shape[2] != self.decoder.hidden_dim:
                hidden_h = self.dropout_h(self.adapter_activation_h(self.encoder_decoder_adapter_h(hidden[0])))
                hidden_c = self.dropout_c(self.adapter_activation_c(self.encoder_decoder_adapter_c(hidden[1])))
                hidden = (hidden_h, hidden_c)

            # color_seqs -> (m, 3, color_dim)
            target_color_seq = color_seqs[:, 2, :] # -> (m, 1, color_dim)
            target_color_seq = torch.squeeze(target_color_seq, dim=1) # -> (m, color_dim)

            output, hidden = self.decoder(
                word_seqs, seq_lengths=seq_lengths, hidden=hidden, target_colors=target_color_seq)

            if self.training:
                return output
            else:
                return output, hidden


    class VarColorizedInputDescriber(ColorizedInputDescriber):
        def __init__(self, encoder_hidden_dim=50, decoder_hidden_dim=50, *args, **kwargs):
            super().__init__(*args, **kwargs)
            self.encoder_hidden_dim = encoder_hidden_dim
            self.decoder_hidden_dim = decoder_hidden_dim

            # to easily find if self.hidden_dim is being used elsewhere
            self.hidden_dim = 0
            self.params += ['encoder_hidden_dim', 'decoder_hidden_dim']

        def build_graph(self):
            encoder = VarEncoder(
                color_dim=self.color_dim,
                hidden_dim=self.encoder_hidden_dim
            )

            decoder = VarColorContextDecoder(
                color_dim=self.color_dim,
                vocab_size=self.vocab_size,
                embed_dim=self.embed_dim,
                embedding=self.embedding,
                hidden_dim=self.decoder_hidden_dim,
                freeze_embedding=self.freeze_embedding
            )

            return VarColorizedEncoderDecoder(encoder, decoder)


In [488]:
if 'IS_GRADESCOPE_ENV' not in os.environ:
    print("Test result:\n", test_exp_full_system(VarColorizedInputDescriber))

Finished epoch 1000 of 1000; error is 0.11831463128328323

Test result:
 1.0


In [None]:
# trying this model to get a ballpark for its performance
if 'IS_GRADESCOPE_ENV' not in os.environ:
    full_var_bert_embedding_color_input_mod = VarColorizedInputDescriber(
        vocab=full_train_bert_vocab,
        embedding=full_train_bert_embedding,
        encoder_hidden_dim=32,
        decoder_hidden_dim=256,
        early_stopping=True)

    _ = full_var_bert_embedding_color_input_mod.fit(full_cols_train, full_seqs_train)

    print("listener_accuracy:\n", full_var_bert_embedding_color_input_mod.listener_accuracy(full_cols_test, full_seqs_test))
    print("bleu:\n", full_var_bert_embedding_color_input_mod.corpus_bleu(full_cols_test, full_seqs_test)[0])


Scores:
default + 1 hidden_layer, 0 dropout, no adapter dropout, unidirectional, hidden_dim(32, 256)-> 0.851, 0.4299
**_default + 1 hidden_layer, 0 dropout, 0.5 adapter dropout, unidirectional, hidden_dim(32, 256)-> 0.859, 0.433_**
default + 1 hidden_layer, 0 dropout, 0.8 adapter dropout, unidirectional, hidden_dim(32, 256)-> 0.852, 0.434
default + 1 hidden_layer, 0 dropout, 0.2 adapter dropout, unidirectional, hidden_dim(32, 256)-> 0.8586, 0.446
default + 1 hidden_layer, 0 dropout, 0.5 adapter dropout, unidirectional, hidden_dim(32, 512)->0.855, x
default + 1 hidden_layer, 0 dropout, 0.5 adapter dropout, unidirectional, hidden_dim(64, 256)-><similar to before> 0.857, 0.44
**_default + 2 hidden_layer, 0.2 dropout, 0.5 adapter dropout, unidirectional, hidden_dim(32, 256)->0.8594, 0.431_**
**_default + 2 hidden_layer, 0.5 dropout, 0.5 adapter dropout, unidirectional, hidden_dim(32, 256)->0.86152, 0.4432


In [None]:
# testing roberta on this
if 'IS_GRADESCOPE_ENV' not in os.environ:
    full_var_roberta_embedding_color_input_mod = VarColorizedInputDescriber(
        vocab=full_train_roberta_vocab,
        embedding=full_train_roberta_embedding,
        encoder_hidden_dim=32,
        decoder_hidden_dim=256,
        early_stopping=True)

    _ = full_var_roberta_embedding_color_input_mod.fit(full_cols_train, full_seqs_train)

    print("listener_accuracy:\n", full_var_roberta_embedding_color_input_mod.listener_accuracy(full_cols_test, full_seqs_test))
    print("bleu:\n", full_var_roberta_embedding_color_input_mod.corpus_bleu(full_cols_test, full_seqs_test)[0])

**_Score on similar RoBERTa setup: 0.85998, 0.44581 (hidden_dim 32, 512)_**
on a hidden_dim (32, 256) the score is: 0.8531, 0.4173

### 10. Super-sample the dataset (swap 1st and 2nd colors and keep the target text the same)

In [519]:
if 'IS_GRADESCOPE_ENV' not in os.environ:
    # full_rawcols, full_texts = zip(*[[ex.colors, ex.contents] for ex in full_examples])
    new_full_rawcols = [[colors[1], colors[0], colors[2]] for colors in full_rawcols]
    new_full_texts = full_texts
    sup_full_rawcols = full_rawcols + tuple(new_full_rawcols)
    sup_full_texts = full_texts + new_full_texts

    sup_full_rawcols_train, sup_full_rawcols_test, sup_full_texts_train, sup_full_texts_test = \
    train_test_split(sup_full_rawcols, sup_full_texts)

In [521]:
if 'IS_GRADESCOPE_ENV' not in os.environ:
    sup_full_seqs_train = [full_tokenize_example(s) for s in sup_full_texts_train]
    sup_full_seqs_test = [full_tokenize_example(s) for s in sup_full_texts_test]
    sup_full_cols_train = [represent_color_context(colors) for colors in sup_full_rawcols_train]
    sup_full_cols_test = [represent_color_context(colors) for colors in sup_full_rawcols_test]

In [None]:
# the super sampled dataset has the same texts (with interchanged disturbances). So the vocab and the embeddings should remain the same
if 'IS_GRADESCOPE_ENV' not in os.environ:
    sup_full_var_bert_embedding_color_input_mod = VarColorizedInputDescriber(
        vocab=full_train_bert_vocab,
        embedding=full_train_bert_embedding,
        encoder_hidden_dim=32,
        decoder_hidden_dim=256,
        early_stopping=True)

    _ = sup_full_var_bert_embedding_color_input_mod.fit(sup_full_cols_train, sup_full_seqs_train)

    print("listener_accuracy:\n", sup_full_var_bert_embedding_color_input_mod.listener_accuracy(sup_full_cols_test, sup_full_seqs_test))
    print("bleu:\n", sup_full_var_bert_embedding_color_input_mod.corpus_bleu(sup_full_cols_test, sup_full_seqs_test)[0])


In [523]:
if 'IS_GRADESCOPE_ENV' not in os.environ:
    torch.save(sup_full_var_bert_embedding_color_input_mod, os.path.join("data", "colors", "sup_full_var_bert_embedding_color_input_mod.mod"))

In [None]:
# trying this on RoBERTa
# the super sampled dataset has the same texts (with interchanged disturbances). So the vocab and the embeddings should remain the same
if 'IS_GRADESCOPE_ENV' not in os.environ:
    sup_full_var_roberta_embedding_color_input_mod = VarColorizedInputDescriber(
        vocab=full_train_roberta_vocab,
        embedding=full_train_roberta_embedding,
        encoder_hidden_dim=32,
        decoder_hidden_dim=512,
        early_stopping=True)

    _ = sup_full_var_roberta_embedding_color_input_mod.fit(sup_full_cols_train, sup_full_seqs_train)

    print("listener_accuracy:\n", sup_full_var_roberta_embedding_color_input_mod.listener_accuracy(sup_full_cols_test, sup_full_seqs_test))
    print("bleu:\n", sup_full_var_roberta_embedding_color_input_mod.corpus_bleu(sup_full_cols_test, sup_full_seqs_test)[0])


In [537]:
if 'IS_GRADESCOPE_ENV' not in os.environ:
    torch.save(sup_full_var_roberta_embedding_color_input_mod, os.path.join("data", "colors", "sup_full_var_roberta_embedding_color_input_mod.mod"))

**_Best model so far (BERT Embedding based), score: 0.9172, 0.453_**
**_Better one. (RoBERTa Embedding based), score: 0.9264, 0.4671_**

In [538]:
def evaluate_original_system(trained_model, color_seqs_test, texts_test):

    tok_seqs = [full_tokenize_example(s) for s in texts_test]

    col_seqs = [represent_color_context(colors)
                for colors in color_seqs_test]

    evaluation = trained_model.evaluate(col_seqs, tok_seqs)

    return evaluation

In [539]:
my_evaluation = evaluate_original_system(sup_full_var_roberta_embedding_color_input_mod, sup_full_rawcols_test, sup_full_texts_test)

  perp = [np.prod(s)**(-1/len(s)) for s in scores]


In [540]:
my_evaluation['listener_accuracy']

0.9261607864833808

In [541]:
my_evaluation['corpus_bleu']

0.46715179883476743

## Test results (ran on online judge):
**_'listener_accuracy': 0.9030034465780404, 'corpus_bleu': 0.6912203345869538_**

**_Avg accuracy for humans to correctly identify the colors based on the description generated by other humans ( in the SCC dataset) is 90%_**
This model is at par with an experimental human performance.

## Future Work
1. Reducing the vocab to include only effective tokens actually improves the english outputs. Since the accuracy did not improve much,
we did not proceed with that in the final pipeline. But including this might lead to a good bleu score along with a decent listener_accuracy.
2. We can train the ColorDescriber by keeping it on a neural conversation with a neural listener (that replies back too). The dataset supports
2-way conversation
3. We can pursue an ensemble of various good-performing models above by combining their probability distributions.
4. We can try character-level tokenization and representation too.

### References

1. [Monroe et. al. 2016](https://nlp.stanford.edu/pubs/monroe2016color.pdf)
2. [Monroe et. al. 2017](https://transacl.org/ojs/index.php/tacl/article/view/1142)
3. [Prof. Chris Potts' NLP Research Architecture](https://github.com/cgpotts/cs224u)