*What this is*: exploratory analysis for language data on ARC and ground truth programs.

 Simple baselines to run:
- Unigram model over the training data; give the likelihood of the programs under the resulting model.
- Linear/simple prediction model from NL encoder vector to -> over the unigrams.
- Dot product similarities between NL encoder and the unigrams. 

In [23]:
import os
import numpy as np
import pandas as pd
import tensorflow as tf

### Data exploration

In [7]:
DATA_DIRECTORY = "/Users/catwong/Desktop/zyzzyva/code/ec_arc/data/arc/"
LANGUAGE_DATA_FILE = "ManyProgramsPlusNlDescription.csv"

language_file = os.path.join(DATA_DIRECTORY, LANGUAGE_DATA_FILE)

# T5 is good. Just using the encoder. Or ROBERTA. 

In [34]:
import csv

task_to_data = dict()
with open(language_file, 'r') as f:
    reader = csv.DictReader(f)
    for row in reader:
        task_name, program, language = row["taskName"], row['program'], row['nlDescription']
        if task_name in task_to_data:
            task_name += "_"
        task_to_data[task_name] = (program, language)
    


In [35]:
def tokenize_program(program):
    replace_tokens = ["lambda", "_", ")", "(", "false", "true"]
    for replace_token in replace_tokens:
        program = program.replace(replace_token, " ")
    program = " ".join(program.split())
    return program
    
for task_name in task_to_data:
    program, nl_0 = task_to_data[task_name]
    tokenized = tokenize_program(program)
    print(f"{nl_0}")
    print("\n")
    print(f"{tokenized}")
    print("\n\n")

look at both the left and right parts of the input grid. You will notice that the left and right parts are 3x3. For each square that is colored on both the left and right parts, color the output grid with red on the new 3x3.


overlap split blocks split grid $0 color logical $0 $1 red land



look at both the left and right parts of the input grid. You will notice that the left and right parts are 3x3. For each square that is colored on both the left and right parts, color the output grid with red on the new 3x3.


overlap split blocks split grid $0 is tile grow grid to block $0 1 color logical $1 $0 red land



make it the area that is a different color.


to min grid remove black b remove color grid to block $0 nth primary color grid to block $0 1



make it the area that is a different color.


blocks to min grid find blocks by color $0 nth primary color grid to block $0 2



you just need to make sure the colored image is exactly (top, bottom, left, right) in the new grid. No color

In [36]:
class DummyModel():
    def __init__(self):
        pass
    
    def fit(self, training_data):
        pass
    
    def evaluate_likelihood(self, language, ground_truth_program):
        return 0.0

def leave_one_out_evaluation(task_to_data, model):
    print(f"Running leave one out evaluation on {len(task_to_data)} tasks.")
    tasks_to_likelihoods = dict()
    for task in task_to_data:
        language, ground_truth_program = task_to_data[task]
        training_tasks = {t : d for t, d in task_to_data.items() if t != task}
        model.fit(training_tasks)
        likelihood = model.evaluate_likelihood(language, ground_truth_program)
        tasks_to_likelihoods[task] = likelihood
    print(f"Average likelihood: {np.mean(list(tasks_to_likelihoods.values()))}")
    return tasks_to_likelihoods
        

tasks_to_likelihoods = leave_one_out_evaluation(task_to_data, DummyModel())

Running leave one out evaluation on 136 tasks.
Average likelihood: 0.0


### Large language model and linear prediction baseline

TODO:
- Encode the natural language to vector
- Fit a linear decoder from the natural language vector to unigram predictions
- Convert the unigram scores into a unigram grammar
- Evaluate the likelihood of a ground truth program under the grammar
- Training and evaluation loop: we should have a leave one out prediction.

### Language encoding utilities

In [38]:
demo_task = '67e8384a.json'
demo_program, demo_language = task_to_data[demo_task]
print(demo_language)

from transformers import pipeline, T5Tokenizer, TFT5EncoderModel
T5_MODEL = 't5-small' # Source: https://huggingface.co/transformers/pretrained_models.html 
ROBERTA_MODEL = 'distilroberta-base'
T5 = 't5'

class LMUnigramPredictionLinear():
    """Encodes natural language as 
    """
    def __init__(self, lm_model_name):
        self.lm_model_name = lm_model_name
        if T5 in self.lm_model_name:
            self.tokenizer = T5Tokenizer.from_pretrained(self.lm_model_name)
            self.t5_encoder_model = TFT5EncoderModel.from_pretrained(self.lm_model_name)
            self.featurizer = self._t5_featurizer_fn
        else:
            self.featurizer = pipeline('feature-extraction', self.lm_model_name)
            
    def _t5_featurizer_fn(self, language):
        """Featurizes batch of sentences using a mean over the tokens in each sentence.
        args:
            language: [array of N sentences]
        ret: numpy array of size N x <HIDDEN_STATE_DIM>
        """
        input_ids = self.tokenizer(language, return_tensors="tf", padding=True, truncation=True).input_ids  # Batch size 1
        outputs = self.t5_encoder_model(input_ids)
        last_hidden_states = outputs.last_hidden_state 
        reduced = tf.math.reduce_mean(last_hidden_states, axis=1)
        return reduced.numpy()
    
    def _featurize_programs(self, programs):
        
    
    def _featurize_language(self, language):
        outputs = self.featurizer(language)
        return outputs
    
    def fit(self, task_to_data):
        """
        task_to_data: dict from task_names to (program, language) for each task.
        """
        programs, language = zip(*task_to_data.values())
        featurized_language = self._featurize_language(language)
        print(featurized_language.shape)
        

model = LMUnigramPredictionLinear(T5_MODEL)
outputs = model._featurize_language([demo_language, demo_language])

model.fit(task_to_data)


copy-paste the pattern in each corner. The pattern on the bottoms flips parallel to the top.


Some layers from the model checkpoint at t5-small were not used when initializing TFT5EncoderModel: ['decoder']
- This IS expected if you are initializing TFT5EncoderModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFT5EncoderModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFT5EncoderModel were initialized from the model checkpoint at t5-small.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5EncoderModel for predictions without further training.


(136, 512)


In [40]:
demo_program

'(lambda (to_min_grid (reflect (remove_black_b (move (reflect (remove_black_b (move (reflect (grid_to_block $0) true) 3 south true)) false) 3 east true)) true) true))'