# How to data

This notebook explains the data structure.

Please make sure to get familiarized.

## Declare some helper functions

In [1]:
import json
from glob import glob
from tqdm.notebook import tqdm
import os
import random
import re


def read_json(path):
    """Read json file.

    Args
    ----
    path: path to the json

    Returns
    -------
    loaded: loaded dict

    """
    with open(path, "r") as stream:
        loaded = json.load(stream)

    return loaded


def atoi(text):
    return int(text) if text.isdigit() else text


def natural_keys(text):
    """
    copied from https://stackoverflow.com/questions/5967500/how-to-correctly-sort-a-string-with-a-number-inside
    alist.sort(key=natural_keys) sorts in human order
    http://nedbatchelder.com/blog/200712/human_sorting.html
    (See Toothy's implementation in the comments)

    """
    return [atoi(c) for c in re.split(r"(\d+)", text)]

## Load all of the data to `data_all`

The filenames are named as `<maximum_history>_<memory_capacity>.json`

The `<maximum_history>` refers to the number of observations. For example, if the value 
of this is 128, it means that the agent has observed the objects and their location 128 times.

The value of `<maximum_history>` is fixed to 128 for simplicity.

`<memory_capacity>` refers to the maximum number of memories per memory system. 
For example, if the value of this is 2, it means that the agent has 2 episodic and 2 
semantic memories in its brain. Obviously the more memories the agent has, the more likely 
it can answer given questions.

every `json` file has two data splits. One is `val` and the other is `test`. You should try
various templates and choose the template that has the best score on the `val` split, and then
run the same template on the `test` split. You should report the numbers of the different templates
to better compare them. 

You don't have to merge the files, since they are all different datasets. At this moment, 
there are 7 json files. In other words, you are reporting validation and test scores on all seven files.

All 7 files have the scores from the hand-crafted models. For example, if you read `128_64.json`,
you'll see the below key: value pairs:

```
"accuracy": {
    "val": 0.96875,
    "test": 0.9921875
```

The accuracy on both validation and test splits are close to 1.0, which means that it was able
to answer most of the questions. This makes sense cuz it's got a lot of memories (i.e., 64 per memory system)

Let's also look at the performance of the hand-crafted model on `128_8.json`:

```
"accuracy": {
    "val": 0.3046875,
    "test": 0.4375
}
```
As you can see this performed worse than the model that has 64 memories per system.

In the below cell, we'll load the data and see actually what's inside.

In [2]:
data_all = {}

paths = glob("../data/original/*.json")
paths.sort(key=natural_keys)
for path in tqdm(paths):
    data = read_json(path)

    print(path, "\t", list(data.keys()))
    data_all[path] = data

    maximum_history = os.path.basename(path).split("_")[0]
    memory_capacity = os.path.basename(path).split("_")[1].split(".json")[0]

    print(
        f"maximum_history: {maximum_history}\nmemory_capacity per system: {memory_capacity}\n"
    )

  0%|          | 0/7 [00:00<?, ?it/s]

../data/original/128_1.json 	 ['val', 'test', 'max_history', 'capacity', 'rewards', 'accuracy']
maximum_history: 128
memory_capacity per system: 1

../data/original/128_2.json 	 ['val', 'test', 'max_history', 'capacity', 'rewards', 'accuracy']
maximum_history: 128
memory_capacity per system: 2

../data/original/128_4.json 	 ['val', 'test', 'max_history', 'capacity', 'rewards', 'accuracy']
maximum_history: 128
memory_capacity per system: 4

../data/original/128_8.json 	 ['val', 'test', 'max_history', 'capacity', 'rewards', 'accuracy']
maximum_history: 128
memory_capacity per system: 8

../data/original/128_16.json 	 ['val', 'test', 'max_history', 'capacity', 'rewards', 'accuracy']
maximum_history: 128
memory_capacity per system: 16

../data/original/128_32.json 	 ['val', 'test', 'max_history', 'capacity', 'rewards', 'accuracy']
maximum_history: 128
memory_capacity per system: 32

../data/original/128_64.json 	 ['val', 'test', 'max_history', 'capacity', 'rewards', 'accuracy']
maximum_his

In [3]:
data.keys(), data["accuracy"]

(dict_keys(['val', 'test', 'max_history', 'capacity', 'rewards', 'accuracy']),
 {'val': 0.953125, 'test': 0.9453125})

## Let's take a look at a random data point.

In [4]:
random_point = random.choice(data_all["../data/original/128_4.json"]["val"])

## This has five {key:value} pairs

In [5]:
from pprint import pprint

pprint(random_point)

{'correct_answer': 'garage',
 'episodic_memory_system': [["James's bus",
                             'AtLocation',
                             "James's city",
                             1008891552.5016184],
                            ["David's bench",
                             'AtLocation',
                             "David's garden",
                             1008977952.5017927],
                            ["James's boat",
                             'AtLocation',
                             "James's water",
                             1009064352.5021348],
                            ["David's motorcycle",
                             'AtLocation',
                             "David's garage",
                             1009150752.5025997]],
 'prediction_hand_crafted': 'garage',
 'question': ["James's motorcycle", 'AtLocation'],
 'semantic_memory_system': [['boat', 'AtLocation', 'water', 4],
                            ['person', 'AtLocation', 'building', 2],
     

This sample has maximum of 4 episodic memories.

In [6]:
random_point["episodic_memory_system"]

[["James's bus", 'AtLocation', "James's city", 1008891552.5016184],
 ["David's bench", 'AtLocation', "David's garden", 1008977952.5017927],
 ["James's boat", 'AtLocation', "James's water", 1009064352.5021348],
 ["David's motorcycle", 'AtLocation', "David's garage", 1009150752.5025997]]

And it can have maximum of 4 semantic memories

In [7]:
random_point["semantic_memory_system"]

[['boat', 'AtLocation', 'water', 4],
 ['person', 'AtLocation', 'building', 2],
 ['bench', 'AtLocation', 'garden', 2],
 ['motorcycle', 'AtLocation', 'garage', 6]]

Below cell is the question.

In [8]:
random_point["question"]

["James's motorcycle", 'AtLocation']

Now your job is to construct a prompt using 1. episodic memory system, 2. semantic memory system, and 3. question.

You can make a prompt such as below. Be as creative as possible!

Let's first convert the unix timestamps to something easier ...


Just to be sure, let's first order the list of memories by its timestamp

In [9]:
random_point["episodic_memory_system"] = sorted(
    random_point["episodic_memory_system"], key=lambda x: x[-1]
)

In [10]:
random_point["episodic_memory_system"]

[["James's bus", 'AtLocation', "James's city", 1008891552.5016184],
 ["David's bench", 'AtLocation', "David's garden", 1008977952.5017927],
 ["James's boat", 'AtLocation', "James's water", 1009064352.5021348],
 ["David's motorcycle", 'AtLocation', "David's garage", 1009150752.5025997]]

Alright, it's ordered nicely.

Let's make the timestamps more human readable!

In [11]:
for idx, mem in enumerate(random_point["episodic_memory_system"]):
    max_len = len(random_point["episodic_memory_system"])
    days = len(random_point["episodic_memory_system"]) - idx - 1
    if days == 0:
        timestamp = "today"
    else:
        timestamp = f"{days} days ago"
    random_point["episodic_memory_system"][idx][-1] = timestamp

In [12]:
random_point["episodic_memory_system"]

[["James's bus", 'AtLocation', "James's city", '3 days ago'],
 ["David's bench", 'AtLocation', "David's garden", '2 days ago'],
 ["James's boat", 'AtLocation', "James's water", '1 days ago'],
 ["David's motorcycle", 'AtLocation', "David's garage", 'today']]

Alright the episodic memory system looks nice!

Now let's create a prompt using 1. episodic memory system, 2. semantic memory system, and 3. question.

In [13]:
prompt = []

for mem in random_point["episodic_memory_system"]:
    prompt.append(f"{mem[0]} was at {mem[2]}, {mem[3]}.")

for mem in random_point["semantic_memory_system"]:
    prompt.append(f"{mem[-1]} {mem[0]} were found at {mem[2]}.")

prompt.append(f"Where is {random_point['question'][0]}?")

prompt = " ".join(prompt)

print(prompt)

James's bus was at James's city, 3 days ago. David's bench was at David's garden, 2 days ago. James's boat was at James's water, 1 days ago. David's motorcycle was at David's garage, today. 4 boat were found at water. 2 person were found at building. 2 bench were found at garden. 6 motorcycle were found at garage. Where is James's motorcycle?


Again, you have to use your own creativity to construct prompts.
If you can't fit all memories into one prompt due to memory constraint, you should come up 
with your own strategy to remove some of the memories, if possible

Let's take this prompt to the huggingface t0pp if it can answer this question properly.

In [14]:
# https://huggingface.co/bigscience/T0pp?text=Mary%27s+knife+was+at+Mary%27s+forest%2C+3+days+ago.+Charles%27s+cat+was+at+Charles%27s+lap%2C+2+days+ago.+Karen%27s+cow+was+at+Karen%27s+countryside%2C+1+days+ago.+Thomas%27s+bicycle+was+at+Thomas%27s+town%2C+today.+2+toothbrush+were+found+at+suitcase.+Where+is+Charles%27s+umbrella%3F
# The answer I get from t0pp is "at Charles's lap"

In [15]:
random_point["correct_answer"]

'garage'

And the correct answer is "closet". In this case, the prediction is false.

Global accuracy (i.e. `TP + FP / (TP + FP + TN + FN)`) is good enough for evaluation.
We don't have to use weighted F1 score since the classes are pretty much balanced anyways.

And also Tae's hand-crafted models are evaluated in global accuracy as well. You want to compare
your results to his.  

You should try your N templates and get N scores on the `val` split. Choose the template
that has the best score on the `val` split and then run it on the `test` split to get
the test scores.