#### Reason for these tests
A PR is raised in [ISSUE_1](https://github.com/frankaging/Reason-SCAN/issues/1), the reporter finds some discrepancies in split numbers. Specifically, the `test` split in our main data frame, is not matching up with our sub-test splits as `p1`, `p2` and `p3`. This PR further exposes another issue with our documentations about the splits (i.e., how we generate our splits). Thus, we use this live debug notebook to address these comments.

#### The Issue

In [12]:
import os, json
p1_test_path_to_data = "../../ReaSCAN-v1.0/ReaSCAN-compositional-p1-test/data-compositional-splits.txt"
print(f"Reading dataset from file: {p1_test_path_to_data}...")
p1_test_data = json.load(open(p1_test_path_to_data, "r"))
print(len(p1_test_data["examples"]["test"]))

p2_test_path_to_data = "../../ReaSCAN-v1.0/ReaSCAN-compositional-p2-test/data-compositional-splits.txt"
print(f"Reading dataset from file: {p2_test_path_to_data}...")
p2_test_data = json.load(open(p2_test_path_to_data, "r"))
print(len(p2_test_data["examples"]["test"]))

p3_test_path_to_data = "../../ReaSCAN-v1.0/ReaSCAN-compositional-p3-test/data-compositional-splits.txt"
print(f"Reading dataset from file: {p3_test_path_to_data}...")
p3_test_data = json.load(open(p3_test_path_to_data, "r"))
print(len(p3_test_data["examples"]["test"]))

Reading dataset from file: ../../ReaSCAN-v1.0/ReaSCAN-compositional-p1-test/data-compositional-splits.txt...
921
Reading dataset from file: ../../ReaSCAN-v1.0/ReaSCAN-compositional-p2-test/data-compositional-splits.txt...
2120
Reading dataset from file: ../../ReaSCAN-v1.0/ReaSCAN-compositional-p3-test/data-compositional-splits.txt...
2712


In [15]:
len(p1_test_data["examples"]["test"]) + len(p2_test_data["examples"]["test"]) + len(p3_test_data["examples"]["test"])

5753

In [6]:
ReaSCAN_path_to_data = "../../ReaSCAN-v1.0/ReaSCAN-compositional/data-compositional-splits.txt"
print(f"Reading dataset from file: {ReaSCAN_path_to_data}...")
ReaSCAN_data = json.load(open(ReaSCAN_path_to_data, "r"))

Reading dataset from file: ../../ReaSCAN-v1.0/ReaSCAN-compositional/data-compositional-splits.txt...


In [8]:
p1_test_example_filtered = []
p2_test_example_filtered = []
p3_test_example_filtered = []
for example in ReaSCAN_data["examples"]["test"]:
    if example['derivation'] == "$OBJ_0":
        p1_test_example_filtered += [example]
    elif example['derivation'] == "$OBJ_0 ^ $OBJ_1":
        p2_test_example_filtered += [example]
    elif example['derivation'] == "$OBJ_0 ^ $OBJ_1 & $OBJ_2":
        p3_test_example_filtered += [example]
print(f"p1 test example count={len(p1_test_example_filtered)}")
print(f"p2 test example count={len(p2_test_example_filtered)}")
print(f"p3 test example count={len(p3_test_example_filtered)}")

p1 test example count=907
p2 test example count=2122
p3 test example count=2724


In [10]:
len(p1_test_example_filtered) + len(p2_test_example_filtered) + len(p3_test_example_filtered)

5753

For instance, as you can see `p1 test example count` should be equal to `921`, but it is not. However, you can see that the total number of test examples matches up. The **root cause** potentially is that our sub-test splits are created asynchronously with the test split in the main data. 

Before confirming the **root cause**, we need to first analyze what is the actual **impact** on performance numbers? Are they changing our results qualitatively? or just quantitatively? We come up with some tests around this issue starting from basic to more complex.

#### Test-1: Validity
We need to ensure our sub-test splits **only** contain commands appear in the training set. Otherwise, our test splits become compositional splits.

In [16]:
train_command_set = set([])
for example in ReaSCAN_data["examples"]["train"]:
    train_command_set.add(example["command"])

In [20]:
for example in p1_test_data["examples"]["test"]:
    assert example["command"] in train_command_set
for example in p2_test_data["examples"]["test"]:
    assert example["command"] in train_command_set
for example in p3_test_data["examples"]["test"]:
    assert example["command"] in train_command_set
print("Test-1 Passed")

Test-1 Passed


#### Test-2: Overestimating?
What about the shape world? Are there overlaps between train and test?

In [27]:
import hashlib
train_example_hash = set([])
for example in ReaSCAN_data["examples"]["train"]:
    example_hash_object = hashlib.md5(json.dumps(example).encode('utf-8'))
    train_example_hash.add(example_hash_object.hexdigest())
assert len(train_example_hash) == len(ReaSCAN_data["examples"]["train"])

In [39]:
p1_test_example_hash = set([])
for example in p1_test_data["examples"]["test"]:
    example_hash_object = hashlib.md5(json.dumps(example).encode('utf-8'))
    p1_test_example_hash.add(example_hash_object.hexdigest())
assert len(p1_test_example_hash) == len(p1_test_data["examples"]["test"])

p2_test_example_hash = set([])
for example in p2_test_data["examples"]["test"]:
    example_hash_object = hashlib.md5(json.dumps(example).encode('utf-8'))
    p2_test_example_hash.add(example_hash_object.hexdigest())
assert len(p2_test_example_hash) == len(p2_test_data["examples"]["test"])

p3_test_example_hash = set([])
for example in p3_test_data["examples"]["test"]:
    example_hash_object = hashlib.md5(json.dumps(example).encode('utf-8'))
    p3_test_example_hash.add(example_hash_object.hexdigest())
assert len(p3_test_example_hash) == len(p3_test_data["examples"]["test"])

In [40]:
p1_test_dup_count = 0
for hash_str in p1_test_example_hash:
    if hash_str in train_example_hash:
        p1_test_dup_count += 1
        
p2_test_dup_count = 0
for hash_str in p2_test_example_hash:
    if hash_str in train_example_hash:
        p2_test_dup_count += 1

p3_test_dup_count = 0
for hash_str in p3_test_example_hash:
    if hash_str in train_example_hash:
        p3_test_dup_count += 1

In [41]:
print(f"p1_test_dup_count={p1_test_dup_count}")
print(f"p2_test_dup_count={p2_test_dup_count}")
print(f"p3_test_dup_count={p3_test_dup_count}")

p1_test_dup_count=858
p2_test_dup_count=1982
p3_test_dup_count=2548


In [42]:
main_p1_test_example_hash = set([])
for example in p1_test_example_filtered:
    example_hash_object = hashlib.md5(json.dumps(example).encode('utf-8'))
    main_p1_test_example_hash.add(example_hash_object.hexdigest())
assert len(main_p1_test_example_hash) == len(p1_test_example_filtered)

In [43]:
main_p1_test_dup_count = 0
for hash_str in main_p1_test_example_hash:
    if hash_str in train_example_hash:
        main_p1_test_dup_count += 1

In [45]:
print(f"main_p1_test_dup_count={main_p1_test_dup_count}")

main_p1_test_dup_count=0


**Conclusion**: Yes. As you can see, we have many duplicated examples in our random tests. This means that, we need to use updated testing splits for evaluating performance. As a result, the **table 3** in the paper needs to be updated since it is now overestimating model performance for non-generalizing test splits (e.g., `p1`, `p2` nad `p3`).

**Action Required**: Need to re-evaluation model performance on those splits.

#### Test-3: Does this issue affect any other generalization splits?
Does our generalization splits containing duplicates?

In [48]:
def get_example_hash_set(split):
    split_test_path_to_data = f"../../ReaSCAN-v1.0/ReaSCAN-compositional-{split}/data-compositional-splits.txt"
    print(f"Reading dataset from file: {split_test_path_to_data}...")
    split_test_data = json.load(open(split_test_path_to_data, "r"))
    split_test_data_test_example_hash = set([])
    for example in split_test_data["examples"]["test"]:
        example_hash_object = hashlib.md5(json.dumps(example).encode('utf-8'))
        split_test_data_test_example_hash.add(example_hash_object.hexdigest())
    assert len(split_test_data_test_example_hash) == len(split_test_data["examples"]["test"])
    return split_test_data_test_example_hash
    

In [50]:
a1_hash = get_example_hash_set("a1")
a2_hash = get_example_hash_set("a2")
a3_hash = get_example_hash_set("a3")

b1_hash = get_example_hash_set("b1")
b2_hash = get_example_hash_set("b2")

c1_hash = get_example_hash_set("c1")
c2_hash = get_example_hash_set("c2")

Reading dataset from file: ../../ReaSCAN-v1.0/ReaSCAN-compositional-a1/data-compositional-splits.txt...
Reading dataset from file: ../../ReaSCAN-v1.0/ReaSCAN-compositional-a2/data-compositional-splits.txt...
Reading dataset from file: ../../ReaSCAN-v1.0/ReaSCAN-compositional-a3/data-compositional-splits.txt...
Reading dataset from file: ../../ReaSCAN-v1.0/ReaSCAN-compositional-b1/data-compositional-splits.txt...
Reading dataset from file: ../../ReaSCAN-v1.0/ReaSCAN-compositional-b2/data-compositional-splits.txt...
Reading dataset from file: ../../ReaSCAN-v1.0/ReaSCAN-compositional-c1/data-compositional-splits.txt...
Reading dataset from file: ../../ReaSCAN-v1.0/ReaSCAN-compositional-c2/data-compositional-splits.txt...


In [53]:
a1_dup_count = 0
for hash_str in a1_hash:
    if hash_str in train_example_hash:
        a1_dup_count += 1
a2_dup_count = 0
for hash_str in a2_hash:
    if hash_str in train_example_hash:
        a2_dup_count += 1
a3_dup_count = 0
for hash_str in a3_hash:
    if hash_str in train_example_hash:
        a3_dup_count += 1

In [54]:
print(f"a1_dup_count={a1_dup_count}")
print(f"a2_dup_count={a2_dup_count}")
print(f"a3_dup_count={a3_dup_count}")

a1_dup_count=0
a2_dup_count=0
a3_dup_count=0


In [55]:
b1_dup_count = 0
for hash_str in b1_hash:
    if hash_str in train_example_hash:
        b1_dup_count += 1
b2_dup_count = 0
for hash_str in b2_hash:
    if hash_str in train_example_hash:
        b2_dup_count += 1

In [56]:
print(f"b1_dup_count={b1_dup_count}")
print(f"b2_dup_count={b2_dup_count}")

b1_dup_count=0
b2_dup_count=0


In [57]:
c1_dup_count = 0
for hash_str in c1_hash:
    if hash_str in train_example_hash:
        c1_dup_count += 1
c2_dup_count = 0
for hash_str in c2_hash:
    if hash_str in train_example_hash:
        c2_dup_count += 1

In [58]:
print(f"c1_dup_count={c1_dup_count}")
print(f"c2_dup_count={c2_dup_count}")

c1_dup_count=0
c2_dup_count=0


**Conclusion**: No.

#### Test-4: What about correctness of generalization splits in general?
We see there is no duplicate, but what about general correctness? Are their created correctly? In this section, we add more sanity checks to show correctness of each generalization split.

For each split, we verify two things:
* the generalization split can ONLY contain test examples that it is designed to test.
* the training split DOES NOT contain examples that are aligned with the generalization split.

A1:novel color modifier

In [59]:
split_test_path_to_data = f"../../ReaSCAN-v1.0/ReaSCAN-compositional-a1/data-compositional-splits.txt"
print(f"Reading dataset from file: {split_test_path_to_data}...")
split_test_data = json.load(open(split_test_path_to_data, "r"))

Reading dataset from file: ../../ReaSCAN-v1.0/ReaSCAN-compositional-a1/data-compositional-splits.txt...


In [60]:
for example in split_test_data["examples"]["test"]:
    assert "yellow,square" in example["command"]

In [62]:
for example in ReaSCAN_data["examples"]["train"]:
    assert "yellow,square" not in example["command"]

A2: novel color attribute

In [63]:
# this test may be a little to weak for now. maybe improve it to verify the shape world?
split_test_path_to_data = f"../../ReaSCAN-v1.0/ReaSCAN-compositional-a2/data-compositional-splits.txt"
print(f"Reading dataset from file: {split_test_path_to_data}...")
split_test_data = json.load(open(split_test_path_to_data, "r"))

Reading dataset from file: ../../ReaSCAN-v1.0/ReaSCAN-compositional-a2/data-compositional-splits.txt...


In [67]:
for example in ReaSCAN_data["examples"]["train"]:
    assert "red,square" not in example["command"]

In [71]:
for example in split_test_data["examples"]["test"]:
    if "red,square" not in example["command"]:
        # then, some background object referred in the command needs to be a red square!!
        if example["derivation"] == "$OBJ_0":
            assert example['situation']['placed_objects']['0']['object']['shape'] == "square"
            assert example['situation']['placed_objects']['0']['object']['color'] == "red"
        elif example["derivation"] == "$OBJ_0 ^ $OBJ_1":
            assert example['situation']['placed_objects']['0']['object']['shape'] == "square" or example['situation']['placed_objects']['1']['object']['shape'] == "square"
            assert example['situation']['placed_objects']['0']['object']['color'] == "red" or example['situation']['placed_objects']['1']['object']['color'] == "red"
        elif example["derivation"] == "$OBJ_0 ^ $OBJ_1 & $OBJ_2":
            assert example['situation']['placed_objects']['0']['object']['shape'] == "square" or example['situation']['placed_objects']['1']['object']['shape'] == "square" or example['situation']['placed_objects']['2']['object']['shape'] == "square"
            assert example['situation']['placed_objects']['0']['object']['color'] == "red" or example['situation']['placed_objects']['1']['object']['color'] == "red" or example['situation']['placed_objects']['2']['object']['color'] == "red"
    else:
        pass

A3: novel size attribute

In [73]:
# this test may be a little to weak for now. maybe improve it to verify the shape world?
split_test_path_to_data = f"../../ReaSCAN-v1.0/ReaSCAN-compositional-a3/data-compositional-splits.txt"
print(f"Reading dataset from file: {split_test_path_to_data}...")
split_test_data = json.load(open(split_test_path_to_data, "r"))

Reading dataset from file: ../../ReaSCAN-v1.0/ReaSCAN-compositional-a3/data-compositional-splits.txt...


In [75]:
for example in split_test_data["examples"]["test"]:
    assert "small,cylinder" in example['command'] or \
        "small,red,cylinder" in example['command'] or \
        "small,blue,cylinder" in example['command'] or \
        "small,yellow,cylinder" in example['command'] or \
        "small,green,cylinder" in example['command']

In [77]:
for example in ReaSCAN_data["examples"]["train"]:
    assert not ("small,cylinder" in example['command'] or \
        "small,red,cylinder" in example['command'] or \
        "small,blue,cylinder" in example['command'] or \
        "small,yellow,cylinder" in example['command'] or \
        "small,green,cylinder" in example['command'])

B1: novel co-occurrence of objects

In [83]:
# this test may be a little to weak for now. maybe improve it to verify the shape world?
split_test_path_to_data = f"../../ReaSCAN-v1.0/ReaSCAN-compositional-b1/data-compositional-splits.txt"
print(f"Reading dataset from file: {split_test_path_to_data}...")
split_test_data = json.load(open(split_test_path_to_data, "r"))

Reading dataset from file: ../../ReaSCAN-v1.0/ReaSCAN-compositional-b1/data-compositional-splits.txt...


In [80]:
from collections import namedtuple, OrderedDict
seen_command_structs = {}
seen_concepts = {} # add in seen concepts, so we can select concepts that are seen, but new composites!
seen_object_co = set([])
seen_rel_co = set([])

for example_selected in ReaSCAN_data["examples"]["train"]:
    rel_map = OrderedDict({})
    for ele in example_selected["relation_map"]:
        rel_map[tuple(ele[0])] = ele[1]
    example_struct = OrderedDict({
        'obj_pattern_map': example_selected["object_pattern_map"],
        'rel_map': rel_map,
        'obj_map': example_selected["object_expression"],
        'grammer_pattern': example_selected['grammer_pattern'],
        'adverb': example_selected['adverb_in_command'],
        'verb': example_selected['verb_in_command']
    })
    obj_co = []
    for k, v in example_selected["object_expression"].items():
        if v not in seen_concepts:
            seen_concepts[v] = 1
        else:
            seen_concepts[v] += 1
        obj_co += [v]
    obj_co.sort()
    seen_object_co.add(tuple(obj_co))
    
    rel_co = []
    for k, v in rel_map.items():
        if v not in seen_concepts:
            seen_concepts[v] = 1
        else:
            seen_concepts[v] += 1
        rel_co += [v]
    rel_co.sort()
    seen_rel_co.add(tuple(rel_co))

In [86]:
test_seen_command_structs = {}
test_seen_concepts = {} # add in seen concepts, so we can select concepts that are seen, but new composites!
test_seen_object_co = set([])
test_seen_rel_co = set([])

for example_selected in split_test_data["examples"]["test"]:
    rel_map = OrderedDict({})
    for ele in example_selected["relation_map"]:
        rel_map[tuple(ele[0])] = ele[1]
    example_struct = OrderedDict({
        'obj_pattern_map': example_selected["object_pattern_map"],
        'rel_map': rel_map,
        'obj_map': example_selected["object_expression"],
        'grammer_pattern': example_selected['grammer_pattern'],
        'adverb': example_selected['adverb_in_command'],
        'verb': example_selected['verb_in_command']
    })
    obj_co = []
    for k, v in example_selected["object_expression"].items():
        if v not in test_seen_concepts:
            test_seen_concepts[v] = 1
        else:
            test_seen_concepts[v] += 1
        obj_co += [v]
    obj_co.sort()
    test_seen_object_co.add(tuple(obj_co))
    
    rel_co = []
    for k, v in rel_map.items():
        if v not in test_seen_concepts:
            test_seen_concepts[v] = 1
        else:
            test_seen_concepts[v] += 1
        rel_co += [v]
    rel_co.sort()
    test_seen_rel_co.add(tuple(rel_co))

In [91]:
test_seen_object_co.intersection(seen_object_co)

set()

B2: novel co-occurrence of relations

In [92]:
# this test may be a little to weak for now. maybe improve it to verify the shape world?
split_test_path_to_data = f"../../ReaSCAN-v1.0/ReaSCAN-compositional-b2/data-compositional-splits.txt"
print(f"Reading dataset from file: {split_test_path_to_data}...")
split_test_data = json.load(open(split_test_path_to_data, "r"))

Reading dataset from file: ../../ReaSCAN-v1.0/ReaSCAN-compositional-b2/data-compositional-splits.txt...


In [93]:
test_seen_command_structs = {}
test_seen_concepts = {} # add in seen concepts, so we can select concepts that are seen, but new composites!
test_seen_object_co = set([])
test_seen_rel_co = set([])

for example_selected in split_test_data["examples"]["test"]:
    rel_map = OrderedDict({})
    for ele in example_selected["relation_map"]:
        rel_map[tuple(ele[0])] = ele[1]
    example_struct = OrderedDict({
        'obj_pattern_map': example_selected["object_pattern_map"],
        'rel_map': rel_map,
        'obj_map': example_selected["object_expression"],
        'grammer_pattern': example_selected['grammer_pattern'],
        'adverb': example_selected['adverb_in_command'],
        'verb': example_selected['verb_in_command']
    })
    obj_co = []
    for k, v in example_selected["object_expression"].items():
        if v not in test_seen_concepts:
            test_seen_concepts[v] = 1
        else:
            test_seen_concepts[v] += 1
        obj_co += [v]
    obj_co.sort()
    test_seen_object_co.add(tuple(obj_co))
    
    rel_co = []
    for k, v in rel_map.items():
        if v not in test_seen_concepts:
            test_seen_concepts[v] = 1
        else:
            test_seen_concepts[v] += 1
        rel_co += [v]
    rel_co.sort()
    test_seen_rel_co.add(tuple(rel_co))

In [94]:
test_seen_rel_co

{('$IS_INSIDE', '$SAME_SIZE')}

C1:novel conjunctive clause length

In [95]:
# this test may be a little to weak for now. maybe improve it to verify the shape world?
split_test_path_to_data = f"../../ReaSCAN-v1.0/ReaSCAN-compositional-c1/data-compositional-splits.txt"
print(f"Reading dataset from file: {split_test_path_to_data}...")
split_test_data = json.load(open(split_test_path_to_data, "r"))

Reading dataset from file: ../../ReaSCAN-v1.0/ReaSCAN-compositional-c1/data-compositional-splits.txt...


In [104]:
for example in split_test_data["examples"]["test"]:
    assert example["derivation"] == "$OBJ_0 ^ $OBJ_1 & $OBJ_2 & $OBJ_3"
    assert (example["command"].count("and")) == 2

C2:novel relative clauses

In [105]:
# this test may be a little to weak for now. maybe improve it to verify the shape world?
split_test_path_to_data = f"../../ReaSCAN-v1.0/ReaSCAN-compositional-c2/data-compositional-splits.txt"
print(f"Reading dataset from file: {split_test_path_to_data}...")
split_test_data = json.load(open(split_test_path_to_data, "r"))

Reading dataset from file: ../../ReaSCAN-v1.0/ReaSCAN-compositional-c2/data-compositional-splits.txt...


In [108]:
for example in split_test_data["examples"]["test"]:
    assert example["derivation"] == "$OBJ_0 ^ $OBJ_1 ^ $OBJ_2"
    assert (example["command"].count("that,is")) == 2

**Conclusion**: No.