## Edit tokenization 

The CHILDES Treebank contains inconsistent tokenization. Here we try to fix that. 

In [None]:
import pandas as pd

def edit_tokenization(corpus_string):
    corpus_string_edited = corpus_string.replace(" 't ", " n't ")
    for letter in "abcdefghijklmnopqrstuvwxyz":
        corpus_string_edited = corpus_string_edited.replace(" " + letter + "l ", " " + letter + " ")
        corpus_string_edited = corpus_string_edited.replace("\n" + letter + "l ", "\n" + letter + " ")
    corpus_string_edited = corpus_string_edited.replace(" cha ", " you ")
    corpus_string_edited = corpus_string_edited.replace("\ncha ", "\nyou ")
    corpus_string_edited = corpus_string_edited.replace("okay ", "ok ")
    corpus_string_edited = corpus_string_edited.replace("hmm ", "hm ")
    corpus_string_edited = corpus_string_edited.replace("will n't ", "wo n't ")
    corpus_string_edited = corpus_string_edited.replace("ING", "ing")
    return corpus_string_edited

with open ("treebank-decl-quest.txt") as f:
    treebank_decl_quest = f.read()
treebank_decl_quest_edited = edit_tokenization(treebank_decl_quest)
with open ("treebank-decl-quest-edited.txt", 'w') as f:
    f.write(treebank_decl_quest_edited)
    
with open ("treebank-allsents.txt") as f:
    treebank_allsents = f.read()
treebank_allsents_edited = edit_tokenization(treebank_allsents)
with open ("treebank-allsents-edited.txt", 'w') as f:
    f.write(treebank_allsents_edited)

# Checking tokenization differences

The way that the test set is created is by searching for all the questions in the CHILDES Treebank that were in the excluded set (ignoring capitalization and punctuation). If there are tokenization differences other than punctuation, then the comparison is invalid. So here we try to find differences between the two. 

`childes-treebank-allsents.txt` contains all the sentences in `childes-xml` from the corpora that are in the CHILDES_Treebank. If there are questions that are not in `childes-treebank-allsents.txt` (ignoring punctuation), then there is a tokenization difference, or possibly a data processing problem. 

Below I loop through all the yes/no questions in the CHILDES Treebank and print the ones that are not in `childes-treebank-allsents.txt`. 

In [None]:
import pandas as pd

def to_alnum(string):
    return ''.join(e.lower() for e in string if e.isalnum() or e == "\n")

with open ("childes-treebank.txt") as f:
    childes_treebank_alnum = set(to_alnum(f.read()).replace('xxx','').splitlines())
    
with open ("treebank-allsents-edited.txt") as f:
    treebank_allsents_edited_alnum = f.read().splitlines()

count = 0
for sent in treebank_allsents_edited_alnum:
    if to_alnum(sent) not in childes_treebank_alnum:
        print(sent)
        count += 1

print(count/len(treebank_allsents_edited_alnum))
print(count,len(treebank_allsents_edited_alnum))

In [None]:
quest = list(pd.read_table("treebank-decl-quest-edited.txt", header=None)[1])

count = 0
for q in quest:
    if to_alnum(q) not in childes_treebank_alnum:# and "supposed" not in q:
        print(q)
        count += 1

print(count/len(quest))
print(count,len(quest))

# Splitting CHILDES Treebank 

The following script splits the treebank data into training, validation and test sets for finetuning. 

The test set is comprised of questions that were held out from pretraining, found in `excluded.txt`.

In [None]:
def split_treebank(excluded, decl, quest):
    excluded_alnum = set(to_alnum(["".join s for s in excluded])

    train = []
    valid = []
    test = []

    for i,q in enumerate(quest):
        if to_alnum(q) in excluded_alnum:
            test.append(decl[i] + " " +   quest[i] + "\n")

    len_test = len(test)
    len_valid = 0
    for i in range(0, len(quest)):
        if decl[i] + " " + quest[i] + "\n" not in test :
            if i % 10 == 0 and len_valid <= len_test:
                valid.append(decl[i] + " " + quest[i] + "\n")
                len_valid += 1
            else:
                train.append(decl[i] + " " + quest[i] + "\n")
                         
    return train, valid, test