# Final Assembly for Choi et al. (2022)'s Soft Constraint Method

We've annotated our sampled training set; all that's left is to assemble the final training set for our fine-tuning.

In [1]:
from datasets import load_dataset, Dataset

#Converts data in src [TAB] tgt [NEWLINE] format to a format suitable for model training
def convertToDictFormat(data):
    source = []
    target = []
    for example in data:
        example = example.strip()
        sentences = example.split("\t")
        source.append(sentences[0])
        target.append(sentences[1])
    ready = Dataset.from_dict({"en":source, "fr":target})
    return ready

In [2]:
#Load in our training set components and convert them to Dataset objects
entire_glossary = load_dataset("ethansimrm/MeSpEn_enfr_dirty_glossary", split = "train")
unchanged_train = load_dataset("ethansimrm/choi_unchanged_train", split = "train")
annotated_train = load_dataset("ethansimrm/choi_annotated_sampled_train", split = "train")
glossary_ready = convertToDictFormat(entire_glossary['text'])
unchanged_train_ready = convertToDictFormat(unchanged_train['text'])
annotated_train_ready = convertToDictFormat(annotated_train['text'])

Downloading and preparing dataset text/ethansimrm--MeSpEn_enfr_dirty_glossary to C:/Users/ethan/.cache/huggingface/datasets/ethansimrm___text/ethansimrm--MeSpEn_enfr_dirty_glossary-d8e0c39300233912/0.0.0/cb1e9bd71a82ad27976be3b12b407850fe2837d80c22c5e03a28949843a8ace2...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/255k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset text downloaded and prepared to C:/Users/ethan/.cache/huggingface/datasets/ethansimrm___text/ethansimrm--MeSpEn_enfr_dirty_glossary-d8e0c39300233912/0.0.0/cb1e9bd71a82ad27976be3b12b407850fe2837d80c22c5e03a28949843a8ace2. Subsequent calls will reuse this data.
Downloading and preparing dataset text/ethansimrm--choi_unchanged_train to C:/Users/ethan/.cache/huggingface/datasets/ethansimrm___text/ethansimrm--choi_unchanged_train-c72ff8ff983bcf98/0.0.0/cb1e9bd71a82ad27976be3b12b407850fe2837d80c22c5e03a28949843a8ace2...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/40.8M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset text downloaded and prepared to C:/Users/ethan/.cache/huggingface/datasets/ethansimrm___text/ethansimrm--choi_unchanged_train-c72ff8ff983bcf98/0.0.0/cb1e9bd71a82ad27976be3b12b407850fe2837d80c22c5e03a28949843a8ace2. Subsequent calls will reuse this data.
Downloading and preparing dataset text/ethansimrm--choi_annotated_sampled_train to C:/Users/ethan/.cache/huggingface/datasets/ethansimrm___text/ethansimrm--choi_annotated_sampled_train-6fdbfe2540b33c16/0.0.0/cb1e9bd71a82ad27976be3b12b407850fe2837d80c22c5e03a28949843a8ace2...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/7.60M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset text downloaded and prepared to C:/Users/ethan/.cache/huggingface/datasets/ethansimrm___text/ethansimrm--choi_annotated_sampled_train-6fdbfe2540b33c16/0.0.0/cb1e9bd71a82ad27976be3b12b407850fe2837d80c22c5e03a28949843a8ace2. Subsequent calls will reuse this data.


In [7]:
#Concatenate, then shuffle
from datasets import concatenate_datasets
choi_full_train_unshuffled = concatenate_datasets([glossary_ready, unchanged_train_ready, annotated_train_ready])
choi_full_train_ready = choi_full_train_unshuffled.shuffle(seed=42).flatten_indices()

In [11]:
#Ready for upload
output = open("choi_full_train.txt", "w", encoding = "utf8")
for bitext in choi_full_train_ready:
    output.write(bitext["en"] + "\t" + bitext["fr"] + "\n")
output.close()