Before starting, take note that folder ```ST2022``` contains only data and surprise data. The rest of the files were removed. You would have to install ```ST2022``` official package to run the function used at the end that evaluates the prediction against the real target sequence.

In [1]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.optimizers.schedules import ExponentialDecay
from tensorflow.keras.layers import (
    Embedding,
    TextVectorization,
    Concatenate,
    Dense,
    Dropout,
)
from tensorflow.keras import layers
import keras_nlp
import numpy as np
import random
from tqdm import tqdm

The following cell reads the data in a dictionary. The changes needed to modify condition are highlighted with a comment.

In [2]:
lst_dirs = [ # Training data
    "abrahammonpa",
    "allenbai",
    "backstromnorthernpakistan",
    "castrosui",
    "davletshinaztecan",
    "felekesemitic",
    "hantganbangime",
    "hattorijaponic",
    "listsamplesize",
    "mannburmish",
]

### Uncomment the following list for surprise data
# lst_dirs = [ 
#     "bantubvd",
#     "beidazihui",
#     "birchallchapacuran",
#     "bodtkhobwa",
#     "bremerberta",
#     "deepadungpalaung",
#     "hillburmish",
#     "kesslersignificance",
#     "luangthongkumkaren",
#     "wangbai"
# ]

max_num_lang = 0
max_seq_len = 0
num_langs = set()
data_dict = {}
val_data_dict = {}
for dir in lst_dirs:
    with open(
        "../ST2022/data/" + dir + "/training-0.10.tsv", # change to training-0.50.tsv for 50% proportion
        encoding="UTF-8",
    ) as f:
        file = f.readlines()
        data = list(map(lambda x: x.strip("\n").split("\t")[1:], file))
        
        val_num = int(len(data) * 0.025)
        
        start_index = random.randint(0, len(data)-val_num)
        end_index = start_index + val_num
        train_data = data[:start_index] + data[end_index:]
        val_data = data[start_index:end_index]

    header = data[0]
    for i, head in enumerate(header):
        for data_point in train_data[1:]:
            if (
                data_point[i] == ""
            ):  # dictionary keys are strings with target language and target form. If there is not target form, skip the iteration
                continue
            else:
                offset_target_form = f"[start] {head} " + data_point[i] + " [end]"

                data_dict[offset_target_form] = [
                    f"{header[k]} {data_point[k]}"
                    for k in range(len(data_point))
                    if k != i
                ]

                for form in data_dict[offset_target_form]:
                    if len(form.split()) > max_seq_len:
                        max_seq_len = len(form.split())
        for data_point in val_data:
            if (
                data_point[i] == ""
            ):  # dictionary keys are strings with target language and target form. If there is not target form, skip the iteration
                continue
            else:
                offset_target_form = f"[start] {head} " + data_point[i] + " [end]"
                val_data_dict[offset_target_form] = [
                    f"{header[k]} {data_point[k]}"
                    for k in range(len(data_point))
                    if k != i
                ]
                for form in val_data_dict[offset_target_form]:
                    if len(form.split()) > max_seq_len:
                        max_seq_len = len(form.split())
                        
max_seq_len += 5
target_lang = []
target_lang_form = []
neighbor_lang_form = []
# neighbor_lang_form_augment = []

val_target_lang = []
val_target_lang_form = []
val_neighbor_lang_form = []


for key, value in data_dict.items():
    target_lang_form.append(key)
    neighbor_lang_form.append(value)
    target_lang.append(
        key.split()[1]
    )  # the first element in the sequence of targets is [start] followed by language code

for key, value in val_data_dict.items():
    val_target_lang_form.append(key)
    val_neighbor_lang_form.append(value)
    val_target_lang.append(
        key.split()[1]
    )  # the first element in the sequence of targets is [start] followed by language code


max_num_lang = max([len(langs) for langs in neighbor_lang_form])  # needed for padding

all_langs = list(set(target_lang))
all_forms = list(
    set(
        [
            char
            for lst in neighbor_lang_form + val_neighbor_lang_form
            for seq in lst
            for char in seq.split()
        ]
    )
)

form_vectorization = TextVectorization(
    standardize=None,
    split="whitespace",
    output_mode="int",
    output_sequence_length=max_seq_len + 1,
)

lang_vectorization = TextVectorization(
    standardize=None,
    split="whitespace",
    output_mode="int",
    output_sequence_length=1,
)

lang_vectorization.adapt(all_langs)
### Vectorize target languages
int_target_lang = lang_vectorization(target_lang)
val_int_target_lang = lang_vectorization(val_target_lang)

form_vectorization.adapt(all_forms)
### Vectorize target sequence
int_target_lang_form = form_vectorization(target_lang_form)
val_int_target_lang_form = form_vectorization(val_target_lang_form)

### Prepare lists for vectorization and padding of neighbor forms
int_neighbor_lang_form = []
val_int_neighbor_lang_form = []


vocab_size_forms = form_vectorization.vocabulary_size()
vocab_size_langs = lang_vectorization.vocabulary_size()

lang_form_vocab = form_vectorization.get_vocabulary()
lang_form_lookup = dict(zip(range(len(lang_form_vocab)), lang_form_vocab))



The following cell take long time to process, as it's poorly optimized for now. The waiting time can be avoided by downloading the zipped ```tf.data.Dataset``` corresponding to configuration of interest. 

- Loop 1 vectorizes training neighbor forms and then padds them with sequences of zeros up to the required shape (18, 19)
- Loop 2 does the same for validation neighbor forms
- Loop 3 performs the data augmentation as described in the paper. As the data becomes three times the original dataset, target forms and target languages are also copied three times
- Loop 4 does the same for validation data

By chance, few cognate sets might get completely deleted and will thus be full of zeros. To avoid it, function ```filter_zeros``` is applied afterwards. 

In [None]:
    
# Loop 1 
for neighborhood in tqdm(neighbor_lang_form, desc=f"Processing neighborhoods"):
    int_neighborhood = form_vectorization(neighborhood)
    num_neighbors = int_neighborhood.shape[0]
    if num_neighbors < max_num_lang:
        to_add = max_num_lang - num_neighbors
        padding = tf.constant([[0, to_add], [0, 0]], dtype=tf.int32)
        int_neighborhood = tf.pad(int_neighborhood, padding)
    int_neighbor_lang_form.append(int_neighborhood)

# Loop 2
for neighborhood in tqdm(val_neighbor_lang_form, desc=f"Processing validation neighborhoods"):
    int_neighborhood = form_vectorization(neighborhood)
    num_neighbors = int_neighborhood.shape[0]
    if num_neighbors < max_num_lang:
        to_add = max_num_lang - num_neighbors
        padding = tf.constant([[0, to_add], [0, 0]], dtype=tf.int32)
        int_neighborhood = tf.pad(int_neighborhood, padding)
    val_int_neighbor_lang_form.append(int_neighborhood)

int_neighbor_lang_form = tf.convert_to_tensor(int_neighbor_lang_form)
val_int_neighbor_lang_form = tf.convert_to_tensor(val_int_neighbor_lang_form)

    # Loop 3
augmented_data = []
other_augmented_data = []
zero_tensor = tf.zeros(shape=(20,), dtype=tf.int64)
for neighborhood in tqdm(int_neighbor_lang_form, desc=f"Processing augmented neighborhoods"):
    augmented_neighborhood = []
    other_augmented_neighborhood = []
    for neighbor in neighborhood:
        if all(neighbor == zero_tensor):
            augmented_neighborhood.append(zero_tensor)
            other_augmented_neighborhood.append(zero_tensor)
        elif random.randint(0, 1) == 1:
            augmented_neighborhood.append(neighbor)
            other_augmented_neighborhood.append(zero_tensor)
        else:
            augmented_neighborhood.append(zero_tensor)
            other_augmented_neighborhood.append(neighbor)
    augmented_data.append(augmented_neighborhood)
    other_augmented_data.append(other_augmented_neighborhood)
augmented_data = tf.convert_to_tensor(augmented_data + other_augmented_data)

int_target_lang = tf.concat([int_target_lang, int_target_lang, int_target_lang], axis=0)
int_target_lang_form = tf.concat([int_target_lang_form, int_target_lang_form, int_target_lang_form], axis=0)
int_neighbor_lang_form = tf.concat([int_neighbor_lang_form, augmented_data], axis=0)


augmented_data = []
other_augmented_data = []
zero_tensor = tf.zeros(shape=(20,), dtype=tf.int64)
for neighborhood in tqdm(val_int_neighbor_lang_form, desc=f"Processing augmented validation neighborhoods"):
    augmented_neighborhood = []
    other_augmented_neighborhood = []
    for neighbor in neighborhood:
        if all(neighbor == zero_tensor):
            augmented_neighborhood.append(zero_tensor)
            other_augmented_neighborhood.append(zero_tensor)
        elif random.randint(0, 1) == 1:
            augmented_neighborhood.append(neighbor)
            other_augmented_neighborhood.append(zero_tensor)
        else:
            augmented_neighborhood.append(zero_tensor)
            other_augmented_neighborhood.append(neighbor)
    augmented_data.append(augmented_neighborhood)
    other_augmented_data.append(other_augmented_neighborhood)
augmented_data = tf.convert_to_tensor(augmented_data + other_augmented_data)

val_int_target_lang = tf.concat([val_int_target_lang, val_int_target_lang, val_int_target_lang], axis=0)
val_int_target_lang_form = tf.concat([val_int_target_lang_form, val_int_target_lang_form, val_int_target_lang_form], axis=0)
val_int_neighbor_lang_form = tf.concat([val_int_neighbor_lang_form, augmented_data], axis=0)


def filter_zeros(example, target):
    neighbor_langs_forms = example["neighbor_langs_forms"]
    all_zeros = tf.reduce_all(tf.equal(neighbor_langs_forms, 0))
    return tf.logical_not(all_zeros)





train_dataset = tf.data.Dataset.from_tensor_slices(
    (
        {
            "neighbor_langs_forms": int_neighbor_lang_form[:, :, :-1],
            "target_langs": int_target_lang,
            "target_langs_forms": int_target_lang_form[:, :-1],
        },
    int_target_lang_form[:, 1:],
    )
)
train_dataset = train_dataset.filter(filter_zeros)

### Uncomment the following line to save the dataset
# train_dataset.save("train_dataset_prop_0.10")

val_dataset = tf.data.Dataset.from_tensor_slices(
    (
        {
            "neighbor_langs_forms": val_int_neighbor_lang_form[:, :, :-1],
            "target_langs": val_int_target_lang,
            "target_langs_forms": val_int_target_lang_form[:, :-1],
        },
        val_int_target_lang_form[:, 1:],
    )
)
val_dataset = val_dataset.filter(filter_zeros)

### Uncomment the following line to save the dataset
# val_dataset.save("val_dataset_prop_0.10")

Note that in the following cell, two lines can be uncommented. These two lines load the datasets from saved folders, in case you haven't run the previous cell.

In [None]:
buffer_size = 75000
batch_size = 32

# val_dataset = tf.data.Dataset.load("val_dataset_prop_0.10") # uncomment to load
validation_dataset_batched = val_dataset.batch(batch_size=batch_size)

# train_dataset = tf.data.Dataset.load("train_dataset_prop_0.10") # uncomment to load
train_dataset_batched = (
    train_dataset.shuffle(buffer_size=buffer_size).batch(batch_size=batch_size).prefetch(tf.data.AUTOTUNE).cache()
)

The following cell subclasses Keras' ```Layer``` class to create a simple stack of encoder/decoder layers

In [None]:
@tf.keras.utils.register_keras_serializable()
class Encoder(keras.layers.Layer):
    def __init__(self, *, num_layers, num_heads, hidden_dim, dropout_rate, **kwargs):
        super(Encoder, self).__init__(**kwargs)
        self.num_layers = num_layers
        self.encoder_layers = [
            keras_nlp.layers.TransformerEncoder(
                intermediate_dim=hidden_dim,
                num_heads=num_heads,
                dropout=dropout_rate,
            )
            for _ in range(num_layers)
        ]

    def call(self, x):
        for i in range(self.num_layers):
            x = self.encoder_layers[i](x)
        return x

    def get_config(self):
        config = super(Encoder, self).get_config()
        config["num_layers"] = self.num_layers
        config["num_heads"] = self.encoder_layers[0].num_heads
        config["hidden_dim"] = self.encoder_layers[0].intermediate_dim
        config["dropout_rate"] = self.encoder_layers[0].dropout
        return config


@tf.keras.utils.register_keras_serializable()
class Decoder(keras.layers.Layer):
    def __init__(self, *, num_layers, num_heads, hidden_dim, dropout_rate, **kwargs):
        super(Decoder, self).__init__(**kwargs)
        self.num_layers = num_layers
        self.decoder_layers = [
            keras_nlp.layers.TransformerDecoder(
                intermediate_dim=hidden_dim,
                num_heads=num_heads,
                dropout=dropout_rate,
            )
            for _ in range(num_layers)
        ]

    def call(self, x, context):
        for i in range(self.num_layers):
            x = self.decoder_layers[i](decoder_sequence=x, encoder_sequence=context)
        return x

    def get_config(self):
        config = super(Decoder, self).get_config()
        config["num_layers"] = self.num_layers
        config["num_heads"] = self.decoder_layers[0].num_heads
        config["hidden_dim"] = self.decoder_layers[0].intermediate_dim
        config["dropout_rate"] = self.decoder_layers[0].dropout
        return config


In [None]:
def build_model():
    embeeding_dim_lang = 64
    embedding_dim_form = 192
    embed_drop = 0.15
    dense_dim_langs = 512
    num_heads_langs = 4
    num_layers_enc = 2
    num_layers_langs = 1
    num_heads_enc = 3
    dense_dim_enc = 256
    drop_enc = 0.1
    drop_dec = 0.2
    dense_dim_dec = 1024
    num_heads_dec = 5
    num_layers_dec = 4
    
    neighbors_encoder_input = keras.Input(
        shape=(max_num_lang, max_seq_len), dtype="int64", name="neighbor_langs_forms"
    )

    langs_encoder_input = keras.Input(
        shape=(1),
        dtype="int64",
        name="target_langs",
    )

    form_pos_embed = keras_nlp.layers.TokenAndPositionEmbedding(
        vocabulary_size=vocab_size_forms,
        sequence_length=max_seq_len,
        embedding_dim=embedding_dim_form,
        mask_zero=True,
        name="form_embedding",
    )

    seqs_by_lang = tf.unstack(neighbors_encoder_input, axis=1)
    embedded_lst = []
    for seq in seqs_by_lang:
        embed_neighbor_form = form_pos_embed(seq)
        embed_neighbor_form = Dropout(embed_drop)(embed_neighbor_form)
        embedded_lst.append(embed_neighbor_form)

    embed_target_langs = Embedding(
        input_dim=vocab_size_langs,
        output_dim=embeeding_dim_lang,
        mask_zero=True,
        embeddings_regularizer=keras.regularizers.OrthogonalRegularizer(
            factor=0.05, mode="rows"
        ),
        name="langs_embedding",
    )(langs_encoder_input)

    encoder_langs = Encoder(
        num_layers=num_layers_langs,
        num_heads=num_heads_langs,
        hidden_dim=dense_dim_langs,
        dropout_rate=drop_enc,
        name="target_langs_encoder",
    )
    encoder_target_langs = encoder_langs(embed_target_langs)

    embedded_input_concat = Concatenate(axis=1)(
        [embed_neighbor for embed_neighbor in embedded_lst]
    )

    encoder_neighbor_forms = Encoder(
        num_layers=num_layers_enc,
        num_heads=num_heads_enc,
        hidden_dim=dense_dim_enc,
        dropout_rate=drop_enc,
    )

    encoder_out = encoder_neighbor_forms(embedded_input_concat)
    # the following line is needed to ensure the possibility of concatenating the output of two encoders
    encoder_target_langs = tf.repeat(
        encoder_target_langs, tf.shape(encoder_out)[1], axis=1
    )

    combined_output = Concatenate(axis=-1)([encoder_target_langs, encoder_out])

    decoder_target_form_input = keras.Input(
        shape=(None,), dtype="int64", name="target_langs_forms"
    )

    embed_target_lang_form = form_pos_embed(decoder_target_form_input)
    embed_target_lang_form = Dropout(embed_drop)(embed_target_lang_form)

    decoder = Decoder(
        num_layers=num_layers_dec,
        num_heads=num_heads_dec,
        hidden_dim=dense_dim_dec,
        dropout_rate=drop_dec,
    )

    decoder_out = decoder(embed_target_lang_form, combined_output)

    decoder_target_form_output = Dense(vocab_size_forms)(decoder_out)
    neighbor_model = keras.Model(
        [neighbors_encoder_input, langs_encoder_input, decoder_target_form_input],
        decoder_target_form_output,
    )
    # Standard Sparse Categorical Crossentropy that additionally ignores masked elements in target sequence
    def masked_loss(label, pred):
        mask = label != 0
        loss_object = tf.keras.losses.SparseCategoricalCrossentropy(
            from_logits=True, reduction="none"
        )
        loss = loss_object(label, pred)
        mask = tf.cast(mask, dtype=loss.dtype)
        loss *= mask

        loss = tf.reduce_sum(loss) / tf.reduce_sum(mask)
        return loss

    
    init_lr = 0.0005
    lr_schedule = ExponentialDecay(
        initial_learning_rate=init_lr,
        decay_steps=50000,
        decay_rate=0.5,
    )
    optimizer = tf.keras.optimizers.Adam(
        learning_rate=lr_schedule, beta_1=0.9, beta_2=0.98
    )

    neighbor_model.compile(
        optimizer=optimizer,
        loss=masked_loss,
        metrics=[keras.metrics.SparseCategoricalAccuracy()],
    )
    return neighbor_model

In [None]:
multilingual_neighbor_model = build_model()

callbacks = [
    keras.callbacks.ModelCheckpoint(
        filepath="train_0.10.keras",
        save_best_only=True,
        verbose=1,
        mode="max",
        monitor="val_sparse_categorical_accuracy",
    ),
    tf.keras.callbacks.EarlyStopping(
        monitor="val_sparse_categorical_accuracy", patience=5
    ),
]

# Note that the number of epochs is excessive but due to use of callbacks, most realistically the model will stop around epoch 15
num_epochs = 50
multilingual_neighbor_model.fit(
    train_dataset_batched,
    validation_data=validation_dataset_batched,
    epochs=num_epochs,
    callbacks=callbacks,
)


Since the model is overfitted by the time it gets to the last epoch, we need to load the last checkpoint with the best value for validation sparse categorical accuracy. Note that The exact state of the optimizer doesn't matter, as we will only use the model to predict the test data.

In [None]:
def masked_loss(label, pred):
    mask = label != 0
    loss_object = tf.keras.losses.SparseCategoricalCrossentropy(
        from_logits=True, reduction="none"
    )
    loss = loss_object(label, pred)
    mask = tf.cast(mask, dtype=loss.dtype)
    loss *= mask

    loss = tf.reduce_sum(loss) / tf.reduce_sum(mask)
    return loss


optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)

neighbor_model = keras.models.load_model(
    "train_0.10.keras",
    custom_objects={
        "TokenAndPositionEmbedding": keras_nlp.layers.TokenAndPositionEmbedding, # may not be required, depending on tensorflow/keras_nlp version
        "TransformerEncoder": keras_nlp.layers.TransformerEncoder, # may not be required, depending on tensorflow/keras_nlp version
        "TransformerDecoder": keras_nlp.layers.TransformerDecoder, # may not be required, depending on tensorflow/keras_nlp version
        "Encoder": Encoder,
        "Decoder": Decoder,
    },
    compile=False,
)
neighbor_model.compile(loss=masked_loss, optimizer=optimizer)

In [3]:
target_langs = []
target_langs_forms = []
neighbor_langs_forms = []
family_dict = {}
for dir in lst_dirs:
    with open("../ST2022/data/" + dir + "/test-0.10.tsv", encoding="UTF-8") as f, open(
        "../ST2022/data/" + dir + "/solutions-0.10.tsv",
        encoding="UTF-8",
    ) as f_sol:
        file = f.readlines()
        file_sol = f_sol.readlines()
        data = list(map(lambda x: x.strip("\n").split("\t")[1:], file))
        data_sol = list(map(lambda x: x.strip("\n").split("\t")[1:], file_sol))
    lang_names = data[0]
    for lang in lang_names:
        family_dict[lang] = dir  # will be needed during metrics computation
    for i in range(1, len(data)):
        to_pred = data[i].index("?")
        target_langs_forms.append(
            f"[start] {lang_names[to_pred]} {data_sol[i][to_pred]} [end]"
        )
        target_langs.append(lang_names[to_pred])
        neighbor_langs_forms.append(
            [
                f"{lang_names[j]} {data[i][j]}"
                for j in range(len(data[i]))
                if j != to_pred
            ]
        )

int_target_lang = lang_vectorization(target_langs)

### Vectorize target sequence
int_target_lang_form = form_vectorization(target_langs_forms)
### Vectorize and pad
int_neighbor_lang_form = []
for neighborhood in neighbor_langs_forms:
    int_neighborhood = form_vectorization(neighborhood)
    num_neighbors = int_neighborhood.shape[0]
    if num_neighbors < max_num_lang:
        to_add = max_num_lang - num_neighbors
        padding = tf.constant([[0, to_add], [0, 0]], dtype=tf.int32)
        int_neighborhood = tf.pad(int_neighborhood, padding)
    int_neighbor_lang_form.append(int_neighborhood)

int_neighbor_lang_form = tf.convert_to_tensor(int_neighbor_lang_form)


lang_form_vocab = form_vectorization.get_vocabulary()
lang_form_lookup = dict(zip(range(len(lang_form_vocab)), lang_form_vocab))

test_dataset = tf.data.Dataset.from_tensor_slices(
    (
        {
            "neighbor_langs_forms": int_neighbor_lang_form[:, :, :-1],
            "target_langs": int_target_lang,
            "target_langs_forms": int_target_lang_form[:, :-1],
        },
        int_target_lang_form[:, 1:],
    )
)

The following function takes long time and requires additional optimization. All four files with decoded target forms are provided in the repository. The function takes especially long in case of proportion with 50% of data retained for testing. Note that despite ```test_dataset``` containing target sequences, we do not use them in any way for prediction.

In [None]:

def predict_decode_forms(test_dataset):
    filename = open("predicted_train_0.10.tsv", "w", encoding="UTF-8")
    print("Predicted\tActual", file=filename)
    decoded_forms = []
    total_samples = len(test_dataset)
    progress_bar = tqdm(total=total_samples, desc="Decoding Forms", unit="sample")
    count = 0
    for input, target in test_dataset.take(-1):
        neighbor_langs_forms = tf.expand_dims(input["neighbor_langs_forms"], axis=0)
        target_langs = tf.expand_dims(input["target_langs"], axis=0)
        decoded_form = "[start]"

        for i in range(max_seq_len):
            tokenized_target_form = form_vectorization([decoded_form])[:, :-1]
            predictions = neighbor_model(
                [neighbor_langs_forms, target_langs, tokenized_target_form]
            )
            sampled_token_index = np.argmax(predictions[0, i, :])
            sampled_token = lang_form_lookup[sampled_token_index]
            decoded_form += " " + sampled_token
            if sampled_token == "[end]":
                break

        decoded_forms.append(decoded_form)
        target_decoded = "[start]"

        for ind in target:
            target_decoded += " " + lang_form_lookup[int(ind)]

            if lang_form_lookup[int(ind)] == "[end]":
                break

        print(f"{decoded_form}\t{target_decoded}", file=filename)

        if count % 25 == 0:
            print(f"{decoded_form}\t{target_decoded}")
        count += 1
        progress_bar.update(1)

    filename.close()
    progress_bar.close()

predict_decode_forms(test_dataset)

In [4]:
from sigtypst2022 import compare_words # this function is provided by the package that came with the shared task
import pandas as pd

#### There was an important error in `strip_start_end_lang` function. It is now solved. I will not change the results in the paper but it slightly improves them. Essentially, in the initial version I was accidentally removing some additional characters.

In [5]:
decoded_forms = pd.read_csv(
    "predicted_train_0.10.tsv",
    sep="\t",
    encoding="UTF-8",
)

def strip_start_end_lang(inp_str):
    new_string_lst = inp_str.split()
    final_string = " ".join(new_string_lst[2:-1])  # to remove lang name and [start], [end]
    return final_string

decoded_forms = decoded_forms.Predicted
decoded_forms = list(decoded_forms)
decoded_forms = list(map(lambda x: strip_start_end_lang(x), decoded_forms)) # remove [start] and [end] tokens, as well as languages from target sequences

In [6]:
### Where to store predictions
dirs_preds = {
    dir: f"../ST2022/data/{dir}/result_train_0.10.tsv"
    for dir in set(family_dict.values())
}

forms_by_fam = {}
for form, lang in zip(decoded_forms, target_langs):
    family = family_dict[lang]
    if family not in forms_by_fam.keys():
        forms_by_fam[family] = []
    forms_by_fam[family].append(form)


The following piece of code works slightly strangely, but it simply takes an existing file with test data and simply wipes out all data there. It was more convenient than to create an empty file in a different way. Ultimately, it produces files in required format for comparison using function ```compare_words``` that was previously loaded.

In [7]:
for dir in lst_dirs:
    test_data = pd.read_csv(
        f"../ST2022/data/{dir}/test-0.10.tsv",
        sep="\t",
        encoding="UTF-8",
    )
    solutions = test_data.copy()
    solutions.iloc[:, 1:] = ""

    for language in test_data.columns[1:]:
        language_ind = test_data[test_data[language] == "?"].index
        slice_min, slice_max = min(language_ind), max(language_ind) + 1
        solutions.loc[language_ind, language] = forms_by_fam[dir][slice_min:slice_max]
    file_path = dirs_preds[dir]
    solutions.to_csv(file_path, sep="\t", encoding="UTF-8", index=False)


The function ```compare_words``` returns a list of results. The final element of the list shows the results averaged per entire language family (as opposed to averaged by language). In the list of results, the final lists consists of several numbers which are metrics values. The following list says which metric correspond to which index of the final list with averaged results per language family:

1 - Edit Distance

2 - Normalized edit distance

3 - B-Cubed F-scores

4 - BLEU

In [9]:

average = []
for dir in lst_dirs:
    result = compare_words(
        f"../ST2022/data/{dir}/result_train_0.10.tsv",
        f"../ST2022/data/{dir}/solutions-0.10.tsv",
        report=False,
    )[-1][1]
    print(f"The result for {dir} is: {round(result, 4)}")
    average.append(result)
print(
    "The average result across all datasets is: ", round(sum(average) / len(average), 4)
)


The result for abrahammonpa is: 0.1617
The result for allenbai is: 0.2456
The result for backstromnorthernpakistan is: 0.2347
The result for castrosui is: 0.0735
The result for davletshinaztecan is: 0.3955
The result for felekesemitic is: 0.3946
The result for hantganbangime is: 0.3791
The result for hattorijaponic is: 0.3556
The result for listsamplesize is: 0.5351
The result for mannburmish is: 0.5639
The average result across all datasets is:  0.3339
