# Setup Work Environment

Run the following cell to setup all the dependencies and the code. There should be no changes required from this cell.

Don't forget to ensure the GPU has already attached to your working environment. You can check it in `Runtime -> Manage sessions -> Check if word GPU is available next to the notebook's name`, you can also double check in `Runtime -> Change runtime type -> Check if GPU has already selected from the dropdown menu`

In [1]:
!rm -rf xib
!pip install pytrie enlighten colorlog inflection ipapy
!git clone https://github.com/akurniawan/xib.git
!cd xib && git clone https://github.com/j-luo93/dev_misc.git && cd dev_misc && git checkout b44fde842a6311e03f731cd4e110dcd9fc394db7 && pip install -e .
!cd xib && pip install -e .

Collecting pytrie
[?25l  Downloading https://files.pythonhosted.org/packages/d3/19/15ec77ab9c85f7c36eb590d6ab7dd529f8c8516c0e2219f1a77a99d7ee77/PyTrie-0.4.0.tar.gz (95kB)
[K     |████████████████████████████████| 102kB 2.5MB/s 
[?25hCollecting enlighten
[?25l  Downloading https://files.pythonhosted.org/packages/2e/15/7a22630323eb816bd560bb2b60b98c9c829a3fb90f55d9d224f3aa4d7bf3/enlighten-1.10.1-py2.py3-none-any.whl (53kB)
[K     |████████████████████████████████| 61kB 5.8MB/s 
[?25hCollecting colorlog
  Downloading https://files.pythonhosted.org/packages/32/e6/e9ddc6fa1104fda718338b341e4b3dc31cd8039ab29e52fc73b508515361/colorlog-5.0.1-py2.py3-none-any.whl
Collecting inflection
  Downloading https://files.pythonhosted.org/packages/59/91/aa6bde563e0085a02a435aa99b49ef75b0a4b062635e606dab23ce18d720/inflection-0.5.1-py2.py3-none-any.whl
Collecting ipapy
  Downloading https://files.pythonhosted.org/packages/41/0d/7e8652df6af20a61bb3315f5c9d99fb9ea8f3779ff80fca9d71001230f90/ipapy-0.0.9.

# Setup Dataset

There are 2 ways to setup your dataset:
1. Mount Gdrive to Colab environment. If you decided to go with this and would like to access the data directly in our `LCT Project` shared folder, you can follow instruction in https://stackoverflow.com/questions/54351852/accessing-shared-with-me-with-colab to load `Preprocessed file` folder in Shared google drive. Essentially you just have to go to the location of the folder, right click and choose `Add a shortcut to Drive`. After that you just have to run the cell below
2. Upload your dataset to `sample_data` folder in google colab environment. Be aware that you **will** lose your data in this folder when you restart colab's environment

Please do remember that the word **must** be in an alphabetical or IPA form, that means no number, no non-alphabetical characters, etc. Otherwise, it will throw an error. If you're unsure whether your data is correct or not, run the last cell before **Run training** section to check whethere all your vocabs are valid


In [14]:
from ipapy.ipastring import IPAString
from ipapy import is_valid_ipa


dia2char = {
    'low': {'à': 'a', 'è': 'e', 'ò': 'o', 'ì': 'i', 'ù': 'u', 'ѐ': 'e', 'ǹ': 'n', 'ỳ': 'y'},
    'high': {'á': 'a', 'é': 'e', 'ó': 'o', 'ú': 'u', 'ý': 'y', 'í': 'i', 'ḿ': 'm', 'ĺ': 'l',
             'ǿ': 'ø', 'ɔ́': 'ɔ', 'ɛ́': 'ɛ', 'ǽ': 'æ', 'ə́': 'ə', 'ŕ': 'r', 'ń': 'n'},
    'rising_falling': {'ã': 'a'},
    'falling': {'â': 'a', 'î': 'i', 'ê': 'e', 'û': 'u', 'ô': 'o', 'ŷ': 'y', 'ĵ': 'j'},
    'rising': {'ǎ': 'a', 'ǐ': 'i', 'ǔ': 'u', 'ǒ': 'o', 'ě': 'e'},
    'extra_short': {'ă': 'a', 'ĕ': 'e', 'ĭ': 'i', 'ŏ': 'o', 'ŭ': 'u'},
    'nasalized': {'ĩ': 'i', 'ũ': 'u', 'ã': 'a', 'õ': 'o', 'ẽ': 'e', 'ṽ': 'v', 'ỹ': 'y'},
    'breathy_voiced': {'ṳ': 'u'},
    'creaky_voiced': {'a̰': 'a', 'ḭ': 'i', 'ḛ': 'e', 'ṵ': 'u'},
    'centralized': {'ë': 'e', 'ä': 'a', 'ï': 'i', 'ö': 'o', 'ü': 'u', 'ÿ': 'y'},
    'mid': {'ǣ': 'æ', 'ū': 'u', 'ī': 'i', 'ē': 'e', 'ā': 'a', 'ō': 'o'},
    'voiceless': {'ḁ': 'a'},
    'extra_high': {'ő': 'o'},
    'extra_low': {'ȁ': 'a'},
    'syllabic': {'ạ': 'a', 'ụ': 'u'}
}


dia2code = {
    'low': 768,
    'high': 769,
    'rising_falling': 771,
    'falling': 770,
    'rising': 780,
    'extra_short': 774,
    'nasalized': 771,
    'breathy_voiced': 804,
    'creaky_voiced': 816,
    'centralized': 776,
    'mid': 772,
    'voiceless': 805,
    'extra_high': 779,
    'extra_low': 783,
    'syllabic': 809,
    'high_rising': 7620,
    'low_rising': 7621,
}


char2ipa_char = dict()
for dia, char_map in dia2char.items():
    code = dia2code[dia]
    s = chr(code)
    for one_char, vowel in char_map.items():
        char2ipa_char[one_char] = vowel + s


to_remove = {'ᶢ', '̍', '-', 'ⁿ', 'ᵑ', 'ᵐ', 'ᶬ', ',', 'ᵊ', 'ˢ', '~', '͍', 'ˣ', 'ᵝ', '⁓', '˭', 'ᵈ', '⁽', '⁾', '˔', 'ᵇ',
             '+', '⁻'}


def clean(s):
    if s == '◌̃':
        return ''
    return ''.join(c for c in s if c not in to_remove)


def sub(s):
    return ''.join(char2ipa_char.get(c, c) for c in s)


to_standardize = {
    'ˁ': 'ˤ',
    "'": 'ˈ',
    '?': 'ʔ',
    'ṭ': 'ʈ',
    'ḍ': 'ɖ',
    'ṇ': 'ɳ',
    'ṣ': 'ʂ',
    'ḷ': 'ɭ',
    ':': 'ː',
    'ˇ': '̌',
    'ỵ': 'y˞',
    'ọ': 'o˞',
    'ř': 'r̝',  # Czech
    '͈': 'ː',  # Irish
    'ŕ̩': sub('ŕ') + '̩',  # sanskrit
    'δ': 'd',  # Greek
    'ń̩': sub('ń') + '̩',  # unsure
    'ε': 'e',
    'X': 'x',
    'ṍ': sub('õ') + chr(769),
    'ÿ̀': sub('ÿ') + chr(768),
    '∅': 'ʏ'  # Norvegian,
}


def get_string(s: str) -> IPAString:
    return IPAString(unicode_string=clean(sub(standardize(s))))


def standardize(s):
    return ''.join(to_standardize.get(c, c) for c in s)

In [2]:
# from google.colab import drive
# drive.mount('/content/drive')

In [3]:
# Set your dataset path here
KNOWN_LANG_PATH = "/content/sample_data/it_testing.txt"
UNKNOWN_LANG_PATH = "/content/sample_data/srb_lat.txt"

In [22]:
print("Check KNOWN Language file format")
with open(KNOWN_LANG_PATH, "r", encoding="utf8") as f:
    for line in f.readlines():
        s = clean(sub(standardize(line.strip())))
        # print(s, is_valid_ipa(s))
        if not is_valid_ipa(s):
            print(s, "is invalid")


print("\nCheck UNKNOWN_LANG_PATH Language file format")
with open(UNKNOWN_LANG_PATH, "r", encoding="utf8") as f:
    for line in f.readlines():
        s = clean(sub(standardize(line.strip())))
        # print(s, is_valid_ipa(s))
        if not is_valid_ipa(s):
            print(s, "is invalid")

Check KNOWN Language file format
dona39signor is invalid
acquietata8928ritorn is invalid
accetta8929entr is invalid
servi8930entr is invalid

Check UNKNOWN_LANG_PATH Language file format


# Run Training

As of now, the training will run indefinitely and not sure if changing this will affect the rest of the code. For that reason, we need to stop the training manually once we feel that the losses are no longer improving. We can monitor the `ll` variable from script output inside a table with the following format to know when to stop the training (i.e. the `ll` is close to zero)

```
+----------------------------------------+  
|                  3_8                   |  
+-----------+----------+--------+--------+  
| name      | value    | weight | mean   |  
+-----------+----------+--------+--------+  
| grad_norm | 58.492   | 60     | 0.975  |  
| ll        | -336.246 | 60     | -5.604 |  
| reg       | 0.555    | 60     | 0.009  |  
+-----------+----------+--------+--------+
```

The other way to know when to stop the training is to also monitor the output of the model validation. Go to the next section to see how to analyze the result.

In [4]:
# Total number of phonetic feature groups
NUM_FEATURE_GROUPS = 10

# Total number of phonetic features
NUM_FEATURES = 10

# Initial value of threshold to determine whether two words are matched. This will determine
# whether two words are in match. The bigger the value, the more false positive we will have.
# However, if the value is too low, the model will not output anything
THRESHOLD = 1.5

# Cost in doing insertion and deletion operation in edit distance algorithm, refer to the paper for more details
INS_DEL_COST = 100.0

# Learning rate for adam optimizer
LR = 0.002

# How many training steps to do before running the evaluation steps
EVAL_INTERVAL = 500

In [24]:
!PYTHONPATH=/content/xib && /usr/local/bin/python -m xib.main --task extract \
  --vocab_path {KNOWN_LANG_PATH} --data_path {UNKNOWN_LANG_PATH} \
  --dim 112 --min_word_length 1 --max_word_length 10 --input_format text \
  --dense_input --eval_interval 20 \
  --char_per_batch 128 --gpus 0 \
  --num_feature_groups {NUM_FEATURE_GROUPS} --num_features {NUM_FEATURES} \
  --init_threshold {THRESHOLD} --init_ins_del_cost {INS_DEL_COST} --learning_rate {LR}

  return f(*args, **kwds)
  from collections import MutableSequence
  from pandas import Panel
[32mINFO - 05/31/21 14:50:50 - 0:00:00 at parser.py:152 - xib.ipa.process:
                                                      	min_word_length: 1
                                                      xib.data_loader.BaseIpaDataLoader:
                                                      	char_per_batch: 128
                                                      	data_path: /content/sample_data/srb_lat.txt
                                                      	new_style: False
                                                      	num_workers: 0
                                                      xib.data_loader:
                                                      	broken_words: False
                                                      	input_format: text
                                                      	max_segment_length: 10
                                                    

# Analyzing The Result

The result will be in the following file `log/<DATE>/default/<TIME>/predictions/extract.<EPOCH>_<STEPS>.tsv`

For some reason, google colab won't show anything under `log` folder, so I would suggest to analyze it via `cat` command or download the result to your local computer and analyze it from there.

The content inside of the `tsv` file will consist of 4 different columns: `segment`, `ground_truth`, `prediction`, `matched_segment`. From my understanding, `segment` is the original segment of the unknown language; `ground_truth` similar to `segment` but with their exact index locations; `prediction` is the vocabulary prediction in the known language; and `matched_segment` is the information on which segment the unknown language match the vocabulary in known language. If you want, you can get more details by looking at the code in `evaluator.py` line 257


In [None]:
# You can run the following command **sequentially**
# !ls log/ # to get the list of experiment dates
# !ls log/<date>/default/ # to get the list of experiment times, first replace date from the command above
# !ls log/<date>/default/<time>/predictions/<filename> # to get the list of filename, first replace date and time from the commands above

# Replace <date>, <time>, and <filename> from the commands above
# !cat log/<date>/default/<time>/predictions/<filename>

In [None]:
# Or you can run the following command to download the result to your local environment
from google.colab import files

files.download("log/2021-05-29/default/09-15-25/predictions/extract.1_10.tsv")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

The other way around is to direct the output to your google drive folder by setting up `--log_dir` parameter on the script. However, by setting up this parameter we won't have the same directory structure format as if we don't set the parameter.