# NOTEBOOK For Running the ML Models

**Subword Contextual Embeddings for Languages with Rich Morphology**

Arda Akdemir, Tetsuo Shibuya, Tunga Gungor



This notebok is prepared to make it easy to re-run the experiments and train new models related to the paper titled, "Subword Contextual Embeddings for Languages with Rich Morphology".



In [1]:
## Mount drive to run the source code on Colab
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
## Move cwd to source code
path_to_source_code = "drive/MyDrive/dlaPaper_sourceCode/pyJNERDEP"
%cd $path_to_source_code

/content/drive/MyDrive/dlaPaper_sourceCode/pyJNERDEP


In [13]:
!pwd

/content/drive/My Drive/dlaPaper_sourceCode/pyJNERDEP


***Datasets*** All datasets are available under:

https://drive.google.com/drive/folders/1AHVHB1t0_9-0oMjSMTnnMiOjywr7bavh?usp=sharing


By default all datasets are expected to be under "../../datasets" folder, relative to the source code. 
Please place the dataset accordingly.

In [14]:
# Install requirements
req_path="/content/drive/MyDrive/dlaPaper_sourceCode/pyJNERDEP/docker_req.txt"
!pip3 install -r $req_path;



## Download Data Dependencies

Below are the code scripts for downloading the pretrained models to replicate the results.

In [3]:
import subprocess
import sys
import gdown
import zipdir
from zipfile import ZipFile
import os

id_model_map = {"SA": {"twitter_turkish": "10mOwZGp4-NTo9K_bJkE2KlWa4HUudW07",
                       "movie_turkish": "1IoQhYijlWVlK0vHnVUn0e1ohwahePO7R",
                       "movie_english": "1t2XgkbfxGPjvThEg-wkO_ejTOlikIzv7"},
                "NER": {"bert": "1M9-JWPL535IIDUSoDNDpMRzTex8tBbN7",
                        "mbert": "1GCcU5hP86CDnH3Jhb8MBE9bHW9zl3vbb",
                        "bert_en": "1KswPuDKxWOl-3qBQ_BESwBpd0vMzN0zn",
                        "fastext": "1JWHSHDmTxsZoYwkc6_K8Wz76JSUmJVhA",
                        "random_init": "117tWTA18lC6iOd31Maypg8jpNMQhPptf",
                        "word2vec": "1E5jGGlhbevjSg-oprf_e0vU2y89_zhHJ"
                        },
                "DEP": {"bert": "1qdWkuwPkyKMBKKHMgTB2sAw_SmdqJ99m",
                        "mbert": "1p4VJg_hQhC1yIiabJbPOuuDwj8jfeOlP",
                        "bert_en": "1NLxgynxoDVb0QoN2NB8z9bf6OrfVNUU7",
                        "fastext": "1reGe2vMsGU-xbXV6dDACzvn8ViRhJKof",
                        "random_init": "1_AqzeDVlsTUlnhduQtU7tE0VyaZTIADj",
                        "word2vec": "1gF7ujIvRdmwvIKIdGw9bKBoKgstX1Mws"
                        }
                }

names = {"SA": "Sentiment Analysis",
         "NER": "Named Entity Recognition",
         "DEP": "Dependency Parsing",
         "FLAT_DEP": "Multi-task Learning DEP",
         "FLAT_NER": "Multi-task Learning NER"}

word2vec_driveIds = {"jp": "1dYISBXsgK3yR6mw-LRGfjGrcN3aVme2q",
                     "tr": "14WH-amhKXn4ayqi2lugUSIoS7b8q0U9H",
                     "hu": "1dmEC0-7Zkmc4p9OmTIw3JKMJ7aNyGYIt",
                     "en": "1avdWgjq138lrfJnIZVRpa9EaLJrU4HVj",
                     "fi": "1wtqAc4FZ6wl4w4_kSozWbDjgUUV22Fjr",
                     "cs": "1ibFwJ6B01Kpm6k6qdI1cJ64wLyaJy-s3"}

word2vec_dict = {"jp": "../word_vecs/jp/jp.bin",
                 "tr": "../word_vecs/tr/tr.bin",
                 "hu": "../word_vecs/hu/hu.bin",
                 "en": "../word_vecs/en/en.txt",
                 "fi": "../word_vecs/fi/fi.bin",
                 "cs": "../word_vecs/cs/cs.txt"}
                 
def drive_download_w2v(lang, save_path):
    print("\nDownloading word2Vec model for {} to {}".format(lang,save_path))
    id = word2vec_driveIds[lang]
    link = download_link_generator(id)
    gdown.download(link, save_path, quiet=False)
    return save_path

def unzip(src, dest):
    with ZipFile(src, 'r') as zipObj:
        # Extract all the contents of zip file in different directory
        zipObj.extractall(dest)
        print('File is unzipped in {} folder'.format(dest))


def download_link_generator(id):
    return "https://drive.google.com/uc?id={}".format(id)


def load_download_models(model_type, word_type, save_folder=None):
    id = id_model_map[model_type][word_type]
    print("\n===Downloading trained {} models to replicate the result===\n".format(names[model_type]))
    link = download_link_generator(id)
    dest = "../{}_{}_models.zip".format(model_type, word_type)
    unzip_path = "../{}".format(os.path.split(dest)[-1].split(".")[0]) if not save_folder else save_folder
    if not os.path.exists(unzip_path):
        print("{} not found. Downloading trained models for {} {}".format(unzip_path, model_type, word_type))
        gdown.download(link, dest, quiet=False)
        unzip(dest, unzip_path)
        print("Trained models are stored in {}".format(unzip_path))
        return unzip_path
    else:
        print("Models for {} {} are already downloaded.".format(model_type, word_type))
        return unzip_path


# Run Sentiment Analysis Experiments

Below bash script allows getting results using a single CLI call.
Steps: 

- Download the trained models
- Run for each language/model combination
- Store results in denoted folder

In [4]:
save_dir = "../sa_experiment_results_3103" # folder to store the SA experiment results

In [5]:
import os
for comb in [["movie","en","english"],["movie","tr","turkish"],["twitter","tr","turkish"]]:
  for word_type in ["bert","mbert","bert_en", "fastext", "word2vec", "random_init"]:
    d,l,lang = comb
    model_folder = os.path.join("../","SA_{}_{}_models".format(lang,d))
    !bash "experiment_scripts/run_sa_models.sh" $l $lang $d $word_type $model_folder  $save_dir

Downloading trained Sentiment Analysis models to: ../SA_english_movie_models

===Downloading trained Sentiment Analysis models to replicate the result===

Models for SA movie_english are already downloaded.
Content of the save model folder: ['movie_en_bert_best_sa_model_weights.pkh', 'movie_en_mbert_best_sa_model_weights.pkh', 'movie_en_bert_en_best_sa_model_weights.pkh', 'movie_en_word2vec_best_sa_model_weights.pkh', 'movie_en_fastext_best_sa_model_weights.pkh', 'movie_en_random_init_best_sa_model_weights.pkh']
Running for  sa  english  bert
2021-03-31 13:39:26.293838: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
Downloading trained Sentiment Analysis models to: ../SA_english_movie_models

===Downloading trained Sentiment Analysis models to replicate the result===

Models for SA movie_english are already downloaded.
Content of the save model folder: ['movie_en_bert_best_sa_model_weights.pkh', 'movie_en_mbert_best

In [6]:
sa_result_path = os.path.join(save_dir,"sa_test_results.txt")
!cat $sa_result_path

Model	Accuracy	F1
movie_en_bert_en	0.833	0.845
movie_en_fastext	0.746	0.763
movie_en_word2vec	0.761	0.774
movie_en_random_init	0.745	0.768
movie_en_mbert	0.804	0.816
movie_en_bert_en	0.833	0.845
movie_en_fastext	0.746	0.763
movie_en_word2vec	0.761	0.774
movie_en_random_init	0.745	0.768
movie_tr_bert	0.93	0.895
movie_tr_mbert	0.819	0.74
movie_tr_bert_en	0.758	0.709
movie_tr_fastext	0.865	0.805
movie_tr_word2vec	0.876	0.815
movie_tr_random_init	0.87	0.804
twitter_tr_bert	0.866	0.849
twitter_tr_mbert	0.753	0.682
twitter_tr_bert_en	0.761	0.72
twitter_tr_fastext	0.739	0.673
twitter_tr_word2vec	0.745	0.696
twitter_tr_random_init	0.714	0.654


# Run NER and DEP Experiments

**NOTE!!** To run monolingual Hungarian BERT models, you need to obtain the TensorFlow-based huBERT  from:  
 https://hlt.bme.hu/en/resources/hubert 

By default, the model folder should reside under "../bert_models/hubert".
Use the --hubert_path to change the path accordingly.

## Single Task Learning Results


In [None]:
import os
import subprocess
save_dir = "all_stl_results_3103"
langs=["czech", "turkish",  "japanese", "english", "finnish", "hungarian"]
lang_prefs = ["cs","tr","jp","en","fi","hu"]
bert_embed_types = ["bert_en","mbert","bert"]
nonbert_embed_types = ["random_init","word2vec","fastext"]
tasks = ["NER","DEP"]

for task in tasks:
  for word_type in bert_embed_types + nonbert_embed_types:
    model_folder=os.path.join("../","{}_{}_models".format(task,word_type))
    load_download_models(task, word_type, save_folder=model_folder)
    for l_p, l in zip(lang_prefs,langs):
      task_lower = task.lower()
      load_path = os.path.join(model_folder,"{}_{}_{}_best_{}_model.pkh".format(task,word_type,l_p,task_lower))
      !python jointtrainer_multilang.py --mode "predict" --eval_mode $task --model_type $task  --load_model 1 --load_path $load_path --word_embed_type $word_type   --lang $l_p --save_dir $save_dir
    cmd = "rm -r {}".format(model_folder)
    print("Removing the downloaded models to free space...")
    subprocess.call(cmd,shell=True)


===Downloading trained Dependency Parsing models to replicate the result===

../DEP_word2vec_models not found. Downloading trained models for DEP word2vec


Permission denied: https://drive.google.com/uc?id=1gF7ujIvRdmwvIKIdGw9bKBoKgstX1Mws
Maybe you need to change permission over 'Anyone with the link'?


FileNotFoundError: ignored

# Run Multi-Task Learning (MTL) Experiments

First you need to download all the pretrained models using the link below:

https://drive.google.com/drive/folders/1OL52-tDvHPWOReNnNytgjotxDB7z_187?usp=sharing

The models are grouped in (Task,word_embed) fields. For example, all DEP MTL models for word2Vec-based approach for all 6 languages are stored in a single zip file.

 
  - Tasks: DEP, NER
  - Word_embed types: bert, mbert, bert_en, fastext, word2vec, random_init

For each (task,word_embed) combination make sure the unzipped model folder exists under the "root_folder". 

**NOTE**: Due to the Colab space limits, you may need to unmount the folder containing the pretrained models of the previous experiments.

In [None]:

import os
import subprocess
root_folder = "../"
save_dir = "all_mtl_results_3103"
langs=["czech", "turkish",  "japanese", "english", "finnish", "hungarian"]
lang_prefs = ["cs","tr","jp","en","fi","hu"]
bert_embed_types = ["bert_en","mbert","bert"]
nonbert_embed_types = ["random_init","word2vec","fastext"]
tasks = ["NER","DEP"]
model_type = "FLAT"
for task in tasks:
  for word_type in bert_embed_types + nonbert_embed_types:
    model_folder=os.path.join(root_folder,"{}_{}_models".format(task,word_type))
    for l_p, l in zip(lang_prefs,langs):
      task_lower = task.lower()
      load_path = os.path.join(model_folder,"{}_{}_{}_best_{}_model.pkh".format(task,word_type,l_p,task_lower))
      !python jointtrainer_multilang.py --mode "predict" --eval_mode $task --model_type $model_type  --load_model 1 --load_path $load_path --word_embed_type $word_type   --lang $l_p --save_dir $save_dir
    cmd = "rm -r {}".format(model_folder)
    print("Removing the downloaded models to free space...")
    subprocess.call(cmd,shell=True)

# Training New Models

You can also use this Colab notebook to train new models.
For word2vec models you need to obtain the pretrained embeddings to start the training.  
You can obtain word2vec embeddings from:

https://drive.google.com/drive/folders/1GqUlRknYWjSECdczeFQ0xxLzxnJAotHQ?usp=sharing

Make sure to change variable names accordingly. By default, word2vec embeddings must reside under "../word_vecs".  
Example
- Hungarian Word2Vec path: "../word_vecs/hu/hu.bin"
- English Word2Vec path: "../word_vecs/en/en.txt" 

**NOTE!!** To train monolingual Hungarian BERT models, you need to obtain the TensorFlow-based huBERT  from:  
 https://hlt.bme.hu/en/resources/hubert 

By default, the model folder should reside under "../bert_models/hubert".
Use the --hubert_path to change the path accordingly.

In [None]:
langs = ["cs","fi","en","tr","jp","hu"]
bert_embed_types = ["bert_en","mbert","bert"]
nonbert_embed_types = ["random_init","word2vec","fastext"]
save_dir="../ner_turkish_mbert_1"
# Training NER model with mbert for Finnish
!python jointtrainer_multilang.py --model_type NER  --word_embed_type mbert  --lang tr --save_dir $save_dir


2021-03-31 13:59:58.500495: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
{'data_dir': 'data/depparse', 'data_folder': '../../datasets', 'wordvec_dir': '../word_vecs', 'train_file': None, 'eval_file': None, 'output_file': None, 'gold_file': None, 'ner_result_out_file': 'ner_results', 'ner_test_result_file': 'ner_test_results.txt', 'dep_test_result_file': 'dep_test_results.txt', 'log_file': 'jointtraining.log', 'ner_train_file': '../../datasets/traindev_pos.tsv', 'dep_train_file': '../../datasets/tr_imst-ud-traindev.conllu', 'ner_val_file': '../../datasets/dev_pos.tsv', 'dep_val_file': '../../datasets/tr_imst-ud-train.conllu', 'ner_test_file': '../../datasets/test_pos.tsv', 'dep_test_file': '../../datasets/tr_imst-ud-test.conllu', 'ner_output_file': 'joint_ner_out.txt', 'dep_output_file': 'joint_dep_out.txt', 'conll_file_name': 'conll_ner_output', 'config_file': 'config.json', 'mode': 'train', 'eval_mode': 'BOTH', '

In [10]:
save_dir="../dep_turkish_bert_en_1"
# Training DEP model with bert_en for Turkish
!python jointtrainer_multilang.py --model_type DEP  --word_embed_type mbert  --lang tr --save_dir $save_dir


Training:  19% 72/383 [00:13<00:57,  5.45it/s][A
Training:  19% 73/383 [00:13<00:55,  5.54it/s][A
Training:  19% 74/383 [00:13<00:55,  5.54it/s][A
Training:  20% 75/383 [00:14<00:56,  5.48it/s][A
Training:  20% 76/383 [00:14<00:56,  5.40it/s][A
Training:  20% 77/383 [00:14<00:56,  5.45it/s][A
Training:  20% 78/383 [00:14<00:55,  5.49it/s][A
Training:  21% 79/383 [00:14<00:54,  5.55it/s][A
Training:  21% 80/383 [00:15<00:55,  5.43it/s][A
Training:  21% 81/383 [00:15<00:56,  5.39it/s][A
Training:  21% 82/383 [00:15<00:55,  5.46it/s][A
Training:  22% 83/383 [00:15<00:54,  5.53it/s][A
Training:  22% 84/383 [00:15<00:54,  5.48it/s][A
Training:  22% 85/383 [00:15<00:55,  5.39it/s][A
Training:  22% 86/383 [00:16<00:54,  5.48it/s][A
Training:  23% 87/383 [00:16<00:53,  5.49it/s][A
Training:  23% 88/383 [00:16<00:53,  5.53it/s][ATraceback (most recent call last):
  File "jointtrainer_multilang.py", line 1603, in <module>
    main(args)
  File "jointtrainer_multilang.py", line 

In [12]:
save_dir="../mtl_czech_fastext_1"
# Training DEP model with bert_en for Turkish
!python jointtrainer_multilang.py --model_type FLAT  --word_embed_type fastext  --lang cs --save_dir $save_dir

2021-03-31 12:49:05.251035: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
{'data_dir': 'data/depparse', 'data_folder': '../../datasets', 'wordvec_dir': '../word_vecs', 'train_file': None, 'eval_file': None, 'output_file': None, 'gold_file': None, 'ner_result_out_file': 'ner_results', 'ner_test_result_file': 'ner_test_results.txt', 'dep_test_result_file': 'dep_test_results.txt', 'log_file': 'jointtraining.log', 'ner_train_file': '../../datasets/traindev_pos.tsv', 'dep_train_file': '../../datasets/tr_imst-ud-traindev.conllu', 'ner_val_file': '../../datasets/dev_pos.tsv', 'dep_val_file': '../../datasets/tr_imst-ud-train.conllu', 'ner_test_file': '../../datasets/test_pos.tsv', 'dep_test_file': '../../datasets/tr_imst-ud-test.conllu', 'ner_output_file': 'joint_ner_out.txt', 'dep_output_file': 'joint_dep_out.txt', 'conll_file_name': 'conll_ner_output', 'config_file': 'config.json', 'mode': 'train', 'eval_mode': 'BOTH', '