## **server*NMT*&reg; with OpenNMT-py 2.0**
*Streamlining workflows for Neural Machine Translation*

![DCU](https://github.com/seamusl/nmt/blob/main/logos/dcu.png?raw=true) ![MTU](https://github.com/seamusl/nmt/blob/main/logos/mtu.png?raw=true)




# **Readme** 

*  Specify directory on your gdrive where models and results are to be stored.
*  Upload a zip file containing the following files:
   * src-val.txt, tgt-val.txt, src-test.txt, tgt-test.txt
   * src-train.txt, tgt-train.txt
   * vanilla.yaml, transformer.yaml (defaults downloaded if not specified)
*   The Experiment log contains the following evaluations:
    *   Bleu corpus level (mixed case), Bleu corpus level (lower case)
    *   Chrf 1, Chrf 3, Meteor

MIT License

##### © 2023 Adapt Centre, DCU / MTU, Ireland.
##### Author: Séamus Lankford  
##### seamus.lankford[at]adaptcentre.ie


Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

![MIT-license](https://github.com/seamusl/nmt/blob/main/logos/mit_license.png?raw=true) 




## **Citation**

### **Using bibtex**

```
@misc{lankford_way_alfi_2023,
 title={serverNMT with OpenNMT 2.0}
 url={https://github.com/adaptNMT},
 publisher={Adapt Centre, Dublin City University}, 
 author={Lankford, Seamus and Way, Andy and Alfi, Haithem}, year={2021}, month={Mar}
 } 
```

### Using text citation

Lankford, S., Way, A., &amp; Alfi, H. (2023, January 22). serverNMT with OpenNMT 2.0 (Version 1.0) [Computer software]. Retrieved from https://github.com/adaptNMT


# **Initialize**



Some, or all, of the following may need to be installed outside of colab depending on what machine the server application is installed.
```
sudo apt-get install python3.7
# following is required to be installed locally for sentencepiece to work
sudo apt-get install cmake build-essential pkg-config libgoogle-perftools-dev
# git must be installed locally in order to pull down repos
sudo apt install git
sudo apt install gnome-online-accounts
```

## **Jupyter notebook server**

### Setup Jupyter notebook



In order to allow Colaboratory to connect to your locally running Jupyter server, you'll need to perform the following steps.

Step 1: Install Jupyter. Install Jupyter on your local machine.

Step 2: Install and enable the jupyter_http_over_ws jupyter extension (one-time). 

The jupyter_http_over_ws extension is authored by the Colaboratory team and available on GitHub.


```
pip3 install jupyterlab
pip3 install jupyter_http_over_ws
sudo apt  install jupyter-core
jupyter serverextension enable --py jupyter_http_over_ws
```

### Start server and authenticate

New notebook servers are started normally, though you will need to set a flag to explicitly trust WebSocket connections from the Colaboratory frontend.

```
jupyter notebook \
  --NotebookApp.allow_origin='https://colab.research.google.com' \
  --port=8888 \
  --NotebookApp.port_retries=0
```

Once the server has started, it will print a message with the initial backend URL used for authentication. Make a copy of this URL as you'll need to provide this in the next step.

Step 4: Connect to the local runtime
In Colaboratory, click the "Connect" button and select "Connect to local runtime...". Enter the URL from the previous step in the dialog that appears and click the "Connect" button. After this, you should now be connected to your local runtime.

Browser-specific settings
Note: If you're using Mozilla Firefox, you'll need to set thenetwork.websocket.allowInsecureFromHTTPS preference within the Firefox config editor. Colaboratory makes a connection to your local kernel using a WebSocket. By default, Firefox disallows connections from HTTPS domains using standard WebSockets.

If Jupyter fails to start on system, upgrading ipykernel may fixes it:

```
pip3 install --upgrade ipykernel
```

See issues:
ipython/ipython#11258 and ipython/ipykernel#335



# **Setup Environment**



In [None]:
#@markdown ### Directory for results and models:
results_dir = "test_server" #@param {type:"string"}

#@markdown ### Cell debug:
debug_on = True #@param {type:"boolean"}

noise = '&> /dev/null'
if debug_on:
  noise = ''

HOME = '~'
%cd $HOME

In [None]:
import os
exists = os.path.isdir('$results_dir')
if not exists:
# recursively create the results directory with its parents
  %mkdir -p $results_dir
  %cd $HOME/$results_dir
  %mkdir data

/home/seamus/test_server


In [None]:
%cd $HOME
# install OpenNMT 2.0 in google home dir
''' <pip install opennmt> installs old version of OpenNMT so instead 
must download OpenNMT source and build'''

import os
exists = os.path.isdir('OpenNMT-py')
if not exists:
  !git clone https://github.com/OpenNMT/OpenNMT-py.git $noise

%cd OpenNMT-py/
%pwd
!pip3 install -e . $noise

In [None]:
#@title Vocab size, submodel type and input dataset

# install sentencepiece in google home dir
%cd $HOME 

use_sub_model = False #@param {type:"boolean"}
v_size = 32000 #@param {type:"number"}
m_type = 'bpe' #@param ["unigram", "bpe"]

''' <pip install sentencepiece> doesn't install properly so
must download source and build 
More info: https://github.com/google/sentencepiece ''' 

if use_sub_model == True:
  !git clone https://github.com/google/sentencepiece.git
  %cd sentencepiece
  %mkdir build
  %cd build
  ! cmake .. $noise
  ! make -j $(nproc) $noise
  ! make install $noise
  ! ldconfig -v

### **Setup graphics and packages**

In [None]:
# Display GPU details provided by Google 
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Select the Runtime > "Change runtime type" menu \
    to enable a GPU accelerator, ')
  print('and then re-execute this cell.')
else:
  print(gpu_info)
  print(gpu_info, file=open("experiment_log.txt", "a"))

Sun Mar 28 11:49:18 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.39       Driver Version: 460.39       CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  GeForce GTX 1080    Off  | 00000000:01:00.0 Off |                  N/A |
| 24%   47C    P5    13W / 180W |    497MiB /  8119MiB |     32%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
# tee command reads standard input and writes it to both standard output and
# one or more files. Useful for logging corpora details and evaluation results
!pip3 install tee $noise

# execution time of each cell tracked with following package. 
# useful for tracking model training times
!pip3 install ipython-autotime $noise
%load_ext autotime

# **Anvil Server**

## **Setup local Anvil server**

### Enter the Uplink key from your Anvil web app.
For information on how to get your apps Uplink key, see [Step 4 - Enable the Uplink](https://anvil.works/learn/tutorials/google-colab-to-web-app#step-4-enable-the-uplink).

In [None]:
from getpass import getpass
uplink_key = getpass('Enter your Uplink key: ')

Enter your Uplink key: ··········


In [None]:
!pip3 install anvil-uplink
import anvil.server
anvil.server.connect(uplink_key)

# **API functions**

### **Split input datasets into train, test and val sets**

In [None]:
@anvil.server.callable

def split_train_val_test(src, tgt):

  %cd $user_dir/data/
  train_percent = 92.5
  valid_percent = 5 
  test_percent = 2.5 

  import random
  import math

  S_DATASET = src
  T_DATASET = tgt

  s_data = [l for l in open(S_DATASET, 'r')]
  t_data = [l for l in open(T_DATASET, 'r')]

  s_train_file = open('src-train.txt', 'w')
  s_valid_file = open('src-val.txt', 'w')
  s_tests_file = open('src-test.txt', 'w')

  t_train_file = open('tgt-train.txt', 'w')
  t_valid_file = open('tgt-val.txt', 'w')
  t_tests_file = open('tgt-test.txt', 'w')

  # s_data and t_data must be same lenght for parallel set
  num_of_data = len(s_data)

  num_train = int((train_percent/100.0)*num_of_data)
  num_valid = int((valid_percent/100.0)*num_of_data)
  num_tests = int((test_percent/100.0)*num_of_data)

  data_splits = [num_train, num_valid, num_tests]

  s_split_data = [[],[],[]]
  t_split_data = [[],[],[]]

  rand_data_ind = 0

  for split_ind, fraction in enumerate(data_splits):
    for i in range(fraction):
      rand_data_ind = random.randint(0, len(s_data)-1)
      s_split_data[split_ind].append(s_data[rand_data_ind])
      s_data.pop(rand_data_ind)
      t_split_data[split_ind].append(t_data[rand_data_ind])
      t_data.pop(rand_data_ind)
      
  for l in s_split_data[0]:
    s_train_file.write(l)
      
  for l in s_split_data[1]:
    s_valid_file.write(l)
        
  for l in s_split_data[2]:
    s_tests_file.write(l)
        
  s_train_file.close()
  s_valid_file.close()
  s_tests_file.close()

  for l in t_split_data[0]:
    t_train_file.write(l)
        
  for l in t_split_data[1]:
      t_valid_file.write(l)
        
  for l in t_split_data[2]:
      t_tests_file.write(l)
        
  t_train_file.close()
  t_valid_file.close()
  t_tests_file.close()

  return

### **Common API functions**

In [None]:
'''
With pytorch, need to use tensorboardX instead of tensorboard for visualisation
https://github.com/lanpa/tensorboardX 
https://tensorboardx.readthedocs.io/en/latest/tutorial.html 
https://github.com/OpenNMT/OpenNMT-py/blob/master/onmt/utils/statistics.py 
# !pip install -U tensorboard-plugin-profile
'''
@anvil.server.callable
def visualize():
  %cd $HOME/$results_dir
  !pip install tensorboardX &> /dev/null

  %load_ext tensorboard
  %tensorboard --logdir runs

In [None]:
# following required once off to make /tmp writeable:
# sudo chmod a=rwx,u+t /tmp
# this function creates temp file for omnt_translate

@anvil.server.callable
def make_file():
  %cd /tmp
  tfile=!$(mktemp)
  !ls -t | head -n1
  recent = !ls -t | head -n1
  temp_file = recent[0]
  return temp_file

## serverNMT&reg; Translate##

In [None]:
# function takes incoming source string and returns target translation
@anvil.server.callable
def tran_api_sent(str):
  print(str)
  src_temp = make_file()
  # store source string in a temporary file
  !echo $str >> $src_temp 
  translation_model = "model_step_100000.pt"
  !echo -e "-Translating sentence ...." $translation_model | tee -a server_log.txt
  
  %cd ~/production/
  full_prediction = !onmt_translate --model $translation_model \
          --src /tmp/$src_temp \
          --output pred \
          -replace_unk -verbose

  !echo $full_prediction
  prediction = !echo $full_prediction | sed 's/.*PRED 1\(.*\)PRED SCORE.*/\1/'
  return prediction 

In [None]:
import anvil.media

@anvil.server.callable
def translate_api(file):
  %cd ~/production
  with anvil.media.TempFile(file) as src_tmp:
    print("printing the file ..")
    !echo $src_tmp
    translation_model = "model_step_100000.pt"
    !echo -e "-Translating file ...." $translation_model | tee -a server_log.txt
    prediction = !onmt_translate --model $translation_model \
          --src $src_tmp \
          --output pred.txt \
          -replace_unk -verbose
    with open('pred.txt', 'r') as f:
      contents = f.read()      
  return contents

time: 1.8 ms (started: 2021-03-23 19:59:45 +00:00)


## serverNMT&reg; Build##


### **Auto NMT**



In [None]:
import anvil.media

# Write the byte contents of the media object to 'tmp/my-file.txt'
# (we opened the file in binary mode, so f.write() accepts bytes)
@anvil.server.callable
def store_src(file):
  user_dir = make_dir()
  %cd $user_dir
  with open('src.txt', 'wb+') as f:
    f.write(file.get_bytes())
  return

@anvil.server.callable
def store_tgt(file):
  user_dir = make_dir()
  %cd $user_dir
  with open('tgt.txt', 'wb+') as f:
    f.write(file.get_bytes())
  return

In [None]:
# following required once off to make /tmp writeable:
# sudo chmod a=rwx,u+t /tmp

# this function creates temp file for omnt_translate
@anvil.server.callable
def make_dir():
  %cd /tmp
  tfile=!"$(mktemp -d)"
  t_dir=!ls -t | head -n1
  tdir=t_dir[0]
  %cd $tdir
  %mkdir data
  print("in make_dir")
  return tdir

In [None]:
@anvil.server.callable
def button2():
  print("in button 2")
  button2_return = "Getting there"
  return button2_return

In [None]:
import anvil.media

@anvil.server.callable
def build(mode, model_type):
  visualize()
  user_dir = make_dir()
  %cd /tmp/$user_dir
  !cp /home/seamus/production/defaults/* ./data
  if mode == 'vanilla':
    config = set_vanilla(model_type)
  else:
    config = set_transformer(model_type)

  import time
  start_train = time.time()

  print("calling split now ..")
#  %cd data
#  %pwd
# split_train_val_test()

#  if use_sub_model:
#    !cat src-train.txt tgt-train.txt> train.txt
#    !wc tgt-train.txt | tee -a experiment_log.txt
#    !spm_train --input='train.txt' --model_prefix=spm \
#      --vocab_size=$v_size --character_coverage=1.0 --model_type='bpe'

  !onmt_build_vocab -config $config -n_sample=-1
  !onmt_train -config $config
  !cat $config | tee -a data/experiment_log.txt

  end_train = time.time()
  train_time = int(end_train - start_train)
  print(train_time)
  print("+ Model Training Time + \n" + str(train_time), \
        file=open("data/experiment_log.txt", "a"))

  build_return = 1
  return build_return

In [None]:
# download any missing files from github.com/seamusl/nmt repo
@anvil.server.callable
def set_vanilla(vanilla_type):
  print("in set vanilla")
  if vanilla_type == 'Base':
    config = "data/vanilla.yaml"
  elif vanilla_type == 'Adam':
    config = "data/vanilla_adam.yaml"
  elif vanilla_type == 'BPE':
    config = "data/vanilla_bpe.yaml"
  else:
    config = "data/config.yaml" 
  return config

In [None]:
# download any missing files from github.com/seamusl/nmt repo
@anvil.server.callable
def set_transformer(transformer_type):
  if use_transformer:
    if transformer_type == 'Base':
      config = "data/transformer.yaml"
    elif transformer_type == 'BPE':
      config = "data/transformer_bpe.yaml"
    else:
      config = "data/config.yaml" 
  return config

## **Run server**

In [None]:
anvil.server.wait_forever() 
# ! >> server_log.txt

# **Acknowledgments**


This work was supported by the ADAPT Centre, which is funded under the SFI Research Centres
Programme (Grant 13/RC/2016) and is co-funded by the European Regional Development Fund.



> ![Adapt](https://github.com/seamusl/nmt/blob/main/logos/adapt.png?raw=true)


>![SFI](https://github.com/seamusl/nmt/blob/main/logos/sfi.png?raw=true)

# **References**




Nvidia driver can be chosen by displaying drivers.

Select recommended driver and install.

https://www.howtogeek.com/451262/how-to-use-rclone-to-back-up-to-google-drive-on-linux/ 

https://rclone.org/drive/


https://www.howtogeek.com/101288/how-to-schedule-tasks-on-linux-an-introduction-to-crontab-files/#:~:text=Opening%20Crontab,if%20you're%20using%20Ubuntu.&text=Use%20the%20crontab%20%2De%20command,with%20your%20user%20account's%20permissions.

```
# install pip3
sudo apt-get install python3-pip

# choose optimal Nvidia driver
ubuntu-drivers devices
sudo apt install nvidia-driver-460

# setup cron job for rclone to check production server every 15 minutes 
sudo crontab -e
# enter the following job
0,14,29,44 * * * * /usr/bin/rclone copy --update --verbose --transfers 30 --checkers 8 --contimeout 60s --timeout 300s --retries 3 \ 
--low-level-retries 10 --stats 1s "mygoogledrive:/" "/home/seamus/production"

# to view contents of cloned drive (only production server cloned using root_id in the rclone config file)
rclone ls mygoogledrive:/

# to clone googledrive to local directory 
/usr/bin/rclone copy --update --verbose --transfers 30 --checkers 8 --contimeout 60s --timeout 300s --retries 3 \ 
--low-level-retries 10 --stats 1s "mygoogledrive:/" "/home/seamus/production"

```



# **Translate**

Specify the model to be used for translation.

An ensemble of models can be used for translation by specifying multiple models.

In [None]:
%cd /home/seamus/OpenNMT-py/

!ls
translation_model1 = "model_step_100000.pt" 

!echo -e "-Translating using ...." $translation_model1

!onmt_translate --model $translation_model1 --src src-test.txt \
          --output pred.txt -replace_unk -verbose

In [None]:
import anvil.media

@anvil.server.callable
def setup_build(src, tgt):
  ## create a build directory using randomly generated name
  user_dir = mkdir()
  %cd $user_dir
  %mkdir data
  config = set_config()
  with anvil.media.TempFile(src) as src_tmp:
    with open('src.txt', 'w') as f:
      contents = f.write()
    with open('tgt.txt', 'w') as f:
      contents = f.write()
  return

In [None]:
@anvil.server.callable
def test_api():
  msg = "Translating .. Yipee"
  print("Hello from the uplink")
  return msg

In [None]:
@anvil.server.callable
def translate_api_sent():
  %cd ~/production
  translation_model = "model_step_100000.pt"
  !echo -e "-Translating sentence using ...." translation_model | tee -a server_log.txt
  %cd ~/production/
  print("Not using subword")
  prediction = !onmt_translate --model $translation_model \
          --src "Hello" \
          --output pred.txt \
          -replace_unk -verbose
  return prediction

In [None]:
!whoami
%cd /tmp
!echo "hello hello" >> tmp.4tqAVHvxnr
#make_file("hello")
#!cat $tfile  

In [None]:
@anvil.server.callable
def say_hello(name):
  print("Hello from the uplink, %s!" % name)

In [None]:
def make_file(str):
  tfile=!$(mktemp)
  !ls -t | head -n1
  temp_file = !ls -t | head -n1
  !echo $str >> $temp_file
  return
make_file("MTU")

In [None]:
@anvil.server.callable
def file_processor(file):
  
  return prediction

time: 1.22 ms (started: 2021-03-23 19:27:02 +00:00)


In [None]:
@anvil.server.callable
def tran_api_sent(src_text):
  src_temp = make_file()
  !echo "just called make_file"
  print(src_text)  
  # >> /tmp/$src_temp 
  translation_model = "model_step_100000.pt"
  #!echo -e "-Translating sentence using ...." $translation_model | tee -a server_log.txt
  %cd ~/production/

  prediction = !onmt_translate --model $translation_model \
          --src /tmp/$src_temp \
          --output pred \
          -replace_unk -verbose
  
  return prediction 

In [None]:
@anvil.server.callable
def auto_build(user_dir):
  import time
  start_train = time.time()
  %cd data
  %pwd
  print("calling split now ..")
  #split_train_val_test()

#  if use_sub_model:
#    !cat src-train.txt tgt-train.txt> train.txt
#    !wc tgt-train.txt | tee -a experiment_log.txt
#    !spm_train --input='train.txt' --model_prefix=spm \
#      --vocab_size=$v_size --character_coverage=1.0 --model_type='bpe'

  %cd $user_dir
  !onmt_build_vocab -config $config -n_sample=-1
  !onmt_train -config $config
  !cat $config | tee -a data/experiment_log.txt

  end_train = time.time()
  train_time = int(end_train - start_train)
  print(train_time)
  print("+ Model Training Time + \n" + str(train_time), \
        file=open("data/experiment_log.txt", "a"))
  return