# 🌍 Vietnamese-to-English Translation Demo (OpenNMT-py)
# credit goes to manhtech264@gmail.com

This notebook demonstrates how to use a **pretrained Transformer model** from the [OpenNMT-py](https://github.com/OpenNMT/OpenNMT-py) framework for **translating Vietnamese to English**.

> ⚠️ **Note**: The Vietnamese → English direction is used **only as a demonstration**.  
> This approach is fully adaptable to other low-resource or minority languages, such as **Ede**, **Khmer**, and more.

---

## ✅ Why use this instead of a multilingual pretrained model?

Most multilingual models (e.g., mBART, NLLB):
- ❌ Do **not support** easy fine-tuning or retraining on new languages.
- ❌ Use **fixed tokenizers** that are ineffective for under-represented languages.

In contrast, this approach provides:
- ✅ A fully **trainable and fine-tunable** Transformer model.
- ✅ **Custom-trained tokenizers** for both source and target languages, allowing better adaptation to minority or unseen languages.

---

## 📦 Demo Assets Provided
- 🧠 **Pretrained Transformer checkpoint** (`OpenNMT-py` format)  
- 🔠 **Source language tokenizer** (Vietnamese)  
- 🔡 **Target language tokenizer** (English)

---

Let's get started with loading the tokenizer and running inference in the next cells! ⬇️


# prepare environment

In [1]:
!apt-get update && apt-get install -y

0% [Working]            Get:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,632 B]
Get:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease [1,581 B]
Hit:3 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:4 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
Get:5 https://r2u.stat.illinois.edu/ubuntu jammy InRelease [6,555 B]
Get:6 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]
Get:7 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages [1,798 kB]
Get:8 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease [18.1 kB]
Get:9 https://r2u.stat.illinois.edu/ubuntu jammy/main amd64 Packages [2,747 kB]
Hit:10 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Get:11 http://archive.ubuntu.com/ubuntu jammy-backports InRelease [127 kB]
Get:12 http://security.ubuntu.com/ubuntu jammy-security/restricted amd64 Packages [4,532

In [2]:
!apt-get install -y cmake build-essential pkg-config libgoogle-perftools-dev

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
build-essential is already the newest version (12.9ubuntu3).
cmake is already the newest version (3.22.1-1ubuntu1.22.04.2).
The following packages were automatically installed and are no longer required:
  libbz2-dev libpkgconf3 libreadline-dev
Use 'apt autoremove' to remove them.
The following additional packages will be installed:
  libunwind-dev
The following packages will be REMOVED:
  pkgconf r-base-dev
The following NEW packages will be installed:
  libgoogle-perftools-dev libunwind-dev pkg-config
0 upgraded, 3 newly installed, 2 to remove and 40 not upgraded.
Need to get 2,401 kB of archives.
After this operation, 9,812 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/main amd64 pkg-config amd64 0.29.2-1ubuntu3 [48.2 kB]
Get:2 http://archive.ubuntu.com/ubuntu jammy-updates/main amd64 libunwind-dev amd64 1.3.2-2build2.1 [1,883 kB]
Get:3 http://arc

In [3]:
!git clone https://github.com/google/sentencepiece.git

Cloning into 'sentencepiece'...
remote: Enumerating objects: 5415, done.[K
remote: Total 5415 (delta 0), reused 0 (delta 0), pack-reused 5415 (from 1)[K
Receiving objects: 100% (5415/5415), 31.75 MiB | 17.67 MiB/s, done.
Resolving deltas: 100% (3703/3703), done.


In [4]:
%cd sentencepiece

/content/sentencepiece


In [5]:
!mkdir build

In [6]:
%cd build

/content/sentencepiece/build


In [7]:
!cmake ..

  Compatibility with CMake < 3.10 will be removed from a future version of
  CMake.

  Update the VERSION argument <min> value.  Or, use the <min>...<max> syntax
  to tell CMake that the project requires at least <min> but has been updated
  to work with policies introduced by <max> or earlier.

[0m
-- VERSION: 0.2.1
-- The C compiler identification is GNU 11.4.0
-- The CXX compiler identification is GNU 11.4.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE
-- Found TCMalloc: /usr/lib/x86_64-linux-gnu/libtcmalloc

In [8]:
!make -j $(nproc)

[  1%] [32mBuilding CXX object src/CMakeFiles/sentencepiece.dir/__/third_party/protobuf-lite/arena.cc.o[0m
[  1%] [32mBuilding CXX object src/CMakeFiles/sentencepiece-static.dir/__/third_party/protobuf-lite/arena.cc.o[0m
[  2%] [32mBuilding CXX object src/CMakeFiles/sentencepiece.dir/__/third_party/protobuf-lite/arenastring.cc.o[0m
[  3%] [32mBuilding CXX object src/CMakeFiles/sentencepiece-static.dir/__/third_party/protobuf-lite/arenastring.cc.o[0m
[  4%] [32mBuilding CXX object src/CMakeFiles/sentencepiece.dir/__/third_party/protobuf-lite/bytestream.cc.o[0m
[  5%] [32mBuilding CXX object src/CMakeFiles/sentencepiece-static.dir/__/third_party/protobuf-lite/bytestream.cc.o[0m
[  6%] [32mBuilding CXX object src/CMakeFiles/sentencepiece.dir/__/third_party/protobuf-lite/coded_stream.cc.o[0m
[  7%] [32mBuilding CXX object src/CMakeFiles/sentencepiece-static.dir/__/third_party/protobuf-lite/coded_stream.cc.o[0m
[  8%] [32mBuilding CXX object src/CMakeFiles/sentencepiece.dir

In [9]:
!make install

[ 35%] Built target sentencepiece
[ 45%] Built target sentencepiece_train
[ 81%] Built target sentencepiece-static
[ 91%] Built target sentencepiece_train-static
[ 93%] Built target spm_encode
[ 94%] Built target spm_decode
[ 96%] Built target spm_normalize
[ 98%] Built target spm_train
[100%] Built target spm_export_vocab
[36mInstall the project...[0m
-- Install configuration: ""
-- Installing: /usr/local/lib/pkgconfig/sentencepiece.pc
-- Installing: /usr/local/lib/libsentencepiece.so.0.0.0
-- Installing: /usr/local/lib/libsentencepiece.so.0
-- Installing: /usr/local/lib/libsentencepiece.so
-- Installing: /usr/local/lib/libsentencepiece_train.so.0.0.0
-- Installing: /usr/local/lib/libsentencepiece_train.so.0
-- Set non-toolchain portion of runtime path of "/usr/local/lib/libsentencepiece_train.so.0.0.0" to ""
-- Installing: /usr/local/lib/libsentencepiece_train.so
-- Installing: /usr/local/lib/libsentencepiece.a
-- Installing: /usr/local/lib/libsentencepiece_train.a
-- Installing: /

In [10]:
!ldconfig -v

/sbin/ldconfig.real: Path `/usr/local/cuda-12/targets/x86_64-linux/lib' given more than once
(from /etc/ld.so.conf.d/988_cuda-12.conf:1 and /etc/ld.so.conf.d/000_cuda.conf:1)
/sbin/ldconfig.real: Path `/usr/local/cuda-12.5/targets/x86_64-linux/lib' given more than once
(from /etc/ld.so.conf.d/gds-12-5.conf:1 and /etc/ld.so.conf.d/000_cuda.conf:1)
/sbin/ldconfig.real: Path `/usr/local/lib' given more than once
(from /etc/ld.so.conf.d/libc.conf:2 and /etc/ld.so.conf.d/colab.conf:1)
/sbin/ldconfig.real: Can't stat /usr/local/nvidia/lib: No such file or directory
/sbin/ldconfig.real: Can't stat /usr/local/nvidia/lib64: No such file or directory
/sbin/ldconfig.real: Can't stat /usr/local/lib/x86_64-linux-gnu: No such file or directory
/sbin/ldconfig.real: Path `/usr/lib/x86_64-linux-gnu' given more than once
(from /etc/ld.so.conf.d/x86_64-linux-gnu.conf:4 and /etc/ld.so.conf.d/x86_64-linux-gnu.conf:3)
/sbin/ldconfig.real: Path `/usr/lib32' given more than once
(from /etc/ld.so.conf.d/zz_i38

In [11]:
%cd /content

/content


In [12]:
!wget https://github.com/OpenNMT/OpenNMT-py/archive/refs/tags/2.3.0.tar.gz

--2025-06-20 08:06:18--  https://github.com/OpenNMT/OpenNMT-py/archive/refs/tags/2.3.0.tar.gz
Resolving github.com (github.com)... 140.82.113.4
Connecting to github.com (github.com)|140.82.113.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://codeload.github.com/OpenNMT/OpenNMT-py/tar.gz/refs/tags/2.3.0 [following]
--2025-06-20 08:06:18--  https://codeload.github.com/OpenNMT/OpenNMT-py/tar.gz/refs/tags/2.3.0
Resolving codeload.github.com (codeload.github.com)... 140.82.112.10
Connecting to codeload.github.com (codeload.github.com)|140.82.112.10|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/x-gzip]
Saving to: ‘2.3.0.tar.gz’

2.3.0.tar.gz            [            <=>     ]  77.81M  18.1MB/s    in 4.7s    

2025-06-20 08:06:23 (16.5 MB/s) - ‘2.3.0.tar.gz’ saved [81586137]



In [13]:
!tar -zxvf 2.3.0.tar.gz

OpenNMT-py-2.3.0/
OpenNMT-py-2.3.0/.github/
OpenNMT-py-2.3.0/.github/workflows/
OpenNMT-py-2.3.0/.github/workflows/push.yml
OpenNMT-py-2.3.0/.github/workflows/release.yml
OpenNMT-py-2.3.0/.gitignore
OpenNMT-py-2.3.0/CHANGELOG.md
OpenNMT-py-2.3.0/CONTRIBUTING.md
OpenNMT-py-2.3.0/LICENSE.md
OpenNMT-py-2.3.0/README.md
OpenNMT-py-2.3.0/SECURITY.md
OpenNMT-py-2.3.0/available_models/
OpenNMT-py-2.3.0/available_models/example.conf.json
OpenNMT-py-2.3.0/build_vocab.py
OpenNMT-py-2.3.0/config/
OpenNMT-py-2.3.0/config/config-rnn-summarization.yml
OpenNMT-py-2.3.0/config/config-transformer-base-1GPU.yml
OpenNMT-py-2.3.0/config/config-transformer-base-4GPU.yml
OpenNMT-py-2.3.0/data/
OpenNMT-py-2.3.0/data/README.md
OpenNMT-py-2.3.0/data/align_data.yaml
OpenNMT-py-2.3.0/data/data.yaml
OpenNMT-py-2.3.0/data/data_features/
OpenNMT-py-2.3.0/data/data_features/src-test.feat0
OpenNMT-py-2.3.0/data/data_features/src-test.txt
OpenNMT-py-2.3.0/data/data_features/src-train.feat0
OpenNMT-py-2.3.0/data/data_fe

In [14]:
!mv OpenNMT-py-2.3.0 OpenNMT-py


In [15]:
%cd OpenNMT-py

/content/OpenNMT-py


In [16]:
!pip install -e .

Obtaining file:///content/OpenNMT-py
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting torchtext==0.5.0 (from OpenNMT-py==2.3.0)
  Downloading torchtext-0.5.0-py3-none-any.whl.metadata (6.2 kB)
Collecting configargparse (from OpenNMT-py==2.3.0)
  Downloading configargparse-1.7.1-py3-none-any.whl.metadata (24 kB)
Collecting waitress (from OpenNMT-py==2.3.0)
  Downloading waitress-3.0.2-py3-none-any.whl.metadata (5.8 kB)
Collecting pyonmttok<2,>=1.23 (from OpenNMT-py==2.3.0)
  Downloading pyonmttok-1.37.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting sacrebleu (from OpenNMT-py==2.3.0)
  Downloading sacrebleu-2.5.1-py3-none-any.whl.metadata (51 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.8/51.8 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.6.0->OpenNMT-py==2.3.0)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata

In [17]:
!pip install transformers



In [18]:
!pip install 'keras<3.0.0'

Collecting keras<3.0.0
  Downloading keras-2.15.0-py3-none-any.whl.metadata (2.4 kB)
Downloading keras-2.15.0-py3-none-any.whl (1.7 MB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.7 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.7/1.7 MB[0m [31m20.3 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m29.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: keras
  Attempting uninstall: keras
    Found existing installation: keras 3.8.0
    Uninstalling keras-3.8.0:
      Successfully uninstalled keras-3.8.0
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow 2.18.0 requires keras>=3.5.0, but you have keras 2.15.0 which is incompatible.[0m[31m
[0mSuccessfully installed kera

In [19]:
!pip install mediapipe-model-maker --no-deps

Collecting mediapipe-model-maker
  Downloading mediapipe_model_maker-0.2.1.4-py3-none-any.whl.metadata (1.7 kB)
Downloading mediapipe_model_maker-0.2.1.4-py3-none-any.whl (133 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/133.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m133.1/133.3 kB[0m [31m4.4 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m133.3/133.3 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: mediapipe-model-maker
Successfully installed mediapipe-model-maker-0.2.1.4


In [20]:
!pip install gdown



In [21]:
!pip install pandas



## in case some package in the future use other version that make this code cant run , you can downgrade to older version like down here

In [22]:
"""
!pip list
Package                               Version             Editable project location
------------------------------------- ------------------- -------------------------
absl-py                               1.4.0
accelerate                            1.7.0
aiofiles                              24.1.0
aiohappyeyeballs                      2.6.1
aiohttp                               3.11.15
aiosignal                             1.3.2
alabaster                             1.0.0
albucore                              0.0.24
albumentations                        2.0.8
ale-py                                0.11.1
altair                                5.5.0
annotated-types                       0.7.0
antlr4-python3-runtime                4.9.3
anyio                                 4.9.0
argon2-cffi                           25.1.0
argon2-cffi-bindings                  21.2.0
array_record                          0.7.2
arviz                                 0.21.0
astropy                               7.1.0
astropy-iers-data                     0.2025.6.16.0.38.47
astunparse                            1.6.3
atpublic                              5.1
attrs                                 25.3.0
audioread                             3.0.1
autograd                              1.8.0
babel                                 2.17.0
backcall                              0.2.0
backports.tarfile                     1.2.0
beautifulsoup4                        4.13.4
betterproto                           2.0.0b6
bigframes                             2.6.0
bigquery-magics                       0.9.0
bleach                                6.2.0
blinker                               1.9.0
blis                                  1.3.0
blobfile                              3.0.0
blosc2                                3.4.0
bokeh                                 3.7.3
Bottleneck                            1.4.2
bqplot                                0.12.45
branca                                0.8.1
build                                 1.2.2.post1
CacheControl                          0.14.3
cachetools                            5.5.2
catalogue                             2.0.10
certifi                               2025.6.15
cffi                                  1.17.1
chardet                               5.2.0
charset-normalizer                    3.4.2
chex                                  0.1.89
clarabel                              0.11.1
click                                 8.2.1
cloudpathlib                          0.21.1
cloudpickle                           3.1.1
cmake                                 3.31.6
cmdstanpy                             1.2.5
colorama                              0.4.6
colorcet                              3.1.0
colorlover                            0.3.0
colour                                0.1.5
community                             1.0.0b1
confection                            0.1.5
ConfigArgParse                        1.7.1
cons                                  0.4.6
contourpy                             1.3.2
cramjam                               2.10.0
cryptography                          43.0.3
cuda-python                           12.6.2.post1
cudf-cu12                             25.2.1
cudf-polars-cu12                      25.2.2
cufflinks                             0.17.3
cuml-cu12                             25.2.1
cupy-cuda12x                          13.3.0
curl_cffi                             0.11.3
cuvs-cu12                             25.2.1
cvxopt                                1.3.2
cvxpy                                 1.6.6
cycler                                0.12.1
cyipopt                               1.5.0
cymem                                 2.0.11
Cython                                3.0.12
dask                                  2024.12.1
dask-cuda                             25.2.0
dask-cudf-cu12                        25.2.2
dask-expr                             1.1.21
dataproc-spark-connect                0.7.5
datascience                           0.17.6
datasets                              2.14.4
db-dtypes                             1.4.3
dbus-python                           1.2.18
debugpy                               1.8.0
decorator                             4.4.2
defusedxml                            0.7.1
diffusers                             0.33.1
dill                                  0.3.7
distributed                           2024.12.1
distributed-ucxx-cu12                 0.42.0
distro                                1.9.0
dlib                                  19.24.6
dm-tree                               0.1.9
docstring_parser                      0.16
docutils                              0.21.2
dopamine_rl                           4.1.2
duckdb                                1.2.2
earthengine-api                       1.5.19
easydict                              1.13
editdistance                          0.8.1
eerepr                                0.1.2
einops                                0.8.1
en_core_web_sm                        3.8.0
entrypoints                           0.4
et_xmlfile                            2.0.0
etils                                 1.12.2
etuples                               0.3.9
Farama-Notifications                  0.0.4
fastai                                2.7.19
fastapi                               0.115.12
fastcore                              1.7.29
fastdownload                          0.0.7
fastjsonschema                        2.21.1
fastprogress                          1.0.3
fastrlock                             0.8.3
ffmpy                                 0.6.0
filelock                              3.18.0
firebase-admin                        6.9.0
Flask                                 3.1.1
flatbuffers                           25.2.10
flax                                  0.10.6
folium                                0.19.7
fonttools                             4.58.4
frozendict                            2.4.6
frozenlist                            1.7.0
fsspec                                2025.3.2
future                                1.0.0
gast                                  0.6.0
gcsfs                                 2025.3.2
GDAL                                  3.8.4
gdown                                 5.2.0
geemap                                0.35.3
geocoder                              1.38.1
geographiclib                         2.0
geopandas                             1.0.1
geopy                                 2.4.1
gin-config                            0.5.0
gitdb                                 4.0.12
GitPython                             3.1.44
glob2                                 0.7
google                                2.0.3
google-ai-generativelanguage          0.6.15
google-api-core                       2.25.1
google-api-python-client              2.172.0
google-auth                           2.38.0
google-auth-httplib2                  0.2.0
google-auth-oauthlib                  1.2.2
google-cloud-aiplatform               1.97.0
google-cloud-bigquery                 3.34.0
google-cloud-bigquery-connection      1.18.3
google-cloud-bigquery-storage         2.32.0
google-cloud-core                     2.4.3
google-cloud-dataproc                 5.20.0
google-cloud-datastore                2.21.0
google-cloud-firestore                2.21.0
google-cloud-functions                1.20.4
google-cloud-iam                      2.19.1
google-cloud-language                 2.17.2
google-cloud-resource-manager         1.14.2
google-cloud-spanner                  3.55.0
google-cloud-storage                  2.19.0
google-cloud-translate                3.20.3
google-colab                          1.0.0
google-crc32c                         1.7.1
google-genai                          1.20.0
google-generativeai                   0.8.5
google-pasta                          0.2.0
google-resumable-media                2.7.2
googleapis-common-protos              1.70.0
googledrivedownloader                 1.1.0
gradio                                5.31.0
gradio_client                         1.10.1
graphviz                              0.21
greenlet                              3.2.3
groovy                                0.1.2
grpc-google-iam-v1                    0.14.2
grpc-interceptor                      0.15.4
grpcio                                1.73.0
grpcio-status                         1.71.0
grpclib                               0.4.8
gspread                               6.2.1
gspread-dataframe                     4.0.0
gym                                   0.25.2
gym-notices                           0.0.8
gymnasium                             1.1.1
h11                                   0.16.0
h2                                    4.2.0
h5netcdf                              1.6.1
h5py                                  3.14.0
hdbscan                               0.8.40
hf_transfer                           0.1.9
hf-xet                                1.1.3
highspy                               1.11.0
holidays                              0.74
holoviews                             1.20.2
hpack                                 4.1.0
html5lib                              1.1
httpcore                              1.0.9
httpimport                            1.4.1
httplib2                              0.22.0
httpx                                 0.28.1
huggingface-hub                       0.33.0
humanize                              4.12.3
hyperframe                            6.1.0
hyperopt                              0.2.7
ibis-framework                        9.5.0
idna                                  3.10
imageio                               2.37.0
imageio-ffmpeg                        0.6.0
imagesize                             1.4.1
imbalanced-learn                      0.13.0
immutabledict                         4.2.1
importlib_metadata                    8.7.0
importlib_resources                   6.5.2
imutils                               0.5.4
inflect                               7.5.0
iniconfig                             2.1.0
intel-cmplr-lib-ur                    2025.1.1
intel-openmp                          2025.1.1
ipyevents                             2.0.2
ipyfilechooser                        0.6.0
ipykernel                             6.17.1
ipyleaflet                            0.20.0
ipyparallel                           8.8.0
ipython                               7.34.0
ipython-genutils                      0.2.0
ipython-sql                           0.5.0
ipytree                               0.2.2
ipywidgets                            7.7.1
itsdangerous                          2.2.0
jaraco.classes                        3.4.0
jaraco.context                        6.0.1
jaraco.functools                      4.1.0
jax                                   0.5.2
jax-cuda12-pjrt                       0.5.1
jax-cuda12-plugin                     0.5.1
jaxlib                                0.5.1
jeepney                               0.9.0
jieba                                 0.42.1
Jinja2                                3.1.6
jiter                                 0.10.0
joblib                                1.5.1
jsonpatch                             1.33
jsonpickle                            4.1.1
jsonpointer                           3.0.0
jsonschema                            4.24.0
jsonschema-specifications             2025.4.1
jupyter-client                        6.1.12
jupyter-console                       6.1.0
jupyter_core                          5.8.1
jupyter_kernel_gateway                2.5.2
jupyter-leaflet                       0.20.0
jupyter-server                        1.16.0
jupyterlab_pygments                   0.3.0
jupyterlab_widgets                    3.0.15
jupytext                              1.17.2
kaggle                                1.7.4.5
kagglehub                             0.3.12
keras                                 2.15.0
keras-hub                             0.18.1
keras-nlp                             0.18.1
keyring                               25.6.0
keyrings.google-artifactregistry-auth 1.1.2
kiwisolver                            1.4.8
langchain                             0.3.25
langchain-core                        0.3.65
langchain-text-splitters              0.3.8
langcodes                             3.5.0
langsmith                             0.3.45
language_data                         1.3.0
launchpadlib                          1.10.16
lazr.restfulclient                    0.14.4
lazr.uri                              1.0.6
lazy_loader                           0.4
libclang                              18.1.1
libcudf-cu12                          25.2.1
libcugraph-cu12                       25.2.0
libcuml-cu12                          25.2.1
libcuvs-cu12                          25.2.1
libkvikio-cu12                        25.2.1
libpysal                              4.13.0
libraft-cu12                          25.2.0
librosa                               0.11.0
libucx-cu12                           1.18.1
libucxx-cu12                          0.42.0
lightgbm                              4.5.0
linkify-it-py                         2.0.3
llvmlite                              0.43.0
locket                                1.0.0
logical-unification                   0.4.6
lxml                                  5.4.0
Mako                                  1.1.3
marisa-trie                           1.2.1
Markdown                              3.8
markdown-it-py                        3.0.0
MarkupSafe                            3.0.2
matplotlib                            3.10.0
matplotlib-inline                     0.1.7
matplotlib-venn                       1.1.2
mdit-py-plugins                       0.4.2
mdurl                                 0.1.2
mediapipe-model-maker                 0.2.1.4
miniKanren                            1.0.3
missingno                             0.5.2
mistune                               3.1.3
mizani                                0.13.5
mkl                                   2025.0.1
ml-dtypes                             0.4.1
mlxtend                               0.23.4
more-itertools                        10.7.0
moviepy                               1.0.3
mpmath                                1.3.0
msgpack                               1.1.1
multidict                             6.4.4
multipledispatch                      1.0.0
multiprocess                          0.70.15
multitasking                          0.0.11
murmurhash                            1.0.13
music21                               9.3.0
namex                                 0.1.0
narwhals                              1.43.0
natsort                               8.4.0
nbclassic                             1.3.1
nbclient                              0.10.2
nbconvert                             7.16.6
nbformat                              5.10.4
ndindex                               1.10.0
nest-asyncio                          1.6.0
networkx                              3.5
nibabel                               5.3.2
nltk                                  3.9.1
notebook                              6.5.7
notebook_shim                         0.2.4
numba                                 0.60.0
numba-cuda                            0.2.0
numexpr                               2.11.0
numpy                                 2.0.2
nvidia-cublas-cu12                    12.4.5.8
nvidia-cuda-cupti-cu12                12.4.127
nvidia-cuda-nvcc-cu12                 12.5.82
nvidia-cuda-nvrtc-cu12                12.4.127
nvidia-cuda-runtime-cu12              12.4.127
nvidia-cudnn-cu12                     9.1.0.70
nvidia-cufft-cu12                     11.2.1.3
nvidia-curand-cu12                    10.3.5.147
nvidia-cusolver-cu12                  11.6.1.9
nvidia-cusparse-cu12                  12.3.1.170
nvidia-cusparselt-cu12                0.6.2
nvidia-ml-py                          12.575.51
nvidia-nccl-cu12                      2.21.5
nvidia-nvcomp-cu12                    4.2.0.11
nvidia-nvjitlink-cu12                 12.4.127
nvidia-nvtx-cu12                      12.4.127
nvtx                                  0.2.12
nx-cugraph-cu12                       25.2.0
oauth2client                          4.1.3
oauthlib                              3.2.2
omegaconf                             2.3.0
openai                                1.86.0
opencv-contrib-python                 4.11.0.86
opencv-python                         4.11.0.86
opencv-python-headless                4.11.0.86
OpenNMT-py                            2.3.0               /content/OpenNMT-py
openpyxl                              3.1.5
opt_einsum                            3.4.0
optax                                 0.2.5
optree                                0.16.0
orbax-checkpoint                      0.11.15
orjson                                3.10.18
osqp                                  1.0.4
packaging                             24.2
pandas                                2.2.2
pandas-datareader                     0.10.0
pandas-gbq                            0.29.1
pandas-stubs                          2.2.2.240909
pandocfilters                         1.5.1
panel                                 1.7.1
param                                 2.2.1
parso                                 0.8.4
parsy                                 2.1
partd                                 1.4.2
pathlib                               1.0.1
patsy                                 1.0.1
peewee                                3.18.1
peft                                  0.15.2
pexpect                               4.9.0
pickleshare                           0.7.5
pillow                                11.2.1
pip                                   24.1.2
platformdirs                          4.3.8
plotly                                5.24.1
plotnine                              0.14.5
pluggy                                1.6.0
ply                                   3.11
polars                                1.21.0
pooch                                 1.8.2
portalocker                           3.2.0
portpicker                            1.5.2
preshed                               3.0.10
prettytable                           3.16.0
proglog                               0.1.12
progressbar2                          4.5.0
prometheus_client                     0.22.1
promise                               2.3
prompt_toolkit                        3.0.51
propcache                             0.3.2
prophet                               1.1.7
proto-plus                            1.26.1
protobuf                              5.29.5
psutil                                5.9.5
psycopg2                              2.9.10
ptyprocess                            0.7.0
py-cpuinfo                            9.0.0
py4j                                  0.10.9.7
pyarrow                               18.1.0
pyasn1                                0.6.1
pyasn1_modules                        0.4.2
pycairo                               1.28.0
pycocotools                           2.0.10
pycparser                             2.22
pycryptodomex                         3.23.0
pydantic                              2.11.7
pydantic_core                         2.33.2
pydata-google-auth                    1.9.1
pydot                                 3.0.4
pydotplus                             2.0.2
PyDrive                               1.3.1
PyDrive2                              1.21.3
pydub                                 0.25.1
pyerfa                                2.0.1.5
pygame                                2.6.1
pygit2                                1.18.0
Pygments                              2.19.1
PyGObject                             3.42.0
PyJWT                                 2.10.1
pylibcudf-cu12                        25.2.1
pylibcugraph-cu12                     25.2.0
pylibraft-cu12                        25.2.0
pymc                                  5.23.0
pymystem3                             0.2.0
pynndescent                           0.5.13
pynvjitlink-cu12                      0.6.0
pynvml                                12.0.0
pyogrio                               0.11.0
pyomo                                 6.9.2
pyonmttok                             1.37.1
PyOpenGL                              3.1.9
pyOpenSSL                             24.2.1
pyparsing                             3.2.3
pyperclip                             1.9.0
pyproj                                3.7.1
pyproject_hooks                       1.2.0
pyshp                                 2.3.1
PySocks                               1.7.1
pyspark                               3.5.1
pytensor                              2.31.3
pytest                                8.3.5
python-apt                            0.0.0
python-box                            7.3.2
python-dateutil                       2.9.0.post0
python-louvain                        0.16
python-multipart                      0.0.20
python-slugify                        8.0.4
python-snappy                         0.7.3
python-utils                          3.9.1
pytz                                  2025.2
pyviz_comms                           3.0.5
PyWavelets                            1.8.0
PyYAML                                6.0.2
pyzmq                                 24.0.1
raft-dask-cu12                        25.2.0
rapids-dask-dependency                25.2.0
ratelim                               0.1.6
referencing                           0.36.2
regex                                 2024.11.6
requests                              2.32.3
requests-oauthlib                     2.0.0
requests-toolbelt                     1.0.0
requirements-parser                   0.9.0
rich                                  13.9.4
rmm-cu12                              25.2.0
roman-numerals-py                     3.1.0
rpds-py                               0.25.1
rpy2                                  3.5.17
rsa                                   4.9.1
ruff                                  0.11.13
sacrebleu                             2.5.1
safehttpx                             0.1.6
safetensors                           0.5.3
scikit-image                          0.25.2
scikit-learn                          1.6.1
scipy                                 1.15.3
scooby                                0.10.1
scs                                   3.2.7.post2
seaborn                               0.13.2
SecretStorage                         3.3.3
semantic-version                      2.10.0
Send2Trash                            1.8.3
sentence-transformers                 4.1.0
sentencepiece                         0.2.0
sentry-sdk                            2.30.0
setproctitle                          1.3.6
setuptools                            75.2.0
shap                                  0.48.0
shapely                               2.1.1
shellingham                           1.5.4
simple-parsing                        0.1.7
simplejson                            3.20.1
simsimd                               6.4.9
six                                   1.17.0
sklearn-compat                        0.1.3
sklearn-pandas                        2.2.0
slicer                                0.0.8
smart-open                            7.1.0
smmap                                 5.0.2
sniffio                               1.3.1
snowballstemmer                       3.0.1
sortedcontainers                      2.4.0
soundfile                             0.13.1
soupsieve                             2.7
soxr                                  0.5.0.post1
spacy                                 3.8.7
spacy-legacy                          3.0.12
spacy-loggers                         1.0.5
spanner-graph-notebook                1.1.7
Sphinx                                8.2.3
sphinxcontrib-applehelp               2.0.0
sphinxcontrib-devhelp                 2.0.0
sphinxcontrib-htmlhelp                2.1.0
sphinxcontrib-jsmath                  1.0.1
sphinxcontrib-qthelp                  2.0.0
sphinxcontrib-serializinghtml         2.0.0
SQLAlchemy                            2.0.41
sqlglot                               25.20.2
sqlparse                              0.5.3
srsly                                 2.5.1
stanio                                0.5.1
starlette                             0.46.2
statsmodels                           0.14.4
stringzilla                           3.12.5
stumpy                                1.13.0
sympy                                 1.13.1
tables                                3.10.2
tabulate                              0.9.0
tbb                                   2022.1.0
tblib                                 3.1.0
tcmlib                                1.3.0
tenacity                              9.1.2
tensorboard                           2.18.0
tensorboard-data-server               0.7.2
tensorflow                            2.18.0
tensorflow-datasets                   4.9.9
tensorflow_decision_forests           1.11.0
tensorflow-hub                        0.16.1
tensorflow-io-gcs-filesystem          0.37.1
tensorflow-metadata                   1.17.1
tensorflow-probability                0.25.0
tensorflow-text                       2.18.1
tensorstore                           0.1.74
termcolor                             3.1.0
terminado                             0.18.1
text-unidecode                        1.3
textblob                              0.19.0
tf_keras                              2.18.0
tf-slim                               1.1.0
thinc                                 8.3.6
threadpoolctl                         3.6.0
tifffile                              2025.6.11
tiktoken                              0.9.0
timm                                  1.0.15
tinycss2                              1.4.0
tokenizers                            0.21.1
toml                                  0.10.2
tomlkit                               0.13.3
toolz                                 0.12.1
torch                                 2.6.0+cu124
torchao                               0.10.0
torchaudio                            2.6.0+cu124
torchdata                             0.11.0
torchsummary                          1.5.1
torchtext                             0.5.0
torchtune                             0.6.1
torchvision                           0.21.0+cu124
tornado                               6.4.2
tqdm                                  4.67.1
traitlets                             5.7.1
traittypes                            0.2.1
transformers                          4.52.4
treelite                              4.4.1
treescope                             0.1.9
triton                                3.2.0
tsfresh                               0.21.0
tweepy                                4.15.0
typeguard                             4.4.3
typer                                 0.16.0
types-pytz                            2025.2.0.20250516
types-setuptools                      80.9.0.20250529
typing_extensions                     4.14.0
typing-inspection                     0.4.1
tzdata                                2025.2
tzlocal                               5.3.1
uc-micro-py                           1.0.3
ucx-py-cu12                           0.42.0
ucxx-cu12                             0.42.0
umap-learn                            0.5.7
umf                                   0.10.0
uritemplate                           4.2.0
urllib3                               2.4.0
uvicorn                               0.34.3
vega-datasets                         0.9.0
wadllib                               1.3.6
waitress                              3.0.2
wandb                                 0.20.1
wasabi                                1.1.3
wcwidth                               0.2.13
weasel                                0.4.1
webcolors                             24.11.1
webencodings                          0.5.1
websocket-client                      1.8.0
websockets                            15.0.1
Werkzeug                              3.1.3
wheel                                 0.45.1
widgetsnbextension                    3.6.10
wordcloud                             1.9.4
wrapt                                 1.17.2
wurlitzer                             3.1.1
xarray                                2025.3.1
xarray-einstats                       0.9.0
xgboost                               2.1.4
xlrd                                  2.0.2
xxhash                                3.5.0
xyzservices                           2025.4.0
yarl                                  1.20.1
ydf                                   0.12.0
yellowbrick                           1.5
yfinance                              0.2.63
zict                                  3.0.0
zipp                                  3.23.0
zstandard                             0.23.0
"""

'\n!pip list \nPackage                               Version             Editable project location\n------------------------------------- ------------------- -------------------------\nabsl-py                               1.4.0\naccelerate                            1.7.0\naiofiles                              24.1.0\naiohappyeyeballs                      2.6.1\naiohttp                               3.11.15\naiosignal                             1.3.2\nalabaster                             1.0.0\nalbucore                              0.0.24\nalbumentations                        2.0.8\nale-py                                0.11.1\naltair                                5.5.0\nannotated-types                       0.7.0\nantlr4-python3-runtime                4.9.3\nanyio                                 4.9.0\nargon2-cffi                           25.1.0\nargon2-cffi-bindings                  21.2.0\narray_record                          0.7.2\narviz                                 0.21.

# Demo

## prepare files and pretrained tokenizer model and checkpoint

In [23]:
%cd /content

/content


In [24]:
!git clone https://github.com/WandererGuy/Ethnic-Minority-Machine-Translation-API.git

Cloning into 'Ethnic-Minority-Machine-Translation-API'...
remote: Enumerating objects: 537, done.[K
remote: Counting objects: 100% (245/245), done.[K
remote: Compressing objects: 100% (163/163), done.[K
remote: Total 537 (delta 125), reused 170 (delta 70), pack-reused 292 (from 3)[K
Receiving objects: 100% (537/537), 104.32 MiB | 29.39 MiB/s, done.
Resolving deltas: 100% (145/145), done.


In [25]:
import os
import shutil
if os.path.exists("OpenNMT-py/onmt/model_builder.py"):
    os.remove("OpenNMT-py/onmt/model_builder.py")
shutil.copy("Ethnic-Minority-Machine-Translation-API/OpenNMT_replace/model_builder.py", "OpenNMT-py/onmt")
if os.path.exists("OpenNMT-py/onmt/models/model_saver.py"):
    os.remove("OpenNMT-py/onmt/models/model_saver.py")
shutil.copy("Ethnic-Minority-Machine-Translation-API/OpenNMT_replace/model_saver.py", "OpenNMT-py/onmt/models")


'OpenNMT-py/onmt/models/model_saver.py'

In [49]:
!gdown https://drive.google.com/drive/folders/1SY972yXXgrr0PGkQgGuseSq3NIkRSbxe?usp=sharing --folder

Retrieving folder contents
Processing file 19KkE4_ccmR_c8avVfqVJDtbWbirwbyRR vie_6M_checkpoint.pt
Processing file 1yMLoLYZ2az1mWE6cbHNc7mEzsaLxUiBI vie_6M_source.model
Processing file 1g1z_AOoNhZhgmw0tzONwUEQSaP1nTWzj vie_6M_target.model
Retrieving folder contents completed
Building directory structure
Building directory structure completed
Downloading...
From (original): https://drive.google.com/uc?id=19KkE4_ccmR_c8avVfqVJDtbWbirwbyRR
From (redirected): https://drive.google.com/uc?id=19KkE4_ccmR_c8avVfqVJDtbWbirwbyRR&confirm=t&uuid=4206ebf5-441e-4043-ae6e-8055c821bd7b
To: /content/machine_translation_vi_to_en/vie_6M_checkpoint.pt
100% 622M/622M [00:12<00:00, 50.5MB/s]
Downloading...
From: https://drive.google.com/uc?id=1yMLoLYZ2az1mWE6cbHNc7mEzsaLxUiBI
To: /content/machine_translation_vi_to_en/vie_6M_source.model
100% 347k/347k [00:00<00:00, 4.51MB/s]
Downloading...
From: https://drive.google.com/uc?id=1g1z_AOoNhZhgmw0tzONwUEQSaP1nTWzj
To: /content/machine_translation_vi_to_en/vie_6M_

In [50]:
model_checkpoint_path = "/content/machine_translation_vi_to_en/vie_6M_checkpoint.pt"
source_checkpoint_tokenizer_path = "/content/machine_translation_vi_to_en/vie_6M_source.model"
target_checkpoint_tokenizer_path = "/content/machine_translation_vi_to_en/vie_6M_target.model"


## input your vietnamese text here

In [84]:
file_to_translate_content = """tôi rất yêu đất nước Việt Nam và bóng đá Việt Nam.
Tôi rất thích cầu thủ Doan Van Hau.

Mẹ tôi rất thích múa và hát mỗi buổi tối.

Kinh tế Việt Nam đã tăng trưởng đều đặn 8 phần trăm GDP một năm.

Ca sĩ Trang đã đạt giải Oscar cho nam diễn viên xuất xắc nhất.

Vùng núi Bac Ninh rất xanh và có nhiều con vật quý hiếm.

Tôi yêu đất nước Việt Nam.

Sông Hồng chảy qua Hà Nội.

Phố cổ Hội An thu hút khách.

Đội tuyển nữ Việt Nam vô địch.

Nông dân trồng lúa ở Đồng Tháp.

Chùa Một Cột nằm ở Hà Nội.

Học sinh đi học mỗi sáng.

Sinh viên bảo vệ luận văn tốt nghiệp.

Người dân đi chợ mỗi ngày.

Du khách thăm vịnh Hạ Long.

Công nhân làm việc tại nhà máy.

Trường đại học tổ chức lễ khai giảng.

Công ty công nghệ phát triển phần mềm.

Chính phủ đầu tư vào y tế.

"""








In [85]:
os.makedirs("output", exist_ok=True)
new_save_path = "output/1_input.txt"

In [86]:
with open(new_save_path, "w", encoding='utf-8') as new_file:
    sentences = file_to_translate_content.split(".")
    for sentence in sentences:
        sentence = sentence.strip()
        if not sentence:
            continue
        new_file.write(sentence + "\n")


In [87]:
tokenized_for_infer_file = "output/2_tokenized_output.txt"


In [88]:
import subprocess
command = ["spm_encode",
            f"--model={source_checkpoint_tokenizer_path}",
            "--input", new_save_path,
            "--output", tokenized_for_infer_file]
print (" ".join(command))
# Running the subprocess with the provided command
result = subprocess.run(command, capture_output=True, text=True)


spm_encode --model=/content/machine_translation_vi_to_en/vie_6M_source.model --input output/1_input.txt --output output/2_tokenized_output.txt


In [89]:
output_filepath = "output/3_translated_output.txt"

In [90]:
command = ["onmt_translate",
        "--model", model_checkpoint_path,
        "--src", tokenized_for_infer_file,
        "--output", output_filepath,
        "--verbose",
        "--gpu", "0"]
print ("start process ....")
print ("running command", " ".join(command))
process = subprocess.run(command,
        capture_output=True,
        shell=False,
        text=True,
        encoding = 'utf-8')
# Checking if the process was successful

if process.returncode == 0:
    # Process stdout (translation output)
    print(process.stdout)
else:
    # If there was an error, print stderr
    print("Error:", process.stderr)
print ("end process ....")

start process ....
running command onmt_translate --model /content/machine_translation_vi_to_en/vie_6M_checkpoint.pt --src output/2_tokenized_output.txt --output output/3_translated_output.txt --verbose --gpu 0

end process ....


In [91]:
detokenized_output_filepath = "output/4_detokenized_output.txt"

In [92]:
command = f"spm_decode --model={target_checkpoint_tokenizer_path} --input_format=piece < {output_filepath} > {detokenized_output_filepath}"
print (command)
# Running the subprocess with the provided command
subprocess.run(command, shell=True, check=True)


spm_decode --model=/content/machine_translation_vi_to_en/vie_6M_target.model --input_format=piece < output/3_translated_output.txt > output/4_detokenized_output.txt


CompletedProcess(args='spm_decode --model=/content/machine_translation_vi_to_en/vie_6M_target.model --input_format=piece < output/3_translated_output.txt > output/4_detokenized_output.txt', returncode=0)

# result

In [93]:
print ("original sentences\n", open(new_save_path, "r", encoding='utf-8').read())
print ("translated sentences\n", open(detokenized_output_filepath, "r", encoding='utf-8').read())

original sentences
 tôi rất yêu đất nước Việt Nam và bóng đá Việt Nam
Tôi rất thích cầu thủ Doan Van Hau
Mẹ tôi rất thích múa và hát mỗi buổi tối
Kinh tế Việt Nam đã tăng trưởng đều đặn 8 phần trăm GDP một năm
Ca sĩ Trang đã đạt giải Oscar cho nam diễn viên xuất xắc nhất
Vùng núi Bac Ninh rất xanh và có nhiều con vật quý hiếm
Tôi yêu đất nước Việt Nam
Sông Hồng chảy qua Hà Nội
Phố cổ Hội An thu hút khách
Đội tuyển nữ Việt Nam vô địch
Nông dân trồng lúa ở Đồng Tháp
Chùa Một Cột nằm ở Hà Nội
Học sinh đi học mỗi sáng
Sinh viên bảo vệ luận văn tốt nghiệp
Người dân đi chợ mỗi ngày
Du khách thăm vịnh Hạ Long
Công nhân làm việc tại nhà máy
Trường đại học tổ chức lễ khai giảng
Công ty công nghệ phát triển phần mềm
Chính phủ đầu tư vào y tế

translated sentences
 I love Vietnam and Vietnamese football.
I really liked Doan Van Hau.
My mother loves dancing and singing every night.
Vietnam's economy has grown steadily 8 percent of GDP a year.
Singer Trang won the Academy Award for Best Actress
The