<a href="https://colab.research.google.com/github/dsynderg/CS-479-machine-translation/blob/main/multilingual_with_scentance_peice.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install sentencepiece



In [2]:
import sentencepiece as spm

spm.SentencePieceTrainer.train(
  input='trainsrcmulti.txt',
  model_prefix='english_sp_model',
  vocab_size=16000,
)

spm.SentencePieceTrainer.train(
  input='traintgtmulti.txt',
  model_prefix='spanish_sp_model',
  vocab_size=16000,
)

In [3]:
!pip install OpenNMT-py
!pip install "numpy<2.0" # This fixes an error caused by OpenNMT not being maintained

Collecting OpenNMT-py
  Downloading OpenNMT_py-3.5.1-py3-none-any.whl.metadata (8.8 kB)
Collecting torch<2.3,>=2.1 (from OpenNMT-py)
  Downloading torch-2.2.2-cp311-cp311-manylinux1_x86_64.whl.metadata (25 kB)
Collecting configargparse (from OpenNMT-py)
  Downloading configargparse-1.7.1-py3-none-any.whl.metadata (24 kB)
Collecting ctranslate2<5,>=4 (from OpenNMT-py)
  Downloading ctranslate2-4.6.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (10 kB)
Collecting waitress (from OpenNMT-py)
  Downloading waitress-3.0.2-py3-none-any.whl.metadata (5.8 kB)
Collecting pyonmttok<2,>=1.37 (from OpenNMT-py)
  Downloading pyonmttok-1.37.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting sacrebleu (from OpenNMT-py)
  Downloading sacrebleu-2.5.1-py3-none-any.whl.metadata (51 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.8/51.8 kB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting rapidfuzz (from OpenNMT-p

In [1]:
# Build YAML Config file
yaml_text = """
# sentance_peice_yaml.yaml

## Where the samples will be written
save_data: /content/example

src_vocab: /content/example.vocab.src
tgt_vocab: /content/example.vocab.tgt

# Prevent overwriting existing files in the folder
overwrite: True

# -- Sentencepiece Params --
# Tokenization options
src_subword_type: sentencepiece
src_subword_model: english_sp_model.model
tgt_subword_type: sentencepiece
tgt_subword_model: spanish_sp_model.model

# Number of candidates for SentencePiece sampling
subword_nbest: 20
# Smoothing parameter for SentencePiece sampling
subword_alpha: 0.1
# Specific arguments for pyonmttok
src_onmttok_kwargs: "{'mode': 'none', 'spacer_annotate': True}"
tgt_onmttok_kwargs: "{'mode': 'none', 'spacer_annotate': True}"


# Corpus opts:
data:
    corpus_1:
        path_src: /content/trainsrcmulti.txt
        path_tgt: /content/traintgtmulti.txt
        transforms: [onmt_tokenize]
    valid:
        path_src: /content/valsrcmulti.txt
        path_tgt: /content/valtgtmulti.txt
        transforms: [onmt_tokenize]

# Train on a single GPU
bucket_size: 10000
world_size: 1
gpu_ranks: [0]

# Where to save the checkpoints
save_model: /content/modle
save_checkpoint_steps: 500
train_steps: 20000
valid_steps: 500
"""

with open("sentance_peice_yaml.yaml", "w") as f:
    f.write(yaml_text)

In [2]:
!onmt_build_vocab -config sentance_peice_yaml.yaml

Corpus corpus_1's weight should be given. We default it to 1 for you.
[2025-10-26 01:40:22,621 INFO] Counter vocab from 5000 samples.
[2025-10-26 01:40:22,621 INFO] Build vocab on 5000 transformed examples/corpus.
[2025-10-26 01:40:23,427 INFO] Counters src: 11787
[2025-10-26 01:40:23,427 INFO] Counters tgt: 11692


In [3]:
!onmt_train -config sentance_peice_yaml.yaml

[2025-10-26 01:40:40,028 INFO] Parsed 2 corpora from -data.
[2025-10-26 01:40:40,028 INFO] Get special vocabs from Transforms: {'src': [], 'tgt': []}.
[2025-10-26 01:40:40,111 INFO] The first 10 tokens of the vocabs are:['<unk>', '<blank>', '<s>', '</s>', ',', '▁<', '>', '.', '▁de', '▁the']
[2025-10-26 01:40:40,111 INFO] The decoder start token is: <s>
[2025-10-26 01:40:40,111 INFO] Building model...
[2025-10-26 01:40:45,991 INFO] Switching model to float32 for amp/apex_amp
[2025-10-26 01:40:45,991 INFO] Non quantized layer compute is fp32
[2025-10-26 01:40:46,355 INFO] NMTModel(
  (encoder): RNNEncoder(
    (embeddings): Embeddings(
      (make_embedding): Sequential(
        (emb_luts): Elementwise(
          (0): Embedding(11792, 500, padding_idx=1)
        )
      )
      (dropout): Dropout(p=0.3, inplace=False)
    )
    (rnn): LSTM(500, 500, num_layers=2, batch_first=True, dropout=0.3)
  )
  (decoder): InputFeedRNNDecoder(
    (embeddings): Embeddings(
      (make_embedding): Seq

In [9]:
translate_yaml_text = """
model: /content/modle_step_20000.pt
src: /content/testsrcmulti.txt
output: /content/sentancepeice_2000.txt
gpu: 0
verbose: True

# Sentencepiece Params

# Tokenization options
transforms: onmt_tokenize
src_subword_type: sentencepiece
src_subword_model: english_sp_model.model
tgt_subword_type: sentencepiece
tgt_subword_model: spanish_sp_model.model

src_subword_nbest: 20
src_subword_alpha: 0.1
"""

with open("translate_yaml.yaml", "w") as f:
    f.write(translate_yaml_text)

In [10]:
!onmt_translate -config translate_yaml.yaml

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
PRED 1001: ▁and ▁they ▁needed ▁the ▁young ▁women ▁to ▁form ▁part ▁of ▁him .
PRED SCORE: -0.3824

[2025-10-26 03:09:44,377 INFO] 
SENT 1002: ['▁<', 'english', '>', '<unk>', '▁fuente', 's', '▁para', '▁apoyar', '▁la', '▁valid', 'ez', '▁de', '▁la', '▁información', '▁', 'en', '▁el', '▁árbol', '.']
PRED 1002: <unk> ▁Source s ▁to ▁support ▁the <unk> ▁of ▁the ▁information ▁on ▁the ▁tree .
PRED SCORE: -0.4458

[2025-10-26 03:09:44,378 INFO] 
SENT 1003: ['▁<', 'spanish', '>', '▁The', '<unk>', '▁of', '▁Mormon', ',', '▁which', '▁consist', '▁of', '▁a', 'n', '▁abr', 'id', 'g', 'ment', '▁by', '▁Mormon', '▁from', '▁the', '▁large', '▁plates', '▁of', '▁Nephi', ',', '▁with', '▁many', '▁comment', 'aries', '.', '▁These', '▁plates', '▁also', '▁contained', '▁a', '<unk>', 'ation', '▁of', '▁the', '▁history', '▁by', '▁Mormon', '▁and', '▁addition', 's', '▁by', '▁his', '▁son', '▁Moroni', '.']
PRED 1003: ▁El <unk> ▁de ▁Mormón , ▁que ▁representa ba ▁d