# Vietnamese Inverse Text Normalization - asignment description

Inverse text normalization (ITN) is the task that transforms spoken to written styles. It is particularly useful in automatic speech recognition (ASR) systems where proper names are often miss-recognized by their pronunciations instead of the written forms. By applying ITN, we can improve the readability of the ASR system’s output significantly. This dataset provides data for doing ITN task in the Vietnamese language.

For example:

| Spoken (src)                                           | Written (tgt)      | Types                      |
|--------------------------------------------------|--------------|----------------------------|
| tám giờ chín phút ngày ba tháng tư năm hai nghìn | 8h9 3/4/2000 | time and date              |
| tám mét khối năm mươi ki lô gam                  | 8m3 50 kg    | number and unit of measure |
| không chín sáu hai bảy bảy chín chín không bốn   | 0962779904   | phone number               |

## Dataset

The ITN dataset has 3 splits: _train_, _validation_, and _test_. In _train_, _validation_ splits, the input (src) and their label (tgt) are provided.

| Dataset Split | Number of Instances in Split |
| ------------- |----------------------------- |
| Train         | 500,000                      |
| Validation    | 2,500                       |
| Test          | 2,500                       |

In [None]:
!pip install transformers==4.21.3
!pip install datasets sentencepiece sacrebleu jiwer

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
!pip install metrics
!pip install pandas

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [1]:
from datasets import load_dataset

In [12]:
norm_data = load_dataset('VietAI/spoken_norm_assignment')
norm_data

Using custom data configuration VietAI--spoken_norm_assignment-ada0fdcdb6b08774
Reusing dataset parquet (/home/ailab/.cache/huggingface/datasets/VietAI___parquet/VietAI--spoken_norm_assignment-ada0fdcdb6b08774/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    test: Dataset({
        features: ['src', 'tgt'],
        num_rows: 2500
    })
    train: Dataset({
        features: ['src', 'tgt'],
        num_rows: 500000
    })
    valid: Dataset({
        features: ['src', 'tgt'],
        num_rows: 2500
    })
})

#### Example train/valid

In the _train_, _validation_ set, the input (src) and the output (tgt) are segmented and aligned.

In [13]:
sample_idx = 91201
set_name = 'train'
list(zip(norm_data[set_name][sample_idx]['src'][90: 120], norm_data[set_name][sample_idx]['tgt'][90: 120]))

[('máy', 'máy'),
 ('chỉ', 'chỉ'),
 ('được', 'được'),
 ('lưu', 'lưu'),
 ('thông', 'thông'),
 ('từ', 'từ'),
 ('sáu giờ', '6h'),
 ('đến', 'đến'),
 ('tháng chín', 'tháng 9'),
 ('các', 'các'),
 ('loại', 'loại'),
 ('ô', 'ô'),
 ('tô', 'tô'),
 ('qua', 'qua'),
 ('hầm', 'hầm'),
 ('cộng hai bẩy bốn ba bốn hai ba một sáu bẩy hai', '+27434231672'),
 ('phải', 'phải'),
 ('giữ', 'giữ'),
 ('khoảng', 'khoảng'),
 ('cách', 'cách'),
 ('ba nghìn ba trăm bẩy lăm phẩy ba trăm bốn mươi bảy oát giờ trên tấn',
  '3375,347 wh/tấn'),
 ('chạy', 'chạy'),
 ('nhanh', 'nhanh'),
 ('nhất', 'nhất'),
 ('chín triệu ba mươi ngàn', '9.030.000'),
 ('km/giờ', 'km/giờ'),
 ('và', 'và'),
 ('chậm', 'chậm'),
 ('nhất', 'nhất'),
 ('là', 'là')]

#### Example test

 In the _test_ splits, only the input (src) is provided.

In [38]:
sample_idx = 0
set_name = 'valid'
print("src: ", norm_data[set_name][sample_idx]['src'][0])
print("tgt: ", norm_data[set_name][sample_idx]['tgt'])

src:  ô
tgt:  ông đỗ hữu trí chủ một ngôi nhà mới xây ở quận -69.619,290 thành phố hồ chí minh cho biết do tin tưởng chủ thầu nên ông giao toàn bộ việc đổ móng cột sàn cho họ toàn bộ các hạng mục đều có hợp đồng kể cả hợp đồng cụ thể với bên cung cấp bê tông tươi theo đó phải cung cấp bê tông mác +83412264145 để đổ sàn và cột chủ công trình/nhà có thể thuê máy trộn bê tông mini để giám sát chất lượng bê tông kết quả là mẫu bê tông chỉ có mác 81 bê tông tươi mác -3388.06939 hiện có giá khoảng 87738002236 672 ngày/ft đồng/m3 mác 1000 khoảng 458 cc đồng mác 46.056 từ -8834.746 pa/m2 đồng/m3 những nơi làm ăn gian dối thường sử dụng đá non dễ vỡ cát có nhiều tạp chất để tiết kiệm ít nhất 1000 đồng/m3 bằng cách này họ sẽ bỏ túi từ 2.067.251 -12,085 đồng/m3 bê tông tươi


In [1]:
# Save data to disk
import os
from pathlib import Path
from datasets import load_from_disk
DATA_DIR = "data/"
file = "dataset.hf"

def join_words(batch):
    source_words = batch['src']
    target_words = batch['tgt'] 
    source = [' '.join(word) for word in source_words]
    target = [' '.join(word) for word in target_words]
    batch['src'] = source
    batch['tgt'] = target
    return batch

if not Path(DATA_DIR + file).is_dir():
    if not os.path.isdir(DATA_DIR):
        os.makedirs(DATA_DIR)
    norm_data['valid'] = norm_data['valid'].map(
    join_words,
    batched=True,
    batch_size=10000
    )
    norm_data['train'] = norm_data['train'].map(
        join_words,
        batched=True,
        batch_size=10000
    )
    norm_data.save_to_disk(os.path.join(DATA_DIR, file))

# Load data from disk
norm_data = load_from_disk(os.path.join(DATA_DIR, file))
norm_data

DatasetDict({
    test: Dataset({
        features: ['src', 'tgt'],
        num_rows: 2500
    })
    train: Dataset({
        features: ['src', 'tgt'],
        num_rows: 500000
    })
    valid: Dataset({
        features: ['src', 'tgt'],
        num_rows: 2500
    })
})

In [2]:
SRC_MAX_LENGTH = 100
TGT_MAX_LENGTH = 100

In [3]:
from transformers.file_utils import cached_path, hf_bucket_url
from importlib.machinery import SourceFileLoader
import os
from transformers import EncoderDecoderModel

cache_dir = './cache'
model_name = 'nguyenvulebinh/envibert'

def download_tokenizer_files():
    resources = ['envibert_tokenizer.py', 'dict.txt', 'sentencepiece.bpe.model']
    for item in resources:
        if not os.path.exists(os.path.join(cache_dir, item)):
            tmp_file = hf_bucket_url(model_name, filename=item)
            tmp_file = cached_path(tmp_file, cache_dir=cache_dir)
            os.rename(tmp_file, os.path.join(cache_dir, item))

def init_tokenizer():
    download_tokenizer_files()
    tokenizer = SourceFileLoader("envibert.tokenizer",
                                 os.path.join(cache_dir,
                                              'envibert_tokenizer.py')).load_module().RobertaTokenizer(cache_dir)
    return tokenizer

def init_model():
    if not os.path.exists(cache_dir):
        os.makedirs(cache_dir)
    download_tokenizer_files()
    tokenizer = SourceFileLoader("envibert.tokenizer",
                                 os.path.join(cache_dir,
                                              'envibert_tokenizer.py')).load_module().RobertaTokenizer(cache_dir)
    # set encoder decoder tying to True
    roberta_shared = EncoderDecoderModel.from_encoder_decoder_pretrained(model_name,
                                                                         model_name,
                                                                         tie_encoder_decoder=False)

    # set special tokens
    roberta_shared.config.decoder_start_token_id = tokenizer.bos_token_id
    roberta_shared.config.eos_token_id = tokenizer.eos_token_id
    roberta_shared.config.pad_token_id = tokenizer.pad_token_id

    # sensible parameters for beam search
    # set decoding params
    roberta_shared.config.max_length = 100
    roberta_shared.config.early_stopping = True
    roberta_shared.config.no_repeat_ngram_size = 3
    roberta_shared.config.length_penalty = 2.0
    roberta_shared.config.num_beams = 1
    roberta_shared.config.vocab_size = roberta_shared.config.encoder.vocab_size

    return roberta_shared, tokenizer

In [4]:
model, tokenizer = init_model()

Some weights of the model checkpoint at nguyenvulebinh/envibert were not used when initializing RobertaModel: ['lm_head.bias', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.dense.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaModel were not initialized from the model checkpoint at nguyenvulebinh/envibert and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weigh

In [5]:
from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer, DataCollatorForSeq2Seq, PreTrainedTokenizerBase
from dataclasses import dataclass
from transformers.utils import PaddingStrategy
from typing import Optional, Union, Any
import numpy as np
from datasets import load_metric
import torch
import numpy as np

# The DataCollator is used for tokenization and padding to make batch input
@dataclass
class DataCollatorForEnViMT:
    def __init__(
        self,
        tokenizer: PreTrainedTokenizerBase,
        model: Optional[Any] = None,
        padding: Union[bool, str, PaddingStrategy] = True,
        max_length: Optional[int] = None,
        target_max_length: Optional[int] = None,
        pad_to_multiple_of: Optional[int] = None,
        label_pad_token_id: int = -100,
        return_tensors: str = "pt"
    ):
        self.tokenizer = tokenizer
        self.model = model
        self.padding = padding
        self.max_length = max_length
        self.target_max_length = target_max_length
        self.pad_to_multiple_of = pad_to_multiple_of
        self.label_pad_token_id = label_pad_token_id
        self.return_tensors = return_tensors
        
    def __call__(self, features, return_tensors=None):
        features_tokenized = []
        for feature in features:
            src = feature['src']
            tgt = feature['tgt']
            if len(src) == 0 or len(tgt) == 0:
                continue
            temp = {}
            # Set up the tokenizer for targets
            temp['input_ids'] = self.tokenizer(src, max_length=self.max_length, truncation=True)["input_ids"]
            with self.tokenizer.as_target_tokenizer():
                temp['labels'] = self.tokenizer(
                    tgt, max_length=self.target_max_length, truncation=True
                )["input_ids"][1:]
            features_tokenized.append(temp)                       
        features = features_tokenized

        if return_tensors is None:
            return_tensors = self.return_tensors
        labels = [feature["labels"] for feature in features] if "labels" in features[0].keys() else None
        # We have to pad the labels before calling `tokenizer.pad` as this method won't pad them and needs them of the
        # same length to return tensors.
        if labels is not None:
            max_label_length = max(len(l) for l in labels)
            if self.pad_to_multiple_of is not None:
                max_label_length = (
                    (max_label_length + self.pad_to_multiple_of - 1)
                    // self.pad_to_multiple_of
                    * self.pad_to_multiple_of
                )

            padding_side = self.tokenizer.padding_side
            for feature in features:
                remainder = [self.label_pad_token_id] * (max_label_length - len(feature["labels"]))
                
                if isinstance(feature["labels"], list):
                    feature["labels"] = (
                        feature["labels"] + remainder if padding_side == "right" else remainder + feature["labels"]
                    )
                elif padding_side == "right":
                    feature["labels"] = np.concatenate([feature["labels"], remainder]).astype(np.int64)
                else:
                    feature["labels"] = np.concatenate([remainder, feature["labels"]]).astype(np.int64)
        
        features = self.tokenizer.pad(
            features,
            padding=self.padding,
            max_length=self.max_length,
            pad_to_multiple_of=self.pad_to_multiple_of,
            return_tensors=return_tensors,
        )
        return features

In [6]:
data_collator = DataCollatorForEnViMT(tokenizer, model=model, max_length=SRC_MAX_LENGTH, target_max_length=TGT_MAX_LENGTH, padding='max_length')

In [7]:
# Define a function for evaluation
def get_metric_compute_fn(tokenizer):
    metric = load_metric("sacrebleu")

    def compute_metrics(eval_preds):
        preds, labels = eval_preds
        # In case the model returns more than the prediction logits
        if isinstance(preds, tuple):
            preds = preds[0]

        decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

        # Replace -100s in the labels as we can't decode them
        labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
        decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

        # Some simple post-processing
        decoded_preds = [pred.strip() for pred in decoded_preds]
        decoded_labels = [[label.strip()] for label in decoded_labels]

        result = metric.compute(predictions=decoded_preds, references=decoded_labels)
        return {"bleu": result["score"]}

    return compute_metrics

In [8]:
import wandb

wandb.init(project="VietAI-NLP-assigment1", entity="ponyo")

[34m[1mwandb[0m: Currently logged in as: [33mponyo[0m. Use [1m`wandb login --relogin`[0m to force relogin


In [9]:
from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer, DataCollatorForSeq2Seq, PreTrainedTokenizerBase
import os
#from metrics import get_metric_compute_fn

def init_trainer(model, tokenizer, dataset, data_collator, epochs=5, batch_size=16):
    checkpoint_path = "./itn_checkpoints"
    training_args = Seq2SeqTrainingArguments(
        output_dir = f"VietAI-NLP-ITN",
        logging_strategy = 'steps',
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size,
        evaluation_strategy="epoch",
        save_strategy="epoch",
        gradient_accumulation_steps=1,
        predict_with_generate=True,
        save_total_limit=2,
        do_train=True,
        do_eval=True,
        logging_steps=1000,
        num_train_epochs = epochs,
        warmup_ratio=1 / epochs,
        logging_dir=os.path.join(checkpoint_path, 'log'),
        overwrite_output_dir=True,
        metric_for_best_model='bleu',
        greater_is_better=True,
        eval_accumulation_steps=10,
        dataloader_num_workers=20,
        # sharded_ddp="simple",
        #fp16=True,
        remove_unused_columns=False,
        report_to='wandb',
        push_to_hub=True,
    )

    # instantiate trainer
    trainer = Seq2SeqTrainer(
        model=model,
        args=training_args,
        compute_metrics=get_metric_compute_fn(tokenizer),
        train_dataset=dataset['train'],    # Only use subset of the dataset for a quick training. Remove shard for full training,
        eval_dataset=dataset['valid'],    # Only use subset of the dataset for a quick training. Remove shard for full training,
        data_collator=data_collator,
        tokenizer=tokenizer
    )
    return trainer

In [10]:
trainer = init_trainer(model, tokenizer, norm_data, data_collator)

/home/ailab/DatNT/assignment1/VietAI-NLP-ITN is already a clone of https://huggingface.co/datnth1709/VietAI-NLP-ITN. Make sure you pull the latest changes with `repo.git_pull()`.


In [11]:
trainer.train()

***** Running training *****
  Num examples = 500000
  Num Epochs = 5
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 156250
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"


Epoch,Training Loss,Validation Loss,Bleu
1,0.6529,0.565955,78.731511
2,0.5125,0.477003,81.397934
3,0.4798,0.455359,81.672031
4,0.4568,0.443488,81.775255
5,0.4387,0.437754,81.857126


***** Running Evaluation *****
  Num examples = 2500
  Batch size = 16
Saving model checkpoint to VietAI-NLP-ITN/checkpoint-31250
Configuration saved in VietAI-NLP-ITN/checkpoint-31250/config.json
Model weights saved in VietAI-NLP-ITN/checkpoint-31250/pytorch_model.bin
tokenizer config file saved in VietAI-NLP-ITN/checkpoint-31250/tokenizer_config.json
Special tokens file saved in VietAI-NLP-ITN/checkpoint-31250/special_tokens_map.json
tokenizer config file saved in VietAI-NLP-ITN/tokenizer_config.json
Special tokens file saved in VietAI-NLP-ITN/special_tokens_map.json
Deleting older checkpoint [VietAI-NLP-ITN/checkpoint-157] due to args.save_total_limit
***** Running Evaluation *****
  Num examples = 2500
  Batch size = 16
Saving model checkpoint to VietAI-NLP-ITN/checkpoint-62500
Configuration saved in VietAI-NLP-ITN/checkpoint-62500/config.json
Model weights saved in VietAI-NLP-ITN/checkpoint-62500/pytorch_model.bin
tokenizer config file saved in VietAI-NLP-ITN/checkpoint-62500/toke

TrainOutput(global_step=156250, training_loss=0.8985652455078125, metrics={'train_runtime': 23847.9965, 'train_samples_per_second': 104.831, 'train_steps_per_second': 6.552, 'total_flos': 9.40655415e+16, 'train_loss': 0.8985652455078125, 'epoch': 5.0})

In [12]:
trainer.push_to_hub(tags="translation", commit_message="Training complete")

Saving model checkpoint to VietAI-NLP-ITN
Configuration saved in VietAI-NLP-ITN/config.json
Model weights saved in VietAI-NLP-ITN/pytorch_model.bin
tokenizer config file saved in VietAI-NLP-ITN/tokenizer_config.json
Special tokens file saved in VietAI-NLP-ITN/special_tokens_map.json
Dropping the following result as it does not have all the necessary fields:
{'task': {'name': 'Sequence-to-sequence Language Modeling', 'type': 'text2text-generation'}, 'metrics': [{'name': 'Bleu', 'type': 'bleu', 'value': 81.8571258360715}]}
Several commits (2) will be pushed upstream.
The progress bars may be unreliable.


Upload file pytorch_model.bin:   0%|          | 32.0k/594M [00:00<?, ?B/s]

remote: Scanning LFS files for validity, may be slow...        
remote: LFS file scan complete.        
To https://huggingface.co/datnth1709/VietAI-NLP-ITN
   f15f351..e6190ef  main -> main



## Model evaluation

The _test_ set will be normalized using the trained model. [WER](https://huggingface.co/spaces/evaluate-metric/wer) metric will be use to evaluate this assignment.

In [30]:
from datasets import load_metric

In [31]:
wer = load_metric('wer')

### Baseline WER result on the _valid_ set (do nothing model)

In [33]:
predictions = [' '.join(item) for item in norm_data['valid']['src']]
references = [' '.join(item) for item in norm_data['valid']['tgt']]
wer.compute(predictions=predictions,
            references=references)

0.31806296858909483

In [13]:
DATA_DIR = "data/"
file = "dataset.hf"

# Load data from disk
dataset = load_from_disk(os.path.join(DATA_DIR, file))

In [14]:
trained_model, tokenizer = init_model()
trained_model = trained_model.from_pretrained("datnth1709/VietAI-NLP-ITN")

loading configuration file https://huggingface.co/nguyenvulebinh/envibert/resolve/main/config.json from cache at /home/ailab/.cache/huggingface/transformers/13a4cc8c4ffe1ad0098bfac7e49814b38a03fd1d5559f0416552fc8b525e717a.8b442c60cec207f1183833b0ed7caf38612373e2e24c8e8c913b92c586ec7a4c
Model config RobertaConfig {
  "_name_or_path": "nguyenvulebinh/envibert",
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 1024,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 6,
  "num_hidden_layers": 6,
  "output_hidden_states": true,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "torch_dtype": "float32",
  "transformers_version": "4.20.1",
  "type

Model config EncoderDecoderConfig {
  "architectures": [
    "EncoderDecoderModel"
  ],
  "decoder": {
    "_name_or_path": "nguyenvulebinh/envibert",
    "add_cross_attention": true,
    "architectures": [
      "RobertaForMaskedLM"
    ],
    "attention_probs_dropout_prob": 0.1,
    "bad_words_ids": null,
    "bos_token_id": 0,
    "chunk_size_feed_forward": 0,
    "classifier_dropout": null,
    "cross_attention_hidden_size": null,
    "decoder_start_token_id": null,
    "diversity_penalty": 0.0,
    "do_sample": false,
    "early_stopping": false,
    "encoder_no_repeat_ngram_size": 0,
    "eos_token_id": 2,
    "exponential_decay_length_penalty": null,
    "finetuning_task": null,
    "forced_bos_token_id": null,
    "forced_eos_token_id": null,
    "gradient_checkpointing": false,
    "hidden_act": "gelu",
    "hidden_dropout_prob": 0.1,
    "hidden_size": 768,
    "id2label": {
      "0": "LABEL_0",
      "1": "LABEL_1"
    },
    "initializer_range": 0.02,
    "intermediate_siz

Downloading:   0%|          | 0.00/594M [00:00<?, ?B/s]

storing https://huggingface.co/datnth1709/VietAI-NLP-ITN/resolve/main/pytorch_model.bin in cache at /home/ailab/.cache/huggingface/transformers/2ca4e11caed16518c1886982dcb1869ca84b94dcf765b9cfa7b063bcb005150f.5af94e95c93a7bc5ef2b7562af9d91de7987ffd82d14ade8222bf33a98473007
creating metadata file for /home/ailab/.cache/huggingface/transformers/2ca4e11caed16518c1886982dcb1869ca84b94dcf765b9cfa7b063bcb005150f.5af94e95c93a7bc5ef2b7562af9d91de7987ffd82d14ade8222bf33a98473007
loading weights file https://huggingface.co/datnth1709/VietAI-NLP-ITN/resolve/main/pytorch_model.bin from cache at /home/ailab/.cache/huggingface/transformers/2ca4e11caed16518c1886982dcb1869ca84b94dcf765b9cfa7b063bcb005150f.5af94e95c93a7bc5ef2b7562af9d91de7987ffd82d14ade8222bf33a98473007
All model checkpoint weights were used when initializing EncoderDecoderModel.

All the weights of EncoderDecoderModel were initialized from the model checkpoint at datnth1709/VietAI-NLP-ITN.
If your task is similar to the task the model

## Inference

In [15]:
import random
SRC_MAX_LENGTH = 100
TGT_MAX_LENGTH = 100

idx = random.randint(0, 2500) # test set only contain 2500 sentences
set_name = 'valid'
sentence = dataset[set_name]['src'][idx]

print("Input: ")
# print(dataset['valid']['src'][idx])
print(sentence)
print(100 * '-')

if set_name == 'valid':
  print("Output groudtruth: ")
  print(dataset['valid']['tgt'][idx])
  print(100 * '-')

print("Output without Beam-search: ")
# encode context the generation is conditioned on
input_ids = tokenizer(sentence, max_length=SRC_MAX_LENGTH, truncation=True, return_tensors='pt')["input_ids"]
# generate text without beam-search
outputs = trained_model.generate(
    input_ids, 
    max_length=SRC_MAX_LENGTH, 
    num_return_sequences=1, 
    early_stopping=True
)
for i, output in enumerate(outputs):
  output_pieces = tokenizer.convert_ids_to_tokens(output.numpy().tolist())
  output_text = tokenizer.sp_model.decode(output_pieces)
  print(output_text)
print(100 * '-')


# generate text using beam-search
print("Output with Beam-search: ")
beam_outputs = trained_model.generate(
    input_ids, 
    max_length=SRC_MAX_LENGTH, 
    num_beams=10, 
    no_repeat_ngram_size=2, 
    num_return_sequences=3, 
    early_stopping=True
)


for i, beam_output in enumerate(beam_outputs):
  output_pieces = tokenizer.convert_ids_to_tokens(beam_output.numpy().tolist())
  output_text = tokenizer.sp_model.decode(output_pieces)
  print("{}: {}\n{}".format(i, output_text,'-'*20))

Input:
thời báo kinh tế sài gòn online bộ trưởng âm chín chín phẩy chín tám bộ giao thông vận tải hồ nghĩa dũng đã ký quyết định phê duyệt quy hoạch sân tám trăm tám mươi chín lít bay tại tỉnh an giang giai đoạn đến mười bốn giờ năm mươi ba và định hướng đến mồng một tháng bảy có tổng nhu cầu vốn dự kiến là tám chín ba sáu tám một ba bẩy bốn năm bốn tỉ đồng bình nguyên dự án sân bay tại xã cần đăng huyện châu thành sẽ được thực hiện thành một ngàn tám trăm ba mươi bẩy giai đoạn với số vốn cần cho giai đoạn đến bốn giờ không phút là âm tám tám chấm không năm mươi tỉ đồng và giai đoạn định hướng đến một nghìn sáu trăm là âm chín ba ngàn hai trăm năm bốn phẩy bốn nghìn chín mươi bẩy tỉ đồng theo quyết định phê duyệt quy hoạch số i gi gi rờ lờ chín không không đến ngày mười hai sân bay an giang sẽ có một đường hạ cất cánh dài hai mươi ba ngàn bẩy trăm mười bốn phẩy ba nghìn tám trăm sáu mươi chín mét và rộng cộng bốn hai sáu sáu sáu không năm bốn sáu một mét đảm bảo cho các hoạt động khai 

In [None]:
import random
SRC_MAX_LENGTH = 100
TGT_MAX_LENGTH = 100

idx = random.randint(0, 2500)
sentence = dataset['valid']['src'][idx]