# Finetune T5 locally for machine translation on COVID-19 Health Service Announcements with Hugging Face

[reference](https://github.com/aws/studio-lab-examples/blob/main/natural-language-processing/NLP_Disaster_Recovery_Translation.ipynb)

This notebook is designed to run within SageMaker Lab, on a `g4dn.xlarge` GPU instance. If you are not using that right now, please restart your session and select `GPU`, as this will help you train your model in a matter of tens of minutes, rather than hours.

If you are ready for training a large-scale machine translation model, then please check out using Hugging Face on Amazon SageMaker! 

Otherwise, please enjoy this notebook.

### Step 0. Install all necessary packages

In [4]:
%%writefile requirements.txt

ipywidgets
git+https://github.com/huggingface/transformers
datasets
sacrebleu
torch
sentencepiece
mlfoundry

Overwriting requirements.txt


In [5]:
%pip install -r requirements.txt

Collecting git+https://github.com/huggingface/transformers (from -r requirements.txt (line 3))
  Cloning https://github.com/huggingface/transformers to /tmp/pip-req-build-znnldqlm
  Running command git clone -q https://github.com/huggingface/transformers /tmp/pip-req-build-znnldqlm
  Resolved https://github.com/huggingface/transformers to commit 49cd736a288a315d741e5c337790effa4c9fa689
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h    Preparing wheel metadata ... [?25ldone
Note: you may need to restart the kernel to use updated packages.


In [6]:
import IPython
# make sure to restart your kernel to use the newly install packages
# IPython.Application.instance().kernel.do_shutdown(True) 

## Step 1. Explore the available datasets on Translators without Borders 
Then, download a pair you would like to use for training a language translation model. The steps below download the translation pairs for English to Spanish, but you are welcome to modify these and use a different pair if you prefer.

Overall site page: https://tico-19.github.io/

Page with all language pairs: https://tico-19.github.io/memories.html 

Scroll through all supported language pairs and pick your favorite. We'll demonstrate English to Spanish, `en-to-es`

Copy the link to that pair, for `en-to-es` it looks like this:
- https://tico-19.github.io/data/TM/all.en-es-LA.tmx.zip 

In [7]:
path_to_my_data = 'https://tico-19.github.io/data/TM/all.en-es-LA.tmx.zip'

In [8]:
!wget {path_to_my_data}

--2022-07-01 05:16:05--  https://tico-19.github.io/data/TM/all.en-es-LA.tmx.zip
Resolving tico-19.github.io (tico-19.github.io)... 185.199.110.153, 185.199.109.153, 185.199.111.153, ...
Connecting to tico-19.github.io (tico-19.github.io)|185.199.110.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 381511 (373K) [application/zip]
Saving to: 'all.en-es-LA.tmx.zip'


2022-07-01 05:16:05 (9.49 MB/s) - 'all.en-es-LA.tmx.zip' saved [381511/381511]



In [9]:
local_file = path_to_my_data.split('/')[-1]
print (local_file)
filename = local_file.split('.zip')[0]
print (filename)

all.en-es-LA.tmx.zip
all.en-es-LA.tmx


In [10]:
!unzip {local_file}

Archive:  all.en-es-LA.tmx.zip
  inflating: all.en-es-LA.tmx        


### Step 2: Extract data from `.tmx` file type 
Next, you can use this local function to extract data from the `.tmx` file type and format for local training with Hugging Face.

In [11]:
# paste the name of your file and language codes here
source_code_1 = 'en'
target_code_2 =  'es'

In [12]:
def parse_tmx(filename, source_code_1, target_code_2):
    '''
    Takes a local TMX filename and codes for source and target languages. 
    Walks through your file, row by row, looking for tmx / html specific formatting.
    If there's a regex match, will clean your string and add to a dictionary for downstream pandas formatting.
    '''
    
    data = {source_code_1:[], target_code_2:[]}

    with open(filename) as f:

        for row in f.readlines():

            if not row.endswith('</seg></tuv>\n'):
                continue

            if row.startswith('<seg>'):

                st_1 = row.strip()

                st_1 = st_1.replace('<seg>', '')
                st_1 = st_1.replace('</seg></tuv>', '')

                data[source_code_1].append(st_1)

            # when you use your own target code, remove the -LA string 
            if row.startswith('<tuv xml:lang="{}-LA"><seg>'.format(target_code_2)):

                st_2 = row.strip()
                # when you use your own target code, remove the -LA string 
                st_2 = st_2.replace('<tuv xml:lang="{}-LA"><seg>'.format(target_code_2), '')
                st_2 = st_2.replace('</seg></tuv>', '')

                data[target_code_2].append(st_2)
                
        return data

data = parse_tmx(filename, source_code_1, target_code_2)

In [13]:
# this makes sure you got actual pairs
assert len(data[source_code_1]) == len(data[target_code_2])

In [14]:
import pandas as pd

df = pd.DataFrame.from_dict(data, orient = 'columns')

df.head()

Unnamed: 0,en,es
0,about how long have these symptoms been going on?,¿cuánto hace más o menos que tiene estos sínto...
1,and all chest pain should be treated this way ...,y siempre el dolor de pecho debe tratarse de e...
2,and along with a fever,y también fiebre
3,and also needs to be checked your cholesterol ...,y también debe controlarse su colesterol y pre...
4,and are you having a fever now?,¿y tiene fiebre ahora?


In [15]:
# write to disk in case you need to restart your kernel later
df.to_csv('language_pairs.csv', index=False, header=True)

### Step 3. Format extracted data for machine translation with Hugging Face
Core examples available right here: https://github.com/huggingface/transformers/tree/master/examples/pytorch/translation 

Guidance on formatting for Hugging Face datasets here:
https://huggingface.co/docs/datasets/loading_datasets.html#json-files 

In [16]:
import pandas as pd

df = pd.read_csv('language_pairs.csv')
df.head()

Unnamed: 0,en,es
0,about how long have these symptoms been going on?,¿cuánto hace más o menos que tiene estos sínto...
1,and all chest pain should be treated this way ...,y siempre el dolor de pecho debe tratarse de e...
2,and along with a fever,y también fiebre
3,and also needs to be checked your cholesterol ...,y también debe controlarse su colesterol y pre...
4,and are you having a fever now?,¿y tiene fiebre ahora?


The task of translation supports only custom JSONLINES files, with each line being a dictionary with a key "translation" and its value another dictionary whose keys is the language pair. For example:

`{ "translation": { "en": "Others have dismissed him as a joke.", "ro": "Alții l-au numit o glumă." } }
{ "translation": { "en": "And some are holding out for an implosion.", "ro": "Iar alții așteaptă implozia." } }`


In [17]:
objs = []

for idx, row in df.iterrows():
    
    obj = {"translation": {source_code_1: row[source_code_1], target_code_2: row[target_code_2]}} 
    objs.append(obj)

In [18]:
objs[:5]

[{'translation': {'en': 'about how long have these symptoms been going on?',
   'es': '¿cuánto hace más o menos que tiene estos síntomas?'}},
 {'translation': {'en': 'and all chest pain should be treated this way especially with your age',
   'es': 'y siempre el dolor de pecho debe tratarse de esta manera, en especial a su edad'}},
 {'translation': {'en': 'and along with a fever', 'es': 'y también fiebre'}},
 {'translation': {'en': 'and also needs to be checked your cholesterol blood pressure',
   'es': 'y también debe controlarse su colesterol y presión arterial'}},
 {'translation': {'en': 'and are you having a fever now?',
   'es': '¿y tiene fiebre ahora?'}}]

In [19]:
import json 
!mkdir data
with open('data/train.json', 'w') as f:
    for row in objs:
        j = json.dumps(row, ensure_ascii = False)
        f.write(j)
        f.write('\n')

### Step 4 - Finetune a machine translation model locally
Do to this, let's first download the raw Python file we need from Hugging Face to finetune our model.

In [20]:
!wget https://raw.githubusercontent.com/huggingface/transformers/master/examples/pytorch/translation/run_translation.py

--2022-07-01 05:16:07--  https://raw.githubusercontent.com/huggingface/transformers/master/examples/pytorch/translation/run_translation.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 28254 (28K) [text/plain]
Saving to: 'run_translation.py'


2022-07-01 05:16:07 (9.90 MB/s) - 'run_translation.py' saved [28254/28254]



In [21]:
# full hugging face Trainer API args available here
# https://github.com/huggingface/transformers/blob/de635af3f1ef740aa32f53a91473269c6435e19e/src/transformers/training_args.py
# T5 trainig args available here
# https://huggingface.co/transformers/model_doc/t5.html#t5config
!python run_translation.py \
    --model_name_or_path t5-small \
    --do_train \
    --source_lang en \
    --target_lang es \
    --source_prefix "translate English to Spanish: " \
    --train_file data/train.json \
    --output_dir output/tst-translation \
    --per_device_train_batch_size=4 \
    --per_device_eval_batch_size=4 \
    --overwrite_output_dir \
    --predict_with_generate \
    --save_strategy epoch \
    --num_train_epochs 3
#     --do_eval \
#     --validation_file path_to_jsonlines_file \
#     --dataset_name cov-19 \
#     --dataset_config_name en-es \


07/01/2022 05:16:09 - INFO - __main__ - Training/evaluation parameters Seq2SeqTrainingArguments(
_n_gpu=0,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_eval=False,
do_predict=False,
do_train=True,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=no,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_min_num_params=0,
full_determinism=False,
generation_max_length=None,
generation_num_beams=None,
gradient_accumulation_steps=1,
gradient_checkpointing=False,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=every_save,
hub_token=<HUB_TOKEN>,
ignore_data_ski

In [22]:
!ls output/tst-translation

README.md	  config.json		   tokenizer_config.json
all_results.json  pytorch_model.bin	   train_results.json
checkpoint-1536   special_tokens_map.json  trainer_state.json
checkpoint-2304   spiece.model		   training_args.bin
checkpoint-768	  tokenizer.json


### Step 5. Test your newly fine-tuned translation model

In [23]:
from transformers import AutoTokenizer, AutoModelWithLMHead
  
tokenizer = AutoTokenizer.from_pretrained("t5-small")

model = AutoModelWithLMHead.from_pretrained(pretrained_model_name_or_path = 'output/tst-translation')

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-small automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


In [24]:
# line to make sure your model supports local inference
model.eval()

T5ForConditionalGeneration(
  (shared): Embedding(32100, 512)
  (encoder): T5Stack(
    (embed_tokens): Embedding(32100, 512)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=512, out_features=512, bias=False)
              (k): Linear(in_features=512, out_features=512, bias=False)
              (v): Linear(in_features=512, out_features=512, bias=False)
              (o): Linear(in_features=512, out_features=512, bias=False)
              (relative_attention_bias): Embedding(32, 8)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseActDense(
              (wi): Linear(in_features=512, out_features=2048, bias=False)
              (wo): Linear(in_features=2048, out_features=512, bias=False)
              (dropout): Drop

Next, let's test it! Remember that, in using the default settings of only 3 epoch, your translation is probably not going to be SOTA. For achieving state of the art, (SOTA), we recommend migrating to Amazon SageMaker to scale up and out. Scaling up means moving your code to a more advanced compute type, such as a p4 series or even Trainium. Scaling out means adding more compute, so going from 1 to many instances. Using the entire AWS cloud you can train for much longer periods of time on much larger datasets, which can directly translate to a more accurate model.

In [25]:
input_sequences = ['about how long have these symptoms been going on?',	
'and all chest pain should be treated this way especially with your age	',
'and along with a fever	',
'and also needs to be checked your cholesterol blood pressure',	
'and are you having a fever now?	',
'and are you having any of the following symptoms with your chest pain',	
'and are you having a runny nose?',	
'and are you having this chest pain now?',
'and besides do you have difficulty breathing',
'and can you tell me what other symptoms are you having along with this?',
'and does this pain move from your chest?',
'and drink lots of fluids',
'and how high has your fever been',
'and i have a cough too',
'and i have a little cold and a cough',
'''and i'm really having some bad chest pain today''']

task_prefix = "translate English to Spanish: "

for i in input_sequences:
    input_ids = tokenizer('''{} {}'''.format(task_prefix, i), return_tensors='pt').input_ids
    outputs = model.generate(input_ids)
    print(i, tokenizer.decode(outputs[0], skip_special_tokens=True))


about how long have these symptoms been going on? en el trabajo de sntomas?
and all chest pain should be treated this way especially with your age	 y todos los dolores de la población del 
and along with a fever	 y a las fièvres
and also needs to be checked your cholesterol blood pressure y es necesario a verificar la pression sanguina del
and are you having a fever now?	 y tu ayuda una fièvre?
and are you having any of the following symptoms with your chest pain y tiene el sntomas cio
and are you having a runny nose? y tu ayuda un nez agua?
and are you having this chest pain now? y tiene el dolor de la población en e
and besides do you have difficulty breathing y ahora ahora a las dificultas
and can you tell me what other symptoms are you having along with this? y tu pueden me dire quels sntomas
and does this pain move from your chest? y se movió el dolor de ta pobla?
and drink lots of fluids y boire lots de fluides
and how high has your fever been y como el trabajo de fièvre
and i ha

In [26]:
model.save_pretrained('my-tf-en-to-sp')

In [27]:
!tar -czf my_model.tar.gz my-tf-en-to-sp

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
