Prepare data for NLLB finetuning - what is the format of data files? #18

edchengg · 2022-12-20T17:25:08Z

Hello, @kauterry! Thanks a lot for the detailed answers on other issues.
I would like to prepare data for NLLB finetuning using the pre-trained SPM model.
My question is "what is the format of these files?" in the README.
For example, I am assuming the format is "one sentence at each line"?

$ tree $DATA_PATH
my_corpora
├── arb_Arab-eng_Latn
│   ├── mycorpus.arb_Arab.gz
│   └── mycorpus.eng_Latn.gz
└── eng_Latn-lij_Latn
    ├── nllbseed.eng_Latn.gz
    ├── nllbseed.lij_Latn.gz
    ├── tatoeba.eng_Latn.gz
    └── tatoeba.lij_Latn.gz

The text was updated successfully, but these errors were encountered:

kauterry · 2022-12-20T18:46:38Z

Yes, that is correct. src sentence and it's translation (tgt) sentence in the src, tgt files. One sentence per line.

edchengg · 2022-12-20T19:40:43Z

Thanks!

edchengg closed this as completed Dec 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prepare data for NLLB finetuning - what is the format of data files? #18

Prepare data for NLLB finetuning - what is the format of data files? #18

edchengg commented Dec 20, 2022

kauterry commented Dec 20, 2022

edchengg commented Dec 20, 2022

Prepare data for NLLB finetuning - what is the format of data files? #18

Prepare data for NLLB finetuning - what is the format of data files? #18

Comments

edchengg commented Dec 20, 2022

kauterry commented Dec 20, 2022

edchengg commented Dec 20, 2022