Skip to content

cdli-gh/Semi-Supervised-NMT-for-Sumerian-English

Repository files navigation

Figure-1: Shows a Cuneiform inscription, extracted from actual tablets.
Sumerian: pisan-dub-ba sza3-bi su-ga sag-nig2-gur11-ra u3 zi-ga lu2-kal-la i3-gal2 ...
English: Basket-of-tablets: therefroms, restitutions, debits, and credits, of Lukalla are here; ...

Sumerian-English Neural Machine Translation

As a part of the MTAAC project at CDLI, we aim to build an end-to-end NMT Pipeline while making use of the extensive monolingual Sumerian Data.

Previous models that have been used to carry out English<-->Sumerian Translation have only made use of the available parallel corpora. Presently we have only about 50K extracted sentences for both languages in the parallel corpora, whereas around 1.47M sentences in the Sumerian monolingual corpus.

This huge amount of monolingual data can be used to improve the NMT system by combining it with techniques like Back Translation, Tranfer Learning and Dual Learning which have proved specially useful for Low-Resource languages like Sumerian which have a limited amount of parallel data. Moreover, we also look to implement models like XLM and MASS for the same.

Requirements - Python 3.5.2 or higher
- NumPy
- Pandas
- PyTorch
- Torch Text
- OpenNMT-py
- fairseq

Table of Contents

Repository Structure

|__ translation/ --> all translation models used for Sumerian-English Translation 
        |__ transformer/ --> Supervised NMT using Vanilla Transformer
                |__ runTransformerSumEn.sh --> to perform training
                |__ README.md --> lists down all checkpoints and steps to run training and inference.
        |__ backtranslation/ --> fairseq usgae for Back Translation using Vanilla Transformers
        |__ backtranslation-onmt/ --> OpenNMT usage for Back Translation using Vanilla Transformers
                |__ backtranslateONMT.py --> to translate all Sumerian Text in a given shard using weights from the previous iteration
                |__ stack.py --> To stack the backtranslated sentences to the parallel corpora for training
                |__ runTransformerSumEn.sh --> To retrain the transformer model using the updated parallel data from the last step
                |__ README.md --> lists down all checkpoints and steps to run training and inference.
        |__ XLM/ --> Unsupervised and Semi-Supervised NMT using Cross-Lingual Langual Model Pretraining
                |__ XLM/ --> directory containing all model, data preperation and inference scripts
                |__ models.txt --> lists the possible commands and parameter combinations for XLM training and inference.
                |__ README.md --> lists down all checkpoints and steps to run training and inference.
        |__ MASS-unmt/ --> Unsupervised NMT using Masked Sequence to Sequence Pretraining
                |__ data_prep.sh --> to prepare and process data for training 
                |__ pre_training.sh --> to carry out pre training using Unsupervised Objectives
                |__ fine_tuning.sh --> to carry out fine tuning using parallel data
                |__ translate.sh --> to carry out inference using the specified checkpoints
                |__ README.md --> lists down all checkpoints and steps to run training and inference.
        |__ MASS-snmt/ --> Unsupervised NMT using Masked Sequence to Sequence Pretraining 
                |__ data_prep.sh --> to prepare and process data for training 
                |__ pre_training.sh --> to carry out pre training using Unsupervised Objectives
                |__ fine_tuning.sh --> to carry out fine tuning using parallel data
                |__ translate.sh --> to carry out inference using the specified checkpoints
                |__ README.md --> lists down all checkpoints and steps to run training and inference.

|__ dataset/ --> All Sumerian Language related textual dataset by CDLI
        |__ README.md --> Gives detailed description of the dataset and the different sub-folders.
        |__ dataToUse/ --> Contains all the parallel data divided among traing, test and dev sets, in 4 different categories
                |__ UrIIICompSents/ --> UrIII Admin Data with complete sentence translations
                |__ AllCompSents/ --> All kinds of Sumerian Data with complete sentence translations
                |__ UrIIILineByLine/ --> UrIII Admin Data with line by line translations
                |__ AllLineByLIne/ --> All kinds of Sumerian Dtaa with line by line translations
        |__ cleaned/ --> Contains data after cleaning using the helper scripts, including the monolingual data. Divided in the same 4 categories.
        |__ original/ --> Contains all of the data before cleaning
        |__ oldFormat/ --> Contains data from last year, for comparison
        

Refer to the README of each folder and sub-folder to throughly know them and to reproduce the translation models

Results

Table-1: Sumerian-English Machine Translation.
All numeric values other than those in Human Evaluation represent the BLEU Score.

Visualisations and Interpretations

Figure-2: Selected output tokens for Sumerian Input text of ”sze-ba geme2 usz-bar kiszib3 ur-dasznan ugula”, which translates to ”barley rations of the female weavers under seal of UrAnan the foreman”

Figure-3: Feature Ablation and attention Attributions, respectively,
for a span of input and output text through the Data Augmented XLM

Mentors:

  1. Niko Schenk
  2. Ravneet Punia

Tasks:

  • Preparing the parallel and monolingual texts for final usage. Using methods like BPE and BBPE to tokenize the text.
  • Implementing the Vanilla Transformer for Sumerian to English as well as English to Sumerian
  • Back Translation using Sumerian Monolingual data
  • Transfer Learning from pre-trained models of other languages
  • XLM for Unsupervised NMT.
  • XLM for Semi-Supervised NMT
  • MASS for Unsupervised NMT.
  • MASS for Semi-Supervised NMT.
  • Pre-training using Augmented Data
  • Interpretation of the NMT Models

...

For an end-to-end translation pipeline making use of translation models from this repository, refer to the cdli-gh/Sumerian_Translation-Pipeline project, where you can give an ATF file containing Sumerian sentences as input and get an ATF file with corresponding English translations as the output.