Intermediate-Task-Code-Switching

This is the repo for the experiments mentioned in the paper

Datasets

The datasets required for the experiments can be found at the following link: Data This includes the following datasets in the given structure. Place the final_Data folder beside final_Code folder and train_final.sh

final_Data
  ├── MLM
  │   ├── generalCS
  │   ├── generalCS_movieCS_nlipremise
  │   ├── generalCS_movieCS_qactxt
  │   ├── movieCS_generalCS
  │   ├── movieCS_nlipremise
  │   └── movieCS_qactxt
  ├── QA_EN_HI
  │   ├── GLUECoS (filename: train-v2.0.json & dev-v2.0.json)
  │   ├── SQuAD (English, Romanised Hindi & Bilingual)
  │   ├── XQuAD Test set in English
  │   ├── XQuAD Machine Translated  & Transliterated to Romanised Hindi using Google API
  │   ├── XQuAD Transliterated to Romanised Hindi using Google API
  │   └── XQuAD Transliterated to Romanised Hindi using IndicTrans
  └── NLI_EN_HI
      ├── english_MNLI
      ├── gluecos
      ├── romanised_hindi_MNLI
      ├── XNLI_eng
      ├── XNLI_GoogMT
      ├── XNLI_GoogTranslit
      └── XNLI_IndicTranslit

Training models

The code provides methods to perform intermediate training on different datasets as well as finetuning on the GLUECoS benchmarks

The following intermediate routines are available

QA
- on monolingual SQuAD (English, Romanised Hindi)
- on bilingual SQuAD
- interspersed MLM and bilingual SQuAD
- on machine translated (Google API) XQuAD Test-set (bilingual version)
- on XQuAD Test-set transliterated using Indictrans (bilingual version)
- on XQuAD Test-set transliterated using Google API (bilingual version)
NLI
- on monolingual MNLI (English, Romanised Hindi)
- on bilingual MNLI
- interspersed MLM and bilingual MNLI
- on machine translated (Google API) XNLI Test-set (bilingual version)
- on XNLI Test-set transliterated using Indictrans (bilingual version)
- on XNLI Test-set transliterated using Google API (bilingual version)
Interspersed QA and NLI

Also available are the methods for finetuning the model on GLUECoS NLI and QA benchmarks

Training requirements

The requirements for running the code are listed in the file requirements.txt. They can be installed via pip install -r requirements.txt

Training

The command below can be used to run both intermediate and fine-tuning experiments. The training scripts uses the Huggingface library and support any models based on BERT, XLM, XLM-Roberta and similar models.

bash train_final.sh MODEL MODEL_TYPE TASK

Example Usage

bash train_final.sh bert-base-multilingual-cased bert GLUECOS_QA_EN_HI

The Tasks available are

GLUECOS_QA_EN_HI
engSQUAD_QA_EN_HI
roman_hinSQUAD_QA_EN_HI
bilingualSQUAD_QA_EN_HI
dual_MLM1_bilSQUAD_EN_HI
dual_MLM2_bilSQUAD_EN_HI
dual_MLM3_bilSQUAD_EN_HI
xquad_bilingual_MT_test_then_gluecos_QA_EN_HI
xquad_bilingual_GoogTranslit_test_then_gluecos_QA_EN_HI
xquad_bilingual_IndicTranslit_test_then_gluecos_QA_EN_HI
GLUECOS_NLI_EN_HI
engMNLI_NLI_EN_HI
roman_hin_MNLI_NLI_EN_HI
bilingual_MNLI_NLI_EN_HI
bilingXNLI_GoogMT_EN_HI
bilingXNLI_GoogTranslit_EN_HI
bilingXNLI_IndicTranslit_EN_HI
dual_MLM1_bilMNLI_EN_HI
dual_MLM2_bilMNLI_EN_HI
dual_MLM3_bilMNLI_EN_HI
dual_NLI_QA_EN_HI

Note that: MLM1 refers to GeneralCS. MLM2 refers to GeneralCS+MovieCS. MLM3 refers to GeneralCS+MovieCS+QA contexts for QA intermediate task training, and GeneralCS+MovieCS+NLI Premise for NLI intermediate task training.

Evaluation

To get the test set results on the GLUECoS benchmarks please refer to the GLUECoS repo

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
final_Code		final_Code
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
train_final.sh		train_final.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

final_Code

final_Code

LICENSE

LICENSE

README.md

README.md

requirements.txt

requirements.txt

train_final.sh

train_final.sh

Repository files navigation

Intermediate-Task-Code-Switching

Datasets

Training models

Training requirements

Training

Evaluation

About

Releases

Packages

Contributors 3

Languages

License

archiki/Intermediate-Task-Code-Switching

Folders and files

Latest commit

History

Repository files navigation

Intermediate-Task-Code-Switching

Datasets

Training models

Training requirements

Training

Evaluation

About

Resources

License

Stars

Watchers

Forks

Languages