DialectBench

Get started

Clone the github repository.

git clone https://github.com/ffaisal93/DialectBench.io.git
cd DialectBench

Download Data

Download all data available [except mt and the ones loadable through huggingface]
```
bash download_data.sh --task all
```

Download data for Turkish dialectal machine translation

bash download_data.sh --task machine_translation_turkish

Package Installation

Dependency parsing: Install Adapter Package

bash install.sh --task install_adapter

Extractive Question Answering [SDQA]: Install Transformers 3.4.0

bash install.sh --task install_transformers_qa

Other Structured Prediction, QA and Classification tasks: Transformers 4.21.1

bash install.sh --task install_transformers

Task Specific Training and Evaluation

Dependency Parsing

Training

Finetune all available language-specific models on both pretrained mBERT and XLMR at once
```
./all_commands.sh --action train_udp --execute bash
```

Finetune one single available language-specific model

bash install.sh --task train_udp --lang UD_English-EWT --MODEL_NAME mbert

Prediction

Prediction on all finetuned model (for both pretrained mBERT and XLMR) and if no training data available for a specific language variety, do zeroshot from English variety "UD_English-EWT"
```
./all_commands.sh --action predict_udp --execute bash
```
Do zero-shot prediction from a specific language variety (e.g. UD_English-EWT) and on all available variety defined in --lang_config metadata/udp_metadata.json
```
bash install.sh --task predict_udp_zeroshot_all --lang UD_English-EWT --MODEL_NAME mbert
```
Do test data prediction on a single finetuned language variety (e.g. UD_English-EWT)
```
bash install.sh --task predict_udp_single --lang UD_English-EWT --MODEL_NAME mbert
```

Parts of Speech (POS) Tagging

Training

Finetune all available language-specific models on both pretrained mBERT and XLMR at once
```
./all_commands.sh --action train_pos --execute bash
```

Prediction

Prediction on all finetuned model (for both pretrained mBERT and XLMR) and if no training data available for a specific language variety, do zeroshot from English variety "UD_English-EWT"
```
./all_commands.sh --action predict_pos --execute bash
```

Named Entity Recognition (NER)

Training

Performing in-variety Finetuning on all available language varieties on both pretrained mBERT and XLMR at one go.
```
./all_commands.sh --action train_pos --execute bash
```
or, If you want to performing in-variety finetuning for a single language only, try the following:
```
bash install.sh --task train_ner --lang bokmaal --MODEL_NAME bert --dataset wikiann
```
We have two datasets supported in DialectBench at this point. wikiann and norwegian_ner.
- wikiann: language varieties ("ar" "az" "ku" "tr" "hsb" "nl" "fr" "zh" "en" "mhr" "it" "de" "pa" "es" "hr" "lv" "hi" "ro" "el" "bn"). Use --dataset wikiann to finetune varieties from this dataset.
- norwegian_ner: language varieties ("bokmaal" "nynorsk" "samnorsk"). Use --dataset scripts/ner/norwegian_ner.py to finetune varieties from this dataset.

Prediction

Prediction using all in-variety finetuned models (for both pretrained mBERT and XLMR) as well as performing zeroshot prediction using English variety en on the varieties available in --lang_config metadata/metadata/ner_metadata.json at one go.
```
./all_commands.sh --action predict_ner --execute bash
```

Topic Classification (TC)

Training

Performing In-cluster finetuning (on both pretrained mbert and xlm-r) on selected varieties from different language cluster.

./all_commands.sh --action train_topic_classification_lm --execute bash

Add or remove specific variety for finetuning from SIB-200 dataset here in command-bash.sh file.

    if [[ "$task" = "train_topic_classification_lm" || "$task" = "predict_topic_classification_lm" ]]; then

      export ALL_LANGS=("eng_Latn" "ita_Latn" "azj_Latn" "ckb_Arab" "nob_Latn" "nld_Latn" "lvs_Latn" 
        "arb_Arab" "lij_Latn" "zho_Hans" "spa_Latn" "nso_Latn")

Prediction

Performing inference on all available varieties across different language clusters (as defined in --lang_config metadata/topic_metadata.json) and on top of different pretrained models (mbert, xlmr)

./all_commands.sh --action predict_topic_classification_lm --execute bash

Natural language inference (NLI)

Training

Performing zero-shot finetuning from English (on top of both pretrained mbert and xlm-r) on selected varieties from different language cluster.

./all_commands.sh --action train_nli --execute bash

Add or remove specific variety for finetuning from translate-test dialect_nli dataset here in command-bash.sh file.

  if [[ "$task" = "train_nli" || "$task" = "predict_nli" ]]; then

    # export ALL_LANGS=("eng_Latn" "ita_Latn" "azj_Latn" "ckb_Arab" "nob_Latn" "nld_Latn" "lvs_Latn" "arb_Arab" "lij_Latn" "zho_Hans" "spa_Latn" "nso_Latn" "ben_Beng")
    export ALL_LANGS=("eng_Latn")
    for lang in "${ALL_LANGS[@]}"; do
      echo ${base_model}
      echo ${lang}
      echo ${dataset}
      bash install.sh --task ${task} --lang ${lang} --MODEL_NAME ${base_model}
    done

  fi

dialect_nli dataset loading script: --dataset_script scripts/nli/dialect_nli.py

Prediction

Performing inference on all available varieties across different language clusters (as defined in --lang_config metadata/nli_metadata.json) and on top of different pretrained models (mbert, xlmr).

./all_commands.sh --action predict_nli --execute bash

Sentiment Analysis (SA)

Training

At this point, DialectBench only supports arabic dialectal sentiment analysis. To finetune variety-specific models:

./all_commands.sh --action train_sa --execute bash

Prediction

To evaluate each variety-specific model at one go:

./all_commands.sh --action predict_sa --execute bash

Add or remove specific variety for finetuning in command-bash.sh file.

  if [[ "$task" = "train_sa" || "$task" = "predict_sa" ]]; then

    export ALL_LANGS=("aeb_Arab" "aeb_Latn" "arb_arab" "ar-lb" "arq_arab" "ary_arab" "arz_arab" "jor_arab" "sau_arab")
    for lang in "${ALL_LANGS[@]}"; do
      echo ${base_model}
      echo ${lang}
      echo ${dataset}
      bash install.sh --task ${task} --lang ${lang} --lang2 arabic --MODEL_NAME ${base_model}
    done

  fi

Dialect Identification (DId)

Training

Finetune Arabic, English, Mandarin, Portuguese, Spanish and Swiss-Dialect identification models (mbert and xlmr based)

./all_commands.sh --action train_did --execute bash

Finetune a dialect identification model of a single language

export lang="arabic" #"arabic" english" "greek" "mandarin_simplified" "mandarin_traditional" "portuguese" "spanish" "swiss-dialects"
export base_model="mbert" #"mbert" "xlmr"
bash install.sh --task train_did --lang ${lang} --dataset ${dataset} --MODEL_NAME ${base_model}

Prediction

./all_commands.sh --action predict_did_lm --execute bash

Machine Reading Comprehension (MRC)

Training

./all_commands.sh --action train_reading_comprehension --execute bash

Prediction

./all_commands.sh --action predict_reading_comprehension --execute bash

Extractive Question Answering

Training

Finetune on all language at once as well as on singlae language and it's varieties.

./all_commands.sh --action train_sdqa --execute bash

Add or remove specific language cluster in this command-bash.sh block.

f [[ "$task" = "train_sdqa" || "$task" = "predict_sdqa" ]]; then

  export ALL_MODELS=("all" "arabic" "bengali" "english" "finnish" "indonesian" "korean" "russian" "swahili" "telugu")

  for MODEL_NAME in "${ALL_MODELS[@]}"; do
    echo ${base_model}
    echo ${MODEL_NAME}
    bash install.sh --task ${task} --lang ${MODEL_NAME} --MODEL_NAME ${base_model} --dataset dev
    bash install.sh --task ${task} --lang ${MODEL_NAME} --MODEL_NAME ${base_model} --dataset test
  done
fi

Prediction

./all_commands.sh --action predict_sdqa --execute bash

Name		Name	Last commit message	Last commit date
Latest commit History 71 Commits
metadata		metadata
results		results
scripts		scripts
slurm		slurm
test/data		test/data
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
all_commands.sh		all_commands.sh
command-bash.sh		command-bash.sh
command-slurm.sh		command-slurm.sh
command.sh		command.sh
download_data.sh		download_data.sh
explore_data.ipynb		explore_data.ipynb
install.sh		install.sh
requirements.txt		requirements.txt

License

ffaisal93/DialectBench

Folders and files

Latest commit

History

Repository files navigation

DialectBench

Get started

Download Data

Package Installation

Task Specific Training and Evaluation

Dependency Parsing

Training

Prediction

Parts of Speech (POS) Tagging

Training

Prediction

Named Entity Recognition (NER)

Training

Prediction

Topic Classification (TC)

Training

Prediction

Natural language inference (NLI)

Training

Prediction

Sentiment Analysis (SA)

Training

Prediction

Dialect Identification (DId)

Training

Prediction

Machine Reading Comprehension (MRC)

Training

Prediction

Extractive Question Answering

Training

Prediction

About

Resources

License

Stars

Watchers

Forks

Languages