Bilingual or Byelingual? Code mixing and romanization in Hindi Sentiment Analysis

Problem Definition

Code-Mixing: is a communication phenomenon where the speakers embed words, phrases or morphemes from one language into another.

Hindi-English Code-Mixing looks like:

Example: ye class bauhat fun hai

Translation: this class is a lot of fun

Romanization: the representation of a language text using the Roman alphabet.

Do romanization and code-mixing trigger substantial changes in model performance?

Dataset

The following datasets have been used -

UMSAB - It contains 14.7K tweets in 8 different languages. This dataset was used for finetuning our baseline models.
Sentimix-2020 - The fintuned models are evaluated on this Hindi-English-Romanized code switched twitter data.
A synthetic data consisting of pure English translations, pure Hindi translations (in Devanagari script) and transliterated Hindi (in Latin script) was generated from the Sentimix-2020 data . GPT-3.5 Turbo was used for the translations and IndicXLIT for transliterations.

Classes and labels -

The following label mapping was used consistently through out the three datasets -

{
  'negative' : 0,
  'neutral' : 1,
  'positive' : 2
}

Models

The following three baselines were examined -

Environment setup

to create a conda environment witht he provided environment.yml file, run the follwoing commands -

conda env create -f environment.yml
source activate my_env

Code

the main.py file accepts the following command line arguments to control the experiments -

1. model : can be either of XLM-T, mBERT, TWHIN-Bert. it loads the tokenizer and model from the hugginface hub. (defaults to XLM-T)
2. model_on_disk : use this if the model is stored on disk. provide the complete path of the model on disk 
3. dataset : can be either of Sentimix or UMSAB. Sentimix will load the full set including the original code switched tweets, translations into Hindi (devanagri) and English and the transliteration into Hindi (Romanised) (defaults to UMSAB).
4. data_dir : path to the dataset for SentiMix 
5. task : can be either inference or finetuning (for mbert/TWHINBert , to be used with UMSAB dataset only). (defaults to inference)
6. cpt_dir : The directory to store check points. (defaults to a folder named checkpoint_logs.)
7. op_dir : The directory to store predictions and other outputs suchs as validation history through out the job. (defaults to a folder named output_logs.)
8. BATCH_SIZE : defaults to 200
9. seed : defaults to 1. 
10. lr : defaults to 0.0001 
11. max_epochs : defaults to 20.

Finetuning

Finetuning is performed for mBERT and TwHIN-BERT on the UMSAB dataset only.

For Finetuning mBERT, execute the follwoing command -

python3 main.py --model mBERT --dataset UMSAB --task Finetuning --cpt_dir checkpoint_logs_mbert --op_dir output_logs_mbert --BATCH_SIZE 8 --lr 2e-5

For Finetuning TwHIN-BERT, execute the follwoing command -

python3 main.py --model TWHIN-Bert --dataset UMSAB --task Finetuning --cpt_dir checkpoint_logs --op_dir output_logs --BATCH_SIZE 8 --lr 2e-5

Inference

to perform inference with the chosen model, execute the following command -

python3 main.py --model XLM-T --dataset UMSAB --task inference --cpt_dir checkpoint_logs --op_dir output_logs

(This will evaluate XLM-T on the UMSAB dataset.)

Additionally, a sample shell file (run_job.sh) is also provided to excute these jobs over High Performance Computing clusters.

Collaborators

Anisha Bhatnagar (ab10945@nyu.edu)
Gauri Gupta (gg2751@nyu.edu)
Benjamin Feuer (bf996@nyu.edu)
Daiheng Zhang (dz2266@nyu.edu)

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
checkpoint_logs		checkpoint_logs
output_logs		output_logs
Custom_dataset.py		Custom_dataset.py
README.md		README.md
environment.yml		environment.yml
inference.py		inference.py
main.py		main.py
run_job.sh		run_job.sh
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bilingual or Byelingual? Code mixing and romanization in Hindi Sentiment Analysis

Problem Definition

Dataset

Classes and labels -

Models

Environment setup

Code

Finetuning

Inference

Collaborators

About

Releases

Packages

Languages

anishabhatnagar/hi-en-senti

Folders and files

Latest commit

History

Repository files navigation

Bilingual or Byelingual? Code mixing and romanization in Hindi Sentiment Analysis

Problem Definition

Dataset

Classes and labels -

Models

Environment setup

Code

Finetuning

Inference

Collaborators

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages