BanglaParaphrase

This repository contains the code, data, and associated models of the paper titled "BanglaParaphrase: A High-Quality Bangla Paraphrase Dataset", accepted in Proceedings of the Asia-Pacific Chapter of the Association for Computational Linguistics: AACL 2022.

Datasets

Disclaimer: You must agree to the license and terms of use before using the dataset.

The dataset files are organized in .jsonl format i.e. one JSON per line. Download the dataset from here.

One example from the test part of the dataset is given below in JSON format.

{
  "source": "খোঁজ খবর রাখতেন বিজ্ঞানের অগ্রগতি নিয়ে।", 
  "target": "বিজ্ঞানের অগ্রগতির দিকে তিনি নজর রেখেছিলেন।"
}

Data Splits

Dataset with train-dev-test example counts are given below:

Language	ISO 639-1 Code	Train	Validation	Test
Bengali	bn	419, 967	233, 31	233, 32

Filtering Pipeline

The following filtering pipeline was used to preprocess the raw dataset to ensure high quality.

Filter Name	Significance	Filtering Parameters
PINC	Ensure diversity in generated paraphrase	0.65, 0.76, 0.80
BERTScore	Preserve semantic coherence with the source	lower 0.91 - 0.93, upper 0.98
N-gram repetition	Reduce n-gram repetition during inference	2 - 4 grams
Punctuation	Prevent generating non-terminating sentences during inference	N/A

In the respective folders, instructions on how to run certain filtering and scoring scripts are provided.

Run the full pipeline

Install requirements from requirements and then run the following command.

bash filter.sh -i <input> -p <pinc_threshold> -l <lower_bert_score_threshold> -h <higher_bert_score_threshold>

Where input is the path to the jsonl file containing sentences and their corresponding paraphrases as key value pairs, pinc_threshold is the threshold for PINCScore, lower_bert_score_threshold and higher_bert_score_threshold are the limits for BERTScore in scale of 0 to 1.

This will generate two files named source.bn and target.bn in the working directory containing the filtered pairs after passing through all the filtering steps.

Training & Evaluation

For training and evaluation, please refer to the repository of BanglaNLG.

Models

The model checkpoint from the paper is available at huggingface model hub.

License

Contents of this repository are restricted to only non-commercial research purposes under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0). Copyright of the dataset contents belongs to the original copyright holders.

Citation

@article{akil2022banglaparaphrase,
  title={BanglaParaphrase: A High-Quality Bangla Paraphrase Dataset},
  author={Akil, Ajwad and Sultana, Najrin and Bhattacharjee, Abhik and Shahriyar, Rifat},
  journal={arXiv preprint arXiv:2210.05109},
  year={2022}
}

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
BERTScore		BERTScore
N-gram Repitition Filter		N-gram Repitition Filter
PINCScore		PINCScore
Punctuation Filter		Punctuation Filter
images		images
.gitignore		.gitignore
README.md		README.md
filter.sh		filter.sh
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BERTScore

BERTScore

N-gram Repitition Filter

N-gram Repitition Filter

PINCScore

PINCScore

Punctuation Filter

Punctuation Filter

images

images

.gitignore

.gitignore

README.md

README.md

filter.sh

filter.sh

requirements.txt

requirements.txt

Repository files navigation

BanglaParaphrase

Table of Contents

Datasets

Data Splits

Filtering Pipeline

Run the full pipeline

Training & Evaluation

Models

License

Citation

About

Releases

Packages

Contributors 3

Languages

csebuetnlp/banglaparaphrase

Folders and files

Latest commit

History

Repository files navigation

BanglaParaphrase

Table of Contents

Datasets

Data Splits

Filtering Pipeline

Run the full pipeline

Training & Evaluation

Models

License

Citation

About

Topics

Resources

Stars

Watchers

Forks

Languages