This repository contains python implementation of automatic reduplication identification along with classification into different types.
These instructions will get you a copy of the project up and running on your local machine for development and testing purposes. Please see to instructions to run the project.
It is encouraged to work in virtual virtual environments.
We used Python 3 along with IndicNLP library. IndicNLP a python library is required for processing Indian languages.
- Install indic nlp:
pip3 install indicnlp
- Copy the file under kept in 'indicnlp' directory to the installation directory location 'indicnlp/tokenize' where IndicNLP is installed.
For more details of IndicNLP package visit github link- https://github.com/anoopkunchukuttan/indic_nlp_resources#indic-nlp-resources
We have used three dataset for our experiment. Except the third one, the other two are own collection.
-
Social Media dataset (sample file included in the repo, named as 'social_media_post.txt' )
-
Agricultural dataset ( sample file inculded in repo, named as 'agicultural_dataset.txt' )
-
TDIL dataset: Acquired dataset named 'Assamese Monolingual Text Corpus ILCI-II' from TDIL, Indian Languages Corpora Initiative phase –II (ILCI Phase-II) project, initiated by the MeitY, Govt. of India, Jawaharlal Nehru University. Not allowed for open publishing.
TDIL Link: https://tdil-dc.in/index.php?option=com_download&task=showresourceDetails&toolid=1877&lang=en
To run the programme, please run the 'find_redup.py' file-
python find_redup.py
-
This will take 'social_media_post.txt' as input.
-
After procesing, the programm will generate a text file named redup_word.txt in the same directory which contain all the reduplicated word along with its classification.
If you use our code or dataset, please cite this paper:
@article{pathak2022reduplication,
title={Reduplication in Assamese: Identification and Modeling},
author={Pathak, Dhrubajyoti and Nandi, Sukumar and Sarmah, Priyankoo},
journal={Transactions on Asian and Low-Resource Language Information Processing},
volume={21},
number={5},
pages={1--18},
articleno = {90},
year={2022},
issue_date = {September 2022},
publisher={ACM New York, NY}
url = {https://doi.org/10.1145/3510419},
doi = {10.1145/3510419},
}