loan-word-detection

Paper: A Generalized Method for Automated Multilingual Loanword Detection

Authors: Abhijnan Nath, Sina Mahdipour Saravani, Ibrahim Khebour, Sheikh Mannan, Zihui Li, and Nikhil Krishnaswamy

Published at COLING 2022

supported_languages.txt lists the languages our system currently supports in principle. We have evaluated on English-French, English-German, Indonesian-Dutch, Polish-French, Romanian-French, Kazakh-Russian, Persian-Arabic, Romanian-Hungarian, German-French, Hindi-Persian, Finnish-Swedish, Azerbaijani-Arabic, Mandarin-English, Hungarian-German, German-Italian, and Catalan-Arabic.

Pipeline

Python 3.7 is recommended (Python 3.8+ caused errors with googletrans package).
Specify language pairs and Wiktionary links in language-pairs.json.
- Resources:
  - EpiTran codes
  - ISO 639-1 codes (for Google Translate)
Run wiktionary-scraper-python/scraper.ipynb to get L1-L2 loan pairs and homonyms for each pair.
Run wiktionary-scraper-python/scaper_lemmas.ipynb to get all L2 lemmas.
Run Datasets/make-datasets.ipynb (running on Colab will prompt you to upload all resource files, including scraped data; running locally will require you to move the *-AllLemmas.csv data files into Datasets/AllLemmas).
- If running make-datasets.ipynb locally, move all downloaded files into the correct folders within Datasets.
- Recommended to run remove-overlaps.ipynb to remove remaining overlaps between synonyms, hard negatives, and loans (occurs due to case sensitivity in Google Translate, will incorporate fix into make-datasets.ipynb for final release).
Run make-train-test-splits.ipynb.
Run get-logits-cos_sims.ipynb to train phonetic alignment networks and get logits and cosine similarities for each pair.
Move any language pairs you want held out from the training into language-pairs-holdout.json.
Run evaluate-experiments.ipynb to run evaluations!

Extending Epitran

Not all languages supported by MBERT and XLM are supported by Epitran. Luckily, it's simple to extend Epitran to another language if you know its orthography and sound pattern. See David Mortensen's discussion here. We have provided sample additional map and preprocessing files for Finnish, a simple use case, in epitran-extensions. These files would need to be moved into the corresponding folder in the Epitran distribution in your Python's site-packages.

Name		Name	Last commit message	Last commit date
Latest commit History 374 Commits
Classifiers		Classifiers
Datasets		Datasets
Final_results		Final_results
FleissKappa		FleissKappa
epitran-extensions		epitran-extensions
torch_models/SM_model		torch_models/SM_model
wiktionary-scraper-python		wiktionary-scraper-python
.gitignore		.gitignore
README.md		README.md
language-pairs-holdout.json		language-pairs-holdout.json
language-pairs-pipelinetest.json		language-pairs-pipelinetest.json
language-pairs.json		language-pairs.json
supported_languages.txt		supported_languages.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

loan-word-detection

Pipeline

Extending Epitran

About

Releases 1

Packages

Contributors 6

Languages

csu-signal/loan-word-detection

Folders and files

Latest commit

History

Repository files navigation

loan-word-detection

Pipeline

Extending Epitran

About

Resources

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 6

Languages

Packages