Entity Projection via MT for Cross-Lingual NER
Repository containing the implementation of the Translate-Match-Project (TMP) method described in this paper. We demonstrate that using off-the-shelf Machine Translation (MT) systems and a few simple heuristics, significant gains can be made towards cross-lingual NER for medium-resource languages1.
This code has been written in Python 3.6. Please create a dedicated
environment (using virtualenv or conda) and install the packages listed in
the requirements file in your environment using the following command. Also,
data/ directory where all the input and output files can be
stored. Note that all the commands and paths listed in this README assume
that you are in the parent project directory, i.e., in
pip install -r requirements.txt
Using Google Cloud Translation API
The Google Cloud Translation API needs to be used twice in order to successfully run the TMP method:
- To translate sentences from source (language) to target.
- To translate each entity phrase in a source sentence to target.
Please find below instructions to set up and use the API.
Setting up the API
Please follow the steps listed here to set up the API. Once your setup is finished, you will have access to an API Key that would be required to authenticate during API usage. Store this key (string) in a text file (not JSON) in your project directory. Please ensure that this key remains private to avoid unauthorized usage from your account.
Using the API programmatically
The following function in
src/util/tmp.py accesses the Translation API:
from googleapiclient.discovery import build . . . def get_google_translations(src_list, src_lang_code, tgt_lang_code, api_key): service = build('translate', 'v2', developerKey=api_key) tgt_dict = service.translations().list( source=src_lang_code, target=tgt_lang_code, q=src_list).execute() return [t['translatedText'] for t in tgt_dict['translations']]
The Google Cloud Translation service can at times error out due to
request arrival rate exceeding the maximum rate allowed. The argument
main.py) has been set to 128 and
time_sleep to 10 (seconds)
to minimize such errors. Note that these arguments are used only while
translating sentences. Entity phrase translation occurs on a
sentence-by-sentence basis without batching and without any wait time.
However, despite these measures, these errors continue to occur. If that
happens, please note down the index of the batch (while translating sentences)
and that of the sentence (while translating entity phrases) at which the error
occurs. If the error occurs while translating sentences, re-run the process
with the argument
main.py) set to this index. If the error
occurs while translating entity do this with the argument
Initially, both these indices are set to -1, so that all sentences or
entity phrases get sent to the API for translation. When one or both of these
have positive integral values, the batches or sentences numbered lower than
these indices (
phrase_iter) are not sent again for translation.
Getting annotated target data
Preprocess the source files
TRANSLATE from source to target
MATCH and PROJECT
Training a model in the target language
We used the code from...
Please send an email to email@example.com in case of any questions or suggestions related to the paper or the code.
1 We define medium-resource languages to be those for which while strong off-the-shelf MT systems exist, large annotated corpora for NER do not.