HAREM Datasets Preprocessing

The HAREM collections are popular Portuguese datasets that are commonly used in Named Entity Recognition (NER) task. In their original XML format, some phrases can have multiple entity identification solutions and entities can be assigned more than one class (<ALT> tags and | characters indicating multiple solutions). This annotation scheme is good for representing vagueness and indeterminacy. However, it introduces complications when modeling NER as sequence tagging problem, specially during evaluation, because a single true answer is required.

The script xml_to_json.py converts the XML file to JSON format and selects a single solution for all <ALT> tags and vague entities:

For each Entity with multiple classes, it selects the first valid class.
For each <ALT> tag, it selects the solution with the highest number of entities.

The script is tested for the following XML files:

FirstHAREM: CDPrimeiroHAREMprimeiroevento.xml
MiniHAREM: CDPrimeiroHAREMMiniHAREM.xml

Total and Selective scenarios

Recent works often train and report performances for two scenarios: Total and Selective. Total scenario corresponds to the full dataset with 10 Entity classes:

PESSOA (Person)
ORGANIZACAO (Organization)
LOCAL (Location)
TEMPO (Date)
VALOR (Value)
ABSTRACCAO (Abstraction)
ACONTECIMENTO (Event)
COISA (Thing)
OBRA (Title)
OUTRO (Other)

The Selective scenario considers only the first 5 classes of the list above.

The script is compatible to both scenarios and selects the entities respecting the chosen scenario.

Usage

The scripts are tested with Python 3.6.

Install the requirements:

$ pip install -r requirements.txt

Run the script:

$ xml_to_json.py path_to_xml_file.xml --scenario [total|selective]

The converted file will be saved with the same name and suffix -{scenario}.json

Tests

To run the tests, first install the test requirements and run the tests:

$ pip install requirements_test.txt
$ HAREM_DATA_DIR=test_files/ python tests.py

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
requirements_test.txt		requirements_test.txt
tests.py		tests.py
utils.py		utils.py
xml_to_json.py		xml_to_json.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HAREM Datasets Preprocessing

Total and Selective scenarios

Usage

Tests

About

Releases

Packages

Languages

License

fabiocapsouza/harem_preprocessing

Folders and files

Latest commit

History

Repository files navigation

HAREM Datasets Preprocessing

Total and Selective scenarios

Usage

Tests

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages