Add support for HIPE 2022 #2675

stefan-it · 2022-03-14T23:11:32Z

Hi,

this PR adds support for the recently released HIPE 2022 Shared Tasks NER datasets.

HIPE 2022 is a lot more challenging, because of more datasets, languages and different label sets compared to the previous CLEF-HIPE 2020 dataset.

The current released v1.0 version of the dataset includes support for the following datasets and languages:

Dataset name	Languages	Labels
`ajmc`	`de`, `en`	`pers`, `work`, `loc`, `object`, `date`, `scope`
`hipe2020`	`de`, `en`, `fr`	`pers`, `org`, `prod`, `time`, `loc`
`letemps`	`fr`	`pers`, `loc`
`newseye`	`de`, `fi`, `fr`, `sv`	`PER`, `LOC`, `ORG`, `HumanProd`
`sonar`	`de`	`PER`, `LOC`, `ORG`
`topres19th`	`en`	`LOC`, `BUILDING`, `STREET`, (not used: `ALIEN`, `OTHER`, `FICTION`)

More details can be found in the dataset documentation from HIPE 2022 repo.

In the current form, there's no "opinionated" corpus implementation. That means: no special normalization (e.g. de-hyphenation) is done: all datasets are just processed as accurate as possible.

Here's a quick example of how to use the Finnish part of the newseye dataset:

corpus = NER_HIPE_2022(dataset_name="newseye", language="fi")

As the HIPE 2022 is part of an ongoing Shared Task, there's no test data available in the current v1.0 release. It will be added in future versions of HIPE 2022.

Caveats

Some datasets come with no training split, such as ajmc, english part of hipe2020 and sonar.

The neweye datasets are really special, because they come with two different development sets: dev and dev2. For this reason, the dev_split_name argument can be used to control which development split should be used:

corpus = NER_HIPE_2022(dataset_name="newseye", language="de", dev_split_name="dev2")

Example

The NER datasets in HIPE 2022 can be used with the NER_HIPE_2022 implementation. Only two arguments are necessary to initialize a corpus:

Dataset name (dataset_name)
Language (language)

The dev_split_name argument is mentioned in previous section. Another useful option is add_document_separator: it will add special -DOCSTART- sentences into the corpus to mark document boundaries. These document boundaries are very helpful when using the previously added FLERT approach. To use, the dataset can be initialized with:

corpus = NER_HIPE_2022(dataset_name="newseye", language="de", add_document_separator=True)

alanakbik · 2022-03-15T07:24:45Z

@stefan-it awesome, thanks a lot for adding this! The unit tests fail due to two minor flake8 errors:

Unused import: /home/runner/work/flair/flair/flair/datasets/__init__.py:156:1: F401 '.sequence_labeling.NER_HIPE_2022' imported but unused

Should use 'not in' to test for membership: /home/runner/work/flair/flair/flair/datasets/sequence_labeling.py:3958:47: E713 test for membership should be 'not in'

stefan-it · 2022-03-15T09:34:54Z

Thanks for these hints @alanakbik ! I added the missing import to the __all__ loader and also fixed the membership check 🤗

alanakbik · 2022-03-15T10:09:12Z

Awesome @stefan-it finally we have the ALIEN class in Flair :D

alanakbik · 2022-03-15T10:28:36Z

@stefan-it thanks again for this. A few questions:

I actually don't find most tags in topres19th, only LOC, BUILDING, STREET
Sonar 'de' also does not have a train split
In newseye 'de' lots of sentences are broken due to I assume OCR errors, but some are also really long and span many sentences (perhaps an issue with sentence splitting?)

Here's my script to check if all annotations are there:

from flair.datasets import NER_HIPE_2022

for config in [
    # ("ajmc", "de"), no training split
    ("hipe2020", "de"),
    ("letemps", "fr"),
    ("newseye", "de"),
    # ("sonar", "de"),  no training split
    ("topres19th", "en"),
]:
    print("\n\n ---- " + str(config))
    corpus = NER_HIPE_2022(dataset_name=config[0],
                           language=config[1],
                           add_document_separator=True,
                           in_memory=False)
    print(corpus)
    print(corpus.make_label_dictionary('ner'))

stefan-it · 2022-03-15T10:55:44Z

Hi @alanakbik

I was also not able to find ALIEN, OTHER and FICTION, so I'll open an issue in the HIPE 2022 data repo
Exactly, sonar (only available for German) has no training splits
NewsEye is a bit special: the original dataset from here had no eos-marker, they used paragraphs instead. For HIPE 2022 it seems that the dataset was not automatically sentence segmented. Instead, the end of paragraphs seems to be annotated with the "EndOfSentence" marker. This could explain the variety in sentence length.

From the original NewsEye dataset:

in	O	O	O	null	null	SpaceAfter
die	O	O	O	null	null	SpaceAfter
Wogen	O	O	O	null	null	SpaceAfter
der	O	O	O	null	null	SpaceAfter
schäumenden	O	O	O	null	null	SpaceAfter
Enns	O	O	O	null	null	SpaceAfter
getrieben	O	O	O	null	null
.	O	O	O	null	null

Mit	O	O	O	null	null	SpaceAfter
den	O	O	O	null	null	SpaceAfter
Fliehenden	O	O	O	null	null	SpaceAfter
drangen	O	O	O	null	null	SpaceAfter
wir	O	O	O	null	null	SpaceAfter
durch	O	O	O	null	null	SpaceAfter
die	O	O	O	null	null	SpaceAfter
Thore	O	O	O	null	null	SpaceAfter

HIPE 2022:

in	O	_	O	_	_	O	_	_	_
die	O	_	O	_	_	O	_	_	_
Wogen	O	_	O	_	_	O	_	_	_
der	O	_	O	_	_	O	_	_	_
schäumenden	O	_	O	_	_	O	_	_	_
Enns	O	_	O	_	_	O	_	_	_
getrieben	O	_	O	_	_	O	_	_	NoSpaceAfter
.	O	_	O	_	_	O	_	_	NoSpaceAfter|EndOfSentence
Mit	O	_	O	_	_	O	_	_	_
den	O	_	O	_	_	O	_	_	_
Fliehenden	O	_	O	_	_	O	_	_	_
drangen	O	_	O	_	_	O	_	_	_
wir	O	_	O	_	_	O	_	_	_
durch	O	_	O	_	_	O	_	_	_
die	O	_	O	_	_	O	_	_	_
Thore	O	_	O	_	_	O	_	_	_

For HIPE 2022: The original end of paragraph (newline between the sentences) in the original NewsEye dataset is removed. And "EndOfSentence" is added.

alanakbik · 2022-03-15T13:19:17Z

@stefan-it thanks for the infos and the PR!

simon-clematide · 2022-03-15T14:34:24Z

Just a comment on the motivation why we have dev and dev2 in the newseye data. Newseye already published a public test set where people might (will) have results published for. But newseye reserved a currently still private second test set for HIPE 2022. In order not to confuse participants of the HIPE shared task, we felt it would be better not to call the published data set "testset". Additionally, we still wanted people to be able to evaluate on the "published newseye train/test/dev splits" even if they use the HIPE 2022 data packages.

simon-clematide · 2022-03-16T08:21:41Z

Another comment: EndOfSentence in MISC is now used for all dataset where we have relatively good automatic or manual sentence splitting. EndOfLine in MISC refers to layout information as before. The ajcm dataset is currently just consisting of a sample. Tomorrow, a proper train/dev split will be available.

alanakbik · 2022-03-16T08:51:55Z

@simon-clematide thanks for the info!

@stefan-it does the class have to be adapted to make use of this information?

stefan-it · 2022-03-16T09:20:20Z

Whenever there's a new version out, I will update split information and test cases 🤗

stefan-it · 2022-03-16T11:48:58Z

Hi @simon-clematide do you accidentally plan to perform de-hyphenation as well in the upstream data? If not, I'm going to add a flag to enable de-hyphenation. As far as I can tell this is only needed for hipe2020 and newseye - and both datasets use different ways of hyphenation:

NewEye:

den	O	O	O	null	null	SpaceAfter
Ver¬	B-LOC	O	O	null	n
einigten	I-LOC	O	O	null	n
Staaten	I-LOC	O	O	null	n
.	O	O	O	null	null

HIPE-2020:

sich	O	O	O	O	O	O	_	_	_
immer	O	O	O	O	O	O	_	_	_
verlän	O	O	O	O	O	O	_	_	NoSpaceAfter
¬	O	O	O	O	O	O	_	_	EndOfLine
gernden	O	O	O	O	O	O	_	_	_
Krieges	O	O	O	O	O	O	_	_	_
ist	O	O	O	O	O	O	_	_	_

stefan-it added 3 commits March 14, 2022 23:45

datasets: add support for HIPE 2022

30b0a14

datasets: register NER_HIPE_2022

6915b9c

tests: add extensive test cases for all sub-datasets for HIPE 2022

b562ca8

datasets: fix flake8 errors for HIPE 2022 integration

84a8a18

alanakbik added 2 commits March 15, 2022 12:02

Merge branch 'master' into hipe-2022-dataset

36b1bd2

Formatting

6a10765

alanakbik merged commit 9b353e7 into master Mar 15, 2022

stefan-it mentioned this pull request Mar 15, 2022

Missing Entities in TopRes19th Dataset hipe-eval/HIPE-2022-data#2

Closed

stefan-it deleted the hipe-2022-dataset branch May 12, 2022 09:47

stefan-it mentioned this pull request Jun 9, 2022

docs: mention HIPE-2022 in corpus tutorial #2807

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for HIPE 2022 #2675

Add support for HIPE 2022 #2675

stefan-it commented Mar 14, 2022 •

edited

Loading

alanakbik commented Mar 15, 2022

stefan-it commented Mar 15, 2022

alanakbik commented Mar 15, 2022

alanakbik commented Mar 15, 2022

stefan-it commented Mar 15, 2022 •

edited

Loading

alanakbik commented Mar 15, 2022

simon-clematide commented Mar 15, 2022

simon-clematide commented Mar 16, 2022 •

edited

Loading

alanakbik commented Mar 16, 2022

stefan-it commented Mar 16, 2022

stefan-it commented Mar 16, 2022

Add support for HIPE 2022 #2675

Add support for HIPE 2022 #2675

Conversation

stefan-it commented Mar 14, 2022 • edited Loading

Caveats

Example

alanakbik commented Mar 15, 2022

stefan-it commented Mar 15, 2022

alanakbik commented Mar 15, 2022

alanakbik commented Mar 15, 2022

stefan-it commented Mar 15, 2022 • edited Loading

alanakbik commented Mar 15, 2022

simon-clematide commented Mar 15, 2022

simon-clematide commented Mar 16, 2022 • edited Loading

alanakbik commented Mar 16, 2022

stefan-it commented Mar 16, 2022

stefan-it commented Mar 16, 2022

stefan-it commented Mar 14, 2022 •

edited

Loading

stefan-it commented Mar 15, 2022 •

edited

Loading

simon-clematide commented Mar 16, 2022 •

edited

Loading