Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for HIPE 2022 #2675

Merged
merged 6 commits into from
Mar 15, 2022
Merged

Add support for HIPE 2022 #2675

merged 6 commits into from
Mar 15, 2022

Conversation

stefan-it
Copy link
Member

@stefan-it stefan-it commented Mar 14, 2022

Hi,

this PR adds support for the recently released HIPE 2022 Shared Tasks NER datasets.

HIPE 2022 is a lot more challenging, because of more datasets, languages and different label sets compared to the previous CLEF-HIPE 2020 dataset.

The current released v1.0 version of the dataset includes support for the following datasets and languages:

Dataset name Languages Labels
ajmc de, en pers, work, loc, object, date, scope
hipe2020 de, en, fr pers, org, prod, time, loc
letemps fr pers, loc
newseye de, fi, fr, sv PER, LOC, ORG, HumanProd
sonar de PER, LOC, ORG
topres19th en LOC, BUILDING, STREET, (not used: ALIEN, OTHER, FICTION)

More details can be found in the dataset documentation from HIPE 2022 repo.

In the current form, there's no "opinionated" corpus implementation. That means: no special normalization (e.g. de-hyphenation) is done: all datasets are just processed as accurate as possible.

Here's a quick example of how to use the Finnish part of the newseye dataset:

corpus = NER_HIPE_2022(dataset_name="newseye", language="fi")

As the HIPE 2022 is part of an ongoing Shared Task, there's no test data available in the current v1.0 release. It will be added in future versions of HIPE 2022.

Caveats

Some datasets come with no training split, such as ajmc, english part of hipe2020 and sonar.

The neweye datasets are really special, because they come with two different development sets: dev and dev2. For this reason, the dev_split_name argument can be used to control which development split should be used:

corpus = NER_HIPE_2022(dataset_name="newseye", language="de", dev_split_name="dev2")

Example

The NER datasets in HIPE 2022 can be used with the NER_HIPE_2022 implementation. Only two arguments are necessary to initialize a corpus:

  • Dataset name (dataset_name)
  • Language (language)

The dev_split_name argument is mentioned in previous section. Another useful option is add_document_separator: it will add special -DOCSTART- sentences into the corpus to mark document boundaries. These document boundaries are very helpful when using the previously added FLERT approach. To use, the dataset can be initialized with:

corpus = NER_HIPE_2022(dataset_name="newseye", language="de", add_document_separator=True)

@alanakbik
Copy link
Collaborator

@stefan-it awesome, thanks a lot for adding this! The unit tests fail due to two minor flake8 errors:

Unused import: /home/runner/work/flair/flair/flair/datasets/__init__.py:156:1: F401 '.sequence_labeling.NER_HIPE_2022' imported but unused

Should use 'not in' to test for membership: /home/runner/work/flair/flair/flair/datasets/sequence_labeling.py:3958:47: E713 test for membership should be 'not in'

@stefan-it
Copy link
Member Author

Thanks for these hints @alanakbik ! I added the missing import to the __all__ loader and also fixed the membership check 🤗

@alanakbik
Copy link
Collaborator

Awesome @stefan-it finally we have the ALIEN class in Flair :D

@alanakbik
Copy link
Collaborator

@stefan-it thanks again for this. A few questions:

  • I actually don't find most tags in topres19th, only LOC, BUILDING, STREET
  • Sonar 'de' also does not have a train split
  • In newseye 'de' lots of sentences are broken due to I assume OCR errors, but some are also really long and span many sentences (perhaps an issue with sentence splitting?)

Here's my script to check if all annotations are there:

from flair.datasets import NER_HIPE_2022

for config in [
    # ("ajmc", "de"), no training split
    ("hipe2020", "de"),
    ("letemps", "fr"),
    ("newseye", "de"),
    # ("sonar", "de"),  no training split
    ("topres19th", "en"),
]:
    print("\n\n ---- " + str(config))
    corpus = NER_HIPE_2022(dataset_name=config[0],
                           language=config[1],
                           add_document_separator=True,
                           in_memory=False)
    print(corpus)
    print(corpus.make_label_dictionary('ner'))

@stefan-it
Copy link
Member Author

stefan-it commented Mar 15, 2022

Hi @alanakbik

  • I was also not able to find ALIEN, OTHER and FICTION, so I'll open an issue in the HIPE 2022 data repo
  • Exactly, sonar (only available for German) has no training splits
  • NewsEye is a bit special: the original dataset from here had no eos-marker, they used paragraphs instead. For HIPE 2022 it seems that the dataset was not automatically sentence segmented. Instead, the end of paragraphs seems to be annotated with the "EndOfSentence" marker. This could explain the variety in sentence length.

From the original NewsEye dataset:

in	O	O	O	null	null	SpaceAfter
die	O	O	O	null	null	SpaceAfter
Wogen	O	O	O	null	null	SpaceAfter
der	O	O	O	null	null	SpaceAfter
schäumenden	O	O	O	null	null	SpaceAfter
Enns	O	O	O	null	null	SpaceAfter
getrieben	O	O	O	null	null
.	O	O	O	null	null

Mit	O	O	O	null	null	SpaceAfter
den	O	O	O	null	null	SpaceAfter
Fliehenden	O	O	O	null	null	SpaceAfter
drangen	O	O	O	null	null	SpaceAfter
wir	O	O	O	null	null	SpaceAfter
durch	O	O	O	null	null	SpaceAfter
die	O	O	O	null	null	SpaceAfter
Thore	O	O	O	null	null	SpaceAfter

HIPE 2022:

in	O	_	O	_	_	O	_	_	_
die	O	_	O	_	_	O	_	_	_
Wogen	O	_	O	_	_	O	_	_	_
der	O	_	O	_	_	O	_	_	_
schäumenden	O	_	O	_	_	O	_	_	_
Enns	O	_	O	_	_	O	_	_	_
getrieben	O	_	O	_	_	O	_	_	NoSpaceAfter
.	O	_	O	_	_	O	_	_	NoSpaceAfter|EndOfSentence
Mit	O	_	O	_	_	O	_	_	_
den	O	_	O	_	_	O	_	_	_
Fliehenden	O	_	O	_	_	O	_	_	_
drangen	O	_	O	_	_	O	_	_	_
wir	O	_	O	_	_	O	_	_	_
durch	O	_	O	_	_	O	_	_	_
die	O	_	O	_	_	O	_	_	_
Thore	O	_	O	_	_	O	_	_	_

For HIPE 2022: The original end of paragraph (newline between the sentences) in the original NewsEye dataset is removed. And "EndOfSentence" is added.

@alanakbik alanakbik merged commit 9b353e7 into master Mar 15, 2022
@alanakbik
Copy link
Collaborator

@stefan-it thanks for the infos and the PR!

@simon-clematide
Copy link

Just a comment on the motivation why we have dev and dev2 in the newseye data. Newseye already published a public test set where people might (will) have results published for. But newseye reserved a currently still private second test set for HIPE 2022. In order not to confuse participants of the HIPE shared task, we felt it would be better not to call the published data set "testset". Additionally, we still wanted people to be able to evaluate on the "published newseye train/test/dev splits" even if they use the HIPE 2022 data packages.

@simon-clematide
Copy link

simon-clematide commented Mar 16, 2022

Another comment: EndOfSentence in MISC is now used for all dataset where we have relatively good automatic or manual sentence splitting. EndOfLine in MISC refers to layout information as before. The ajcm dataset is currently just consisting of a sample. Tomorrow, a proper train/dev split will be available.

@alanakbik
Copy link
Collaborator

@simon-clematide thanks for the info!

@stefan-it does the class have to be adapted to make use of this information?

@stefan-it
Copy link
Member Author

Whenever there's a new version out, I will update split information and test cases 🤗

@stefan-it
Copy link
Member Author

Hi @simon-clematide do you accidentally plan to perform de-hyphenation as well in the upstream data? If not, I'm going to add a flag to enable de-hyphenation. As far as I can tell this is only needed for hipe2020 and newseye - and both datasets use different ways of hyphenation:

NewEye:

den	O	O	O	null	null	SpaceAfter
Ver¬	B-LOC	O	O	null	n
einigten	I-LOC	O	O	null	n
Staaten	I-LOC	O	O	null	n
.	O	O	O	null	null

HIPE-2020:

sich	O	O	O	O	O	O	_	_	_
immer	O	O	O	O	O	O	_	_	_
verlän	O	O	O	O	O	O	_	_	NoSpaceAfter
¬	O	O	O	O	O	O	_	_	EndOfLine
gernden	O	O	O	O	O	O	_	_	_
Krieges	O	O	O	O	O	O	_	_	_
ist	O	O	O	O	O	O	_	_	_

@stefan-it stefan-it deleted the hipe-2022-dataset branch May 12, 2022 09:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants