Added universal propositions bank for French and German #1866

Dabendorf · 2020-09-15T13:34:56Z

I have added the German and the French data from he Universal Proposition Bank (https://github.com/System-T/UniversalPropositions)

alanakbik

I've tested with the following code and the output does not look fully correct:

from flair.datasets import UP_GERMAN

# load corpus
corpus = UP_GERMAN()
print(corpus)

# print first sentence in train data
print(corpus.train[0])

This prints:

Sentence: "sent_id Sehr gute Beratung , schnelle Behebung der Probleme , so stelle ich mir Kundenservice vor ."   [− Tokens: 17  − Token-Labels: "sent_id Sehr gute Beratung , schnelle Behebung der Probleme , so <AM-MNR> stelle ich <A0> mir Kundenservice <A1> vor ."]

Two problems here:

(1) the sentence is read as "sent_id Sehr gute Beratung , schnelle Behebung der Probleme , so stelle ich mir Kundenservice vor .", but sent_id should not be part of the sentence. The problem is that the UP files have lines that are comments. These lines are prefixed by a # symbol and should be skipped. You can get this behavior by setting the column_symbol in the ColumnCorpus class.

(2) the frames are not annotated. The annotation is printed as "_sent_id Sehr gute Beratung , schnelle Behebung der Probleme , so <AM-MNR> stelle ich <A0> mir Kundenservice <A1> vor ._" But the annotation should be the verbs. So the object currently selects the wrong column as frame.

alanakbik · 2020-09-16T11:19:08Z

flair/datasets/sequence_labeling.py

+            base_path: Path = Path(base_path)
+
+        # column format
+        columns = {1: "text", 10: "frame"}


the column are wrong, it seems that column 10 is not the frame information

It was column nr. 9, starting with 0 to count, sorry, my mistake. It got fixed

alanakbik · 2020-09-16T11:20:00Z

flair/datasets/sequence_labeling.py

+            train_file="de-up-train.conllu",
+            test_file="de-up-dev.conllu",
+            dev_file="de-up-test.conllu",
+            in_memory=in_memory,


the comment_symbol parameter is missing here (the UP and UD datasets have comments that should not be read)

I have added comment_symbol="#" in both classes, thank you

alanakbik

Looks good, but test and dev splits are switched! Can you change this?

alanakbik · 2020-09-18T13:35:57Z

flair/datasets/sequence_labeling.py

+            encoding="utf-8",
+            train_file="de-up-train.conllu",
+            test_file="de-up-dev.conllu",
+            dev_file="de-up-test.conllu",


This is switched: You are loading the dev split as test_file and the test split as dev_file.

Oh. Merde. Sorry for that. Such a sloppy work. I am going to fix that in a couple of minutes, thank you!

Changed it, thank you!

alanakbik · 2020-09-18T13:36:12Z

flair/datasets/sequence_labeling.py

+            encoding="utf-8",
+            train_file="fr-up-train.conllu",
+            test_file="fr-up-dev.conllu",
+            dev_file="fr-up-test.conllu",


alanakbik · 2020-09-18T14:36:33Z

@Dabendorf thanks for adding this!

Added universal propositions bank for French and German

003db41

alanakbik requested changes Sep 16, 2020

View reviewed changes

Changed wrong frame column, added comment character

26e4d85

Dabendorf requested a review from alanakbik September 16, 2020 12:20

alanakbik requested changes Sep 18, 2020

View reviewed changes

Changed swaped lines, bugfix

172e80c

alanakbik approved these changes Sep 18, 2020

View reviewed changes

alanakbik merged commit 4e1c7b6 into flairNLP:master Sep 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added universal propositions bank for French and German #1866

Added universal propositions bank for French and German #1866

Dabendorf commented Sep 15, 2020

alanakbik left a comment •

edited

Loading

alanakbik Sep 16, 2020

Dabendorf Sep 16, 2020

alanakbik Sep 16, 2020

Dabendorf Sep 16, 2020

alanakbik left a comment

alanakbik Sep 18, 2020

Dabendorf Sep 18, 2020

Dabendorf Sep 18, 2020

alanakbik Sep 18, 2020

alanakbik commented Sep 18, 2020

Added universal propositions bank for French and German #1866

Added universal propositions bank for French and German #1866

Conversation

Dabendorf commented Sep 15, 2020

alanakbik left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alanakbik left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alanakbik commented Sep 18, 2020

alanakbik left a comment •

edited

Loading