Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added universal propositions bank for French and German #1866

Merged
merged 3 commits into from
Sep 18, 2020

Conversation

Dabendorf
Copy link
Contributor

I have added the German and the French data from he Universal Proposition Bank (https://github.com/System-T/UniversalPropositions)

Copy link
Collaborator

@alanakbik alanakbik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tested with the following code and the output does not look fully correct:

from flair.datasets import UP_GERMAN

# load corpus
corpus = UP_GERMAN()
print(corpus)

# print first sentence in train data
print(corpus.train[0])

This prints:

Sentence: "sent_id Sehr gute Beratung , schnelle Behebung der Probleme , so stelle ich mir Kundenservice vor ."   [− Tokens: 17  − Token-Labels: "sent_id Sehr gute Beratung , schnelle Behebung der Probleme , so <AM-MNR> stelle ich <A0> mir Kundenservice <A1> vor ."]

Two problems here:

(1) the sentence is read as "sent_id Sehr gute Beratung , schnelle Behebung der Probleme , so stelle ich mir Kundenservice vor .", but sent_id should not be part of the sentence. The problem is that the UP files have lines that are comments. These lines are prefixed by a # symbol and should be skipped. You can get this behavior by setting the column_symbol in the ColumnCorpus class.

(2) the frames are not annotated. The annotation is printed as "_sent_id Sehr gute Beratung , schnelle Behebung der Probleme , so <AM-MNR> stelle ich <A0> mir Kundenservice <A1> vor ._" But the annotation should be the verbs. So the object currently selects the wrong column as frame.

base_path: Path = Path(base_path)

# column format
columns = {1: "text", 10: "frame"}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the column are wrong, it seems that column 10 is not the frame information

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was column nr. 9, starting with 0 to count, sorry, my mistake. It got fixed

train_file="de-up-train.conllu",
test_file="de-up-dev.conllu",
dev_file="de-up-test.conllu",
in_memory=in_memory,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the comment_symbol parameter is missing here (the UP and UD datasets have comments that should not be read)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added comment_symbol="#" in both classes, thank you

Copy link
Collaborator

@alanakbik alanakbik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, but test and dev splits are switched! Can you change this?

encoding="utf-8",
train_file="de-up-train.conllu",
test_file="de-up-dev.conllu",
dev_file="de-up-test.conllu",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is switched: You are loading the dev split as test_file and the test split as dev_file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh. Merde. Sorry for that. Such a sloppy work. I am going to fix that in a couple of minutes, thank you!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed it, thank you!

encoding="utf-8",
train_file="fr-up-train.conllu",
test_file="fr-up-dev.conllu",
dev_file="fr-up-test.conllu",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

@alanakbik alanakbik merged commit 4e1c7b6 into flairNLP:master Sep 18, 2020
@alanakbik
Copy link
Collaborator

@Dabendorf thanks for adding this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants