Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Format question #9

Open
stefan-it opened this issue Jul 27, 2021 · 3 comments
Open

Format question #9

stefan-it opened this issue Jul 27, 2021 · 3 comments
Labels
bug Something isn't working

Comments

@stefan-it
Copy link

Hi guys,

I'm currently wondering why the "text" line contains the tokenized sentence, whereas "text" line in the German GSD corpus contains the un-tokenized sentence.

Is there any format specification that defines if the tokenized/untokenized sentence should be used there 🤔

@akoehn
Copy link
Member

akoehn commented Jul 27, 2021

This comes from the different sources. The HDT development started ~20 years ago with scraped HTML3 and we converted the already tokenized HDT to UD. Obtaining untokenized text is possible (the correspondences are in some extra files) but not trivial.

Trying to automatically un-tokenize would be weird as everyone else could do that themselves as well and it would make the impression that it would be the original even though it is not. Therefore, we do not provide a non-tokenized text.

@dan-zeman
Copy link
Member

Is there any format specification that defines if the tokenized/untokenized sentence should be used there

The text attribute should contain the untokenized sentence. For treebanks where it is available, it contains the original text before any processing started, which is of course highly preferable. However, there are many other treebanks for which the original is not available, and in these treebanks the de-tokenization has been estimated automatically using heuristics. So if HDT goes that way, it will definitely not be the only treebank to do so.

I would say that this approach is preferable (while the README should explain that it is not the original). If people train an end-to-end model, such as UDPipe, on a treebank where there are spaces between all tokens, they will get a model that is useless.

@dan-zeman dan-zeman added the bug Something isn't working label Aug 5, 2021
@AngledLuffa
Copy link

bump - the tokenization is still essentially useless for training a model as in @dan-zeman 's explanation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants