Format question #9

stefan-it · 2021-07-27T07:35:46Z

Hi guys,

I'm currently wondering why the "text" line contains the tokenized sentence, whereas "text" line in the German GSD corpus contains the un-tokenized sentence.

Is there any format specification that defines if the tokenized/untokenized sentence should be used there 🤔

akoehn · 2021-07-27T07:52:54Z

This comes from the different sources. The HDT development started ~20 years ago with scraped HTML3 and we converted the already tokenized HDT to UD. Obtaining untokenized text is possible (the correspondences are in some extra files) but not trivial.

Trying to automatically un-tokenize would be weird as everyone else could do that themselves as well and it would make the impression that it would be the original even though it is not. Therefore, we do not provide a non-tokenized text.

dan-zeman · 2021-08-05T21:06:56Z

Is there any format specification that defines if the tokenized/untokenized sentence should be used there

The text attribute should contain the untokenized sentence. For treebanks where it is available, it contains the original text before any processing started, which is of course highly preferable. However, there are many other treebanks for which the original is not available, and in these treebanks the de-tokenization has been estimated automatically using heuristics. So if HDT goes that way, it will definitely not be the only treebank to do so.

I would say that this approach is preferable (while the README should explain that it is not the original). If people train an end-to-end model, such as UDPipe, on a treebank where there are spaces between all tokens, they will get a model that is useless.

AngledLuffa · 2022-06-27T21:34:43Z

bump - the tokenization is still essentially useless for training a model as in @dan-zeman 's explanation

dan-zeman added the bug Something isn't working label Aug 5, 2021

AngledLuffa mentioned this issue Jun 27, 2022

Problems with lemmatization (German) stanfordnlp/stanza#1063

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Format question #9

Format question #9

stefan-it commented Jul 27, 2021

akoehn commented Jul 27, 2021

dan-zeman commented Aug 5, 2021

AngledLuffa commented Jun 27, 2022

Format question #9

Format question #9

Comments

stefan-it commented Jul 27, 2021

akoehn commented Jul 27, 2021

dan-zeman commented Aug 5, 2021

AngledLuffa commented Jun 27, 2022