Certain tokens receive null POS and token is not outputted #15

matgrioni · 2017-11-02T21:05:19Z

For some tokens, such as âManli are tagged as:

''/null

I couldn't find anything on the documentation about a null POS being returned in any case, so I figure this is an undesirable feature. In either case, the inclusion of the original token would be necessary.

The text was updated successfully, but these errors were encountered:

datquocnguyen · 2017-11-03T01:19:31Z

Thanks for the report. I cannot figure out any reason for what happened. It would be great if you can provide the whole sentence where âManli appears, and which language/model you are working on.

matgrioni · 2017-11-03T22:06:53Z

I am using the Latin model. Also after grepping and looking into the issue more it seems like many tokens receive null POS, even seemingly normal tokens such as ". I've attached a file that I am running this on, which has multiple instances of null POS when run through the tagger. This file has been converted to iso-8859-1 encoding for legacy reasons, but the same problem exists when it is in utf-8 format, although I'm not sure if it's the same tokens.

This is an example with ", the actual quote that is tagged as null is before cedant. I've included the context of the sentence in case the model uses it (I'm not sure).

nimis magna poena te consule constituta est sive malo poetae sive libero. 'scripsisti enim: "cedant arma togae."' quid tum? 'haec res tibi fluctus illos excitavit.'

Against_Lucius_Calpurnius_Piso.txt

datquocnguyen · 2017-11-03T23:54:12Z

Note that RDRPOSTagger requires an input tokenized/word-segmented corpus. You have to perform tokenization before performing POS tagging. Then you need also to add an entry
'' PUNCT into file la_ittb-upos.DICT (here '' is two-single-quotations mark). This entry is important to RDRPOSTagger for working properly, but somehow the training Universal Dependencies data for Latin does not contain both '' and ".
After you doing tokenization and adding the entry '' PUNCT (you can also add other entries for missing punctuations if you want!), RDRPOSTagger will definitely work, e.g. on the sentence ' scripsisti enim : " cedant arma togae . " '

matgrioni · 2017-11-04T12:18:18Z

Sorry, I was unclear. That file I attached is the raw file. I am tokenizing before it is being sent to the POS tagger and adding a space to separate any punctuation from the next character. I assume the fix will work the same independent of this fact however.

Thanks!

datquocnguyen · 2017-11-04T12:23:37Z

You can re-download rdrpostagger to run on the tokenized corpus. I just fix the errors by adding missing punctuation mark '' in the .DICT files. It should work with la_ittb-upos.DICT and la_ittb-upos.RDR.

datquocnguyen closed this as completed Nov 3, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Certain tokens receive null POS and token is not outputted #15

Certain tokens receive null POS and token is not outputted #15

matgrioni commented Nov 2, 2017

datquocnguyen commented Nov 3, 2017

matgrioni commented Nov 3, 2017

datquocnguyen commented Nov 3, 2017

matgrioni commented Nov 4, 2017

datquocnguyen commented Nov 4, 2017 •

edited

Certain tokens receive null POS and token is not outputted #15

Certain tokens receive null POS and token is not outputted #15

Comments

matgrioni commented Nov 2, 2017

datquocnguyen commented Nov 3, 2017

matgrioni commented Nov 3, 2017

datquocnguyen commented Nov 3, 2017

matgrioni commented Nov 4, 2017

datquocnguyen commented Nov 4, 2017 • edited

datquocnguyen commented Nov 4, 2017 •

edited