Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Certain tokens receive null POS and token is not outputted #15

Closed
matgrioni opened this issue Nov 2, 2017 · 5 comments
Closed

Certain tokens receive null POS and token is not outputted #15

matgrioni opened this issue Nov 2, 2017 · 5 comments

Comments

@matgrioni
Copy link

For some tokens, such as âManli are tagged as:

''/null

I couldn't find anything on the documentation about a null POS being returned in any case, so I figure this is an undesirable feature. In either case, the inclusion of the original token would be necessary.

@datquocnguyen
Copy link
Owner

Thanks for the report. I cannot figure out any reason for what happened. It would be great if you can provide the whole sentence where âManli appears, and which language/model you are working on.

@matgrioni
Copy link
Author

I am using the Latin model. Also after grepping and looking into the issue more it seems like many tokens receive null POS, even seemingly normal tokens such as ". I've attached a file that I am running this on, which has multiple instances of null POS when run through the tagger. This file has been converted to iso-8859-1 encoding for legacy reasons, but the same problem exists when it is in utf-8 format, although I'm not sure if it's the same tokens.

This is an example with ", the actual quote that is tagged as null is before cedant. I've included the context of the sentence in case the model uses it (I'm not sure).

nimis magna poena te consule constituta est sive malo poetae sive libero. 'scripsisti enim: "cedant arma togae."' quid tum? 'haec res tibi fluctus illos excitavit.'

Against_Lucius_Calpurnius_Piso.txt

@datquocnguyen
Copy link
Owner

Note that RDRPOSTagger requires an input tokenized/word-segmented corpus. You have to perform tokenization before performing POS tagging. Then you need also to add an entry
'' PUNCT into file la_ittb-upos.DICT (here '' is two-single-quotations mark). This entry is important to RDRPOSTagger for working properly, but somehow the training Universal Dependencies data for Latin does not contain both '' and ".
After you doing tokenization and adding the entry '' PUNCT (you can also add other entries for missing punctuations if you want!), RDRPOSTagger will definitely work, e.g. on the sentence ' scripsisti enim : " cedant arma togae . " '

@matgrioni
Copy link
Author

Sorry, I was unclear. That file I attached is the raw file. I am tokenizing before it is being sent to the POS tagger and adding a space to separate any punctuation from the next character. I assume the fix will work the same independent of this fact however.

Thanks!

@datquocnguyen
Copy link
Owner

datquocnguyen commented Nov 4, 2017

You can re-download rdrpostagger to run on the tokenized corpus. I just fix the errors by adding missing punctuation mark '' in the .DICT files. It should work with la_ittb-upos.DICT and la_ittb-upos.RDR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants