New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with SLP model and POS tagging (pattern.en) #182

Closed
markus-beuckelmann opened this Issue Jun 13, 2017 · 3 comments

Comments

Projects
None yet
2 participants
@markus-beuckelmann
Collaborator

markus-beuckelmann commented Jun 13, 2017

There are currently some issues with POS misclassification in pattern.en on Python 2.7 which cause multiple tests in test_en.py and test_text.py to fail. Let's take a look at test_tag() for instance (see Travis log), which looks like this:

# Assert [("black", "JJ"), ("cats", "NNS")].
v = en.tag("black cats")
self.assertEqual(v, [("black", "JJ"), ("cats", "NNS")])

The test fails because 'black' get classified as JJS (adjective, superlative) instead of JJ (adjective).

Here is what happens: When we call en.tag() it gets passed down to en.parse() which will then be handled by parse() in text/__init__.py (source) which in turn calls find_tags() (source). Inside find_tags() the word gets looked up in the lexicon (here) which assigns the correct (!) label JJ. Then, this label is overruled (here) by the model (because 'black' is listed in model.unkown) and classify() (source) returns the wrong label 'JJS'.

There are many similar examples that you can look at: test_parse (see e.g. misclassification for 'sat'), test_find_tags, test_tagged_string, test_word, test_document.

Sure, the SLP model is a statistical model and consequently is allowed to be wrong in some cases, but what bothers me is that it apparently used to work some time ago. Sentences of the form "The black cat sat on..." are scattered everywhere across unit tests that I can't believe that the model got that wrong all the time.

I just can't find the cause for this change. @tom-de-smedt, what am I missing?

@markus-beuckelmann

This comment has been minimized.

Collaborator

markus-beuckelmann commented Jul 28, 2017

I finally narrowed down the cause for this problem. Looks like this line in pattern/text/__init__.py introduced in dc85534 is responsible for the problems mentioned above.

@piyush0609

This comment has been minimized.

piyush0609 commented Jul 30, 2017

@markus-beuckelmann I would be glad to work on the issue and try to resolve it, if you are not already working it.

@markus-beuckelmann

This comment has been minimized.

Collaborator

markus-beuckelmann commented Aug 1, 2017

@piyush0609, it is already fixed as of 93235fe. It's great that you want to contribute though, keep an eye out for issues tagged with the "help" label.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment