-
-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improving Spacy Pos Tagger Accuracy (by 0.5-1% or more) #2087
Comments
[JJ,NN,NNP ...] would be parts of speech of the word sorted by probability. So it would be the most probably this word is an JJ, and it is less likely it is NN ... |
Yes, this has been a requested feature that I'd like to see built. For now you can do it in your code by subclassing the def predict(self, docs):
tokvecs = self.model.tok2vec(docs)
scores = self.model.softmax(tokvecs)
guesses = []
for doc_scores in scores:
doc_guesses = doc_scores.argmax(axis=1)
if not isinstance(doc_guesses, numpy.ndarray):
doc_guesses = doc_guesses.get()
guesses.append(doc_guesses)
return guesses, tokvecs The If the dictionary is getting big, make sure you're doing the mapping with the integers, not the strings. If it's still too large, we can put it a Cython variable. As far as implementation goes, it's probably fine to just build a boolean array for each word in the doc, and then multiply the scores by that. If that's not fast enough, I have similar logic in the parser, see the |
Thanks very much. I actually want to apply the part-of-speech filter after the training as it would take a lot of time to learn how to train spacy. Is it possible to get the probabilities in as output after I apply for example doc = nlp(u"This is a sentence.") I tried:
But this reports an error. |
I haven't run the following, so it probably has bugs --- but something like this: from spacy.pipeline import Tagger
from spacy.tokens import Doc, Token
from spacy.language import Language
Doc.set_extension('tag_scores', default=None)
Token.set_extension('tag_scores', getter=lambda token: token.doc.tag_scores[token.i])
class ProbabilityTagger(Tagger):
def predict(self, docs):
tokvecs = self.model.tok2vec(docs)
scores = self.model.softmax(tokvecs)
guesses = []
for i, doc_scores in enumerate(scores):
docs[i]._.tag_scores = doc_scores
doc_guesses = doc_scores.argmax(axis=1)
if not isinstance(doc_guesses, numpy.ndarray):
doc_guesses = doc_guesses.get()
guesses.append(doc_guesses)
return guesses, tokvecs
Language.factories['tagger'] = lambda nlp, **cfg: ProbabilityTagger(nlp.vocab, **cfg) This should:
The following should then work: nlp = spacy.load('en_core_web_sm')
doc = nlp(u"This is a sentence.")
doc._.tag_scores # Should be numpy array, with one row per token
doc[0]._.tag_scores # Should be one row of the above array, accessed as a view. You can read more about the extension attributes and processing pipeline here:https://spacy.io/usage/processing-pipelines#section-custom-components |
Thanks. it doesn't work because I get an error on the first line Doc.set_extension('tag_scores', default=None). But if you woul'd like to try to add this feature in Spacy, please give me mail so I can send you a code (one simple function) and a .txt file with a list of possible POS tags. I also added a few rules that transforms predicted tags ... for example:
|
Actually here are few lines I used to increase acurracy for old perceptron tagger. I have also added more features and made it bidirectional so it archived around 98.26% acurracy, but I guess you also have bidirectional CNNs, and making the network bidirectional increases just fro around 0.3% and this part should lift it at least 0.5%. Providing CNNS predicts second tag probabability with similar accuracy as Perceptron.
|
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
Is it possible to get a (sorted) list of probabilities of predicted parts of speech.
For example: "Apples are red". For example I should have list of possible parts of speech for word red: [JJ,NN,NNP ...].
The I should be able to choose the part of speech from the possible lists of parts of speech for certain word. Simply because I noticed that for some sentences, the predicted part of speech is not always a possible part od speech. For example word "free" gets tagged as NN, and from dictionaries one can find that word "free" is not always
Doing so should likely improve the part of speech tagger for 0.5%-1%. So it should be around 98% instead of around 97% which is not bad.
The text was updated successfully, but these errors were encountered: