Improving Spacy Pos Tagger Accuracy (by 0.5-1% or more) #2087

Oxi84 · 2018-03-11T17:41:41Z

Is it possible to get a (sorted) list of probabilities of predicted parts of speech.
For example: "Apples are red". For example I should have list of possible parts of speech for word red: [JJ,NN,NNP ...].

The I should be able to choose the part of speech from the possible lists of parts of speech for certain word. Simply because I noticed that for some sentences, the predicted part of speech is not always a possible part od speech. For example word "free" gets tagged as NN, and from dictionaries one can find that word "free" is not always

Doing so should likely improve the part of speech tagger for 0.5%-1%. So it should be around 98% instead of around 97% which is not bad.

Oxi84 · 2018-03-11T17:44:22Z

[JJ,NN,NNP ...] would be parts of speech of the word sorted by probability. So it would be the most probably this word is an JJ, and it is less likely it is NN ...

honnibal · 2018-03-11T17:54:53Z

Yes, this has been a requested feature that I'd like to see built.

For now you can do it in your code by subclassing the Tagger class and override the .predict() method. The current implementation:

    def predict(self, docs):
        tokvecs = self.model.tok2vec(docs)
        scores = self.model.softmax(tokvecs)
        guesses = []
        for doc_scores in scores:
            doc_guesses = doc_scores.argmax(axis=1)
            if not isinstance(doc_guesses, numpy.ndarray):
                doc_guesses = doc_guesses.get()
            guesses.append(doc_guesses)
        return guesses, tokvecs

The scores variable will hold a list of numpy (or cupy if gpu) arrays, of length len(docs) and shape (len(doc), len(tags)). The function then needs to return a list of class IDs, which should be integer indices into the self.labels list. The metadata that describes the lookup table should go into the cfg dictionary if it's smallish (e.g. if you only want it to apply to frequent words and closed classes). You might want to populate your table during Tagger.begin_training(). The begin_training function receives a callable that produces an iterable over the training data.

If the dictionary is getting big, make sure you're doing the mapping with the integers, not the strings. If it's still too large, we can put it a Cython variable. PreshMap is very space efficient.

As far as implementation goes, it's probably fine to just build a boolean array for each word in the doc, and then multiply the scores by that. If that's not fast enough, I have similar logic in the parser, see the arg_max_if_valid function within the spacy/syntax/nn_parser.pyx file.

Oxi84 · 2018-03-11T18:43:20Z

Thanks very much.

I actually want to apply the part-of-speech filter after the training as it would take a lot of time to learn how to train spacy. Is it possible to get the probabilities in as output after I apply for example doc = nlp(u"This is a sentence.")

I tried:

import spacy
from spacy.pipeline import Tagger
nlp = spacy.load('en')
doc = nlp(u"This is a sentence.")
print [e.tag_ for e in doc]
tagger = Tagger(nlp.vocab)
scores = tagger.predict(doc)
print "scores",scores

But this reports an error.

honnibal · 2018-03-12T01:36:11Z

I haven't run the following, so it probably has bugs --- but something like this:

from spacy.pipeline import Tagger
from spacy.tokens import Doc, Token
from spacy.language import Language

Doc.set_extension('tag_scores', default=None)
Token.set_extension('tag_scores', getter=lambda token: token.doc.tag_scores[token.i])

class ProbabilityTagger(Tagger):
    def predict(self, docs):
        tokvecs = self.model.tok2vec(docs)
        scores = self.model.softmax(tokvecs)
        guesses = []
        for i, doc_scores in enumerate(scores):
            docs[i]._.tag_scores = doc_scores
            doc_guesses = doc_scores.argmax(axis=1)
  
            if not isinstance(doc_guesses, numpy.ndarray):
                doc_guesses = doc_guesses.get()
            guesses.append(doc_guesses)
        return guesses, tokvecs

Language.factories['tagger'] = lambda nlp, **cfg: ProbabilityTagger(nlp.vocab, **cfg)

This should:

Register an extension attribute that can be accessed at doc._.tag_scores
Register an extension attribute that can be accessed at token._.tag_scores
Subclasses the tagger, so that during prediction, the tag scores are saved in the new extension attribute
Registers the new class in the Language.factories dictionary, so that the subclass will be created.

The following should then work:

nlp = spacy.load('en_core_web_sm')
doc = nlp(u"This is a sentence.")
doc._.tag_scores # Should be numpy array, with one row per token
doc[0]._.tag_scores # Should be one row of the above array, accessed as a view.

You can read more about the extension attributes and processing pipeline here:https://spacy.io/usage/processing-pipelines#section-custom-components

Oxi84 · 2018-03-12T14:11:23Z

Thanks. it doesn't work because I get an error on the first line Doc.set_extension('tag_scores', default=None).

But if you woul'd like to try to add this feature in Spacy, please give me mail so I can send you a code (one simple function) and a .txt file with a list of possible POS tags.

I also added a few rules that transforms predicted tags ... for example:

                    if predtag == "VBG":
                        if tag_bef == "DT" or tag_bef == "JJ" or tag_bef == "CD":
                            if predtag in possible_tags:predtag = "NN"
                            #raw_input("0")

Oxi84 · 2018-03-12T14:28:32Z

Actually here are few lines I used to increase acurracy for old perceptron tagger. I have also added more features and made it bidirectional so it archived around 98.26% acurracy, but I guess you also have bidirectional CNNs, and making the network bidirectional increases just fro around 0.3% and this part should lift it at least 0.5%. Providing CNNS predicts second tag probabability with similar accuracy as Perceptron.

  ### GET THE LIST OF WORDS AND POSSIBLE TAGS FROM A TXT FILE (THIS CAN BE DONE IN A MUCH BETTER WAY FOR SURE)
   ##
   ## LIST LINE EXAMPLE:  free JJ VB VBP RB  (so you just need to split it in order to get elements)
   ##
   ##############################################################
    new_ptlist_w = []
        new_ptlist_t = []            
        script_dir = os.path.dirname(__file__)
        thepath = "word_tag_words.txt"
        thefile_path = os.path.join(script_dir,thepath)            
        with open(thefile_path,"r") as f:
            append1 = new_ptlist_w.append
            append2 = new_ptlist_t.append
            i = -1
            for line in f:
                tlist = line.split()
                append1(tlist[0])
                append2(tlist[1:])
				
				
   wordtotag = "Free" #FOR EXAMPLE		
		
  #GET THE POSSIBLE TAGS FOR A GIVEN WORD
    theindex = new_ptlist_w.index(wordtotag)
        possible_tags = new_ptlist_t[theindex]	

        #After Spacy predicts a list of possible tags, you just remove the ones that are not in list possible_tags.
        #Lets say predicted tags list is predicted_tags_list - sorted from the highest to lowest probability:

    predicted_tags_list_new = []
        for item in predicted_tags_list:
            if item in possible_tags: 	predicted_tags_list_new.append(item)
        
    predicted_tags_list = [x for x in predicted_tags_list_new]
		
        predicted_tag = predicted_tags_list[0]

lock · 2018-05-07T20:52:59Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

honnibal added enhancement Feature requests and improvements training Training and updating models labels Mar 27, 2018

honnibal closed this as completed Mar 27, 2018

lock bot locked as resolved and limited conversation to collaborators May 7, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving Spacy Pos Tagger Accuracy (by 0.5-1% or more) #2087

Improving Spacy Pos Tagger Accuracy (by 0.5-1% or more) #2087

Oxi84 commented Mar 11, 2018

Oxi84 commented Mar 11, 2018

honnibal commented Mar 11, 2018 •

edited

Oxi84 commented Mar 11, 2018 •

edited

honnibal commented Mar 12, 2018 •

edited

Oxi84 commented Mar 12, 2018

Oxi84 commented Mar 12, 2018 •

edited

lock bot commented May 7, 2018

Improving Spacy Pos Tagger Accuracy (by 0.5-1% or more) #2087

Improving Spacy Pos Tagger Accuracy (by 0.5-1% or more) #2087

Comments

Oxi84 commented Mar 11, 2018

Oxi84 commented Mar 11, 2018

honnibal commented Mar 11, 2018 • edited

Oxi84 commented Mar 11, 2018 • edited

honnibal commented Mar 12, 2018 • edited

Oxi84 commented Mar 12, 2018

Oxi84 commented Mar 12, 2018 • edited

lock bot commented May 7, 2018

honnibal commented Mar 11, 2018 •

edited

Oxi84 commented Mar 11, 2018 •

edited

honnibal commented Mar 12, 2018 •

edited

Oxi84 commented Mar 12, 2018 •

edited