Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving Spacy Pos Tagger Accuracy (by 0.5-1% or more) #2087

Closed
Oxi84 opened this issue Mar 11, 2018 · 7 comments
Closed

Improving Spacy Pos Tagger Accuracy (by 0.5-1% or more) #2087

Oxi84 opened this issue Mar 11, 2018 · 7 comments
Labels
enhancement Feature requests and improvements training Training and updating models

Comments

@Oxi84
Copy link

Oxi84 commented Mar 11, 2018

Is it possible to get a (sorted) list of probabilities of predicted parts of speech.
For example: "Apples are red". For example I should have list of possible parts of speech for word red: [JJ,NN,NNP ...].

The I should be able to choose the part of speech from the possible lists of parts of speech for certain word. Simply because I noticed that for some sentences, the predicted part of speech is not always a possible part od speech. For example word "free" gets tagged as NN, and from dictionaries one can find that word "free" is not always

Doing so should likely improve the part of speech tagger for 0.5%-1%. So it should be around 98% instead of around 97% which is not bad.

@Oxi84
Copy link
Author

Oxi84 commented Mar 11, 2018

[JJ,NN,NNP ...] would be parts of speech of the word sorted by probability. So it would be the most probably this word is an JJ, and it is less likely it is NN ...

@honnibal
Copy link
Member

honnibal commented Mar 11, 2018

Yes, this has been a requested feature that I'd like to see built.

For now you can do it in your code by subclassing the Tagger class and override the .predict() method. The current implementation:

    def predict(self, docs):
        tokvecs = self.model.tok2vec(docs)
        scores = self.model.softmax(tokvecs)
        guesses = []
        for doc_scores in scores:
            doc_guesses = doc_scores.argmax(axis=1)
            if not isinstance(doc_guesses, numpy.ndarray):
                doc_guesses = doc_guesses.get()
            guesses.append(doc_guesses)
        return guesses, tokvecs

The scores variable will hold a list of numpy (or cupy if gpu) arrays, of length len(docs) and shape (len(doc), len(tags)). The function then needs to return a list of class IDs, which should be integer indices into the self.labels list. The metadata that describes the lookup table should go into the cfg dictionary if it's smallish (e.g. if you only want it to apply to frequent words and closed classes). You might want to populate your table during Tagger.begin_training(). The begin_training function receives a callable that produces an iterable over the training data.

If the dictionary is getting big, make sure you're doing the mapping with the integers, not the strings. If it's still too large, we can put it a Cython variable. PreshMap is very space efficient.

As far as implementation goes, it's probably fine to just build a boolean array for each word in the doc, and then multiply the scores by that. If that's not fast enough, I have similar logic in the parser, see the arg_max_if_valid function within the spacy/syntax/nn_parser.pyx file.

@Oxi84
Copy link
Author

Oxi84 commented Mar 11, 2018

Thanks very much.

I actually want to apply the part-of-speech filter after the training as it would take a lot of time to learn how to train spacy. Is it possible to get the probabilities in as output after I apply for example doc = nlp(u"This is a sentence.")

I tried:

import spacy
from spacy.pipeline import Tagger
nlp = spacy.load('en')
doc = nlp(u"This is a sentence.")
print [e.tag_ for e in doc]
tagger = Tagger(nlp.vocab)
scores = tagger.predict(doc)
print "scores",scores 

But this reports an error.

@honnibal
Copy link
Member

honnibal commented Mar 12, 2018

I haven't run the following, so it probably has bugs --- but something like this:

from spacy.pipeline import Tagger
from spacy.tokens import Doc, Token
from spacy.language import Language

Doc.set_extension('tag_scores', default=None)
Token.set_extension('tag_scores', getter=lambda token: token.doc.tag_scores[token.i])

class ProbabilityTagger(Tagger):
    def predict(self, docs):
        tokvecs = self.model.tok2vec(docs)
        scores = self.model.softmax(tokvecs)
        guesses = []
        for i, doc_scores in enumerate(scores):
            docs[i]._.tag_scores = doc_scores
            doc_guesses = doc_scores.argmax(axis=1)
  
            if not isinstance(doc_guesses, numpy.ndarray):
                doc_guesses = doc_guesses.get()
            guesses.append(doc_guesses)
        return guesses, tokvecs

Language.factories['tagger'] = lambda nlp, **cfg: ProbabilityTagger(nlp.vocab, **cfg)

This should:

  • Register an extension attribute that can be accessed at doc._.tag_scores
  • Register an extension attribute that can be accessed at token._.tag_scores
  • Subclasses the tagger, so that during prediction, the tag scores are saved in the new extension attribute
  • Registers the new class in the Language.factories dictionary, so that the subclass will be created.

The following should then work:

nlp = spacy.load('en_core_web_sm')
doc = nlp(u"This is a sentence.")
doc._.tag_scores # Should be numpy array, with one row per token
doc[0]._.tag_scores # Should be one row of the above array, accessed as a view.

You can read more about the extension attributes and processing pipeline here:https://spacy.io/usage/processing-pipelines#section-custom-components

@Oxi84
Copy link
Author

Oxi84 commented Mar 12, 2018

Thanks. it doesn't work because I get an error on the first line Doc.set_extension('tag_scores', default=None).

But if you woul'd like to try to add this feature in Spacy, please give me mail so I can send you a code (one simple function) and a .txt file with a list of possible POS tags.

I also added a few rules that transforms predicted tags ... for example:

                    if predtag == "VBG":
                        if tag_bef == "DT" or tag_bef == "JJ" or tag_bef == "CD":
                            if predtag in possible_tags:predtag = "NN"
                            #raw_input("0") 

@Oxi84
Copy link
Author

Oxi84 commented Mar 12, 2018

Actually here are few lines I used to increase acurracy for old perceptron tagger. I have also added more features and made it bidirectional so it archived around 98.26% acurracy, but I guess you also have bidirectional CNNs, and making the network bidirectional increases just fro around 0.3% and this part should lift it at least 0.5%. Providing CNNS predicts second tag probabability with similar accuracy as Perceptron.

  ### GET THE LIST OF WORDS AND POSSIBLE TAGS FROM A TXT FILE (THIS CAN BE DONE IN A MUCH BETTER WAY FOR SURE)
   ##
   ## LIST LINE EXAMPLE:  free JJ VB VBP RB  (so you just need to split it in order to get elements)
   ##
   ##############################################################
    new_ptlist_w = []
        new_ptlist_t = []            
        script_dir = os.path.dirname(__file__)
        thepath = "word_tag_words.txt"
        thefile_path = os.path.join(script_dir,thepath)            
        with open(thefile_path,"r") as f:
            append1 = new_ptlist_w.append
            append2 = new_ptlist_t.append
            i = -1
            for line in f:
                tlist = line.split()
                append1(tlist[0])
                append2(tlist[1:])
				
				
   wordtotag = "Free" #FOR EXAMPLE		
		
  #GET THE POSSIBLE TAGS FOR A GIVEN WORD
    theindex = new_ptlist_w.index(wordtotag)
        possible_tags = new_ptlist_t[theindex]	

        #After Spacy predicts a list of possible tags, you just remove the ones that are not in list possible_tags.
        #Lets say predicted tags list is predicted_tags_list - sorted from the highest to lowest probability:

    predicted_tags_list_new = []
        for item in predicted_tags_list:
            if item in possible_tags: 	predicted_tags_list_new.append(item)
        
    predicted_tags_list = [x for x in predicted_tags_list_new]
		
        predicted_tag = predicted_tags_list[0]

@honnibal honnibal added enhancement Feature requests and improvements training Training and updating models labels Mar 27, 2018
@lock
Copy link

lock bot commented May 7, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 7, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement Feature requests and improvements training Training and updating models
Projects
None yet
Development

No branches or pull requests

2 participants