New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
app.nlp.ner: Turning ChainNer2 into an annotator, many cleanups #32
Conversation
@andrewmccallum @samanz , please review and comment. |
Since this is for CoNLL 2003 training data, let's name it something that doesn't have "NER2" in it. |
Thanks for doing this! One thing is that using the new lexicon objects we lose the BILOU labels on the lexicon features that my lexicons class provided, which provided a boost in f1 when I trained this model originally. This should not be that big of a change to add to Lexicon. Also, if we are going to have this model work as out of the box as the other models, we do need to have our own brown clusters file (which we could generate) and a trainer object with the default options of ner.aggregate = true, Those are all things I can take care of after you merge though. Otherwise it looks good to me. opts.sigmaSq is not being used anymore and can be removed. There were many problems before with deserializing a trained model in ChainNer2 for some reason beforehand, and I'd like to test it now, with the new serialization. |
My plan is to extend this code so it will also work in ontonotes labels, as I'm fine with naming it NER3. On Thu, Aug 8, 2013 at 1:35 PM, Andrew McCallum notifications@github.comwrote:
|
+1 for revisiting this. It would be great to have for comparison with previous work. Let me know if I can help testing it. |
Converted the naming to NER3. Is this ready for merge now? |
@samanz , thanks for the comments. Is there value in keeping the code around for aggregate = false, bp = false, and twoStage = false ? Vineet is working on some word embeddings which can already outperform brown clusters on a worse NER system, and I think going forward the plan is to check those into lexicons and use them in this ner, so I'm not too worried about shipping a model with brown clusters right now. |
@alextp twoStage = false is important since it will allow somebody who wants most of the accuracy without the added time penalty of running two models since it will only use the first model. It should always be true during training. The other two are useful for adding new features, since by selectively turing them on, you replicate different experiments from the paper the features were derived from, allowing you to see how a new feature changes the outcome of a model with a specific set of features turned on. Since setting them all to true adds all the features, which has the best results, they should all be set to true by default. |
I see. Thanks for the explanation :-) On Thu, Aug 8, 2013 at 2:37 PM, Sam Anzaroot notifications@github.comwrote:
|
Also, I have to run initSecondaryFeatures only when using the second model On Thu, Aug 8, 2013 at 2:42 PM, Alexandre Passos alexandre.tp@gmail.comwrote:
|
That's correct. It gets the features by looking at the values at the tokens attr[BilouConllNerLabel].categoryValue, hopefully predicted from the previous model. This may not be the best way of doing it, since there is no guarantee that this is the source of the label, especially since when loading the model, the value of categoryValue is the same as target.categoryValue. This isn't a big problem, since this code is only ever used here, and we can control where the values come from. It isn't perfect, and this way we need to take care to make sure that the wrong values don't seep in. Do you think that we should have separate labels for first stage and second stage models? |
I'm fine with you merging this whenever it is ready. |
Since the labels are created within the model I think the current approach On Thu, Aug 8, 2013 at 2:57 PM, Sam Anzaroot notifications@github.comwrote:
|
app.nlp.ner: Turning ChainNer2 into an annotator, many cleanups
Since ChainNer2 is the one NER system we have with state-of-the-art performance on conll I think it's desirable to revisit it and make it work in the new framework.
The current situation where its maintenance is "make it compile" is not ideal for something so much effort was put on. This pull request makes it into an annotator and simplifies the code in a way which hopefully preserves correctness.