New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Questions RE your NER #62
Comments
Hey, It's still about 3-4% less accurate than Stanford and MITE, on OntoNotes The plan is to switch to using DBPedia as a gazetteer, for entity linking. On Wednesday, April 29, 2015, Andrew Nystrom notifications@github.com
|
In more detail:
I believe that the addition of syntactic features and gazetteer will raise performance to around the state-of-the-art on these evaluations. However, I don't really buy these data sets as good benchmarks. I believe these evaluations under-estimate the importance of gazetteers for real-world performance. My plan is to hash the DBPedia entities and hold them in memory. By designing the data structures carefully, and doing a bit of pruning, I think I can do entity linking entirely in-memory. Currently systems rely on DB queries, which imposes a lot of extra complexity. |
Thank you very much for the fantastic response. I've found MITIE often does some unexpected things, like usually tagging Amazon as a location, or McDonald's as a person. Would the gazetteer help with this? |
The current model I'm shipping in SpaCy will make this kind of mistake. —
|
Sounds awesome. Are you thinking bloom filters and count min sketches?
|
The DBPedia gazetteer will be great addition. |
I'm thinking that won't be necessary. DBPedia has about 4 million entities, we can probably prune away 25-50% of them, as there's probably a long tail. And we probably want to store 2-3 aliases per entity. So, we want to store 12 million 64 bit hashes --- that's only 100mb! No need to do anything special. We'll also want a per-word gazetteer, that asks "Does this word begin an entity of category X?". But this is a boolean value, and there's still plenty of room in the lexicon's bit vector --- I think I have about 30 values free. So, this won't take any extra memory at all. |
Do you think a gazetteer is better than just manually adding modern training data to the mix? |
Hi. I'm seeing quite a few situations where the NER will not tag properly when an entire sentence or phrase does not have any capital letters. Admittedly, this is a problem across the board with most state-of-the-art taggers last I checked. However, Alan Ritter did some work on this topic about a year ago: https://github.com/aritter/twitter_nlp tl;dr He created a tagger that performed reasonably well on noisy Twitter text. Do you have any plans to add support for robust NER in noisy text? Alternatively, do you plan to add the ability to slot-in other NER modules when necessary? |
I am also seeing situations where a sentence gets wrongly tagged (and parsed) because of wrong capitalization and would also be interested in this use case (twitter, noisy text). |
Actually, I wrote a fairly naive NER for Tweets a few months back, and wouldn't mind rewriting and updating it for this project. Of course, this assumes you are A.) looking for contributors, and B.) willing to wait a month or so while I finish off a couple other projects. |
Hi, I should be rolling out a new model with more robust training within a On Sat, May 30, 2015 at 9:45 AM, Derek Duoba notifications@github.com
|
Just pushed version 0.85. The NER should be a bit more robust, although it's still not great. I'm working on various fixes. One idea is to add corruption to the training data, e.g. swap casing etc. I've always noticed this was an effective trick they use in ASR and OCR, and thought it'd be good to put it in an NLP model. Initial results are promising. Still working on the gazetteer. |
wrt the gazetteer, I think it's great that it's based on DBpedia, and that's a feature we're really looking forward as we already use this dataset. However, will it be possible to easily extend the gazetteer with our own lists? For example we would like to link to restaurants names. |
Yes, definitely. I want to have a black / grey / white list system, where the grey list is On Tue, Jun 16, 2015 at 8:51 PM, François Scharffe <notifications@github.com
|
This sounds fantastic. Can't wait to see how it performs. On 16 June 2015 at 14:52, Matthew Honnibal notifications@github.com wrote:
|
It would also be nice, if you could provide a case insensitive model. The current model is basically useless for social media data, such as tweets, where people often write in all lower case. |
@honnibal Really excited about the planned addition of DBPedia. Have you made any progress towards that? |
@ma2rten I am expecting the lower case NER feature too. |
@honnibal is the lower case NER in progress? |
@honnibal Any progress on the lowercase NER? |
@honnibal And the gazetteer? |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
I see spaCy has an NER now. Very nice. I'm curious about how it compares to other NER systems. Have you benchmarked it on a standard dataset? What algorithm are you using? How does it compare to MITIE and Stanford?
The text was updated successfully, but these errors were encountered: