Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions RE your NER #62

Closed
AWNystrom opened this issue Apr 29, 2015 · 23 comments
Closed

Questions RE your NER #62

AWNystrom opened this issue Apr 29, 2015 · 23 comments
Labels
enhancement Feature requests and improvements

Comments

@AWNystrom
Copy link

I see spaCy has an NER now. Very nice. I'm curious about how it compares to other NER systems. Have you benchmarked it on a standard dataset? What algorithm are you using? How does it compare to MITIE and Stanford?

@honnibal
Copy link
Member

Hey,

It's still about 3-4% less accurate than Stanford and MITE, on OntoNotes
and CoNLL 03 It's not an exact replication of anything, but I guess you
could say it's using something like SEARN.

The plan is to switch to using DBPedia as a gazetteer, for entity linking.

On Wednesday, April 29, 2015, Andrew Nystrom notifications@github.com
wrote:

I see spaCy has an NER now. Very nice. I'm curious about how it compares
to other NER systems. Have you benchmarked it on a standard dataset? What
algorithm are you using? How does it compare to MITIE and Stanford?


Reply to this email directly or view it on GitHub
#62.

@honnibal
Copy link
Member

In more detail:

  • Uses same code as the shift/reduce parser, with a transition system based on the BILOU tagging scheme. Instead of coding the problem as a tagging task, we maintain a stack, with the transitions B, I, L, U and O, and constrain the actions so that {I, L} are only valid while an entity is on the stack.
  • At the moment, no syntactic features are being used. But this scheme will make it easy to do joint parsing and NER.
  • Ships model trained on OntoNotes 5
  • On OntoNotes WSJ, scores ~82% evaluating all entity types, reportedly Stanford gets 85%
  • On CoNLL '03, scores ~86% on CoNLL '03, reportedly Stanford gets 90% there. MITE is reporting 88%.

I believe that the addition of syntactic features and gazetteer will raise performance to around the state-of-the-art on these evaluations. However, I don't really buy these data sets as good benchmarks.

I believe these evaluations under-estimate the importance of gazetteers for real-world performance. My plan is to hash the DBPedia entities and hold them in memory. By designing the data structures carefully, and doing a bit of pruning, I think I can do entity linking entirely in-memory. Currently systems rely on DB queries, which imposes a lot of extra complexity.

@AWNystrom
Copy link
Author

Thank you very much for the fantastic response. I've found MITIE often does some unexpected things, like usually tagging Amazon as a location, or McDonald's as a person. Would the gazetteer help with this?

@honnibal
Copy link
Member

honnibal commented May 4, 2015

Thank you very much for the fantastic response. I've found MITIE often
does some unexpected things, like usually tagging Amazon as a location, or
McDonald's as a person. Would the gazetteer help with this?

Yes, definitely. These are good examples of the problem, thanks! "Amazon"
occurs a handful of times in the CoNLL data, always as a reference to the
Amazon rainforest. McDonald's doesn't occur at all. If the system is only
trained on this data, it's going to get these really easy cases wrong.

The current model I'm shipping in SpaCy will make this kind of mistake.
Clever use of the DBPedia data should correct this. I want to be able to
match whole entity spans, and I want to include some sort of prior about
how "prominent" the entity is, e.g. by using Wikipedia page view stats, or
link counts, or something.

Reply to this email directly or view it on GitHub
#62 (comment).

@AWNystrom
Copy link
Author

Sounds awesome. Are you thinking bloom filters and count min sketches?
On Mon, May 4, 2015 at 6:41 PM Matthew Honnibal notifications@github.com
wrote:

Thank you very much for the fantastic response. I've found MITIE often
does some unexpected things, like usually tagging Amazon as a location,
or
McDonald's as a person. Would the gazetteer help with this?

Yes, definitely. These are good examples of the problem, thanks! "Amazon"
occurs a handful of times in the CoNLL data, always as a reference to the
Amazon rainforest. McDonald's doesn't occur at all. If the system is only
trained on this data, it's going to get these really easy cases wrong.

The current model I'm shipping in SpaCy will make this kind of mistake.
Clever use of the DBPedia data should correct this. I want to be able to
match whole entity spans, and I want to include some sort of prior about
how "prominent" the entity is, e.g. by using Wikipedia page view stats, or
link counts, or something.

Reply to this email directly or view it on GitHub
#62 (comment).


Reply to this email directly or view it on GitHub
#62 (comment).

@elyase
Copy link
Contributor

elyase commented May 4, 2015

The DBPedia gazetteer will be great addition.

@honnibal
Copy link
Member

honnibal commented May 4, 2015

I'm thinking that won't be necessary.

DBPedia has about 4 million entities, we can probably prune away 25-50% of them, as there's probably a long tail. And we probably want to store 2-3 aliases per entity.

So, we want to store 12 million 64 bit hashes --- that's only 100mb! No need to do anything special.

We'll also want a per-word gazetteer, that asks "Does this word begin an entity of category X?". But this is a boolean value, and there's still plenty of room in the lexicon's bit vector --- I think I have about 30 values free. So, this won't take any extra memory at all.

@AWNystrom
Copy link
Author

Do you think a gazetteer is better than just manually adding modern training data to the mix?

@derekduoba
Copy link

Hi. I'm seeing quite a few situations where the NER will not tag properly when an entire sentence or phrase does not have any capital letters. Admittedly, this is a problem across the board with most state-of-the-art taggers last I checked. However, Alan Ritter did some work on this topic about a year ago:

https://github.com/aritter/twitter_nlp
https://aritter.github.io/twitter_ner.pdf

tl;dr He created a tagger that performed reasonably well on noisy Twitter text.

Do you have any plans to add support for robust NER in noisy text? Alternatively, do you plan to add the ability to slot-in other NER modules when necessary?

@elyase
Copy link
Contributor

elyase commented May 29, 2015

I am also seeing situations where a sentence gets wrongly tagged (and parsed) because of wrong capitalization and would also be interested in this use case (twitter, noisy text).

@derekduoba
Copy link

Actually, I wrote a fairly naive NER for Tweets a few months back, and wouldn't mind rewriting and updating it for this project. Of course, this assumes you are A.) looking for contributors, and B.) willing to wait a month or so while I finish off a couple other projects.

@honnibal
Copy link
Member

Hi,

I should be rolling out a new model with more robust training within a
week. The new model still lacks a gazetteer, but at least it's trained on
better data. A gazetteer is the nex step after that.

On Sat, May 30, 2015 at 9:45 AM, Derek Duoba notifications@github.com
wrote:

Actually, I wrote a fairly naive NER for Tweets a few months back, and
wouldn't mind rewriting and updating it for this project. Of course, this
assumes you are A.) looking for contributors, and B.) willing to wait a
month or so while I finish off a couple other projects.


Reply to this email directly or view it on GitHub
#62 (comment).

@honnibal
Copy link
Member

honnibal commented Jun 7, 2015

Just pushed version 0.85. The NER should be a bit more robust, although it's still not great.

I'm working on various fixes. One idea is to add corruption to the training data, e.g. swap casing etc. I've always noticed this was an effective trick they use in ASR and OCR, and thought it'd be good to put it in an NLP model. Initial results are promising.

Still working on the gazetteer.

@lechatpito
Copy link

wrt the gazetteer, I think it's great that it's based on DBpedia, and that's a feature we're really looking forward as we already use this dataset. However, will it be possible to easily extend the gazetteer with our own lists? For example we would like to link to restaurants names.

@honnibal
Copy link
Member

Yes, definitely.

I want to have a black / grey / white list system, where the grey list is
used as a feature, and the black and white lists are deterministic.

On Tue, Jun 16, 2015 at 8:51 PM, François Scharffe <notifications@github.com

wrote:

wrt the gazetteer, I think it's great that it's based on DBpedia, and
that's a feature we're really looking forward as we already use this
dataset. However, will it be possible to easily extend the gazetteer with
our own lists? For example we would like to link to restaurants names.


Reply to this email directly or view it on GitHub
#62 (comment).

@AWNystrom
Copy link
Author

This sounds fantastic. Can't wait to see how it performs.

On 16 June 2015 at 14:52, Matthew Honnibal notifications@github.com wrote:

Yes, definitely.

I want to have a black / grey / white list system, where the grey list is
used as a feature, and the black and white lists are deterministic.

On Tue, Jun 16, 2015 at 8:51 PM, François Scharffe <
notifications@github.com

wrote:

wrt the gazetteer, I think it's great that it's based on DBpedia, and
that's a feature we're really looking forward as we already use this
dataset. However, will it be possible to easily extend the gazetteer with
our own lists? For example we would like to link to restaurants names.


Reply to this email directly or view it on GitHub
#62 (comment).


Reply to this email directly or view it on GitHub
#62 (comment).

@ma2rten
Copy link

ma2rten commented Jul 23, 2015

It would also be nice, if you could provide a case insensitive model. The current model is basically useless for social media data, such as tweets, where people often write in all lower case.

@matichorvat
Copy link

@honnibal Really excited about the planned addition of DBPedia. Have you made any progress towards that?

@forrestbao
Copy link

@ma2rten I am expecting the lower case NER feature too.

@honnibal honnibal added the enhancement Feature requests and improvements label Jan 18, 2016
@bawongfai
Copy link

@honnibal is the lower case NER in progress?
If not, would you mind to give some instructions on how to train that?
Thanks. :)

@icyc9
Copy link

icyc9 commented Jul 21, 2016

@honnibal Any progress on the lowercase NER?

@cmuell89
Copy link

@honnibal And the gazetteer?

@lock
Copy link

lock bot commented May 9, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 9, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement Feature requests and improvements
Projects
None yet
Development

No branches or pull requests