Questions RE your NER #62

AWNystrom · 2015-04-29T14:45:40Z

I see spaCy has an NER now. Very nice. I'm curious about how it compares to other NER systems. Have you benchmarked it on a standard dataset? What algorithm are you using? How does it compare to MITIE and Stanford?

honnibal · 2015-04-29T14:57:31Z

Hey,

It's still about 3-4% less accurate than Stanford and MITE, on OntoNotes
and CoNLL 03 It's not an exact replication of anything, but I guess you
could say it's using something like SEARN.

The plan is to switch to using DBPedia as a gazetteer, for entity linking.

On Wednesday, April 29, 2015, Andrew Nystrom notifications@github.com
wrote:

I see spaCy has an NER now. Very nice. I'm curious about how it compares
to other NER systems. Have you benchmarked it on a standard dataset? What
algorithm are you using? How does it compare to MITIE and Stanford?

—
Reply to this email directly or view it on GitHub
#62.

honnibal · 2015-04-30T11:47:00Z

In more detail:

Uses same code as the shift/reduce parser, with a transition system based on the BILOU tagging scheme. Instead of coding the problem as a tagging task, we maintain a stack, with the transitions B, I, L, U and O, and constrain the actions so that {I, L} are only valid while an entity is on the stack.
At the moment, no syntactic features are being used. But this scheme will make it easy to do joint parsing and NER.
Ships model trained on OntoNotes 5
On OntoNotes WSJ, scores ~82% evaluating all entity types, reportedly Stanford gets 85%
On CoNLL '03, scores ~86% on CoNLL '03, reportedly Stanford gets 90% there. MITE is reporting 88%.

I believe that the addition of syntactic features and gazetteer will raise performance to around the state-of-the-art on these evaluations. However, I don't really buy these data sets as good benchmarks.

I believe these evaluations under-estimate the importance of gazetteers for real-world performance. My plan is to hash the DBPedia entities and hold them in memory. By designing the data structures carefully, and doing a bit of pruning, I think I can do entity linking entirely in-memory. Currently systems rely on DB queries, which imposes a lot of extra complexity.

AWNystrom · 2015-05-04T20:47:08Z

Thank you very much for the fantastic response. I've found MITIE often does some unexpected things, like usually tagging Amazon as a location, or McDonald's as a person. Would the gazetteer help with this?

honnibal · 2015-05-04T23:40:57Z

Thank you very much for the fantastic response. I've found MITIE often
does some unexpected things, like usually tagging Amazon as a location, or
McDonald's as a person. Would the gazetteer help with this?

Yes, definitely. These are good examples of the problem, thanks! "Amazon"
occurs a handful of times in the CoNLL data, always as a reference to the
Amazon rainforest. McDonald's doesn't occur at all. If the system is only
trained on this data, it's going to get these really easy cases wrong.

The current model I'm shipping in SpaCy will make this kind of mistake.
Clever use of the DBPedia data should correct this. I want to be able to
match whole entity spans, and I want to include some sort of prior about
how "prominent" the entity is, e.g. by using Wikipedia page view stats, or
link counts, or something.

—

Reply to this email directly or view it on GitHub
#62 (comment).

AWNystrom · 2015-05-04T23:47:02Z

Sounds awesome. Are you thinking bloom filters and count min sketches?
On Mon, May 4, 2015 at 6:41 PM Matthew Honnibal notifications@github.com
wrote:

Thank you very much for the fantastic response. I've found MITIE often
does some unexpected things, like usually tagging Amazon as a location,
or
McDonald's as a person. Would the gazetteer help with this?

Yes, definitely. These are good examples of the problem, thanks! "Amazon"
occurs a handful of times in the CoNLL data, always as a reference to the
Amazon rainforest. McDonald's doesn't occur at all. If the system is only
trained on this data, it's going to get these really easy cases wrong.

The current model I'm shipping in SpaCy will make this kind of mistake.
Clever use of the DBPedia data should correct this. I want to be able to
match whole entity spans, and I want to include some sort of prior about
how "prominent" the entity is, e.g. by using Wikipedia page view stats, or
link counts, or something.

—

Reply to this email directly or view it on GitHub
#62 (comment).

—
Reply to this email directly or view it on GitHub
#62 (comment).

elyase · 2015-05-04T23:49:46Z

The DBPedia gazetteer will be great addition.

honnibal · 2015-05-04T23:55:29Z

I'm thinking that won't be necessary.

DBPedia has about 4 million entities, we can probably prune away 25-50% of them, as there's probably a long tail. And we probably want to store 2-3 aliases per entity.

So, we want to store 12 million 64 bit hashes --- that's only 100mb! No need to do anything special.

We'll also want a per-word gazetteer, that asks "Does this word begin an entity of category X?". But this is a boolean value, and there's still plenty of room in the lexicon's bit vector --- I think I have about 30 values free. So, this won't take any extra memory at all.

AWNystrom · 2015-05-05T16:38:27Z

Do you think a gazetteer is better than just manually adding modern training data to the mix?

derekduoba · 2015-05-26T04:10:34Z

Hi. I'm seeing quite a few situations where the NER will not tag properly when an entire sentence or phrase does not have any capital letters. Admittedly, this is a problem across the board with most state-of-the-art taggers last I checked. However, Alan Ritter did some work on this topic about a year ago:

https://github.com/aritter/twitter_nlp
https://aritter.github.io/twitter_ner.pdf

tl;dr He created a tagger that performed reasonably well on noisy Twitter text.

Do you have any plans to add support for robust NER in noisy text? Alternatively, do you plan to add the ability to slot-in other NER modules when necessary?

elyase · 2015-05-29T11:12:49Z

I am also seeing situations where a sentence gets wrongly tagged (and parsed) because of wrong capitalization and would also be interested in this use case (twitter, noisy text).

derekduoba · 2015-05-30T07:45:00Z

Actually, I wrote a fairly naive NER for Tweets a few months back, and wouldn't mind rewriting and updating it for this project. Of course, this assumes you are A.) looking for contributors, and B.) willing to wait a month or so while I finish off a couple other projects.

honnibal · 2015-05-30T10:41:32Z

Hi,

I should be rolling out a new model with more robust training within a
week. The new model still lacks a gazetteer, but at least it's trained on
better data. A gazetteer is the nex step after that.

On Sat, May 30, 2015 at 9:45 AM, Derek Duoba notifications@github.com
wrote:

Actually, I wrote a fairly naive NER for Tweets a few months back, and
wouldn't mind rewriting and updating it for this project. Of course, this
assumes you are A.) looking for contributors, and B.) willing to wait a
month or so while I finish off a couple other projects.

—
Reply to this email directly or view it on GitHub
#62 (comment).

honnibal · 2015-06-07T23:52:32Z

Just pushed version 0.85. The NER should be a bit more robust, although it's still not great.

I'm working on various fixes. One idea is to add corruption to the training data, e.g. swap casing etc. I've always noticed this was an effective trick they use in ASR and OCR, and thought it'd be good to put it in an NLP model. Initial results are promising.

Still working on the gazetteer.

lechatpito · 2015-06-16T18:51:15Z

wrt the gazetteer, I think it's great that it's based on DBpedia, and that's a feature we're really looking forward as we already use this dataset. However, will it be possible to easily extend the gazetteer with our own lists? For example we would like to link to restaurants names.

honnibal · 2015-06-16T19:52:33Z

Yes, definitely.

I want to have a black / grey / white list system, where the grey list is
used as a feature, and the black and white lists are deterministic.

On Tue, Jun 16, 2015 at 8:51 PM, François Scharffe <notifications@github.com

wrote:

wrt the gazetteer, I think it's great that it's based on DBpedia, and
that's a feature we're really looking forward as we already use this
dataset. However, will it be possible to easily extend the gazetteer with
our own lists? For example we would like to link to restaurants names.

—
Reply to this email directly or view it on GitHub
#62 (comment).

AWNystrom · 2015-06-16T19:54:20Z

This sounds fantastic. Can't wait to see how it performs.

On 16 June 2015 at 14:52, Matthew Honnibal notifications@github.com wrote:

Yes, definitely.

I want to have a black / grey / white list system, where the grey list is
used as a feature, and the black and white lists are deterministic.

On Tue, Jun 16, 2015 at 8:51 PM, François Scharffe <
notifications@github.com

wrote:

wrt the gazetteer, I think it's great that it's based on DBpedia, and
that's a feature we're really looking forward as we already use this
dataset. However, will it be possible to easily extend the gazetteer with
our own lists? For example we would like to link to restaurants names.

—
Reply to this email directly or view it on GitHub
#62 (comment).

—
Reply to this email directly or view it on GitHub
#62 (comment).

ma2rten · 2015-07-23T02:08:45Z

It would also be nice, if you could provide a case insensitive model. The current model is basically useless for social media data, such as tweets, where people often write in all lower case.

matichorvat · 2015-10-01T17:53:41Z

@honnibal Really excited about the planned addition of DBPedia. Have you made any progress towards that?

forrestbao · 2015-11-06T03:13:00Z

@ma2rten I am expecting the lower case NER feature too.

bawongfai · 2016-04-12T18:09:42Z

@honnibal is the lower case NER in progress?
If not, would you mind to give some instructions on how to train that?
Thanks. :)

icyc9 · 2016-07-21T18:11:55Z

@honnibal Any progress on the lowercase NER?

cmuell89 · 2016-07-21T18:22:49Z

@honnibal And the gazetteer?

lock · 2018-05-09T08:11:53Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

honnibal added the enhancement Feature requests and improvements label Jan 18, 2016

geovedi mentioned this issue May 25, 2016

Just an interesting discovery in the English NER model #394

Closed

ines mentioned this issue Oct 22, 2016

💫 Document workflow: Training the tagger, parser and entity recogniser #553

Closed

ines closed this as completed Oct 22, 2016

lock bot locked as resolved and limited conversation to collaborators May 9, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions RE your NER #62

Questions RE your NER #62

AWNystrom commented Apr 29, 2015

honnibal commented Apr 29, 2015

honnibal commented Apr 30, 2015

AWNystrom commented May 4, 2015

honnibal commented May 4, 2015

AWNystrom commented May 4, 2015

elyase commented May 4, 2015

honnibal commented May 4, 2015

AWNystrom commented May 5, 2015

derekduoba commented May 26, 2015

elyase commented May 29, 2015

derekduoba commented May 30, 2015

honnibal commented May 30, 2015

honnibal commented Jun 7, 2015

lechatpito commented Jun 16, 2015

honnibal commented Jun 16, 2015

AWNystrom commented Jun 16, 2015

ma2rten commented Jul 23, 2015

matichorvat commented Oct 1, 2015

forrestbao commented Nov 6, 2015

bawongfai commented Apr 12, 2016

icyc9 commented Jul 21, 2016

cmuell89 commented Jul 21, 2016

lock bot commented May 9, 2018

Questions RE your NER #62

Questions RE your NER #62

Comments

AWNystrom commented Apr 29, 2015

honnibal commented Apr 29, 2015

honnibal commented Apr 30, 2015

AWNystrom commented May 4, 2015

honnibal commented May 4, 2015

AWNystrom commented May 4, 2015

elyase commented May 4, 2015

honnibal commented May 4, 2015

AWNystrom commented May 5, 2015

derekduoba commented May 26, 2015

elyase commented May 29, 2015

derekduoba commented May 30, 2015

honnibal commented May 30, 2015

honnibal commented Jun 7, 2015

lechatpito commented Jun 16, 2015

honnibal commented Jun 16, 2015

AWNystrom commented Jun 16, 2015

ma2rten commented Jul 23, 2015

matichorvat commented Oct 1, 2015

forrestbao commented Nov 6, 2015

bawongfai commented Apr 12, 2016

icyc9 commented Jul 21, 2016

cmuell89 commented Jul 21, 2016

lock bot commented May 9, 2018