Example code for Spacy Entity Linking? #4511

davidbernat · 2019-10-23T18:26:02Z

Apologies for what is likely a simple failure to find the right documentation. I understand Spacy recently added Entity Linking. How do I enable this in the default pipeline? What model do I need to install to bring this capability? Sample code would be very helpful.

import spacy
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe(nlp.create_pipe("entity_linker"))
doc = nlp("The Democratic Party has cycled through various candidates.")

Throws error:
Model for component 'entity_linker' not initialized. Did you forget to load a model, or forget to call begin_training()?

Where do I get the model? Thanks!

The text was updated successfully, but these errors were encountered:

AmoghM · 2019-10-23T21:25:11Z

@davidbernat I think these two pointers should help you out:

davidbernat · 2019-10-23T21:45:23Z

Thanks. I'd read those previously. There is no pre-trained model for download? This surprises me.

AmoghM · 2019-10-23T21:54:02Z

Link #2 talks about download the wikidata KB. Are you talking about something else?

davidbernat · 2019-10-24T03:31:34Z

Item 2 discusses downloading the Spacy Wikipedia Knowledge Base and using the Spacy training paradigm to train a model. It does not describe downloading a pre-trained model. I am surprised by this. Certainly an already trained model exists out there, already. No? And/or this pre-trained model would be useful for pre-training in creation of other knowledge bases.

petulla · 2019-10-24T03:34:31Z

@AmoghM Can you clarify how step 2 is run? I find the instructions confusing.

Do train instances and dev instances need to be set? It says it's set to 90/10 by default in the file wikidata_train_entity_linker.py's annotations but looking at the code and the command line output, the script seems to bork out if they're not set.

I'm running python wikidata_train_entity_linker.py './output' where output has my kb directory.

svlandeg · 2019-10-24T07:20:40Z

@davidbernat on the original question: The NEL functionality is still in a sort of beta phase, as we're still working on refining the data & models. All the API's etc have been implemented though, and can be used for early experimentation. This does indeed mean training your own model on Wikipedia/Wikidata dumps you have to download, as detailed in the readme file linked to by @AmoghM.

@petulla: I'm not entirely sure what you mean by "the script seems to bork out", but the script should indeed work if train_inst and dev_inst are not set. This will extract all data from Wikipedia (which is quite a lot) and by default assign each article ending on "3" (arbitrary choice) in the dev set. If you run into further issues when running this script, feel free to open a separate issue.

davidbernat · 2019-10-24T12:38:36Z

@svlandeg Thanks Sofie! And great presentations by you on the feature floating around the web!

Could you kindly include some rough parameters on the training routine in your experience? What hyper-parameters? How many epochs? Typical wall time duration and performance characteristics?

So that I could know that I am training to approximately state-of-the-art (or state-of-your-art). The same information for a typical pre-training warm start for a new model would be superbly helpful.

Surely you have a trained model sitting on your hard drive somewhere.. no? :-)

I am really looking to include this feature in a release I am making this weekend.

What other examples of Entity Linking models have you made? I'm building a tool for scientific journal article comprehension. Thinking of word sense disambiguation as well.

Would love to discuss NLP via email. If you're open to it.

Thanks!

svlandeg · 2019-10-24T13:44:54Z

Happy to hear you've enjoyed the presentations! It's also good to see the community interest in the EL work. Unfortunately it's taking us a little longer than expected to get a proper model out, as we've run into several issues with the data, want to add also coreference resolution, etc etc. This is the reason why we haven't officially released this yet... but any feedback on the current implementation is ofcourse very welcome !

Surely you have a trained model sitting on your hard drive somewhere.. no? :-)

Not one I'm satisfied with, no, hence the work-in-progress ;-)

davidbernat · 2019-10-24T14:41:48Z

Oh pish posh! :-) What do they say? Show your drafts early? ;-)

OK. What co-ref are you using? Hugging Face's impressed me with its demo accuracy.
https://github.com/huggingface/neuralcoref

If you change your mind re: models, please email me. Will be appreciated.

svlandeg · 2019-10-24T14:46:28Z

Yes - we're actually working together with Hugging Face for keeping neuralcoref up-to-date, see e.g. huggingface/neuralcoref#211. It will be more closely integrated with spaCy in the future!

And I do show my drafts early - it's all on the current master branch :-)

davidbernat · 2019-11-04T18:01:46Z

I'm confused. Are you saying there is a trained model on the master branch?

alepiscopo · 2019-11-05T13:36:41Z

Hi, I'm trying to train a model using the scripts in https://github.com/explosion/spaCy/tree/master/bin/wiki_entity_linking.
Do you have any estimates of how long it should take? I'm using a 4-cores GCP instance with 128GB RAM.
Thanks.

miikargh · 2019-11-06T10:34:38Z

Hi,

Great work on making EL more accessible!

I'm also trying to train a model for Finnish language and I'm confused about the "model" parameter that the wikidata_pretrain_kb.py requires. What is this "model"?

Thanks in advance!

cwulfman · 2019-11-06T17:18:02Z

I'm really excited to see this work happening! Maybe I'm being too eager to use this WIP, but I've tried twice to run wikidata_train_entity_linker.py and each time, after 36 hours or so, I get a memory error:

[1]+ python ./bin/wiki_entity_linking/wikidata_train_entity_linker.py ~/projects/nel/out/ &
$ python ./bin/wiki_entity_linking/wikidata_train_entity_linker.py ~/projects/nel/out/
2019-11-05 18:28:42,575 - INFO - __main__ - Creating Entity Linker with Wikipedia and WikiData
2019-11-05 18:28:42,575 - INFO - __main__ - STEP 1a: Loading model from /Users/cwulfman/projects/nel/out/nlp_kb
2019-11-05 18:28:59,520 - INFO - __main__ - STEP 1b: Loading KB from /Users/cwulfman/projects/nel/out/kb
2019-11-05 18:29:09,389 - INFO - __main__ - STEP 2: Reading training dataset from /Users/cwulfman/projects/nel/out/gold_entities.jsonl
2019-11-05 18:29:09,389 - INFO - bin.wiki_entity_linking.wikipedia_processor - Reading train data with limit None

[edited]

1122567it [17:20:10, 66.05it/s]
/Users/cwulfman/.pyenv/versions/3.8.0/lib/python3.8/multiprocessing/resource_tracker.py:203: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

Is there anything I can do to fix this, or should I just be patient and wait for @svlandeg to release a model she's happy with? 😉

alepiscopo · 2019-11-08T08:43:20Z

My training script keeps failing as well (with 256GB RAM). It loads gold_entities.json in memory and then fails with the following message:
MemoryError: Unable to allocate array with shape (4728, 8, 64, 2) and data type float32

sharner · 2019-11-12T00:12:00Z

Try limiting the number of training examples, as in python wikidata_train_entity_linker.py spacy_el -t 50000 -d 10000.

alepiscopo · 2019-11-12T09:32:25Z

That seems to be working – it feels weird to have memory problems, even after having upgraded to 312GB RAM. I guess I'll have to find the optimal balance of RAM and training data size.

StudyExchange · 2019-11-20T09:57:26Z

Happy to hear you've enjoyed the presentations! It's also good to see the community interest in the EL work. Unfortunately it's taking us a little longer than expected to get a proper model out, as we've run into several issues with the data, want to add also coreference resolution, etc etc. This is the reason why we haven't officially released this yet... but any feedback on the current implementation is ofcourse very welcome !

Surely you have a trained model sitting on your hard drive somewhere.. no? :-)

Not one I'm satisfied with, no, hence the work-in-progress ;-)

Even official entity liking model is not prefect now, could you please release a version for demo?

petulla · 2019-11-21T19:16:22Z

@StudyExchange if you need an entity linking model and are OK with something slightly dated and pre-baked, the dbpedia spotlight API might suffice. https://www.dbpedia-spotlight.org/demo/

StudyExchange · 2019-11-22T07:54:03Z

@StudyExchange if you need an entity linking model and are OK with something slightly dated and pre-baked, the dbpedia spotlight API might suffice. https://www.dbpedia-spotlight.org/demo/

Thank you! dbpedia is OK to basic perceptual learn.

zainbnv · 2019-12-10T06:59:28Z

Hi,

Great work on making EL more accessible!

I'm also trying to train a model for Finnish language and I'm confused about the "model" parameter that the wikidata_pretrain_kb.py requires. What is this "model"?

Thanks in advance!

Here model can be any spacy language model, for example: 'en_core_web_lg'

svlandeg · 2019-12-16T14:04:48Z

@alepiscopo, @cwulfman, @petulla, @kevingeng et al (also in response to Issue #4544):

PR #4811 addresses the memory requirements for training the Entity Linking pipe.
On a machine with 16GB, it now takes me about 10h to run one epoch with 50.000 WP articles (-t 50000). Note that -t and -d have been modified and now refer to # of ARTICLES and not # of entities, so 50.000 is already quite a big training set !

Also, there is now a more informative progress bar that lets you estimate how long one epoch will take.

If there are any more issues after merging / trying this PR - please feel free to open a new Issue.

[Edit 6 april 2020]: this was the original command I ran:

python wikidata_train_entity_linker.py KB/ -o EL_50000/ -t 50000 -d 100 -l CARDINAL,DATE,MONEY,ORDINAL,QUANTITY,TIME,PERCENT -e 5

icochico · 2020-01-09T19:38:16Z

@svlandeg Thanks so much for this amazing work! I am actually in the process of testing the Entity Linking and creating a custom model, is there any actual example on how to run it? Would you guys mind providing an example of how to run the wikidata_pretrain_kb.py for example? That would be very helpful thank you

svlandeg · 2020-01-09T19:47:53Z

I'm not sure what you mean by an example on how to run the script - you just need to fill in the appropriate parameters and it'll run for you. See also the readme file at https://github.com/explosion/spaCy/tree/master/bin/wiki_entity_linking.

svlandeg · 2020-01-09T19:50:18Z

For a quick test (only using a small part of the data to see whether the script runs), use -lt 20000 -lp 2000 -lw 3000 -f 1

icochico · 2020-01-09T20:56:45Z

@svlandeg Thanks for your quick response. I am referring to the somewhat unclear part of the documentation in the README.

Do the dump files need to be extracted before they can be fed to the script?
Does SpaCy needs to be built from source for this to work? Or can we just use a released version like 2.2.3 for example
When I try to run the script with SpaCy from source using the model en-core-web-sm-2.2.5, I get this Error: ValueError: thinc.extra.search.Beam size changed, may indicate binary incompatibility. Expected 120 from C header, got 112 from PyObject

Thanks for your quick test sample parameters :)

zainbnv · 2020-01-10T05:01:58Z

@svlandeg Thanks for your quick response. I am referring to the somewhat unclear part of the documentation in the README.

1. Do the dump files need to be extracted before they can be fed to the script?

2. Does SpaCy needs to be built from source for this to work? Or can we just use a released version like `2.2.3` for example

3. When I try to run the script with SpaCy from source using the model `en-core-web-sm-2.2.5`, I get this Error: `ValueError: thinc.extra.search.Beam size changed, may indicate binary incompatibility. Expected 120 from C header, got 112 from PyObject`

Thanks for your quick test sample parameters :)

you dont need to extract the file.
You just simply need spacy 2.2.3 installed in your environment.(with updated en_core_web_sm)

icochico · 2020-01-10T19:48:15Z

@svlandeg @zainbnv Thanks both for your help!

I was able to fix the issues on a clean conda environment using spaCy 2.2.3 and model en-core-web-lg==2.2.5

icochico · 2020-01-11T19:08:05Z

Fixed it, other environments issue. FYI for all the people interested I am running with:
python3 ./bin/wiki_entity_linking/wikidata_train_entity_linker.py output_kb_pretrain/ -t 50000 -o output_train/

svlandeg closed this as completed Oct 24, 2019

svlandeg added feat / nel Feature: Named Entity linking usage General spaCy usage labels Oct 24, 2019

lock bot locked as resolved and limited conversation to collaborators Feb 11, 2020

explosion deleted a comment from lock bot Dec 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Example code for Spacy Entity Linking? #4511

Example code for Spacy Entity Linking? #4511

davidbernat commented Oct 23, 2019

AmoghM commented Oct 23, 2019

davidbernat commented Oct 23, 2019

AmoghM commented Oct 23, 2019

davidbernat commented Oct 24, 2019

petulla commented Oct 24, 2019

svlandeg commented Oct 24, 2019 •

edited

davidbernat commented Oct 24, 2019

svlandeg commented Oct 24, 2019

davidbernat commented Oct 24, 2019

svlandeg commented Oct 24, 2019

davidbernat commented Nov 4, 2019

alepiscopo commented Nov 5, 2019

miikargh commented Nov 6, 2019

cwulfman commented Nov 6, 2019

alepiscopo commented Nov 8, 2019

sharner commented Nov 12, 2019

alepiscopo commented Nov 12, 2019

StudyExchange commented Nov 20, 2019

petulla commented Nov 21, 2019 •

edited

StudyExchange commented Nov 22, 2019

zainbnv commented Dec 10, 2019

svlandeg commented Dec 16, 2019 •

edited

icochico commented Jan 9, 2020

svlandeg commented Jan 9, 2020 •

edited

svlandeg commented Jan 9, 2020

icochico commented Jan 9, 2020

zainbnv commented Jan 10, 2020 •

edited

icochico commented Jan 10, 2020 •

edited

icochico commented Jan 11, 2020 •

edited

Example code for Spacy Entity Linking? #4511

Example code for Spacy Entity Linking? #4511

Comments

davidbernat commented Oct 23, 2019

AmoghM commented Oct 23, 2019

davidbernat commented Oct 23, 2019

AmoghM commented Oct 23, 2019

davidbernat commented Oct 24, 2019

petulla commented Oct 24, 2019

svlandeg commented Oct 24, 2019 • edited

davidbernat commented Oct 24, 2019

svlandeg commented Oct 24, 2019

davidbernat commented Oct 24, 2019

svlandeg commented Oct 24, 2019

davidbernat commented Nov 4, 2019

alepiscopo commented Nov 5, 2019

miikargh commented Nov 6, 2019

cwulfman commented Nov 6, 2019

alepiscopo commented Nov 8, 2019

sharner commented Nov 12, 2019

alepiscopo commented Nov 12, 2019

StudyExchange commented Nov 20, 2019

petulla commented Nov 21, 2019 • edited

StudyExchange commented Nov 22, 2019

zainbnv commented Dec 10, 2019

svlandeg commented Dec 16, 2019 • edited

icochico commented Jan 9, 2020

svlandeg commented Jan 9, 2020 • edited

svlandeg commented Jan 9, 2020

icochico commented Jan 9, 2020

zainbnv commented Jan 10, 2020 • edited

icochico commented Jan 10, 2020 • edited

icochico commented Jan 11, 2020 • edited

svlandeg commented Oct 24, 2019 •

edited

petulla commented Nov 21, 2019 •

edited

svlandeg commented Dec 16, 2019 •

edited

svlandeg commented Jan 9, 2020 •

edited

zainbnv commented Jan 10, 2020 •

edited

icochico commented Jan 10, 2020 •

edited

icochico commented Jan 11, 2020 •

edited