Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Example code for Spacy Entity Linking? #4511

Closed
davidbernat opened this issue Oct 23, 2019 · 29 comments
Closed

Example code for Spacy Entity Linking? #4511

davidbernat opened this issue Oct 23, 2019 · 29 comments
Labels
feat / nel Feature: Named Entity linking usage General spaCy usage

Comments

@davidbernat
Copy link

Apologies for what is likely a simple failure to find the right documentation. I understand Spacy recently added Entity Linking. How do I enable this in the default pipeline? What model do I need to install to bring this capability? Sample code would be very helpful.

import spacy
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe(nlp.create_pipe("entity_linker"))
doc = nlp("The Democratic Party has cycled through various candidates.")

Throws error:
Model for component 'entity_linker' not initialized. Did you forget to load a model, or forget to call begin_training()?

Where do I get the model? Thanks!

@AmoghM
Copy link

AmoghM commented Oct 23, 2019

@davidbernat I think these two pointers should help you out:

  1. https://spacy.io/usage/linguistic-features#entity-linking
  2. https://github.com/explosion/spaCy/tree/master/bin/wiki_entity_linking

@davidbernat
Copy link
Author

Thanks. I'd read those previously. There is no pre-trained model for download? This surprises me.

@AmoghM
Copy link

AmoghM commented Oct 23, 2019

Link #2 talks about download the wikidata KB. Are you talking about something else?

@davidbernat
Copy link
Author

Item 2 discusses downloading the Spacy Wikipedia Knowledge Base and using the Spacy training paradigm to train a model. It does not describe downloading a pre-trained model. I am surprised by this. Certainly an already trained model exists out there, already. No? And/or this pre-trained model would be useful for pre-training in creation of other knowledge bases.

@petulla
Copy link

petulla commented Oct 24, 2019

@AmoghM Can you clarify how step 2 is run? I find the instructions confusing.

Do train instances and dev instances need to be set? It says it's set to 90/10 by default in the file wikidata_train_entity_linker.py's annotations but looking at the code and the command line output, the script seems to bork out if they're not set.

I'm running python wikidata_train_entity_linker.py './output' where output has my kb directory.

@svlandeg
Copy link
Member

svlandeg commented Oct 24, 2019

@davidbernat on the original question: The NEL functionality is still in a sort of beta phase, as we're still working on refining the data & models. All the API's etc have been implemented though, and can be used for early experimentation. This does indeed mean training your own model on Wikipedia/Wikidata dumps you have to download, as detailed in the readme file linked to by @AmoghM.

@petulla: I'm not entirely sure what you mean by "the script seems to bork out", but the script should indeed work if train_inst and dev_inst are not set. This will extract all data from Wikipedia (which is quite a lot) and by default assign each article ending on "3" (arbitrary choice) in the dev set. If you run into further issues when running this script, feel free to open a separate issue.

@svlandeg svlandeg added feat / nel Feature: Named Entity linking usage General spaCy usage labels Oct 24, 2019
@davidbernat
Copy link
Author

@svlandeg Thanks Sofie! And great presentations by you on the feature floating around the web!

  1. Could you kindly include some rough parameters on the training routine in your experience? What hyper-parameters? How many epochs? Typical wall time duration and performance characteristics?

So that I could know that I am training to approximately state-of-the-art (or state-of-your-art). The same information for a typical pre-training warm start for a new model would be superbly helpful.

  1. Surely you have a trained model sitting on your hard drive somewhere.. no? :-)

I am really looking to include this feature in a release I am making this weekend.

  1. What other examples of Entity Linking models have you made? I'm building a tool for scientific journal article comprehension. Thinking of word sense disambiguation as well.

Would love to discuss NLP via email. If you're open to it.

Thanks!

@svlandeg
Copy link
Member

Happy to hear you've enjoyed the presentations! It's also good to see the community interest in the EL work. Unfortunately it's taking us a little longer than expected to get a proper model out, as we've run into several issues with the data, want to add also coreference resolution, etc etc. This is the reason why we haven't officially released this yet... but any feedback on the current implementation is ofcourse very welcome !

Surely you have a trained model sitting on your hard drive somewhere.. no? :-)

Not one I'm satisfied with, no, hence the work-in-progress ;-)

@davidbernat
Copy link
Author

Oh pish posh! :-) What do they say? Show your drafts early? ;-)

OK. What co-ref are you using? Hugging Face's impressed me with its demo accuracy.
https://github.com/huggingface/neuralcoref

If you change your mind re: models, please email me. Will be appreciated.

@svlandeg
Copy link
Member

Yes - we're actually working together with Hugging Face for keeping neuralcoref up-to-date, see e.g. huggingface/neuralcoref#211. It will be more closely integrated with spaCy in the future!

And I do show my drafts early - it's all on the current master branch :-)

@davidbernat
Copy link
Author

I'm confused. Are you saying there is a trained model on the master branch?

@alepiscopo
Copy link

Hi, I'm trying to train a model using the scripts in https://github.com/explosion/spaCy/tree/master/bin/wiki_entity_linking.
Do you have any estimates of how long it should take? I'm using a 4-cores GCP instance with 128GB RAM.
Thanks.

@miikargh
Copy link

miikargh commented Nov 6, 2019

Hi,

Great work on making EL more accessible!

I'm also trying to train a model for Finnish language and I'm confused about the "model" parameter that the wikidata_pretrain_kb.py requires. What is this "model"?

Thanks in advance!

@cwulfman
Copy link

cwulfman commented Nov 6, 2019

I'm really excited to see this work happening! Maybe I'm being too eager to use this WIP, but I've tried twice to run wikidata_train_entity_linker.py and each time, after 36 hours or so, I get a memory error:

[1]+ python ./bin/wiki_entity_linking/wikidata_train_entity_linker.py ~/projects/nel/out/ &
$ python ./bin/wiki_entity_linking/wikidata_train_entity_linker.py ~/projects/nel/out/
2019-11-05 18:28:42,575 - INFO - __main__ - Creating Entity Linker with Wikipedia and WikiData
2019-11-05 18:28:42,575 - INFO - __main__ - STEP 1a: Loading model from /Users/cwulfman/projects/nel/out/nlp_kb
2019-11-05 18:28:59,520 - INFO - __main__ - STEP 1b: Loading KB from /Users/cwulfman/projects/nel/out/kb
2019-11-05 18:29:09,389 - INFO - __main__ - STEP 2: Reading training dataset from /Users/cwulfman/projects/nel/out/gold_entities.jsonl
2019-11-05 18:29:09,389 - INFO - bin.wiki_entity_linking.wikipedia_processor - Reading train data with limit None

[edited]

1122567it [17:20:10, 66.05it/s]
/Users/cwulfman/.pyenv/versions/3.8.0/lib/python3.8/multiprocessing/resource_tracker.py:203: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

Is there anything I can do to fix this, or should I just be patient and wait for @svlandeg to release a model she's happy with? 😉

@alepiscopo
Copy link

My training script keeps failing as well (with 256GB RAM). It loads gold_entities.json in memory and then fails with the following message:
MemoryError: Unable to allocate array with shape (4728, 8, 64, 2) and data type float32

@sharner
Copy link

sharner commented Nov 12, 2019

Try limiting the number of training examples, as in python wikidata_train_entity_linker.py spacy_el -t 50000 -d 10000.

@alepiscopo
Copy link

That seems to be working – it feels weird to have memory problems, even after having upgraded to 312GB RAM. I guess I'll have to find the optimal balance of RAM and training data size.

@StudyExchange
Copy link

Happy to hear you've enjoyed the presentations! It's also good to see the community interest in the EL work. Unfortunately it's taking us a little longer than expected to get a proper model out, as we've run into several issues with the data, want to add also coreference resolution, etc etc. This is the reason why we haven't officially released this yet... but any feedback on the current implementation is ofcourse very welcome !

Surely you have a trained model sitting on your hard drive somewhere.. no? :-)

Not one I'm satisfied with, no, hence the work-in-progress ;-)

Even official entity liking model is not prefect now, could you please release a version for demo?

@petulla
Copy link

petulla commented Nov 21, 2019

@StudyExchange if you need an entity linking model and are OK with something slightly dated and pre-baked, the dbpedia spotlight API might suffice. https://www.dbpedia-spotlight.org/demo/

@StudyExchange
Copy link

@StudyExchange if you need an entity linking model and are OK with something slightly dated and pre-baked, the dbpedia spotlight API might suffice. https://www.dbpedia-spotlight.org/demo/

Thank you! dbpedia is OK to basic perceptual learn.

@zainbnv
Copy link

zainbnv commented Dec 10, 2019

Hi,

Great work on making EL more accessible!

I'm also trying to train a model for Finnish language and I'm confused about the "model" parameter that the wikidata_pretrain_kb.py requires. What is this "model"?

Thanks in advance!

Here model can be any spacy language model, for example: 'en_core_web_lg'

@svlandeg
Copy link
Member

svlandeg commented Dec 16, 2019

@alepiscopo, @cwulfman, @petulla, @kevingeng et al (also in response to Issue #4544):

PR #4811 addresses the memory requirements for training the Entity Linking pipe.
On a machine with 16GB, it now takes me about 10h to run one epoch with 50.000 WP articles (-t 50000). Note that -t and -d have been modified and now refer to # of ARTICLES and not # of entities, so 50.000 is already quite a big training set !

Also, there is now a more informative progress bar that lets you estimate how long one epoch will take.

If there are any more issues after merging / trying this PR - please feel free to open a new Issue.

[Edit 6 april 2020]: this was the original command I ran:

python wikidata_train_entity_linker.py KB/ -o EL_50000/ -t 50000 -d 100 -l CARDINAL,DATE,MONEY,ORDINAL,QUANTITY,TIME,PERCENT -e 5

@icochico
Copy link

icochico commented Jan 9, 2020

@svlandeg Thanks so much for this amazing work! I am actually in the process of testing the Entity Linking and creating a custom model, is there any actual example on how to run it? Would you guys mind providing an example of how to run the wikidata_pretrain_kb.py for example? That would be very helpful thank you

@svlandeg
Copy link
Member

svlandeg commented Jan 9, 2020

I'm not sure what you mean by an example on how to run the script - you just need to fill in the appropriate parameters and it'll run for you. See also the readme file at https://github.com/explosion/spaCy/tree/master/bin/wiki_entity_linking.

@svlandeg
Copy link
Member

svlandeg commented Jan 9, 2020

For a quick test (only using a small part of the data to see whether the script runs), use -lt 20000 -lp 2000 -lw 3000 -f 1

@icochico
Copy link

icochico commented Jan 9, 2020

@svlandeg Thanks for your quick response. I am referring to the somewhat unclear part of the documentation in the README.

  1. Do the dump files need to be extracted before they can be fed to the script?
  2. Does SpaCy needs to be built from source for this to work? Or can we just use a released version like 2.2.3 for example
  3. When I try to run the script with SpaCy from source using the model en-core-web-sm-2.2.5, I get this Error: ValueError: thinc.extra.search.Beam size changed, may indicate binary incompatibility. Expected 120 from C header, got 112 from PyObject

Thanks for your quick test sample parameters :)

@zainbnv
Copy link

zainbnv commented Jan 10, 2020

@svlandeg Thanks for your quick response. I am referring to the somewhat unclear part of the documentation in the README.

1. Do the dump files need to be extracted before they can be fed to the script?

2. Does SpaCy needs to be built from source for this to work? Or can we just use a released version like `2.2.3` for example

3. When I try to run the script with SpaCy from source using the model `en-core-web-sm-2.2.5`, I get this Error: `ValueError: thinc.extra.search.Beam size changed, may indicate binary incompatibility. Expected 120 from C header, got 112 from PyObject`

Thanks for your quick test sample parameters :)

  1. you dont need to extract the file.

  2. You just simply need spacy 2.2.3 installed in your environment.(with updated en_core_web_sm)

@icochico
Copy link

icochico commented Jan 10, 2020

@svlandeg @zainbnv Thanks both for your help!

I was able to fix the issues on a clean conda environment using spaCy 2.2.3 and model en-core-web-lg==2.2.5

@icochico
Copy link

icochico commented Jan 11, 2020

Fixed it, other environments issue. FYI for all the people interested I am running with:
python3 ./bin/wiki_entity_linking/wikidata_train_entity_linker.py output_kb_pretrain/ -t 50000 -o output_train/

@lock lock bot locked as resolved and limited conversation to collaborators Feb 11, 2020
@explosion explosion deleted a comment from lock bot Dec 11, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feat / nel Feature: Named Entity linking usage General spaCy usage
Projects
None yet
Development

No branches or pull requests