Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Step 2 in entity linking runs out of memory after 6-8 hours #4544

Closed
petulla opened this issue Oct 28, 2019 · 9 comments
Closed

Step 2 in entity linking runs out of memory after 6-8 hours #4544

petulla opened this issue Oct 28, 2019 · 9 comments
Labels
feat / nel Feature: Named Entity linking perf / memory Performance: memory use

Comments

@petulla
Copy link

petulla commented Oct 28, 2019

I've been stuck on Step 2 of the entity linking docs and could use some help. I'm sure I'm making a basic error.

How to reproduce the problem

I create the knowledge base and training data. This seemed to work as it said "Done!" after running for 12 hours.

python wikidata_pretrain_kb.py 'latest-all.json.bz2' 'enwiki-latest-pages-articles-multistream.xml.bz2' './output2' 'en_core_web_lg'

The second step is dying off for me. Checking the logs, I'm running out of memory (and a lot of memory is being used). I have 32gb of RAM.

I'm running:

python wikidata_train_entity_linker.py output2:

output2 is the directory with the knowledge base. Its contents are:

entity_alias.csv	gold_entities.jsonl
entity_defs.csv		kb
entity_descriptions.csv	nlp_kb
entity_freq.csv		prior_prob.csv

Here is the command line output after running step 2's command. It simply escapes to the command line after running for 6+ hrs and running out of memory:

2019-10-27 19:52:44,320 - INFO - __main__ - Creating Entity Linker with Wikipedia and WikiData
2019-10-27 19:52:44,320 - INFO - __main__ - STEP 1a: Loading model from output2/nlp_kb
I1027 19:52:44.486585 4612883904 file_utils.py:39] PyTorch version 1.2.0 available.
I1027 19:52:45.090944 4612883904 modeling_xlnet.py:194] Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .
I1027 19:52:59.587772 4612883904 wikidata_train_entity_linker.py:67] STEP 1b: Loading KB from output2/kb
I1027 19:53:07.744759 4612883904 wikidata_train_entity_linker.py:75] STEP 2: Reading training dataset from output2/gold_entities.jsonl
I1027 19:53:07.744909 4612883904 wikipedia_processor.py:473] Reading train data with limit None
2247226it [8:19:35,  3.76s/it]Killed: 9

Your Environment

  • Operating System: os Mojave
  • Python Version Used: 3.7.4
  • spaCy Version Used: 2.2.2.dev1 (built from source)
  • Environment Information: PyEnv
@svlandeg svlandeg added feat / nel Feature: Named Entity linking perf / memory Performance: memory use labels Oct 28, 2019
@svlandeg
Copy link
Member

svlandeg commented Oct 28, 2019

Hi @petulla , sorry to hear you're running into issues!

I think it would be good to try the pipeline by only parsing part of the data. Wikipedia is quite huge, so the training file you've created will likely be quite huge, too. Can you try setting the limit parameter to something small (2000 orso) to start, just to check if the pipeline runs?

[EDIT]: I mean only in the second step. Looks like the first step went just fine so let's not touch that data :-)

@svlandeg svlandeg added the more-info-needed This issue needs more information label Oct 28, 2019
@petulla
Copy link
Author

petulla commented Oct 28, 2019

@svlandeg I'll give it a shot. Which parameter is limit? https://github.com/explosion/spaCy/blob/master/bin/wiki_entity_linking/wikidata_train_entity_linker.py#L29-L38 I see limit in the step 1.

Edit I assume you mean train_inst/dev_inst here? Running now with those set to 2000.

@no-response no-response bot removed the more-info-needed This issue needs more information label Oct 28, 2019
@petulla
Copy link
Author

petulla commented Oct 28, 2019

@svlandeg That worked:

I1028 13:03:02.262120 4559787456 wikidata_train_entity_linker.py:189] ent Galaxy ORG Q321
I1028 13:03:02.262177 4559787456 wikidata_train_entity_linker.py:189] ent Douglas Adams PERSON Q42
I1028 13:03:02.262225 4559787456 wikidata_train_entity_linker.py:189] ent Douglas PERSON Q18569
I1028 13:03:02.262270 4559787456 wikidata_train_entity_linker.py:189] ent China GPE Q148
I1028 13:03:02.262312 4559787456 wikidata_train_entity_linker.py:189] ent Brazil GPE Q155
I1028 13:03:02.262353 4559787456 wikidata_train_entity_linker.py:189] ent Doug PERSON Q2426198
I1028 13:03:02.262412 4559787456 wikidata_train_entity_linker.py:189] ent Arthur Dent PERSON Q613901
I1028 13:03:02.262453 4559787456 wikidata_train_entity_linker.py:189] ent Dougledydoug PERSON NIL
I1028 13:03:02.262493 4559787456 wikidata_train_entity_linker.py:189] ent George Washington PERSON Q23
I1028 13:03:02.262533 4559787456 wikidata_train_entity_linker.py:189] ent Homer Simpson PERSON Q7810
I1028 13:03:02.262586 4559787456 wikidata_train_entity_linker.py:159] STEP 6: Writing trained NLP to output2/nlp
I1028 13:03:16.456601 4559787456 wikidata_train_entity_linker.py:162] Done!

Do you know how much RAM is needed to train the corpus/is there a way to batch train? I'm not sure where to go from here.

@svlandeg
Copy link
Member

Happy to see it runs! I haven't done exhaustive testing to see which settings require how much RAM etc. For now I'm afraid I can only tell you do to experiment a bit, play with the size of the datasets & hyperparameters. This functionality is still under development, too, so the accuracy levels aren't that great yet. Work in progress ;-)

@petulla
Copy link
Author

petulla commented Oct 28, 2019

Jw, What machine did you all use to get it to run? I'll see what I can do. I also have access to Databricks clusters/AWS services but not sure which I'd use here. @svlandeg

@kevingeng
Copy link

I also stuck in this step, I passed at a small amount of data (2000). But when I tried other quantities, such as 500,000, and failed again. It took a long time for each failure.
I also tried to estimate the memory requirements with the size of the gold-entity file, but I don’t know what the correlation is. My server has 64GB RAM, And I can find more RAM by using a cloud servers, but I don't know how to estimate memory requirements to apply for resources. In addition, I saw that there is an issue said that this does not support GPU. Does it support GPU now?

To sum up, I have two questions. 1. How much memory I need? 2. Did I need a GPU now?

@svlandeg svlandeg reopened this Nov 4, 2019
@svlandeg
Copy link
Member

svlandeg commented Nov 4, 2019

@kevingeng : did you, in the first step, parse out the full KB?

I'm asking because it could be that you put a limit on the number of entities/aliases in the KB, resulting in a relatively small KB. If that's the case, it can take quite long before the parsing script finds appropriate training examples in the data - for each training example it will check the KB whether it knows it, and if not, discards it, so that's a costly process.

@svlandeg svlandeg closed this as completed Nov 4, 2019
@kevingeng
Copy link

@kevingeng : did you, in the first step, parse out the full KB?

I'm asking because it could be that you put a limit on the number of entities/aliases in the KB, resulting in a relatively small KB. If that's the case, it can take quite long before the parsing script finds appropriate training examples in the data - for each training example it will check the KB whether it knows it, and if not, discards it, so that's a costly process.


Thanks for your answer
Yes, you are right, I created the complete knowledge base in step 1 using the default parameters.

    max_per_alias=("Max. # entities per alias (default 10)", "option", "a", int),
    min_freq=("Min. count of an entity in the corpus (default 20)", "option", "f", int),
    min_pair=("Min. count of entity-alias pairs (default 5)", "option", "c", int),

In this step, it creates an output directory, like this:

     82M Oct 31 18:17 entity_alias.csv
     59M Oct 31 18:17 entity_defs.csv
     27M Oct 31 18:17 entity_descriptions.csv
     436M Oct 31 18:17 entity_freq.csv
     25G Oct 31 18:21 gold_entities.jsonl
     310M Oct 31 18:21 kb
     1.4G Oct 31 18:17 prior_prob.csv 

Then in step 2. The program runs to "#STEP 2: Read the training dataset previously created from WP." Failed, the information at the time is "Read training data limited to none." The reason is that after loading some items, the server is out of memory and the process is killed.

There are only 64G of memory in this server, I want to add more but I don't know how to calculate the amount of memory needed to run.

Also, by your answer, are you saying that I can change some parameters to make a small gold_entities.jsonl? So I need to ask more:Is the ”alias“ means "mention"--which is the raw-text of the entity in the document? Can you give me some advice about max_per_ailas/min_freq/min_pair args?

@lock
Copy link

lock bot commented Dec 7, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Dec 7, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feat / nel Feature: Named Entity linking perf / memory Performance: memory use
Projects
None yet
Development

No branches or pull requests

3 participants