Step 2 in entity linking runs out of memory after 6-8 hours #4544

petulla · 2019-10-28T16:25:29Z

I've been stuck on Step 2 of the entity linking docs and could use some help. I'm sure I'm making a basic error.

How to reproduce the problem

I create the knowledge base and training data. This seemed to work as it said "Done!" after running for 12 hours.

python wikidata_pretrain_kb.py 'latest-all.json.bz2' 'enwiki-latest-pages-articles-multistream.xml.bz2' './output2' 'en_core_web_lg'

The second step is dying off for me. Checking the logs, I'm running out of memory (and a lot of memory is being used). I have 32gb of RAM.

I'm running:

python wikidata_train_entity_linker.py output2:

output2 is the directory with the knowledge base. Its contents are:

entity_alias.csv	gold_entities.jsonl
entity_defs.csv		kb
entity_descriptions.csv	nlp_kb
entity_freq.csv		prior_prob.csv

Here is the command line output after running step 2's command. It simply escapes to the command line after running for 6+ hrs and running out of memory:

2019-10-27 19:52:44,320 - INFO - __main__ - Creating Entity Linker with Wikipedia and WikiData
2019-10-27 19:52:44,320 - INFO - __main__ - STEP 1a: Loading model from output2/nlp_kb
I1027 19:52:44.486585 4612883904 file_utils.py:39] PyTorch version 1.2.0 available.
I1027 19:52:45.090944 4612883904 modeling_xlnet.py:194] Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .
I1027 19:52:59.587772 4612883904 wikidata_train_entity_linker.py:67] STEP 1b: Loading KB from output2/kb
I1027 19:53:07.744759 4612883904 wikidata_train_entity_linker.py:75] STEP 2: Reading training dataset from output2/gold_entities.jsonl
I1027 19:53:07.744909 4612883904 wikipedia_processor.py:473] Reading train data with limit None
2247226it [8:19:35,  3.76s/it]Killed: 9

Your Environment

Operating System: os Mojave
Python Version Used: 3.7.4
spaCy Version Used: 2.2.2.dev1 (built from source)
Environment Information: PyEnv

The text was updated successfully, but these errors were encountered:

svlandeg · 2019-10-28T16:39:48Z

Hi @petulla , sorry to hear you're running into issues!

I think it would be good to try the pipeline by only parsing part of the data. Wikipedia is quite huge, so the training file you've created will likely be quite huge, too. Can you try setting the limit parameter to something small (2000 orso) to start, just to check if the pipeline runs?

[EDIT]: I mean only in the second step. Looks like the first step went just fine so let's not touch that data :-)

petulla · 2019-10-28T16:44:04Z

@svlandeg I'll give it a shot. Which parameter is limit? https://github.com/explosion/spaCy/blob/master/bin/wiki_entity_linking/wikidata_train_entity_linker.py#L29-L38 I see limit in the step 1.

Edit I assume you mean train_inst/dev_inst here? Running now with those set to 2000.

petulla · 2019-10-28T17:03:58Z

@svlandeg That worked:

I1028 13:03:02.262120 4559787456 wikidata_train_entity_linker.py:189] ent Galaxy ORG Q321
I1028 13:03:02.262177 4559787456 wikidata_train_entity_linker.py:189] ent Douglas Adams PERSON Q42
I1028 13:03:02.262225 4559787456 wikidata_train_entity_linker.py:189] ent Douglas PERSON Q18569
I1028 13:03:02.262270 4559787456 wikidata_train_entity_linker.py:189] ent China GPE Q148
I1028 13:03:02.262312 4559787456 wikidata_train_entity_linker.py:189] ent Brazil GPE Q155
I1028 13:03:02.262353 4559787456 wikidata_train_entity_linker.py:189] ent Doug PERSON Q2426198
I1028 13:03:02.262412 4559787456 wikidata_train_entity_linker.py:189] ent Arthur Dent PERSON Q613901
I1028 13:03:02.262453 4559787456 wikidata_train_entity_linker.py:189] ent Dougledydoug PERSON NIL
I1028 13:03:02.262493 4559787456 wikidata_train_entity_linker.py:189] ent George Washington PERSON Q23
I1028 13:03:02.262533 4559787456 wikidata_train_entity_linker.py:189] ent Homer Simpson PERSON Q7810
I1028 13:03:02.262586 4559787456 wikidata_train_entity_linker.py:159] STEP 6: Writing trained NLP to output2/nlp
I1028 13:03:16.456601 4559787456 wikidata_train_entity_linker.py:162] Done!

Do you know how much RAM is needed to train the corpus/is there a way to batch train? I'm not sure where to go from here.

svlandeg · 2019-10-28T19:44:44Z

Happy to see it runs! I haven't done exhaustive testing to see which settings require how much RAM etc. For now I'm afraid I can only tell you do to experiment a bit, play with the size of the datasets & hyperparameters. This functionality is still under development, too, so the accuracy levels aren't that great yet. Work in progress ;-)

petulla · 2019-10-28T19:49:09Z

Jw, What machine did you all use to get it to run? I'll see what I can do. I also have access to Databricks clusters/AWS services but not sure which I'd use here. @svlandeg

kevingeng · 2019-11-04T03:10:24Z

I also stuck in this step, I passed at a small amount of data (2000). But when I tried other quantities, such as 500,000, and failed again. It took a long time for each failure.
I also tried to estimate the memory requirements with the size of the gold-entity file, but I don’t know what the correlation is. My server has 64GB RAM, And I can find more RAM by using a cloud servers, but I don't know how to estimate memory requirements to apply for resources. In addition, I saw that there is an issue said that this does not support GPU. Does it support GPU now?

To sum up, I have two questions. 1. How much memory I need? 2. Did I need a GPU now?

svlandeg · 2019-11-04T14:25:19Z

@kevingeng : did you, in the first step, parse out the full KB?

I'm asking because it could be that you put a limit on the number of entities/aliases in the KB, resulting in a relatively small KB. If that's the case, it can take quite long before the parsing script finds appropriate training examples in the data - for each training example it will check the KB whether it knows it, and if not, discards it, so that's a costly process.

kevingeng · 2019-11-07T10:53:45Z

@kevingeng : did you, in the first step, parse out the full KB?

I'm asking because it could be that you put a limit on the number of entities/aliases in the KB, resulting in a relatively small KB. If that's the case, it can take quite long before the parsing script finds appropriate training examples in the data - for each training example it will check the KB whether it knows it, and if not, discards it, so that's a costly process.

Thanks for your answer
Yes, you are right, I created the complete knowledge base in step 1 using the default parameters.

    max_per_alias=("Max. # entities per alias (default 10)", "option", "a", int),
    min_freq=("Min. count of an entity in the corpus (default 20)", "option", "f", int),
    min_pair=("Min. count of entity-alias pairs (default 5)", "option", "c", int),

In this step, it creates an output directory, like this:

     82M Oct 31 18:17 entity_alias.csv
     59M Oct 31 18:17 entity_defs.csv
     27M Oct 31 18:17 entity_descriptions.csv
     436M Oct 31 18:17 entity_freq.csv
     25G Oct 31 18:21 gold_entities.jsonl
     310M Oct 31 18:21 kb
     1.4G Oct 31 18:17 prior_prob.csv

Then in step 2. The program runs to "#STEP 2: Read the training dataset previously created from WP." Failed, the information at the time is "Read training data limited to none." The reason is that after loading some items, the server is out of memory and the process is killed.

There are only 64G of memory in this server, I want to add more but I don't know how to calculate the amount of memory needed to run.

Also, by your answer, are you saying that I can change some parameters to make a small gold_entities.jsonl? So I need to ask more：Is the ”alias“ means "mention"--which is the raw-text of the entity in the document? Can you give me some advice about max_per_ailas/min_freq/min_pair args?

lock · 2019-12-07T12:17:54Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

svlandeg added feat / nel Feature: Named Entity linking perf / memory Performance: memory use labels Oct 28, 2019

svlandeg added the more-info-needed This issue needs more information label Oct 28, 2019

no-response bot removed the more-info-needed This issue needs more information label Oct 28, 2019

svlandeg closed this as completed Oct 28, 2019

svlandeg reopened this Nov 4, 2019

svlandeg closed this as completed Nov 4, 2019

lock bot locked as resolved and limited conversation to collaborators Dec 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Step 2 in entity linking runs out of memory after 6-8 hours #4544

Step 2 in entity linking runs out of memory after 6-8 hours #4544

petulla commented Oct 28, 2019 •

edited

Loading

svlandeg commented Oct 28, 2019 •

edited

Loading

petulla commented Oct 28, 2019 •

edited

Loading

petulla commented Oct 28, 2019 •

edited

Loading

svlandeg commented Oct 28, 2019

petulla commented Oct 28, 2019 •

edited

Loading

kevingeng commented Nov 4, 2019

svlandeg commented Nov 4, 2019

kevingeng commented Nov 7, 2019

lock bot commented Dec 7, 2019

Step 2 in entity linking runs out of memory after 6-8 hours #4544

Step 2 in entity linking runs out of memory after 6-8 hours #4544

Comments

petulla commented Oct 28, 2019 • edited Loading

How to reproduce the problem

Your Environment

svlandeg commented Oct 28, 2019 • edited Loading

petulla commented Oct 28, 2019 • edited Loading

petulla commented Oct 28, 2019 • edited Loading

svlandeg commented Oct 28, 2019

petulla commented Oct 28, 2019 • edited Loading

kevingeng commented Nov 4, 2019

svlandeg commented Nov 4, 2019

kevingeng commented Nov 7, 2019

lock bot commented Dec 7, 2019

petulla commented Oct 28, 2019 •

edited

Loading

svlandeg commented Oct 28, 2019 •

edited

Loading

petulla commented Oct 28, 2019 •

edited

Loading

petulla commented Oct 28, 2019 •

edited

Loading

petulla commented Oct 28, 2019 •

edited

Loading