-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Step 2 in entity linking runs out of memory after 6-8 hours #4544
Comments
Hi @petulla , sorry to hear you're running into issues! I think it would be good to try the pipeline by only parsing part of the data. Wikipedia is quite huge, so the training file you've created will likely be quite huge, too. Can you try setting the [EDIT]: I mean only in the second step. Looks like the first step went just fine so let's not touch that data :-) |
@svlandeg I'll give it a shot. Which parameter is limit? https://github.com/explosion/spaCy/blob/master/bin/wiki_entity_linking/wikidata_train_entity_linker.py#L29-L38 I see limit in the step 1. Edit I assume you mean train_inst/dev_inst here? Running now with those set to 2000. |
@svlandeg That worked:
Do you know how much RAM is needed to train the corpus/is there a way to batch train? I'm not sure where to go from here. |
Happy to see it runs! I haven't done exhaustive testing to see which settings require how much RAM etc. For now I'm afraid I can only tell you do to experiment a bit, play with the size of the datasets & hyperparameters. This functionality is still under development, too, so the accuracy levels aren't that great yet. Work in progress ;-) |
Jw, What machine did you all use to get it to run? I'll see what I can do. I also have access to Databricks clusters/AWS services but not sure which I'd use here. @svlandeg |
I also stuck in this step, I passed at a small amount of data (2000). But when I tried other quantities, such as 500,000, and failed again. It took a long time for each failure. To sum up, I have two questions. 1. How much memory I need? 2. Did I need a GPU now? |
@kevingeng : did you, in the first step, parse out the full KB? I'm asking because it could be that you put a limit on the number of entities/aliases in the KB, resulting in a relatively small KB. If that's the case, it can take quite long before the parsing script finds appropriate training examples in the data - for each training example it will check the KB whether it knows it, and if not, discards it, so that's a costly process. |
Thanks for your answer
In this step, it creates an output directory, like this:
Then in step 2. The program runs to "#STEP 2: Read the training dataset previously created from WP." Failed, the information at the time is "Read training data limited to none." The reason is that after loading some items, the server is out of memory and the process is killed. There are only 64G of memory in this server, I want to add more but I don't know how to calculate the amount of memory needed to run. Also, by your answer, are you saying that I can change some parameters to make a small gold_entities.jsonl? So I need to ask more:Is the ”alias“ means "mention"--which is the raw-text of the entity in the document? Can you give me some advice about max_per_ailas/min_freq/min_pair args? |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
I've been stuck on Step 2 of the entity linking docs and could use some help. I'm sure I'm making a basic error.
How to reproduce the problem
I create the knowledge base and training data. This seemed to work as it said "Done!" after running for 12 hours.
python wikidata_pretrain_kb.py 'latest-all.json.bz2' 'enwiki-latest-pages-articles-multistream.xml.bz2' './output2' 'en_core_web_lg'
The second step is dying off for me. Checking the logs, I'm running out of memory (and a lot of memory is being used). I have 32gb of RAM.
I'm running:
python wikidata_train_entity_linker.py output2
:output2
is the directory with the knowledge base. Its contents are:Here is the command line output after running step 2's command. It simply escapes to the command line after running for 6+ hrs and running out of memory:
Your Environment
The text was updated successfully, but these errors were encountered: