Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Set changed size during iteration" Error in Worker.put_key_in_memory() #4371

Closed
manuels opened this issue Dec 17, 2020 · 11 comments · Fixed by #5285
Closed

"Set changed size during iteration" Error in Worker.put_key_in_memory() #4371

manuels opened this issue Dec 17, 2020 · 11 comments · Fixed by #5285

Comments

@manuels
Copy link

manuels commented Dec 17, 2020

I am getting an error in this line:

for dep in ts.dependents:

Maybe a temporary copy of ts.dependents helps here?

Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOMainLoop object at 0x7fa72878fb90>>, <Task finished coro=<Worker.gather_dep() done, defined at [...]/python3.7/site-packages/distributed/worker.py:2000> exception=RuntimeError('Set changed size during iteration')>)
Traceback (most recent call last):
File "[...]/python3.7/site-packages/tornado/ioloop.py", line 743, in _run_callback
  ret = callback()
File "[...]/python3.7/site-packages/tornado/ioloop.py", line 767, in _discard_future_result
  future.result()
File "[...]/python3.7/site-packages/distributed/worker.py", line 2119, in gather_dep
  self.transition(ts, "memory", value=data[d])
File "[...]/python3.7/site-packages/distributed/worker.py", line 1539, in transition
  state = func(ts, **kwargs)
File "[...]/python3.7/site-packages/distributed/worker.py", line 1605, in transition_flight_memory
  self.put_key_in_memory(ts, value)
File "[...]/python3.7/site-packages/distributed/worker.py", line 1970, in put_key_in_memory
  for dep in ts.dependents:
RuntimeError: Set changed size during iteration
@manuels
Copy link
Author

manuels commented Dec 17, 2020

Maybe that happens because I submit tasks from tasks?

Wrapping the loop with this seems to fix the error:

while True:
  try:
    for dep in ts.dependents:
      ...
    break
  except RuntimeError as e:
    if e.args[0] == 'Set changed size during iteration':
      continue
    raise e

@mrocklin
Copy link
Member

mrocklin commented Dec 17, 2020 via email

@gforsyth
Copy link
Contributor

I think this particular error is fixed in @fjetter's PR in #4360 -- @manuels, short term you can change it to for dep in list(ts.dependents): and that should do the trick

@manuels
Copy link
Author

manuels commented Dec 18, 2020

Even with the while True/try except block, I run into the issue that some job depenedencies are not started at all. I'm not sure if this is really connected to this issue. Let me try to figure out what is going on on my side and report back.

@manuels
Copy link
Author

manuels commented Dec 22, 2020

It seems to work fine if I downgrade to v2.30.

@mdering
Copy link

mdering commented Jan 19, 2021

is this still an issue in 2021.01.1? Has anyone checked or verified?

@fjetter
Copy link
Member

fjetter commented Jan 19, 2021

Judging by the code, this is still a thing. In #4360 I tried to address all of this at once but that just kept on escalating. The temporary copy is probably the best approach here.

A minimal, reproducible example would be great for this.

I believe this is loosely related to #4439

@gerrymanoim
Copy link
Contributor

@fjetter In case this is useful, I can confirm I get this on 2021.7.2. Worker logs:

distributed.worker - INFO - Can't find dependencies {<Task "('getitem-fcb17a4c527a0024751780f943dcdd2c', 18)" fetch>} for key ('rename-15e023e766a9c5978b5744bd9d5796f0', 18)
distributed.worker - INFO - Can't find dependencies {<Task "('getitem-61812525dc9c43f3c1b8d1899572b1e9', 18)" fetch>} for key ('rename-54810db74ff49d3d232c63496d2cade8', 18)
distributed.worker - INFO - Dependent not found: ('getitem-fcb17a4c527a0024751780f943dcdd2c', 18) 0 .  Asking scheduler
distributed.worker - INFO - Dependent not found: ('getitem-61812525dc9c43f3c1b8d1899572b1e9', 18) 0 .  Asking scheduler
distributed.worker - INFO - Can't find dependencies {<Task "('getitem-fcb17a4c527a0024751780f943dcdd2c', 18)" fetch>} for key ('rename-15e023e766a9c5978b5744bd9d5796f0', 18)
distributed.worker - INFO - Can't find dependencies {<Task "('getitem-61812525dc9c43f3c1b8d1899572b1e9', 18)" fetch>} for key ('rename-54810db74ff49d3d232c63496d2cade8', 18)
distributed.worker - INFO - Dependent not found: ('getitem-fcb17a4c527a0024751780f943dcdd2c', 18) 1 .  Asking scheduler
distributed.worker - INFO - Dependent not found: ('getitem-61812525dc9c43f3c1b8d1899572b1e9', 18) 1 .  Asking scheduler
distributed.worker - INFO - Can't find dependencies {<Task "('getitem-fcb17a4c527a0024751780f943dcdd2c', 18)" fetch>} for key ('rename-15e023e766a9c5978b5744bd9d5796f0', 18)
distributed.worker - INFO - Can't find dependencies {<Task "('getitem-61812525dc9c43f3c1b8d1899572b1e9', 18)" fetch>} for key ('rename-54810db74ff49d3d232c63496d2cade8', 18)
distributed.worker - INFO - Dependent not found: ('getitem-fcb17a4c527a0024751780f943dcdd2c', 18) 2 .  Asking scheduler
distributed.worker - INFO - Dependent not found: ('getitem-61812525dc9c43f3c1b8d1899572b1e9', 18) 2 .  Asking scheduler
distributed.worker - INFO - Can't find dependencies {<Task "('getitem-61812525dc9c43f3c1b8d1899572b1e9', 18)" fetch>} for key ('rename-54810db74ff49d3d232c63496d2cade8', 18)
distributed.worker - INFO - Can't find dependencies {<Task "('getitem-fcb17a4c527a0024751780f943dcdd2c', 18)" fetch>} for key ('rename-15e023e766a9c5978b5744bd9d5796f0', 18)
distributed.worker - INFO - Dependent not found: ('getitem-61812525dc9c43f3c1b8d1899572b1e9', 18) 3 .  Asking scheduler
distributed.worker - INFO - Dependent not found: ('getitem-fcb17a4c527a0024751780f943dcdd2c', 18) 3 .  Asking scheduler
distributed.worker - INFO - Can't find dependencies {<Task "('getitem-61812525dc9c43f3c1b8d1899572b1e9', 18)" fetch>} for key ('rename-54810db74ff49d3d232c63496d2cade8', 18)
distributed.worker - INFO - Can't find dependencies {<Task "('getitem-fcb17a4c527a0024751780f943dcdd2c', 18)" fetch>} for key ('rename-15e023e766a9c5978b5744bd9d5796f0', 18)
distributed.worker - INFO - Dependent not found: ('getitem-61812525dc9c43f3c1b8d1899572b1e9', 18) 4 .  Asking scheduler
distributed.worker - INFO - Dependent not found: ('getitem-fcb17a4c527a0024751780f943dcdd2c', 18) 4 .  Asking scheduler
distributed.worker - INFO - Can't find dependencies {<Task "('getitem-fcb17a4c527a0024751780f943dcdd2c', 18)" fetch>} for key ('rename-15e023e766a9c5978b5744bd9d5796f0', 18)
distributed.worker - INFO - Can't find dependencies {<Task "('getitem-61812525dc9c43f3c1b8d1899572b1e9', 18)" fetch>} for key ('rename-54810db74ff49d3d232c63496d2cade8', 18)
distributed.worker - INFO - Dependent not found: ('getitem-fcb17a4c527a0024751780f943dcdd2c', 18) 5 .  Asking scheduler
distributed.worker - INFO - Dependent not found: ('getitem-61812525dc9c43f3c1b8d1899572b1e9', 18) 5 .  Asking scheduler
distributed.worker - INFO - Can't find dependencies {<Task "('getitem-fcb17a4c527a0024751780f943dcdd2c', 18)" fetch>} for key ('rename-15e023e766a9c5978b5744bd9d5796f0', 18)
distributed.worker - INFO - Can't find dependencies {<Task "('getitem-61812525dc9c43f3c1b8d1899572b1e9', 18)" fetch>} for key ('rename-54810db74ff49d3d232c63496d2cade8', 18)
distributed.worker - ERROR - Handle missing dep failed, retrying
Traceback (most recent call last):
  File "../ext/public/python/distributed/2021/7/2/dist/lib/python3.7/distributed/worker.py", line 2498, in handle_missing_dep
    for dep in deps:
RuntimeError: Set changed size during iteration
distributed.worker - ERROR - Handle missing dep failed, retrying
Traceback (most recent call last):
  File "../distributed/2021/7/2/dist/lib/python3.7/distributed/worker.py", line 2498, in handle_missing_dep
    for dep in deps:
RuntimeError: Set changed size during iteration

On the client side I get

ValueError: Could not find dependent ('getitem-fcb17a4c527a0024751780f943dcdd2c', 18).  Check worker logs

Unfortunately I don't have a good repro here.

mrocklin added a commit to mrocklin/distributed that referenced this issue Aug 27, 2021
@mrocklin
Copy link
Member

it's a shot in the dark, but maybe #5285 helps? Is it easy for you to test your workload against that branch?

@gerrymanoim
Copy link
Contributor

Thanks - will try that as soon as I can.

@jrbourbeau
Copy link
Member

Thanks @gerrymanoim

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants