Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect reporting of memory utilisation #141

Open
david-waterworth opened this issue Mar 22, 2023 · 0 comments
Open

Incorrect reporting of memory utilisation #141

david-waterworth opened this issue Mar 22, 2023 · 0 comments

Comments

@david-waterworth
Copy link

Describe the bug
I'm running into issues with batch transform due to what I assume is an OOM condition. The main problem appears to be because as far as I can see there's no way to explicitly configure the batch_size for a batch transform that I'm aware of.

Instead the batch_size appears to be controlled by MaxPayloadInMB which has a minimum of 1. I added logging in my predict_fn and observe that I'm receiving a mix of batches containing 1000 examples, and some that contain 10k+ examples. The huge batches are pretty much 1MB is size - I have no idea where the batches of 1000 come from (I'm wondering if its splitting the last batch that is less than the 1MB payload).

The issue is that the large batches seem to occasionally cause the worker to crash - I suspect it's an out-of-memory (the obvious workaround is to pick a machine with more memory). When I look at the logs the maximum utilisation appears to be around 50% - but looking closer that metric appears wrong, the example below has MemoryUsed=3537.828125 / MemoryAvailable=3843.3515625 = MemoryUtilization=50%

Expected behavior
MemoryUtilization = 100.0 * MemoryUsed / MemoryAvailable

Screenshots or logs

2023-03-22T12:53:27.708+11:00 | 2023-03-22T01:53:26,857 [INFO ] pool-3-thread-2 TS_METRICS - MemoryAvailable.Megabytes:3843.3515625\|#Level:Host\|#hostname:4a73e96743e7,timestamp:1679450006
-- | --
  | 2023-03-22T12:53:27.708+11:00 | 2023-03-22T01:53:26,857 [INFO ] pool-3-thread-2 TS_METRICS - MemoryUsed.Megabytes:3537.828125\|#Level:Host\|#hostname:4a73e96743e7,timestamp:1679450006
  | 2023-03-22T12:53:27.708+11:00 | 2023-03-22T01:53:26,857 [INFO ] pool-3-thread-2 TS_METRICS - MemoryUtilization.Percent:50.0\|#Level:Host\|#hostname:4a73e96743e7,timestamp:1679450006

System information
A description of your system. Please provide:

  • Toolkit version: pytorch
  • Framework version: 1.13.1
  • Python version: 3.9
  • CPU or GPU: CPU
  • Custom Docker image (Y/N): No

Additional context
Add any other context about the problem here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant