Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enforcing limits on the internal memory needed by monitors #1273

Merged
merged 2 commits into from
Nov 28, 2023

Conversation

momchil-flex
Copy link
Collaborator

Currently, monitors that produce a small amount of data but may need a large amount of memory internally are not really validated to prevent OOM on our servers. I was thinking on imposing some limits like max number of freqs, or product of num_freqs x num_points x num_modes for mode monitors, etc. but I realized what probably makes the most sense is restricting the data size of the associated field monitor, which is always needed internally.

Copy link
Collaborator

@tylerflex tylerflex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple minor comments and suggestions but overall looks good


return data_size

def _monitor_data_size(self, monitor):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add type annotations?

# internal storage also does not exceed the limit.
check_monitor_types = [AbstractModeMonitor, SurfaceIntegrationMonitor, DiffractionMonitor]
for monitor in self.monitors:
if not isinstance(monitor, tuple(check_monitor_types)):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessary to convert to tuple here i think

if not isinstance(monitor, tuple(check_monitor_types)):
continue

mnt_kwargs = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a cleaner way to do this could be to add a property for monitors that returns the equivalent of the backend field monitor that would be used for that type of monitor (before processing) and then compute the storage size of that monitor here in the validator. What do you think? This might also be useful on the backend potentially or for debugging purposes

@momchil-flex
Copy link
Collaborator Author

I think a cleaner way to do this could be to add a property for monitors that returns the equivalent of the backend field monitor that would be used for that type of monitor (before processing) and then compute the storage size of that monitor here in the validator. What do you think? This might also be useful on the backend potentially or for debugging purposes

So I was thinking about doing it like this initially but it doesn't make that much sense actually; it doesn't work exactly like that on the backend. For example, surface integration monitors do not actually add a FieldMonitor that is then post-processed; on the other hand, ModeMonitor-s do. Also even disregarding that it wasn't easy for me to come up with a way to add this in a nicely unified way that would not need this property to be overwritten in multiple classes.

I've modified things to just introduce a server storage method. What do you think about this? I also added a limit on the max data that a mode solver call can produce, because more than 20GB is likely to error currently. Finally, I also realized that the previous implementation was not right for TimeMonitor-s (essentially FluxTimeMonitor is the only one affected) because it was making a FieldTimeMonitor with the same amount of time steps; instead, the fields at a single time step are stored internally at any given time (this also goes to the above point that there's no direct correspondence to a FieldMonitor or FieldTimeMonitor).

@tomflexcompute
Copy link
Contributor

Can the current limit ensure OOM never occurs?

@momchil-flex
Copy link
Collaborator Author

Can the current limit ensure OOM never occurs?

Unfortunately no. There's too many ways in which it can happen and we need more work on server-side resource allocation too. But this will prevent some cases which would otherwise almost certainly have gone oom.

Copy link
Collaborator

@tylerflex tylerflex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one minor comment but otherwise looks good.

for monitor in self.monitors:
num_cells = self._monitor_num_cells(monitor)
# intermediate storage needed
internal_data_gb = monitor._storage_size_solver(num_cells=num_cells, tmesh=self.tmesh)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be a bit easier to understand if we named the variables like

internal_data_bytes = monitor._storage_size_solver(num_cells=num_cells, tmesh=self.tmesh)
internal_data_gb = internal_data_bytes / 2**30

also in the other instance where the 2 ** 30 is divided out.

@momchil-flex momchil-flex merged commit 99248a8 into pre/2.5 Nov 28, 2023
14 checks passed
@momchil-flex momchil-flex deleted the momchil/monitor_limits branch November 28, 2023 22:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants