[ML] More accurate job memory overhead #47516

droberts195 · 2019-10-03T16:19:32Z

When an ML job runs the memory required can be
broken down into:

Memory required to load the executable code
Instrumented model memory
Other memory used by the job's main process or
ancilliary processes that is not instrumented

Previously we added a simple fixed overhead to
account for 1 and 3. This was 100MB for anomaly
detection jobs (large because of the completely
uninstrumented categorization function and
normalize process), and 20MB for data frame
analytics jobs.

However, this was an oversimplification because
the executable code only needs to be loaded once
per machine. Also the 100MB overhead for anomaly
detection jobs was probably too high in most cases
because categorization and normalization don't use
that much memory.

This PR therefore changes the calculation of memory
requirements as follows:

A per-node overhead of 30MB for only the first
job of any type to be run on a given node - this
is to account for loading the executable code
The established model memory (if applicable) or
model memory limit of the job
A per-job overhead of 10MB for anomaly detection
jobs and 5MB for data frame analytics jobs, to
account for the uninstrumented memory usage

This change will enable more jobs to be run on the
same node. It will be particularly beneficial when
there are a large number of small jobs. It will
have less of an effect when there are a small number
of large jobs.

When an ML job runs the memory required can be broken down into: 1. Memory required to load the executable code 2. Instrumented model memory 3. Other memory used by the job's main process or ancilliary processes that is not instrumented Previously we added a simple fixed overhead to account for 1 and 3. This was 100MB for anomaly detection jobs (large because of the completely uninstrumented categorization function and normalize process), and 20MB for data frame analytics jobs. However, this was an oversimplification because the executable code only needs to be loaded once per machine. Also the 100MB overhead for anomaly detection jobs was probably too high in most cases because categorization and normalization don't use _that_ much memory. This PR therefore changes the calculation of memory requirements as follows: 1. A per-node overhead of 30MB for _only_ the first job of any type to be run on a given node - this is to account for loading the executable code 2. The established model memory (if applicable) or model memory limit of the job 3. A per-job overhead of 10MB for anomaly detection jobs and 5MB for data frame analytics jobs, to account for the uninstrumented memory usage This change will enable more jobs to be run on the same node. It will be particularly beneficial when there are a large number of small jobs. It will have less of an effect when there are a small number of large jobs.

elasticmachine · 2019-10-03T16:19:33Z

Pinging @elastic/ml-core (:ml)

When an ML job runs the memory required can be broken down into: 1. Memory required to load the executable code 2. Instrumented model memory 3. Other memory used by the job's main process or ancilliary processes that is not instrumented Previously we added a simple fixed overhead to account for 1 and 3. This was 100MB for anomaly detection jobs (large because of the completely uninstrumented categorization function and normalize process), and 20MB for data frame analytics jobs. However, this was an oversimplification because the executable code only needs to be loaded once per machine. Also the 100MB overhead for anomaly detection jobs was probably too high in most cases because categorization and normalization don't use _that_ much memory. This PR therefore changes the calculation of memory requirements as follows: 1. A per-node overhead of 30MB for _only_ the first job of any type to be run on a given node - this is to account for loading the executable code 2. The established model memory (if applicable) or model memory limit of the job 3. A per-job overhead of 10MB for anomaly detection jobs and 5MB for data frame analytics jobs, to account for the uninstrumented memory usage This change will enable more jobs to be run on the same node. It will be particularly beneficial when there are a large number of small jobs. It will have less of an effect when there are a small number of large jobs.

droberts195 added >enhancement :ml Machine learning v8.0.0 v7.5.0 labels Oct 3, 2019

benwtrent approved these changes Oct 3, 2019

View reviewed changes

droberts195 merged commit d683b20 into elastic:master Oct 4, 2019

droberts195 deleted the adjust_ml_memory_overheads branch October 4, 2019 08:16

droberts195 mentioned this pull request Oct 10, 2019

Extend memory instrumentation to categorization elastic/ml-cpp#724

Closed

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] More accurate job memory overhead #47516

[ML] More accurate job memory overhead #47516

droberts195 commented Oct 3, 2019

elasticmachine commented Oct 3, 2019

[ML] More accurate job memory overhead #47516

[ML] More accurate job memory overhead #47516

Conversation

droberts195 commented Oct 3, 2019

elasticmachine commented Oct 3, 2019