Allow total memory to be overridden #78750

droberts195 · 2021-10-06T11:35:49Z

Since #65905 Elasticsearch has determined the Java heap settings
from node roles and total system memory.

This change allows the total system memory used in that calculation
to be overridden with a user-specified value. This is intended to
be used when Elasticsearch is running on a machine where some other
software that consumes a non-negligible amount of memory is running.
For example, a user could tell Elasticsearch to assume it was
running on a machine with 3GB of RAM when actually it was running
on a machine with 4GB of RAM.

The system property is es.total_memory_bytes, so, for example,
could be specified using -Des.total_memory_bytes=3221225472.
(It is specified in bytes rather than using a unit, because it
needs to be parsed by startup code that does not have access to
the utility classes that interpret byte size units.)

Since elastic#65905 Elasticsearch has determined the Java heap settings from node roles and total system memory. This change allows the total system memory used in that calculation to be overridden with a user-specified value. This is intended to be used when Elasticsearch is running on a machine where some other software that consumes a non-negligible amount of memory is running. For example, a user could tell Elasticsearch to assume it was running on a machine with 3GB of RAM when actually it was running on a machine with 4GB of RAM. The system property is `es.total_memory_bytes`, so, for example, could be specified using `-Des.total_memory_bytes=3221225472`. (It is specified in bytes rather than using a unit, because it needs to be parsed by startup code that does not have access to the utility classes that interpret byte size units.)

elasticmachine · 2021-10-06T15:59:23Z

Pinging @elastic/es-delivery (Team:Delivery)

elasticmachine · 2021-10-06T15:59:23Z

Pinging @elastic/es-core-infra (Team:Core/Infra)

elasticmachine · 2021-10-06T15:59:24Z

Pinging @elastic/ml-core (Team:ML)

elasticsearchmachine · 2021-10-06T15:59:45Z

Hi @droberts195, I've created a changelog YAML for you.

elasticsearchmachine · 2021-10-06T16:11:26Z

Hi @droberts195, I've updated the changelog YAML for you.

pugnascotia · 2021-10-06T20:22:48Z

...on/tools/launchers/src/main/java/org/elasticsearch/tools/launchers/MachineDependentHeap.java

+        String totalMemoryBytesOption = userDefinedJvmOptions.stream()
+            .filter(option -> option.startsWith("-Des.total_memory_bytes="))
+            .findFirst()
+            .orElse(null);
+        if (totalMemoryBytesOption == null) {
+            return null;
+        }
+        try {
+            return Long.parseLong(totalMemoryBytesOption.split("=", 2)[1]);
+        } catch (NumberFormatException e) {
+            throw new IllegalArgumentException("Unable to parse number of bytes from [" + totalMemoryBytesOption + "]");
+        }


I think you don't need an intermediate variable if you used Optional.map?

Suggested change

String totalMemoryBytesOption = userDefinedJvmOptions.stream()

.filter(option -> option.startsWith("-Des.total_memory_bytes="))

.findFirst()

.orElse(null);

if (totalMemoryBytesOption == null) {

return null;

}

try {

return Long.parseLong(totalMemoryBytesOption.split("=", 2)[1]);

} catch (NumberFormatException e) {

throw new IllegalArgumentException("Unable to parse number of bytes from [" + totalMemoryBytesOption + "]");

}

return userDefinedJvmOptions.stream()

.filter(option -> option.startsWith("-Des.total_memory_bytes="))

.findFirst()

.map(totalMemoryBytesOption -> {

try {

return Long.parseLong(totalMemoryBytesOption.split("=", 2)[1]);

} catch (NumberFormatException e) {

throw new IllegalArgumentException("Unable to parse number of bytes from [" + totalMemoryBytesOption + "]");

}

})

.orElse(null);

pugnascotia · 2021-10-06T20:30:29Z

docs/reference/cluster/stats.asciidoc

+`total_override`::
+(<<byte-units,byte value>>)
+If the amount of physical memory has been overridden using the `es.total_memory_bytes`
+system property on all selected nodes then this reports the sum of the overridden


What if the total memory was only overridden on some of the selected nodes?

Yes, this is a tricky one. I opted not to report any override at all at the cluster level if some nodes have overrides and others don't. (You can still get all the values from the node stats.) I guess the alternative would be to report the sum of overrides on the nodes that have overrides, plus un-overridden total on the nodes that don't, but only if at least one node has an override. Maybe that's better - I'd be interested to hear if subsequent reviewers have any thoughts on this.

This also ties in with #78750 (comment)

I wonder if we need to have any distinct concept of override in the stats at all. With available disk space, when limited through cgroups, we do not show what the real disk has available. Why not just show this as “this is the memory available”?

pugnascotia · 2021-10-06T20:39:23Z

server/src/test/java/org/elasticsearch/monitor/os/OsProbeTests.java

+        assertThat(OsProbe.getTotalMemoryOverride("123456789123456789"), is(123456789123456789L));
+        IllegalArgumentException e = expectThrows(IllegalArgumentException.class, () -> OsProbe.getTotalMemoryOverride("abc"));
+        assertThat(e.getMessage(), is("Invalid value for [es.total_memory_bytes]: [abc]"));
+        // Although numeric, this value overflows long. This won't be a problem if the override is set appropriately on a 64 bit machine.


I don't understand why this won't be a problem on a 64 bit machine - can you elaborate?

A 64 bit pointer can address 16 exabytes of memory. A signed long can only store half that, but it's still huge. I think as machines start to have memory in the exabyte ranges we'll see 128 bit CPUs and languages will be changed to have 128 bit primitive integer types. At that time all the places where we've stored memory sizes in long will need to be changed to that new 128 bit primitive type, but that's decades in the future.

Basically, for the foreseeable future there's no justification for anyone to need to tell Elasticsearch to work on the assumption that the total memory on the machine is greater than 8 exabytes, so it doesn't matter that attempting to do so is an error. I'll change the comment to say that.

mark-vieira

I left a few comments. I think we should get somebody from core/infra to give this a once over as well. Otherwise 👍 .

...on/tools/launchers/src/main/java/org/elasticsearch/tools/launchers/MachineDependentHeap.java

mark-vieira · 2021-10-06T20:34:45Z

server/src/main/java/org/elasticsearch/monitor/os/OsProbe.java

+     * is read from the "es.total_memory_bytes" system property. Negative
+     * values or not set at all mean no override.
+     */
+    public Long getTotalMemoryOverride() {


I wonder if we should call this something less implementation-specific, like "totalUsableMemory"? I think in that case it would also make sense to have this return a long primative as well, and if there's no override defined, we just return the same value as totalSystemMemory. This saves downstream callers the trouble of null checks.

mark-vieira · 2021-10-06T20:42:18Z

qa/os/src/test/java/org/elasticsearch/packaging/test/DockerTests.java

+        assertThat(xArgs, hasItems("-Xms304m", "-Xmx304m"));
+    }
+
+    private List<String> machineDependentHeapTest(final String containerMemory, final List<String> extraJvmOptions) throws Exception {


We should make sure we're testing this scenario in our ArchiveTests and PackageTests as well to get full coverage across packaging types.

droberts195 · 2021-10-11T14:15:11Z

Just summarizing some of the comments that have been raised against different lines but are all very interrelated:

Is this like cgroups?
Can we just modify total and free instead of introducing a 3rd field?
If we do introduce a 3rd field, it should always be present, not only present if memory was overridden.

This is not exactly like cgroups. The key difference is that cgroups enforces boundaries but this functionality does not. So with cgroups, if the memory of the machine is 16GB but cgroups says this container can use 4GB then the OS will enforce that the container really can only use 4GB. With this functionality if a container has 4GB but we want Elasticsearch to work on the basis that something else will be using 1GB then there is nothing at the OS level that will stop Elasticsearch using more than 3GB, nor stop the other processes from using more than 1GB. It has been suggested that a better solution to doing all this would be to run the other processes in a separate container, and I guess this issue shows that that is indeed the better solution from the perspective of different processes fighting over memory. But I guess it also creates other problems when the other process is supposed to be monitoring the container where Elasticsearch is running and cannot get all the stats it wants when running outside the Elasticsearch container.

This also makes it hard to modify free to reflect the override. Suppose total is 1024m, and this is overridden to 844m (i.e. 180m reduction) to allow metricbeat to run in the same container without causing of OOM errors. Suppose at the time the node stats are reported free is 233m, and metricbeat is using 97m. It would be wrong to subtract 180m from 233m and report that as free. Instead we should subtract 83m from free, which is the 180m reduction less the 97m that metricbeat is currently using. But to actually do this, we'd need to know that metricbeat is the reason for the reduction, and be able to find it in the process table and get its stats. I don't think a non-root process would be able to find the stats for another non-root process, and even if it could passing the details of the process or multiple processes that the memory override relates to is going to add a lot of extra complexity.

As a result of the answers to points 1 and 2 I think the only way forward is to add a third field in addition to total and free. Messing with total and free in some way that doesn't reflect the operating system's opinion of them could make support cases where memory is a factor very challenging to work on. I will change this third field so that it's always present, but I think since it's always going to be there we should spend a bit more time thinking about the best name for it. total_usable is one option, but I wonder if usable might cause confusion about how it relates to free. Some other options could be ration, budget, share or allowance. Please let me know if you have any thoughts or if there's anyone else who's particularly good at choosing good field names that I should consult.

droberts195 · 2021-10-11T16:08:03Z

x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/MachineLearning.java

@@ -1286,7 +1286,8 @@ public static boolean allTemplatesInstalled(ClusterState clusterState) {

    /**
     * Find the memory size (in bytes) of the machine this node is running on.
-     * Takes container limits (as used by Docker for example) into account.
+     * Takes container limits (as used by Docker for example) and forced memory
+     * size overrides into account.
     */
    static long machineMemoryFromStats(OsStats stats) {
        long mem = stats.getMem().getTotal().getBytes();


The discussion about cgroups made me think that the 10 lines below (not changed by this PR, so expand the diff to see them) are unnecessary now. They certainly were necessary back in some 6.x release when they were first added.

Is it the case that OperatingSystemMXBean takes cgroups into account only for Java 14 and above, hence:

elasticsearch/distribution/tools/launchers/src/main/java/org/elasticsearch/tools/launchers/DefaultSystemMemoryInfo.java

Line 36 in b968fcb

throw new SystemMemoryInfoException("The minimum required Java version is 14 to use " + this.getClass().getName());

?

If that's the case then this method can be simplified for 8.0+ when we're mandating at least Java 17.

droberts195 · 2021-10-13T16:24:07Z

I wonder if usable might cause confusion about how it relates to free. Some other options could be ration, budget, share or allowance.

I went with adjusted_total in 449cf2a, which makes the new field always present.

I also removed ML's hand coded application of cgroups limits on the assumption Java 14+ does this automatically.

rjernst

LGTM

rjernst · 2021-10-14T19:16:19Z

server/src/main/java/org/elasticsearch/monitor/os/OsProbe.java

+    static Long getTotalMemoryOverride(String memoryOverrideProperty) {
+        try {
+            long memoryOverride = Long.parseLong(memoryOverrideProperty);
+            return (memoryOverride < 0) ? null : memoryOverride;


Why lenience for a negative value, shouldn't that be an error? Instead could we use null as the sentinel for the property not existing?

rjernst · 2021-10-14T19:33:55Z

server/src/main/java/org/elasticsearch/monitor/os/OsStats.java

            assert total >= 0 : "expected total memory to be positive, got: " + total;
-            assert free >= 0 : "expected free memory to be positive, got: " + total;
+            assert adjustedTotal >= 0 : "expected adjusted total memory to be positive, got: " + adjustedTotal;


I see we are asserting here to ensure it is positive, but since asserts are not normally running in production, I think a hard check when parsing would still be better.

I'm reluctant to make it throw exceptions in production here. The assertions date back to #42725, which was fixing #42157. #54415 and #56435 are also related. Even though those issues showed up in CI, I know there were also a load of problems reported by end users related to exceptions caused by negative memory values. This included nasty problems like ECK being unusable on certain operating systems.

This constructor is only called in 2 places in production code, and the code paths that lead to those calls were all changed in #42725 to ensure none of the arguments can be negative. Changing negative values reported by the JVM to zero was the general approach to solving all the related problems.

I'll make the two constructors log an error and change negative values to zero after the assertions. This will add an extra layer of protection that negative values won't corrupt the stats API outputs if a future code change means a negative value gets this far in production. And the assertions will still trip CI, so we should still find out about future JVM bugs or OS bugs as soon as we test with them.

Hrm, but I was specifically referring to the adjusted total, which does not come from the JVM. CI won’t catch that since it will be passed in externally at runtime? Surely we can enforce our own system property does not have a negative value?

Surely we can enforce our own system property does not have a negative value?

Yes, sure, that's done in https://github.com/elastic/elasticsearch/pull/78750/files#diff-709e127f78f07be389e23093a3fd58cde32a752de9b80a7f64230e330d744e36R148 now.

So none of the arguments to this constructor should be negative now.

The adjusted value might still come from the JVM if the system property wasn't set at all, because it will have been set to be the same as total. This class is really just a holder for passing previously calculated values around.

droberts195 · 2021-10-16T06:18:02Z

@elasticmachine update branch

droberts195 · 2021-10-16T08:42:38Z

@elasticmachine run elasticsearch-ci/part-1

* upstream/master: (109 commits) Migrate custom role providers to licensed feature (elastic#79127) Remove stale AwaitsFix in InternalEngineTests (elastic#79323) Fix errors in RefreshListenersTests (elastic#79324) Reeable BwC Tests after elastic#79318 (elastic#79320) Mute BwC Tests for elastic#79318 (elastic#79319) Reenable BwC Tests after elastic#79308 (elastic#79313) Disable BwC Tests for elastic#79308 (elastic#79310) Adjust BWC for node-level field cap requests (elastic#79301) Allow total memory to be overridden (elastic#78750) Fix SnapshotBasedIndexRecoveryIT#testRecoveryIsCancelledAfterDeletingTheIndex (elastic#79269) Disable BWC tests Mute GeoIpDownloaderCliIT.testStartWithNoDatabases (elastic#79299) Add alias support to fleet search API (elastic#79285) Create a coordinating node level reader for tsdb (elastic#79197) Route documents to the correct shards in tsdb (elastic#77731) Inject migrate action regardless of allocate action (elastic#79090) Migrate to data tiers should always ensure a TIER_PREFERENCE is set (elastic#79100) Skip building of BWC distributions when building release artifacts (elastic#79180) Default ENFORCE_DEFAULT_TIER_PREFERENCE to true (elastic#79275) Deprecation of transient cluster settings (elastic#78794) ... # Conflicts: # server/src/main/java/org/elasticsearch/index/IndexMode.java # server/src/test/java/org/elasticsearch/index/TimeSeriesModeTests.java

* upstream/master: (521 commits) Migrate custom role providers to licensed feature (elastic#79127) Remove stale AwaitsFix in InternalEngineTests (elastic#79323) Fix errors in RefreshListenersTests (elastic#79324) Reeable BwC Tests after elastic#79318 (elastic#79320) Mute BwC Tests for elastic#79318 (elastic#79319) Reenable BwC Tests after elastic#79308 (elastic#79313) Disable BwC Tests for elastic#79308 (elastic#79310) Adjust BWC for node-level field cap requests (elastic#79301) Allow total memory to be overridden (elastic#78750) Fix SnapshotBasedIndexRecoveryIT#testRecoveryIsCancelledAfterDeletingTheIndex (elastic#79269) Disable BWC tests Mute GeoIpDownloaderCliIT.testStartWithNoDatabases (elastic#79299) Add alias support to fleet search API (elastic#79285) Create a coordinating node level reader for tsdb (elastic#79197) Route documents to the correct shards in tsdb (elastic#77731) Inject migrate action regardless of allocate action (elastic#79090) Migrate to data tiers should always ensure a TIER_PREFERENCE is set (elastic#79100) Skip building of BWC distributions when building release artifacts (elastic#79180) Default ENFORCE_DEFAULT_TIER_PREFERENCE to true (elastic#79275) Deprecation of transient cluster settings (elastic#78794) ... # Conflicts: # rest-api-spec/src/yamlRestTest/resources/rest-api-spec/test/tsdb/10_settings.yml # server/src/main/java/org/elasticsearch/common/settings/IndexScopedSettings.java # server/src/main/java/org/elasticsearch/common/settings/Setting.java # server/src/main/java/org/elasticsearch/index/IndexMode.java # server/src/test/java/org/elasticsearch/index/TimeSeriesModeTests.java

The current capacity in use in autoscaling would use the full container memory and not the adjusted total memory. ES sometimes responds with `current_capacity` as `required_capacity` and the `current_capacity` therefore need to use the adjusted capacity instead (since the orchestration should add the memory reservation on top). Relates elastic#78750

The current capacity in use in autoscaling would use the full container memory and not the adjusted total memory. ES sometimes responds with `current_capacity` as `required_capacity` and the `current_capacity` therefore need to use the adjusted capacity instead (since the orchestration should add the memory reservation on top). Relates #78750

droberts195 added >enhancement :Delivery/Packaging RPM and deb packaging, tar and zip archives, shell and batch scripts :Core/Infra/Core Core issues without another label :ml Machine learning v8.0.0 labels Oct 6, 2021

droberts195 added 5 commits October 6, 2021 14:36

Fix formatting

0f1dd57

Fix test

26c5579

Fix more tests

431c8bd

Merge branch 'master' into add_memory_override

7d09701

Adding packaging test

6acad57

droberts195 marked this pull request as ready for review October 6, 2021 15:59

elasticmachine added Team:Delivery Meta label for Delivery team Team:Core/Infra Meta label for core/infra team Team:ML Meta label for the ML team labels Oct 6, 2021

droberts195 added 3 commits October 6, 2021 16:59

Update docs/changelog/78750.yaml

0f27a97

Fix changelog

824019a

Update docs/changelog/78750.yaml

898fec6

droberts195 added 2 commits October 6, 2021 17:21

Fix test

82c7937

Adding a launcher test for option parse failure

d1d58e6

pugnascotia reviewed Oct 6, 2021

View reviewed changes

mark-vieira approved these changes Oct 6, 2021

View reviewed changes

Addressing comments related to the launcher code

017263b

droberts195 added 3 commits October 11, 2021 11:31

Move packaging test out of systemd section

3c1eabd

Merge branch 'master' into add_memory_override

8df4836

Fix test

b4c646c

droberts195 commented Oct 11, 2021

View reviewed changes

droberts195 added 2 commits October 13, 2021 16:39

Merge branch 'master' into add_memory_override

4e7b3c1

Using always present adjusted_total instead of optional total_override

449cf2a

droberts195 added 2 commits October 13, 2021 18:56

Fixing tests

ee59b9f

Merge branch 'master' into add_memory_override

2f71542

rjernst approved these changes Oct 14, 2021

View reviewed changes

droberts195 added 6 commits October 15, 2021 09:47

Merge branch 'master' into add_memory_override

7dcd8ec

Address code review comments

be14364

Adapt to packaging test framework changes

8741c98

Merge branch 'master' into add_memory_override

9731c2d

Packaging tests need https now

71f7a2a

Set up security after calling install()

1cd19bf

Merge branch 'master' into add_memory_override

becace8

droberts195 merged commit e86de06 into elastic:master Oct 16, 2021

droberts195 deleted the add_memory_override branch October 16, 2021 11:01

jakelandis added v8.0.0-beta1 and removed v8.0.0 labels Oct 27, 2021

henningandersen mentioned this pull request Nov 9, 2021

Autoscaling use adjusted total memory #80528

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow total memory to be overridden #78750

Allow total memory to be overridden #78750

droberts195 commented Oct 6, 2021

elasticmachine commented Oct 6, 2021

elasticmachine commented Oct 6, 2021

elasticmachine commented Oct 6, 2021

elasticsearchmachine commented Oct 6, 2021

elasticsearchmachine commented Oct 6, 2021

pugnascotia Oct 6, 2021

pugnascotia Oct 6, 2021

droberts195 Oct 7, 2021 •

edited

Loading

droberts195 Oct 7, 2021

rjernst Oct 8, 2021

pugnascotia Oct 6, 2021

droberts195 Oct 7, 2021

mark-vieira left a comment

mark-vieira Oct 6, 2021

rjernst Oct 8, 2021

breskeby Oct 11, 2021

mark-vieira Oct 6, 2021

droberts195 commented Oct 11, 2021 •

edited

Loading

droberts195 Oct 11, 2021

droberts195 commented Oct 13, 2021

rjernst left a comment

rjernst Oct 14, 2021

rjernst Oct 14, 2021

droberts195 Oct 15, 2021

rjernst Oct 15, 2021

droberts195 Oct 15, 2021

droberts195 commented Oct 16, 2021

droberts195 commented Oct 16, 2021

Allow total memory to be overridden #78750

Allow total memory to be overridden #78750

Conversation

droberts195 commented Oct 6, 2021

elasticmachine commented Oct 6, 2021

elasticmachine commented Oct 6, 2021

elasticmachine commented Oct 6, 2021

elasticsearchmachine commented Oct 6, 2021

elasticsearchmachine commented Oct 6, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

droberts195 Oct 7, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mark-vieira left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

droberts195 commented Oct 11, 2021 • edited Loading

Choose a reason for hiding this comment

droberts195 commented Oct 13, 2021

rjernst left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

droberts195 commented Oct 16, 2021

droberts195 commented Oct 16, 2021

droberts195 Oct 7, 2021 •

edited

Loading

droberts195 commented Oct 11, 2021 •

edited

Loading