Resiliency: Throw exception if the JVM will corrupt data. #7580

Closed
wants to merge 3 commits into
from

Projects

None yet

8 participants

@rmuir
Contributor
rmuir commented Sep 4, 2014

Detect the worst-offender, known hotspot versions that can
cause index corruption, and fail on startup.

@rmuir rmuir [Bootstrap] Throw exception if the JVM will corrupt data.
Detect the worst-offender, known hotspot versions that can
cause index corruption, and fail on startup.
6224eb2
@dadoonet dadoonet and 1 other commented on an outdated diff Sep 4, 2014
src/main/java/org/elasticsearch/bootstrap/Bootstrap.java
@@ -195,11 +198,19 @@ public static void main(String[] args) {
}
String stage = "Initialization";
- try {
+ try {
@dadoonet
dadoonet Sep 4, 2014 Member

Not needed change :)

@rmuir
rmuir Sep 4, 2014 Contributor

I think these changes are unavoidable and auto-created by the eclipse configuration in elasticsearch :(

@s1monw s1monw commented on an outdated diff Sep 4, 2014
src/main/java/org/elasticsearch/bootstrap/Bootstrap.java
@@ -281,4 +292,20 @@ private static String buildErrorMessage(String stage, Throwable e) {
}
return errorMessage.toString();
}
+
+ // Note: this list is for 1.7+ versions only, the vm vendor is always 'Oracle Corporation' here,
+ // (was different before!), even on openjdk/icedtea when configured with hotspot VM.
+ static final Set<String> brokenHotspotVersions = new HashSet<>(Arrays.asList(
+ "21.0-b17", // 1.7.0 loop optimizer bug: https://bugs.openjdk.java.net/browse/JDK-7070134
+ "24.0-b56", // 1.7.0_40 register allocation bug (technically only x86/amd64): https://bugs.openjdk.java.net/browse/JDK-8024830
+ "24.45-b08", // 1.7.0_45 ''
+ "24.51-b03" // 1.7.0_51 ''
+
+ ));
+
+ private static void checkJVM() {
@s1monw
s1monw Sep 4, 2014 Contributor

can we also call this from a test baseclass - I don't want our tests to run on a broken JVM?!

@s1monw s1monw commented on an outdated diff Sep 4, 2014
src/main/java/org/elasticsearch/bootstrap/Bootstrap.java
if (!foreground) {
Loggers.disableConsoleLogging();
System.out.close();
}
+
+ // fail if using broken version
+ if (Boolean.parseBoolean(System.getProperty("elasticsearch.bypass.vm.check"))) {
@s1monw
s1monw Sep 4, 2014 Contributor

I think we should make this a constant. For consistentcy we should use the es. prefix rather than elasticsearch.

@s1monw
Contributor
s1monw commented Sep 4, 2014

I left two comments - looks good in general

@spinscale
Member

"Please upgrade" is the last message the user gets in the exception, without having any clue where to upgrade to and which other versions are affected (maybe even the latest one is).

Can we point the user somewhere, where he gets help instead of being unspecific here? A documentation page which contains unsafe versions we could link to or something?

@dadoonet
Member
dadoonet commented Sep 4, 2014

+1 for @spinscale comment. I really like when projects log errors with links to documentation.

@rmuir
Contributor
rmuir commented Sep 4, 2014

Who will make and maintain such a page and ensure it doesnt go 404?

@clintongormley
Member

@rmuir the docs build process checks all links within the docs before pushing any changes live. it also has hooks for custom checks, eg we also check the links in the config.yml file to ensure that they are still correct. We could add a custom hook for this page too.

@rmuir
Contributor
rmuir commented Sep 4, 2014

I'm not really a webpage designer. I think this is blowing up the scope of the issue with unnecessary requirements.

@rmuir rmuir closed this Sep 4, 2014
@dadoonet
Member
dadoonet commented Sep 4, 2014

I think we should think of something somehow generic. Such as what Twitter4J does.
For each error you get, you have an associated Error code which is unique.

The log says:

Relevant discussions can be found on the Internet at:
    http://www.google.co.jp/search?q=93f2523c or
    http://www.google.co.jp/search?q=8b74a4e3
TwitterException{exceptionCode=[93f2523c-8b74a4e3], statusCode=401, message=Invalid or expired token, code=89, retryAfter=-1, rateLimitStatus=null, version=3.0.3}

I think we should think about something similar but in another issue as we somehow hijacked this one :)

@clintongormley
Member

Well, we need to document good versions anyway. It doesn't require any design, just a page of docs. This is versioned along with the code. You tell me what info needs to be on the page and I'll write it.

@rmuir
Contributor
rmuir commented Sep 4, 2014

Crashing with an empty SIGSEGV is better than today. It prevents data corruption. We don't even need an error message at all, a webpage is over the top. Its stupid to block this issue on such things.

@clintongormley clintongormley reopened this Sep 4, 2014
@clintongormley
Member

Rather than adding a link to the error message, which may well change in the future, let's just add "Please see the documentation" to the error message. Searching for "java version" or "jvm version" will bring up the docs section added in #7591

@nik9000
Contributor
nik9000 commented Sep 4, 2014

@clintongormley watch out for how personalized google can be - that probably would work for you or me but it might not work for someone who's searched for elasticsearch fewer times. Besides, if you move the page it should get a redirect. Or you could go all out and point to a version on the internet archive or something....

@clintongormley
Member

@nik9000 i meant the search in the ES docs :)

won't work now, but will when #7591 is merged

@nik9000
Contributor
nik9000 commented Sep 4, 2014

@nik9000 i meant the search in the ES docs :)

Ah! Cool. Can you link to a page that searches the Elasticsearch docs? Something like [this](http://www.elasticsearch.org/guide/?s=java version)?

@s1monw s1monw removed the review label Sep 5, 2014
@s1monw
Contributor
s1monw commented Sep 5, 2014

@rmuir I left some comments - I think it's close though!

@clintongormley clintongormley changed the title from [Bootstrap] Throw exception if the JVM will corrupt data. to Resiliency: Throw exception if the JVM will corrupt data. Sep 8, 2014
@s1monw
Contributor
s1monw commented Sep 10, 2014

@rmuir can we move forward here it is marked as a blocker an I'd really like to move forward here

@rmuir
Contributor
rmuir commented Sep 10, 2014

I don't think the change will make it in. Please understand my point of view:

I didn't just take a few minutes and whip up a quick patch here, i tested extensively with all impacted hotspot versions (both Oracle JDK and openjdk/IcedTea), including testing the workaround bypass, and testing that "good versions" are ok, etc. This took me a good bit of time.

This is an exceptionally dangerous piece of functionality. With all the scope creep of this issue ("extra features" and "nice to haves" etc), comes the need to retest. There is no way I would touch this logic in a functional way, such as renaming parameters, without retesting.

So we should decide what is really a priority, and what can be a followup issue.

Currently I am fighting other fires, and I just dont have enough time to do it right, sorry.

@s1monw s1monw added v1.5.0 and removed v1.4.0 labels Sep 11, 2014
@s1monw
Contributor
s1monw commented Sep 11, 2014

moved out to 1.5

@clintongormley clintongormley removed the blocker label Sep 26, 2014
@s1monw s1monw added v1.6.0 and removed v1.5.0 labels Mar 17, 2015
@s1monw
Contributor
s1monw commented Mar 20, 2015

@rmuir @rjernst should we revisit this or should I just close it for now?

@s1monw
Contributor
s1monw commented Mar 20, 2015

I really think we should use what we have and expand in other issues. This PR is a good improvement and it's been sitting here for way too long. @rmuir can you rebase this branch and add maybe missing JVMs? all the other requests can be added later.

@rjernst
Member
rjernst commented Mar 20, 2015

+1 to get this in as is.

@rmuir
Contributor
rmuir commented Mar 21, 2015

I updated the check since i had to re-test all JVMs anyway.

I refactored this out of bootstrap, made messaging more verbose, added support for compiler workaround flags, and added logic for IBM (because users have hit this on the mailing list).

Example output with different versions:

1.7.0:

$ java -cp elasticsearch-2.0.0-SNAPSHOT.jar:lib/lucene-core-5.1.0-snapshot-1662607.jar org.elasticsearch.bootstrap.JVMCheck

Exception in thread "main" java.lang.RuntimeException: Java version: 1.7.0 suffers from critical bug https://bugs.openjdk.java.net/browse/JDK-7070134 which can cause data corruption.
Please upgrade the JVM, see http://www.elastic.co/guide/en/elasticsearch/reference/current/_installation.html for current recommendations.
If you absolutely cannot upgrade, please add -XX:-UseLoopPredicate to the JVM_OPTS environment variable.

$ java -cp elasticsearch-2.0.0-SNAPSHOT.jar:lib/lucene-core-5.1.0-snapshot-1662607.jar -XX:-UseLoopPredicate org.elasticsearch.bootstrap.JVMCheck

Mar 21, 2015 1:48:58 AM org.elasticsearch.bootstrap.JVMCheck check
WARNING: Workaround flag -XX:-UseLoopPredicate for bug https://bugs.openjdk.java.net/browse/JDK-7070134 found.
This will result in degraded performance!
Upgrading is preferred, see http://www.elastic.co/guide/en/elasticsearch/reference/current/_installation.html for current recommendations.

$ java -cp elasticsearch-2.0.0-SNAPSHOT.jar:lib/lucene-core-5.1.0-snapshot-1662607.jar -Des.bypass.vm.check=true org.elasticsearch.bootstrap.JVMCheck

Mar 21, 2015 1:59:57 AM org.elasticsearch.bootstrap.JVMCheck check
WARNING: bypassing jvm version check for version [1.7.0], this can result in data corruption!

1.7.0_40:

$ java -cp elasticsearch-2.0.0-SNAPSHOT.jar:lib/lucene-core-5.1.0-snapshot-1662607.jar org.elasticsearch.bootstrap.JVMCheck

Exception in thread "main" java.lang.RuntimeException: Java version: 1.7.0_40 suffers from critical bug https://bugs.openjdk.java.net/browse/JDK-8024830 which can cause data corruption.
Please upgrade the JVM, see http://www.elastic.co/guide/en/elasticsearch/reference/current/_installation.html for current recommendations.
If you absolutely cannot upgrade, please add -XX:-UseSuperWord to the JVM_OPTS environment variable.
Upgrading is preferred, this workaround will result in degraded performance.

$ java -cp elasticsearch-2.0.0-SNAPSHOT.jar:lib/lucene-core-5.1.0-snapshot-1662607.jar -XX:-UseSuperWord org.elasticsearch.bootstrap.JVMCheck

Mar 21, 2015 1:51:03 AM org.elasticsearch.bootstrap.JVMCheck check
WARNING: Workaround flag -XX:-UseSuperWord for bug https://bugs.openjdk.java.net/browse/JDK-8024830 found.
This will result in degraded performance!
Upgrading is preferred, see http://www.elastic.co/guide/en/elasticsearch/reference/current/_installation.html for current recommendations.

$ java -cp elasticsearch-2.0.0-SNAPSHOT.jar:lib/lucene-core-5.1.0-snapshot-1662607.jar -Des.bypass.vm.check=true org.elasticsearch.bootstrap.JVMCheck

Mar 21, 2015 1:58:52 AM org.elasticsearch.bootstrap.JVMCheck check
WARNING: bypassing jvm version check for version [1.7.0_40], this can result in data corruption!

IBM:

$ java -cp elasticsearch-2.0.0-SNAPSHOT.jar:lib/lucene-core-5.1.0-snapshot-1662607.jar org.elasticsearch.bootstrap.JVMCheck

Exception in thread "main" java.lang.RuntimeException: IBM runtimes suffer from several bugs which can cause data corruption.
Please upgrade the JVM, see http://www.elastic.co/guide/en/elasticsearch/reference/current/_installation.html for current recommendations.

$ java -cp elasticsearch-2.0.0-SNAPSHOT.jar:lib/lucene-core-5.1.0-snapshot-1662607.jar -Des.bypass.vm.check=true org.elasticsearch.bootstrap.JVMCheck

Mar 21, 2015 1:57:16 AM org.elasticsearch.bootstrap.JVMCheck check
WARNING: bypassing jvm version check for version [1.7.0], this can result in data corruption!

@kimchy
Member
kimchy commented Mar 21, 2015

LGTM

@rmuir rmuir added a commit that closed this pull request Mar 21, 2015
@rmuir rmuir [Bootstrap] Throw exception if the JVM will corrupt data.
Detect the worst-offenders, all IBM versions and several known hotspot
versions that can cause index corruption, and fail on startup.

Provide/detect compiler workarounds when they exist, but warn about
performance degradation.

In all cases the check can be bypassed completely with a safety
switch via undocumented system property (es.bypass.vm.check=true)

Closes #7580
6da99b3
@rmuir rmuir closed this in 6da99b3 Mar 21, 2015
@rmuir rmuir added a commit that referenced this pull request Mar 21, 2015
@rmuir rmuir [Bootstrap] Throw exception if the JVM will corrupt data.
Detect the worst-offenders, all IBM versions and several known hotspot
versions that can cause index corruption, and fail on startup.

Provide/detect compiler workarounds when they exist, but warn about
performance degradation.

In all cases the check can be bypassed completely with a safety
switch via undocumented system property (es.bypass.vm.check=true)

Closes #7580
8d4c805
@rmuir rmuir added v1.5.0 v1.4.5 and removed v1.6.0 labels Mar 21, 2015
@rmuir rmuir added a commit that referenced this pull request Mar 21, 2015
@rmuir rmuir [Bootstrap] Throw exception if the JVM will corrupt data.
Detect the worst-offenders, all IBM versions and several known hotspot
versions that can cause index corruption, and fail on startup.

Provide/detect compiler workarounds when they exist, but warn about
performance degradation.

In all cases the check can be bypassed completely with a safety
switch via undocumented system property (es.bypass.vm.check=true)

Closes #7580
89e5a9b
@rmuir rmuir added a commit that referenced this pull request Mar 21, 2015
@rmuir rmuir [Bootstrap] Throw exception if the JVM will corrupt data.
Detect the worst-offenders, all IBM versions and several known hotspot
versions that can cause index corruption, and fail on startup.

Provide/detect compiler workarounds when they exist, but warn about
performance degradation.

In all cases the check can be bypassed completely with a safety
switch via undocumented system property (es.bypass.vm.check=true)

Closes #7580
1fe2ad7
@clintongormley
Member

w00t!

@mute mute pushed a commit to mute/elasticsearch that referenced this pull request Jul 29, 2015
@rmuir rmuir [Bootstrap] Throw exception if the JVM will corrupt data.
Detect the worst-offenders, all IBM versions and several known hotspot
versions that can cause index corruption, and fail on startup.

Provide/detect compiler workarounds when they exist, but warn about
performance degradation.

In all cases the check can be bypassed completely with a safety
switch via undocumented system property (es.bypass.vm.check=true)

Closes #7580
086f1cb
@mute mute pushed a commit to mute/elasticsearch that referenced this pull request Jul 29, 2015
@rmuir rmuir [Bootstrap] Throw exception if the JVM will corrupt data.
Detect the worst-offenders, all IBM versions and several known hotspot
versions that can cause index corruption, and fail on startup.

Provide/detect compiler workarounds when they exist, but warn about
performance degradation.

In all cases the check can be bypassed completely with a safety
switch via undocumented system property (es.bypass.vm.check=true)

Closes #7580
0f3c992
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment