Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] add version information in case of crash of native ML process #30674

Merged
merged 3 commits into from May 18, 2018

Conversation

hendrikmuhs
Copy link
Contributor

This change adds version information in case a native ML process crashes, the version is important for choosing the right symbol files when analyzing the crash. Adding the version combines all necessary information in one line.

relates elastic/ml-cpp#94

@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core

// add version information, so it's conveniently next to the crash log
upstreamMessage += ", version: ";
try {
upstreamMessage += getCppCopyright(Duration.ofMillis(10));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might look a bit weird to see the copyright as part of the "version" part of the message. The bit we actually want comes before the copyright. It could be got rid of quite simply like this:

    upstreamMessage += getCppCopyright(Duration.ofMillis(10)).replaceFirst(" Copyright.*", "");

For example, from:

autodetect (64 bit): Version 7.0.0-alpha1-SNAPSHOT (Build 799d6b9e8bdff9) Copyright (c) 2018 Elasticsearch BV

we'd add:

, version: autodetect (64 bit): Version 7.0.0-alpha1-SNAPSHOT (Build 799d6b9e8bdff9)

If we can already easily distinguish normalize from autodetect then we could add another replaceFirst to strip up to "Version ", so we'd get:

, version: 7.0.0-alpha1-SNAPSHOT (Build 799d6b9e8bdff9)

But it's useful to know the program name and I can't remember how easy it is to find that out from what's in the error message already.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The process name is part of the log line, e.g.

[2018-05-17T10:31:56,396][ERROR][o.e.x.m.j.p.a.NativeAutodetectProcess] [g6] autodetect process stopped unexpectedly: Fatal error: 'si_signo 11, si_code: 1, si_errno: 0, address: 0x7f5b8d4160b8, library: /home/hendrik/work/git-elastic/elasticsearch/test-install/elasticsearch-7.0.0-alpha1-SNAPSHOT/modules/x-pack/x-pack-ml/platform/linux-x86_64/bin/../lib/libMlMaths.so, base: 0x7f5b8d0de000, normalized address: 0x3380b8', version: autodetect (64 bit): Version based on 7.0.0-alpha1-SNAPSHOT (Build DEVELOPMENT BUILD by hendrik) Copyright (c) 2018 Elasticsearch BV

therfore: yes we can potentially replace it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be useful then to have a specific getCppVersion method that pulls out just the version data as @droberts195 describes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might also be a good opportunity to update the comment on the getCppCopyright method as it currently refers to "... the process ID of the C++ process ".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be useful then to have a specific getCppVersion method that pulls out just the version data

If we do that it would be good to use it instead of getCppCopyright() in NativeController.getNativeCodeInfo().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I am refactoring it, so there is just 1 place that parses the string.

@@ -283,6 +283,15 @@ private void parseMessage(XContent xContent, BytesReference bytesRef) {
if (upstreamMessage.contains("bad_alloc")) {
upstreamMessage += ", process ran out of memory.";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The . at the end of this is no longer appropriate, as the line below will add a comma.

@hendrikmuhs
Copy link
Contributor Author

@droberts195 @edsavage

I pushed changes to omit anything but the essential version information. It looks like this now:

[2018-05-17T11:11:30,402][ERROR][o.e.x.m.j.p.a.NativeAutodetectProcess] [g6] autodetect process stopped unexpectedly: Fatal error: 'si_signo 11, si_code: 1, si_errno: 0, address: 0x7fb1da0c90b8, library: /home/hendrik/work/git-elastic/elasticsearch/test-install/elasticsearch-7.0.0-alpha1-SNAPSHOT/modules/x-pack/x-pack-ml/platform/linux-x86_64/bin/../lib/libMlMaths.so, base: 0x7fb1d9d91000, normalized address: 0x3380b8', version: based on 7.0.0-alpha1-SNAPSHOT (build DEVELOPMENT BUILD by hendrik)

On official builds you will see a proper version and build hash

Copy link
Contributor

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

I just noticed one pre-existing nit that you moved from one file to another.

return info;
} else {
// If this happens it probably means someone has changed the format in lib/ver/CBuildInfo.cc
// in the machine-learning-cpp repo without changing the pattern above to match
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: machine-learning-cpp -> ml-cpp

Copy link
Contributor

@edsavage edsavage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@hendrikmuhs hendrikmuhs merged commit d893041 into elastic:master May 18, 2018
hendrikmuhs pushed a commit that referenced this pull request May 18, 2018
…30674)

This change adds version information in case a native ML process
crashes, the version is important for choosing the right symbol files
when analyzing the crash. Adding the version combines all necessary
information on one line.

relates elastic/ml-cpp#94
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request May 18, 2018
…ngs-to-true

* elastic/master:
  Tests: Fail if test watches could not be triggered (elastic#30392)
  [ML] add version information in case of crash of native ML process (elastic#30674)
  Make TransportClusterStateAction abide to our style (elastic#30697)
  Change required version for Get Settings transport API changes to 6.4.0 (elastic#30706)
  [DOCS] Fixes edit URLs for stack overview (elastic#30583)
  Silence sleep based watcher test
  [TEST] Adjust version skips for movavg/movfn tests
jasontedor added a commit to rjernst/elasticsearch that referenced this pull request May 18, 2018
* elastic/master:
  [DOCS] Removes out-dated x-pack/docs/en/index.asciidoc
  [DOCS] Removes redundant index.asciidoc files (elastic#30707)
  [TEST] Reduce forecast overflow to disk test memory limit (elastic#30727)
  Plugins: Remove meta plugins (elastic#30670)
  [DOCS] Moves X-Pack configurationg pages in table of contents (elastic#30702)
  TEST: Add engine log to testCorruptFileThenSnapshotAndRestore
  [ML][TEST] Fix bucket count assertion in ModelPlotsIT (elastic#30717)
  [ML][TEST] Make AutodetectMemoryLimitIT less fragile (elastic#30716)
  Default copy settings to true and deprecate on the REST layer (elastic#30598)
  [Build] Add test admin when starting gradle run with trial license and
  This implementation lazily (on 1st forecast request) checks for available diskspace and creates a subfolder for storing data outside of Lucene indexes, but as part of the ES data paths.
  Tests: Fail if test watches could not be triggered (elastic#30392)
  [ML] add version information in case of crash of native ML process (elastic#30674)
  Make TransportClusterStateAction abide to our style (elastic#30697)
  Change required version for Get Settings transport API changes to 6.4.0 (elastic#30706)
dnhatn added a commit that referenced this pull request May 19, 2018
* 6.x:
  Mute testCorruptFileThenSnapshotAndRestore
  Plugins: Remove meta plugins (#30670)
  Upgrade to Lucene-7.4.0-snapshot-59f2b7aec2 (#30726)
  Docs: Add uptasticsearch to list of clients (#30738)
  [TEST] Reduce forecast overflow to disk test memory limit (#30727)
  [DOCS] Removes redundant index.asciidoc files (#30707)
  [DOCS] Moves X-Pack configurationg pages in table of contents (#30702)
  [ML][TEST] Fix bucket count assertion in ModelPlotsIT (#30717)
  [ML][TEST] Make AutodetectMemoryLimitIT less fragile (#30716)
  [Build] Add test admin when starting gradle run with trial license and
  [ML] provide tmp storage for forecasting and possibly any ml native jobs #30399
  Tests: Fail if test watches could not be triggered (#30392)
  Watcher: Prevent duplicate watch triggering during upgrade (#30643)
  [ML] add version information in case of crash of native ML process (#30674)
  Add detailed assert message to IndexAuditUpgradeIT (#30669)
  Preserve REST client auth despite 401 response (#30558)
  Make TransportClusterStateAction abide to our style (#30697)
  [DOCS] Fixes edit URLs for stack overview (#30583)
  [DOCS] Add missing callout in IndicesClientDocumentationIT
  Backport get settings API changes to 6.x (#30494)
  Silence sleep based watcher test
  [DOCS] Replace X-Pack terms with attributes
  Improve explanation in rescore (#30629)
  [test] packaging: add windows boxes (#30402)
  [ML] Clean left behind model state docs (#30659)
  filters agg docs duplicated 'bucket' word removal (#30677)
  top_hits doc example description update (#30676)
  MovingFunction Pipeline agg backport to 6.x (#30658)
  [Docs] Replace InetSocketTransportAddress with TransportAdress (#30673)
  [TEST] Account for increase in ML C++ memory usage (#30675)
  User proper write-once semantics for GCS repository (#30438)
  Deprecate `nGram` and `edgeNGram` names for ngram filters (#30209)
  Watcher: Fix watch history template for dynamic slack attachments (#30172)
  Fix _cluster/state to always return cluster_uuid (#30656)
dnhatn added a commit that referenced this pull request May 19, 2018
* master:
  Scripting: Remove getDate methods from ScriptDocValues (#30690)
  Upgrade to Lucene-7.4.0-snapshot-59f2b7aec2 (#30726)
  [Docs] Fix single page :docs:check invocation (#30725)
  Docs: Add uptasticsearch to list of clients (#30738)
  [DOCS] Removes out-dated x-pack/docs/en/index.asciidoc
  [DOCS] Removes redundant index.asciidoc files (#30707)
  [TEST] Reduce forecast overflow to disk test memory limit (#30727)
  Plugins: Remove meta plugins (#30670)
  [DOCS] Moves X-Pack configurationg pages in table of contents (#30702)
  TEST: Add engine log to testCorruptFileThenSnapshotAndRestore
  [ML][TEST] Fix bucket count assertion in ModelPlotsIT (#30717)
  [ML][TEST] Make AutodetectMemoryLimitIT less fragile (#30716)
  Default copy settings to true and deprecate on the REST layer (#30598)
  [Build] Add test admin when starting gradle run with trial license and
  This implementation lazily (on 1st forecast request) checks for available diskspace and creates a subfolder for storing data outside of Lucene indexes, but as part of the ES data paths.
  Tests: Fail if test watches could not be triggered (#30392)
  [ML] add version information in case of crash of native ML process (#30674)
  Make TransportClusterStateAction abide to our style (#30697)
  Change required version for Get Settings transport API changes to 6.4.0 (#30706)
  [DOCS] Fixes edit URLs for stack overview (#30583)
  Silence sleep based watcher test
  [TEST] Adjust version skips for movavg/movfn tests
  [DOCS] Replace X-Pack terms with attributes
  [ML] Clean left behind model state docs (#30659)
  Correct typos
  filters agg docs duplicated 'bucket' word removal (#30677)
  top_hits doc example description update (#30676)
  [Docs] Replace InetSocketTransportAddress with TransportAdress (#30673)
  [TEST] Account for increase in ML C++ memory usage (#30675)
  User proper write-once semantics for GCS repository (#30438)
  Remove bogus file accidentally added
  Add detailed assert message to IndexAuditUpgradeIT (#30669)
  Adjust fast forward for token expiration test  (#30668)
  Improve explanation in rescore (#30629)
  Deprecate `nGram` and `edgeNGram` names for ngram filters (#30209)
  Watcher: Fix watch history template for dynamic slack attachments (#30172)
  Fix _cluster/state to always return cluster_uuid (#30656)
  [Tests] Add debug information to CorruptedFileIT

# Conflicts:
#	test/framework/src/main/java/org/elasticsearch/indices/analysis/AnalysisFactoryTestCase.java
ywelsch pushed a commit to ywelsch/elasticsearch that referenced this pull request May 23, 2018
…lastic#30674)

This change adds version information in case a native ML process crashes, the version is important for choosing the right symbol files when analyzing the crash. Adding the version combines all necessary information on one line.

relates elastic/ml-cpp#94
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants