Put error message as tag into failed trace. #4000

tz70s · 2018-09-03T15:18:18Z

Prior error message doesn't put error tag into failed trace.
This is nice to have and useful on trouble shooting within request context via modern UI in Jaegar or Zipkin.

With a minor change in primitive action flow which didn't enclose with failed trace.

I'm not sure folks are consistent with tracing system because of conflicting goals that we already have transid in log. IMHO, at least, it shouldn't make traces confused once if user decides to take it into system: whether there's an accurate error message existed or not, we should know which one get failed.

Semantics followed by:
https://github.com/opentracing/specification/blob/master/semantic_conventions.md

Snapshots in Jaegar:

Signed-off-by: Tzu-Chiao Yeh su3g4284zo6y7@gmail.com

Description

Related issue and scope

I opened an issue to propose and discuss this change (#????)

My changes affect the following components

Types of changes

Bug fix (generally a non-breaking change which closes an issue).
Enhancement or new feature (adds new functionality).
Breaking change (a bug fix or enhancement which changes existing behavior).

Checklist:

I signed an Apache CLA.
I reviewed the style guides and followed the recommendations (Travis CI will check :).
I added tests to cover my changes.
My changes require further changes to the documentation.
I updated the documentation where necessary.

Prior error message doesn't put error tag into failed trace. This is nice to have and useful on trouble shooting within request context via modern UI in Jaegar or Zipkin. With a minor change in primitive action flow which didn't enclose with failed trace. Signed-off-by: Tzu-Chiao Yeh <su3g4284zo6y7@gmail.com>

codecov-io · 2018-09-03T17:45:13Z

Codecov Report

Merging #4000 into master will decrease coverage by 5.08%.
The diff coverage is 46.66%.

@@            Coverage Diff             @@
##           master    #4000      +/-   ##
==========================================
- Coverage    85.9%   80.82%   -5.09%     
==========================================
  Files         146      146              
  Lines        7095     7102       +7     
  Branches      443      439       -4     
==========================================
- Hits         6095     5740     -355     
- Misses       1000     1362     +362

Impacted Files	Coverage Δ
...la/src/main/scala/whisk/common/TransactionId.scala	`93.44% <100%> (ø)`	⬆️
...ala/whisk/common/tracing/OpenTracingProvider.scala	`21.15% <14.28%> (-1.77%)`	⬇️
...isk/core/controller/actions/PrimitiveActions.scala	`88.4% <71.42%> (-1.23%)`	⬇️
...core/database/cosmosdb/RxObservableImplicits.scala	`0% <0%> (-100%)`	⬇️
...core/database/cosmosdb/CosmosDBArtifactStore.scala	`0% <0%> (-95.1%)`	⬇️
...sk/core/database/cosmosdb/CosmosDBViewMapper.scala	`0% <0%> (-92.6%)`	⬇️
...whisk/core/database/cosmosdb/CosmosDBSupport.scala	`0% <0%> (-81.82%)`	⬇️
...abase/cosmosdb/CosmosDBArtifactStoreProvider.scala	`0% <0%> (-58.83%)`	⬇️
.../scala/src/main/scala/whisk/utils/Exceptions.scala	`20% <0%> (-20%)`	⬇️
...whisk/connector/kafka/KafkaProducerConnector.scala	`75.67% <0%> (-16.22%)`	⬇️
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0242a60...e3c3899. Read the comment docs.

chetanmeh · 2018-09-05T04:31:42Z

Surfacing error state in trace would be useful. @sandeep-paliwal Thoughts?

vvraskin · 2018-09-05T09:32:45Z

core/controller/src/main/scala/whisk/core/controller/actions/PrimitiveActions.scala

-
+    postedFuture andThen {
+      case Success(_) => transid.finished(this, startLoadbalancer)
+      case Failure(e) => transid.failed(this, startLoadbalancer, e.getMessage, logLevel = ErrorLevel)


Should the error log level be eliminated here? An activation failure doesn't always imply a system failure

Thanks for reminding, we can use the default warn level. Or I can match exception types for different levels. However, I think logging here is quite redundant b.c. we've already logged this at doInvoke.

Should this get changed as well?
https://github.com/apache/incubator-openwhisk/blob/0242a60aeb976d070abda76a8ac2aa7e3306e393/core/controller/src/main/scala/whisk/core/controller/Actions.scala#L292

BTW, in opentracing, there's no standardized warn tag provided, is this fine?

I think that using the standard warn level would be fine.

The snippet you provided is depicting the system error, so the error log level in that place is fine IMO.

As long as it doesn't disturb anyone else, I'm fine with leaving the err tag for the tracing. My main point was to avoid the log overflood with the messages produced by users, since it could significantly hamper the performance.

Ah yes, it's a system error, thanks!

vvraskin · 2018-09-05T09:33:01Z

core/controller/src/main/scala/whisk/core/controller/actions/PrimitiveActions.scala

        }
+    } andThen {
+      case Success(_) => transid.finished(this, startActivation)
+      case Failure(e) => transid.failed(this, startActivation, e.getMessage, logLevel = ErrorLevel)


The same question regarding the log level...

sandeep-paliwal · 2018-09-06T13:44:46Z

@chetanmeh agree. Having error information in trace for failed Invocations will be very useful.

sandeep-paliwal · 2018-09-06T13:48:39Z

common/scala/src/main/scala/whisk/common/tracing/OpenTracingProvider.scala


+  private def setErrorTags(span: Span, message: => String): Unit = {
+    span.setTag("error", true)
+    span.setTag("message", message)


Tag is usually added to make it easy to search for a specific span. Since key in this case is always going to be "error" do you think its more appropriate case for span.log() ?

Hi @sandeep-paliwal, thanks for suggestion.
I think the big difference between tag and log is timestamp; as you mentioned, the tag is much easier to search for a specific span, or make it explicit on what's going on.

I've tried capturing errors guidance in open tracing data model standard. This is not that useful to know that what's really going on in Jaegar.

Some code snippet:

private def setErrorTags(span: Span, message: => String): Unit = { span.log(Map("event" -> "error", "message" -> message).asJava) }

Snapshot:

The error is not explicit.

Am I missing something? If not, I think an alternative way is still use error as tag, and attach the log that we can eliminate the message tag field, if this is necessary. WDYT?

i.e.

private def setErrorTags(span: Span, message: => String): Unit = { span.setTag("error", true) span.log(Map("event" -> "error", "message" -> message).asJava) }

Yeah I see that just adding log to trace is not that helpful in UI. I guess using tag has a better option then. Is there any use case to search for all failed invocations with a specific error? In that case error message as tag will also be useful compared to log().

Yeah it might be useful if someone knows internal error messages. One drawback is polluting tracing system if it's mixed with some other systems besides OpenWhisk, but it can be filtered by module though. I think we can keep it as tag for advanced usage as you mentioned.

Signed-off-by: Tzu-Chiao Yeh <su3g4284zo6y7@gmail.com>

chetanmeh

LGTM

Put error message as tag into failed trace. This is nice to have and useful on trouble shooting within request context via modern UI in Jaegar or Zipkin.

tz70s force-pushed the add-error-trace branch from b66eb43 to e3c3899 Compare September 3, 2018 16:58

vvraskin reviewed Sep 5, 2018

View reviewed changes

sandeep-paliwal reviewed Sep 6, 2018

View reviewed changes

Revise for review comments - eliminating log level.

53264d5

Signed-off-by: Tzu-Chiao Yeh <su3g4284zo6y7@gmail.com>

tz70s force-pushed the add-error-trace branch from f022d8c to 53264d5 Compare September 6, 2018 16:35

markusthoemmes assigned chetanmeh Sep 18, 2018

chetanmeh approved these changes Sep 25, 2018

View reviewed changes

chetanmeh merged commit 77884e5 into apache:master Sep 25, 2018

Put error message as tag into failed trace. #4000

Put error message as tag into failed trace. #4000

Uh oh!

Conversation

tz70s commented Sep 3, 2018

Description

Related issue and scope

My changes affect the following components

Types of changes

Checklist:

Uh oh!

codecov-io commented Sep 3, 2018

Codecov Report

Uh oh!

chetanmeh commented Sep 5, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tz70s Sep 5, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sandeep-paliwal commented Sep 6, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chetanmeh left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

tz70s Sep 5, 2018 •

edited

Loading