Skip to content
This repository has been archived by the owner on Apr 3, 2018. It is now read-only.

[RFC] Generate enriched errors #657

Closed

Conversation

jodh-intel
Copy link
Collaborator

Context

All components of both Clear Containers and Kata Containers now produce structured logs. This is extremely useful as the log fields provide a lot of context on key stages in the runtime flows.

Current virtcontainers error example

However, errors currently suffer as they are wholly "free-format".

For example, here's an example of a structured log message logged by cc-runtime:

time="2018-02-28T21:18:55.292102004+01:00" level=error msg="Container not ready or running, impossible to signal the container" source=runtime pid=1234

Or formatted with the log-parser on kata-containers/tests#98 for easier reading:

- count: 1
  timedelta: 0
  filename: /home/james/tmp/vc-old.txt
  line: 1
  time: !!timestamp 2018-02-28T21:18:55.292102004+01:00
  pid: 1234
  level: error
  msg: Container not ready or running, impossible to signal the container
  source: runtime
  name: cc-runtime
  data: {}

Notes:

  • The msg is the value passed to fmt.Errorf().
  • The source=runtime is actually wrong - this error was generated by virtcontainers, but it was the runtime that caught the error and logged it (but how would a reader know this?)

Error based on this PR

This PR is just an idea for how we could improve our error handling. It adds a new implementation of Errorf() so that rather than the above, with this PR you'd get the following:

time="2018-03-02T14:40:30.311729832Z" level=error arch=amd64 error="Container not ready or running, impossible to signal the container" error-time="2018-03-02T14:40:30.311705732Z" file="\"/home/james/go/src/github.com/clearcontainers/runtime/vendor/github.com/containers/virtcontainers/container.go\"" function="github.com/clearcontainers/runtime/vendor/github.com/containers/virtcontainers.(*Container).kill" hello=world line=628 name=cc-runtime pid=22469 source=virtcontainers

... and here's the log-parser reformatted version of that line:

- count: 1
  timedelta: 0
  filename: /home/james/tmp/vc.log
  line: 1
  time: !!timestamp 2018-03-02T14:40:30.311729832Z
  pid: 22469
  level: error
  msg: ""
  source: virtcontainers
  name: cc-runtime
  data:
    arch: amd64
    error: Container not ready or running, impossible to signal the container                                                                                                                 
    error-time: "2018-03-02T14:40:30.311705732Z"
    file: '"/home/james/go/src/github.com/clearcontainers/runtime/vendor/github.com/containers/virtcontainers/container.go"'
    function: github.com/clearcontainers/runtime/vendor/github.com/containers/virtcontainers.(*Container).kill
    hello: world
    line: "628"

Notes:

  • We now get two timestamps - one showing the time the error was logged (time) and the other showing the time the error was generated (error-time).
  • The file, line number and function where the error was generated are now logged.
  • The source now correctly shows virtcontainers.

Caveats

This PR is just an example. To be generally useful, we'd need this Errorf() function (and an equivalent to errors.New()) to be available to all components of Clear Containers and Kata Containers. That implies we'd need a new package for this functionality in a separate repo that we'd "vendor in" to all the components.

The other issue is how we handle the logger itself. Ideally, we'd find a way to register the logger with the new package and then have Errorf() use that logger object. But worst case, we'd need to change all the fmt.Errorf() and errors.New() calls to pass in the logger as an extra arg.

/cc @sameo, @sboeuf, @grahamwhaley.

Change the way the logger is constructed to allow for future
improvements by making the logger concrete logrus.Entry rather than a
FieldLogger interface type.

Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
@@ -625,7 +625,7 @@ func (c *Container) kill(signal syscall.Signal, all bool) error {
}

if c.state.State != StateReady && c.state.State != StateRunning {
return fmt.Errorf("Container not ready or running, impossible to signal the container")
return Errorf("Container not ready or running, impossible to signal the container")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks very good and returning better error messages will definitely help for faster debug. One question though, why is that only applied to this specific error ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind, I missed that it was only an example...

@sboeuf
Copy link
Collaborator

sboeuf commented Mar 2, 2018

@jodh-intel PR looks good, and I agree this should be in a new subpackage of virtcontainers. This way, this could be reused accross repos.

Create a new `Errorf()` function that returns an `error` object as
`fmt.Errorf()` does but which also includes additional metadata:

- the structured logging fields of the logger.
- details of the call site of the error.

Fixes containers#656.

Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
Make a call to the new internal error creation function.

Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
@jodh-intel
Copy link
Collaborator Author

Hi @sboeuf - regarding the separate package, what about Kata? If we agree to adopt "enriched errors", we'd want them used everywhere so wouldn't it be simpler to create a completely separate package (https://github.com/kata-containers/errors or https://github.com/kata-containers/utils or something) that all components could vendor to avoid vendoring (a sub-package of) virtcontainers?

@sboeuf
Copy link
Collaborator

sboeuf commented Mar 2, 2018

Well given the fact that virtcontainers might very likely move under kata-containers/runtime, there is no need to move this to its own repo.

@jodh-intel
Copy link
Collaborator Author

jodh-intel commented Mar 6, 2018

A summary of current thinking...

The errors package

@markdryan pointed me at https://github.com/pkg/errors. This looks interesting, but although it improves the quality of the errors by adding a stack trace...

  • The returned error requires that particular format verbs are used to show the stack trace.
  • The stack trace is designed to be human-readable, not machine-readable.

I'm after a solution that doesn't change the error interface and which generates "structured" errors.

Original API Idea

I have an almost complete implementation whose API is:

package errors

// SetLogger sets the error logger to be used when generating error
// objects.
func SetLogger(logger *logrus.Entry)

// New is equivalent to errors.New().
func New(msg string) error

// Errorf is equivalent to fmt.Errorf().
func Errorf(format string, args ...interface{}) error

// Error takes a "normal" error and converts it into an enriched error
func Error(err error) error

This works in most scenarios but gets tripped up due to the flexible nature of logrus. Specifically, the following idiom -- where a more specialised logger is created (with extra fields) -- is problematic:

// "register" logger with the errors package
errors.SetLogger(virtLog)

// ...

logger := virtLog.WithField("foo", "bar")

err := doSomething()
if err != nil {
    // log the error
    logger.WithError(err).Error("doSomething() failed")

    // return an enriched error.
    //
    // BUG: the Error() function is still using the original logger (virtLog)
    // which doesn't have the "foo" field.
    return errors.Error(err)
}

The ways to resolve this issue (atleast the ones I came up with) aren't particularly palatable:

  • Allow errors.Error(), errors.Errorf() and errors.New() to accept an optional logger.

    It would work, but be error prone to use as easy to forget to pass it.

  • Change the API so that SetLogger() becomes something like:

    func SetLogger(f func() *logrus.Entry)
    

    That way, the code could be reworked so that direct access to virtLog would be replaced by a call to a function that returned the logger. But again, you'd need to remember to call it so it would be error-prone.

A simpler API idea

I had a further think and came up with the following:

// LogError takes a logger which is assumed to contain an error and returns a
// new structured error which includes details of the error call site.
func LogError(entry *logrus.Entry) error

With this, we could then do magic like:

func f() error {
    logger := virtLog.WithField("foo", "bar")

    err := doSomething()
    if err != nil {
        // this logs 
        return LogError(logger.WithError(err))
    }

    value := doSomethingElse()
    if value == "" {
        return LogError(logger.Error("doSomethingElse() failed"))
    }

    return nil
}

The advantages of this approach:

  • The API is extremely simple 😄
  • You can use whichever logger you happen to have in your function:
    • no need to register in advance.
    • all log fields will then be added to the error.
  • LogError() can either just return the error or it could also log it normally then return it as an error if desired.
  • If you forget to use .WithError() with the provided logger, atleast the call site information tells you where your bug is!

An example of that API:

origErr := errors.New("hello world")
err := LogError(logger.WithError(origErr))
fmt.Println(err)

which generates an error something like:

time="2018-03-06T17:29:13.06774561Z" level=error error="hello world" error-file=/home/james/src/go/test_logrus_as_error2.go error-function=main.testLogError error-line=156 error-time="2018-03-06T17:29:13.067665955Z" name=test_logrus_as_error2 pid=25206 source=error-logger

... or re-formatted with the log-parser:

- count: 1
  timedelta: 0
  filename: /home/james/tmp/kata-logs/error-logger.txt
  line: 1
  time: !!timestamp 2018-03-06T17:29:13.06774561Z
  pid: 25206
  level: error
  msg: ""
  source: error-logger
  name: test_logrus_as_error2
  data:
    error: hello world
    error-file: /home/james/src/go/test_logrus_as_error2.go
    error-function: main.testLogError
    error-line: "156"
    error-time: "2018-03-06T17:29:13.067665955Z"

@jodh-intel
Copy link
Collaborator Author

Any thoughts on the above @sameo, @sboeuf, @grahamwhaley?

@grahamwhaley
Copy link
Contributor

grahamwhaley commented Mar 7, 2018

Most of that sounds sane to me, I think.

  • Are you getting tied tightly to having to use a logger, or specifically logrus here - if so, then it would be nice to avoid being tied in if possible (I know all our current code is using logrus and logging, but just noting for being fully generic).
  • I'd value the input from @markdryan here, for his more global go pov

mostly unrelated - query - is the !! deliberate in the timestamps:

 time: !!timestamp 2018-03-06T17:29:13.06774561Z

and if so, what does it mean, and why do they not also appear in the error time?

@jodh-intel
Copy link
Collaborator Author

Yes, LogError() is logrus-specific. If we ever wanted to move to another logger, I'm assuming we'd find a way to rework that function in similar ways to extract the log fields.

Ignore those timestamps - you get the weird prefix when running kata-log-parser --output-format yaml .... See: http://yaml.org/type/timestamp.html.

@sboeuf
Copy link
Collaborator

sboeuf commented Mar 7, 2018

@jodh-intel I am all for simplicity ! And your simpler API idea looks good to me. Especially for logging, it's important not to be confused with too many options.

@sboeuf
Copy link
Collaborator

sboeuf commented Mar 15, 2018

@jodh-intel Need to be closed and moved to https://github.com/kata-containers/runtime/virtcontainers

@sboeuf sboeuf closed this Mar 15, 2018
@sboeuf sboeuf removed the review label Mar 15, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants