server: enhance HTTP and RPC logging#55640
Conversation
This patch ensures that all logging that occurs "under" a HTTP request is annotated with a logging tag `client` with the remote address. Additionally, the patch adds conditional logging of the HTTP requests themselves under `log.V(1)` and a new cluster setting `server.log_web_requests.enabled`. To understand how this patch work, one should be aware of how the HTTP-RPC gateway works inside CockroachDB: 1. `Server.ServeHTTP` gets called by `http.Serve` 2. this calls the grpc-gateway `ServeHTTP` method 3. this initiates a `net.Conn` using the `loopbackListener` inside CockroachDB, and delivers a valid RPC request payload over that connection. 4. the RPC server gets handed a new net.Conn from scratch, and reads an encoded RPC payload from there. Of note, the Go RPC handler method *is not called directly by the HTTP server function*. Instead, the HTTP request is translated to a RPC request and the RPC server receives it "over the wire" This creates complexity with the handling of `context.Context`: because the RPC server gets the request over the wire, there is no `context.Context` attached. It starts from scratch. Any information in the `context.Context` available to the HTTP request at step (2) is lost in this translation. This is why we have this method called `forwardAuthenticationMetadata()` in the code: this takes the useful annotations in the HTTP `context.Context` and translates them to K/V pairs that are then automatically embedded in the RPC request through the loopback interface. Then, "on the other side" in the RPC server, the metadata is loaded back into a `context.Context` via `decorateContextFromWebMetadata()`. To ensure this is done consistently, the new function `decorateContextFromWebMetadata()` is now called from `(*Server).AnnotateCtx()`, which is more-or-less called from every RPC handler. (see note below) Finally, when a RPC handler makes a RPC request as part of its behavior (e.g. a cluster-wide RPC forwards requests to individual nodes), we want these metadata K/V pairs to be forwarded automatically to the "inner" request. This is the logic that was (previously) present in `propagateGatewayMetadata()`. This function has been merged into `decorateContextFromWebMetadata()` and is thus now done for every request that calls `AnnotateCtx()` implicitly, which is presumably all of them. Note: In an ideal world, we'd have the RPC server call `(*Server).AnnotateCtx()` for us. Is there a mechanism for this? If there was, the code can be further simplified. Release note (cli change): It is now possible to trigger logging of HTTP requests (for troubleshooting) to the debug log on all nodes with the cluster setting `server.log_web_requests.enabled`. Release note (cli change): The `vmodule` setting `server=1` now also includes logging of every HTTP request. Note that this is only meant for use during troubleshooting and the particular `vmodule` setting that produces HTTP request logs will likely change in a later version without notice.
Release note (security update): CockroachDB is now able to log every HTTP user login and manual logout event in the connection and authentication log, when the new cluster setting `server.auth_log.web_sessions.enabled` is set. This uses the same logging channel as used by SQL conn&auth logging enabled by the previous `server.auth_log.*` cluster settings. Release note (security update): CockroachDB is now able to log every authenticated HTTP request in the connection and authentication log, when the new cluster setting `server.auth_log.web_requests.enabled` is set. Note that only authenticated requests are logged, and so public HTTP APIs are excluded (e.g. health requests). This uses the same logging channel as used by SQL conn&auth logging enabled by the previous `server.auth_log.*` cluster settings.
f8ddecc to
5746007
Compare
bdarnell
left a comment
There was a problem hiding this comment.
I have no idea why the http-to-rpc stuff works the way it does (isn't it handled by a third-party package grpc-gateway?).
Reviewed 9 of 9 files at r1, 6 of 6 files at r2.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @knz)
docs/generated/settings/settings.html, line 36 at r1 (raw file):
<tr><td><code>server.auth_log.sql_connections.enabled</code></td><td>boolean</td><td><code>false</code></td><td>if set, log SQL client connect and disconnect events (note: may hinder performance on loaded nodes)</td></tr> <tr><td><code>server.auth_log.sql_sessions.enabled</code></td><td>boolean</td><td><code>false</code></td><td>if set, log SQL session login/disconnection events (note: may hinder performance on loaded nodes)</td></tr> <tr><td><code>server.auth_log.web_requests.enabled</code></td><td>boolean</td><td><code>false</code></td><td>if set, log HTTP requests by authenticated users (note: may hinder performance on loaded nodes)</td></tr>
Nit: The settings introduced in r2 got committed to settings.html in r1.
docs/generated/settings/settings.html, line 43 at r1 (raw file):
<tr><td><code>server.eventlog.ttl</code></td><td>duration</td><td><code>2160h0m0s</code></td><td>if nonzero, event log entries older than this duration are deleted every 10m0s. Should not be lowered below 24 hours.</td></tr> <tr><td><code>server.host_based_authentication.configuration</code></td><td>string</td><td><code></code></td><td>host-based authentication configuration to use during connection authentication</td></tr> <tr><td><code>server.log_web_requests.enabled</code></td><td>boolean</td><td><code>false</code></td><td>if set, log all HTTP requests on every node</td></tr>
The new settings are confusing. Do we need both "log all requests" and "log authenticated requests"? The help text and the non-adjacent positions in the list make it more confusing. To me server.auth_log.* are just for authentication events, and operations by users who happen to be authenticated would be better in a server.log.* namespace.
The fact that logs of authenticated requests go to a separate file while "web request" go to the debug logs is not obvious from the help text, and while the fact that the auth logs are fsynced and web logs are not is indicated with the "hinder performance" comment this seems a little subtle to me.
This offers boolean options for two points in a three-dimensional config space: all vs authed, same file vs separate, and synced vs unsynced. I realize that a large refactoring of log configuration is out of scope for this change, but until we get there we should minimize the number of ad-hoc options we support (and will have to translate to the future configs). Do we really need all three of these? I think we probably need the auth logs but I'd prefer to get rid of server.log_web_requests.enabled (it's fine to keep the log.V(1) version of it)
pkg/server/auth_log.go, line 30 at r2 (raw file):
var logWebSessionAuth = settings.RegisterPublicBoolSetting( "server.auth_log.web_sessions.enabled", "if set, log HTTP session login/disconnection events (note: may hinder performance on loaded nodes)",
s/disconnection/logout/. You're not logging anything on disconnection (nor would it make sense too because the TCP connection doesn't map to an auth session)
This logs successful login events - is there a need for logging failed login attempts?
pkg/server/auth_log.go, line 56 at r2 (raw file):
ctx := req.Context() // Note: the URL is not safe for logging, as it can contain // application data as part of the query path.
This seems unnecessarily restrictive and makes the redacted logs pretty useless. We're good about not hiding parameters in the path portion of the URL, so I think it would be better to parse the URL and log everything but the query portion as "safe".
pkg/server/auth_log.go, line 57 at r2 (raw file):
// Note: the URL is not safe for logging, as it can contain // application data as part of the query path. a.authLogger.Logf(ctx, "HTTP %s: %s", log.Safe(req.Method), req.URL)
Don't we need to log the user here? (or is that coming in from the log tags?). Do we need to skip the logging if there is no user, or are we guaranteed that we'll only get here with a logged-in user? Even if all that is being guaranteed externally, it seems easy to wrap handlers out of order and not have the user info here so it's better to check. (related: any time we're using a log that fsyncs to disk we should add tests. These logs are by definition a stronger commitment to the user than our regular logging)
pkg/server/authentication.go, line 636 at r1 (raw file):
// We are annotating a context while serving a RPC request. if remoteAddrs, ok := md[webSessionRemoteAddrStr]; ok && len(remoteAddrs) == 1 { ctx = logtags.AddTag(ctx, "client", remoteAddrs[0])
s/client/remoteAddr/ or something like that? In any case, this string is used in multiple places (here and server.go) so should probably be pulled out into a constant.
pkg/server/authentication.go, line 470 at r2 (raw file):
if logWebSessionAuth.Get(&s.server.st.SV) { ctx = security.DecorateContext(ctx, username) log.Infof(ctx, "WOO\n%s", debug.Stack())
Don't forget to remove this.
pkg/server/server.go, line 682 at r1 (raw file):
// AnnotateCtx is a convenience wrapper; see AmbientContext. func (s *Server) AnnotateCtx(ctx context.Context) context.Context { ctx = decorateContextFromWebMetadata(ctx)
Nit: I think it would be better to reverse the order of these, so that in the event of overlap (which shouldn't happen, which is why this is just a nit), the metadata overrides the "ambient" context.
pkg/server/server.go, line 2252 at r1 (raw file):
var httpRequestLog = settings.RegisterPublicBoolSetting( "server.log_web_requests.enabled",
The other settings are server.auth_log.*.enabled; maybe this should be server.log.web_requests.enabled?
Informs (possibly closes) #45716
The first of the two commits is just a little bit on the side of "tricky". @bdarnell can you inspect both the commit message and the code to check that what I did is sensical and I didn't make a mess?
(I agree with one of your comments somewhere that this code needs to be clarified. However I'd like this PR to be back-portable to 20.2 and 20.1 due to a customer commitment. So I'm leaving the refactor/clarification to a follow-up PR that will be 21.1-only.)
It took me a solid 3 hours to understand why the
context.Contexton the RPC handler side had lost all the annotations from the HTTP side. I hope in the future we can avoid the loopback and have the RPC handlers called directly by gRPC? Does this make any sense?