Skip to content

server: enhance HTTP and RPC logging#55640

Open
knz wants to merge 2 commits intocockroachdb:masterfrom
knz:20201016-server-log
Open

server: enhance HTTP and RPC logging#55640
knz wants to merge 2 commits intocockroachdb:masterfrom
knz:20201016-server-log

Conversation

@knz
Copy link
Copy Markdown
Contributor

@knz knz commented Oct 16, 2020

Informs (possibly closes) #45716

The first of the two commits is just a little bit on the side of "tricky". @bdarnell can you inspect both the commit message and the code to check that what I did is sensical and I didn't make a mess?

(I agree with one of your comments somewhere that this code needs to be clarified. However I'd like this PR to be back-portable to 20.2 and 20.1 due to a customer commitment. So I'm leaving the refactor/clarification to a follow-up PR that will be 21.1-only.)

It took me a solid 3 hours to understand why the context.Context on the RPC handler side had lost all the annotations from the HTTP side. I hope in the future we can avoid the loopback and have the RPC handlers called directly by gRPC? Does this make any sense?

@knz knz requested a review from bdarnell October 16, 2020 18:22
@knz knz requested a review from a team as a code owner October 16, 2020 18:22
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

knz added 2 commits October 16, 2020 20:56
This patch ensures that all logging that occurs "under" a HTTP
request is annotated with a logging tag `client` with the remote
address.

Additionally, the patch adds conditional logging of the HTTP requests
themselves under `log.V(1)` and a new cluster setting
`server.log_web_requests.enabled`.

To understand how this patch work, one should be aware of how the
HTTP-RPC gateway works inside CockroachDB:

1. `Server.ServeHTTP` gets called by `http.Serve`
2. this calls the grpc-gateway `ServeHTTP` method
3. this initiates a `net.Conn` using the `loopbackListener` inside
   CockroachDB, and delivers a valid RPC request payload over
   that connection.
4. the RPC server gets handed a new net.Conn from scratch,
   and reads an encoded RPC payload from there.

Of note, the Go RPC handler method *is not called directly by the HTTP
server function*. Instead, the HTTP request is translated to a RPC
request and the RPC server receives it "over the wire"

This creates complexity with the handling of `context.Context`:
because the RPC server gets the request over the wire, there is no
`context.Context` attached. It starts from scratch. Any information
in the `context.Context` available to the HTTP request at step (2)
is lost in this translation.

This is why we have this method called
`forwardAuthenticationMetadata()` in the code: this takes the useful
annotations in the HTTP `context.Context` and translates them to K/V
pairs that are then automatically embedded in the RPC request through
the loopback interface.

Then, "on the other side" in the RPC server, the metadata is loaded
back into a `context.Context` via `decorateContextFromWebMetadata()`.

To ensure this is done consistently, the new function
`decorateContextFromWebMetadata()` is now called from
`(*Server).AnnotateCtx()`, which is more-or-less called from every RPC
handler. (see note below)

Finally, when a RPC handler makes a RPC request as part of its
behavior (e.g. a cluster-wide RPC forwards requests to individual
nodes), we want these metadata K/V pairs to be forwarded automatically
to the "inner" request. This is the logic that was (previously)
present in `propagateGatewayMetadata()`. This function has been merged
into `decorateContextFromWebMetadata()` and is thus now done for every
request that calls `AnnotateCtx()` implicitly, which is presumably all
of them.

Note: In an ideal world, we'd have the RPC server call
`(*Server).AnnotateCtx()` for us. Is there a mechanism for this? If
there was, the code can be further simplified.

Release note (cli change): It is now possible to trigger logging
of HTTP requests (for troubleshooting) to the debug log on all nodes with
the cluster setting `server.log_web_requests.enabled`.

Release note (cli change): The `vmodule` setting `server=1` now also
includes logging of every HTTP request. Note that this is only meant
for use during troubleshooting and the particular `vmodule` setting
that produces HTTP request logs will likely change in a later version
without notice.
Release note (security update): CockroachDB is now able to log every
HTTP user login and manual logout event in the connection and
authentication log, when the new cluster setting
`server.auth_log.web_sessions.enabled` is set. This uses the same
logging channel as used by SQL conn&auth logging enabled by the
previous `server.auth_log.*` cluster settings.

Release note (security update): CockroachDB is now able to log every
authenticated HTTP request in the connection and authentication log,
when the new cluster setting `server.auth_log.web_requests.enabled` is
set. Note that only authenticated requests are logged, and so public
HTTP APIs are excluded (e.g. health requests). This uses the same
logging channel as used by SQL conn&auth logging enabled by the
previous `server.auth_log.*` cluster settings.
@knz knz force-pushed the 20201016-server-log branch from f8ddecc to 5746007 Compare October 16, 2020 18:59
Copy link
Copy Markdown
Contributor

@bdarnell bdarnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea why the http-to-rpc stuff works the way it does (isn't it handled by a third-party package grpc-gateway?).

Reviewed 9 of 9 files at r1, 6 of 6 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz)


docs/generated/settings/settings.html, line 36 at r1 (raw file):

<tr><td><code>server.auth_log.sql_connections.enabled</code></td><td>boolean</td><td><code>false</code></td><td>if set, log SQL client connect and disconnect events (note: may hinder performance on loaded nodes)</td></tr>
<tr><td><code>server.auth_log.sql_sessions.enabled</code></td><td>boolean</td><td><code>false</code></td><td>if set, log SQL session login/disconnection events (note: may hinder performance on loaded nodes)</td></tr>
<tr><td><code>server.auth_log.web_requests.enabled</code></td><td>boolean</td><td><code>false</code></td><td>if set, log HTTP requests by authenticated users (note: may hinder performance on loaded nodes)</td></tr>

Nit: The settings introduced in r2 got committed to settings.html in r1.


docs/generated/settings/settings.html, line 43 at r1 (raw file):

<tr><td><code>server.eventlog.ttl</code></td><td>duration</td><td><code>2160h0m0s</code></td><td>if nonzero, event log entries older than this duration are deleted every 10m0s. Should not be lowered below 24 hours.</td></tr>
<tr><td><code>server.host_based_authentication.configuration</code></td><td>string</td><td><code></code></td><td>host-based authentication configuration to use during connection authentication</td></tr>
<tr><td><code>server.log_web_requests.enabled</code></td><td>boolean</td><td><code>false</code></td><td>if set, log all HTTP requests on every node</td></tr>

The new settings are confusing. Do we need both "log all requests" and "log authenticated requests"? The help text and the non-adjacent positions in the list make it more confusing. To me server.auth_log.* are just for authentication events, and operations by users who happen to be authenticated would be better in a server.log.* namespace.

The fact that logs of authenticated requests go to a separate file while "web request" go to the debug logs is not obvious from the help text, and while the fact that the auth logs are fsynced and web logs are not is indicated with the "hinder performance" comment this seems a little subtle to me.

This offers boolean options for two points in a three-dimensional config space: all vs authed, same file vs separate, and synced vs unsynced. I realize that a large refactoring of log configuration is out of scope for this change, but until we get there we should minimize the number of ad-hoc options we support (and will have to translate to the future configs). Do we really need all three of these? I think we probably need the auth logs but I'd prefer to get rid of server.log_web_requests.enabled (it's fine to keep the log.V(1) version of it)


pkg/server/auth_log.go, line 30 at r2 (raw file):

var logWebSessionAuth = settings.RegisterPublicBoolSetting(
	"server.auth_log.web_sessions.enabled",
	"if set, log HTTP session login/disconnection events (note: may hinder performance on loaded nodes)",

s/disconnection/logout/. You're not logging anything on disconnection (nor would it make sense too because the TCP connection doesn't map to an auth session)

This logs successful login events - is there a need for logging failed login attempts?


pkg/server/auth_log.go, line 56 at r2 (raw file):

		ctx := req.Context()
		// Note: the URL is not safe for logging, as it can contain
		// application data as part of the query path.

This seems unnecessarily restrictive and makes the redacted logs pretty useless. We're good about not hiding parameters in the path portion of the URL, so I think it would be better to parse the URL and log everything but the query portion as "safe".


pkg/server/auth_log.go, line 57 at r2 (raw file):

		// Note: the URL is not safe for logging, as it can contain
		// application data as part of the query path.
		a.authLogger.Logf(ctx, "HTTP %s: %s", log.Safe(req.Method), req.URL)

Don't we need to log the user here? (or is that coming in from the log tags?). Do we need to skip the logging if there is no user, or are we guaranteed that we'll only get here with a logged-in user? Even if all that is being guaranteed externally, it seems easy to wrap handlers out of order and not have the user info here so it's better to check. (related: any time we're using a log that fsyncs to disk we should add tests. These logs are by definition a stronger commitment to the user than our regular logging)


pkg/server/authentication.go, line 636 at r1 (raw file):

	// We are annotating a context while serving a RPC request.
	if remoteAddrs, ok := md[webSessionRemoteAddrStr]; ok && len(remoteAddrs) == 1 {
		ctx = logtags.AddTag(ctx, "client", remoteAddrs[0])

s/client/remoteAddr/ or something like that? In any case, this string is used in multiple places (here and server.go) so should probably be pulled out into a constant.


pkg/server/authentication.go, line 470 at r2 (raw file):

	if logWebSessionAuth.Get(&s.server.st.SV) {
		ctx = security.DecorateContext(ctx, username)
		log.Infof(ctx, "WOO\n%s", debug.Stack())

Don't forget to remove this.


pkg/server/server.go, line 682 at r1 (raw file):

// AnnotateCtx is a convenience wrapper; see AmbientContext.
func (s *Server) AnnotateCtx(ctx context.Context) context.Context {
	ctx = decorateContextFromWebMetadata(ctx)

Nit: I think it would be better to reverse the order of these, so that in the event of overlap (which shouldn't happen, which is why this is just a nit), the metadata overrides the "ambient" context.


pkg/server/server.go, line 2252 at r1 (raw file):

var httpRequestLog = settings.RegisterPublicBoolSetting(
	"server.log_web_requests.enabled",

The other settings are server.auth_log.*.enabled; maybe this should be server.log.web_requests.enabled?

@tbg tbg added the X-noremind Bots won't notify about PRs with X-noremind label May 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

X-noremind Bots won't notify about PRs with X-noremind

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants