changefeedccl: add structured logging for changefeed create and failed #79513

samiskin · 2022-04-06T16:40:28Z

Resolves #77283

Previously we didn't have any logging related to the new Telemetry
channel (see
https://cockroachlabs.atlassian.net/wiki/spaces/MC/pages/2283798543/Telemetry+Logging).
This patch adds events around changefeed creation and failure, informing
us of some creation options and a general reason for a failure.

Release justification: low risk logging related changes

Release note (ops change): Changefeed creations and failures event logs
are now emitted to the telemetry channel

cockroach-teamcity · 2022-04-06T16:40:36Z

This change is

miretskiy · 2022-04-07T12:23:25Z

pkg/ccl/changefeedccl/changefeed_stmt.go

+	createChangefeedEvent := &eventpb.CreateChangefeedQuery{
+		CommonJobEventDetails: jobEventDetails,
+		NumTables:             int32(len(AllTargets(changefeedDetails))),
+		SinkType:              "core",


Perhaps empty string for core? I'm okay either way... it's just we're using "" elsewhere.

Since this is for the purposes of Telemetry in Snowflake I like "core", we already use "core" in our existing telemetry.Count('changefeed.create.core'), this data will persist into the future when maybe we have another "empty-ish" type of sink that makes "" unintuitive to someone that knows all sinks, and if someone is looking at the data without complete knowledge of the sinks then "core" gives them something to search up.

miretskiy · 2022-04-07T12:34:11Z

pkg/ccl/changefeedccl/helpers_test.go

+	EventType string
+}
+
+var cmLogRe = regexp.MustCompile(`event_log\.go`)


what does cm stand for?

Compiled, this was copied from a bulk test looking at these files, I'll just expand the variable name since its unclear.

miretskiy · 2022-04-07T12:46:31Z

pkg/util/log/eventpb/telemetry.proto

@@ -71,3 +71,31 @@ message CapturedIndexUsageStats {
  bool is_unique = 10 [(gogoproto.jsontag) = ",omitempty"];
  bool is_inverted = 11 [(gogoproto.jsontag) = ",omitempty"];
 }
+
+// CreateChangefeedQuery


Better comment needed if you going to have a comment.
Also, I'm not too keen on having "query" in its name. Maybe simply CreateChangefeed?

It does look a little weird when its alongside other changefeed events unrelated to a Query, I'll switch to just CreateChangefeed 👍

miretskiy · 2022-04-07T12:50:32Z

pkg/util/log/eventpb/telemetry.proto

+// CreateChangefeedQuery
+message CreateChangefeedQuery {
+  CommonEventDetails common = 1 [(gogoproto.nullable) = false, (gogoproto.jsontag) = "", (gogoproto.embed) = true];
+  CommonJobEventDetails job = 2 [(gogoproto.nullable) = false, (gogoproto.jsontag) = "", (gogoproto.embed) = true];


Not sure we should embed job details. We do have core style changefeeds -- and those don't have job record(s). So, it's a bit strange to see it here. Secondly, does the job detail proto give us interesting information? I.e. is knowing job id interesting? Personally, I don't think so. Furthermore, we are using "Description" of the job event details to put in, basically a redacted changefeed statement. Why don't we use CommonSQLEventDetails which already has statement field (which gets redacted)?

I was mostly using it for Description, and now I think maybe it'd be better to just use our own CommonChangefeedEventDetails with a description and sink_type in it, since it'll map directly to what would've been the job description, re-using our own redaction.

miretskiy · 2022-04-07T12:54:04Z

pkg/util/log/eventpb/telemetry.proto

+  // The type of sink being emitted to (ex: kafka, nodelocal, webhook-https).
+  string sink_type = 4 [(gogoproto.jsontag) = ",omitempty", (gogoproto.moretags) = "redact:\"nonsensitive\""];
+
+  // The behavior of emitted resolved spans.


is this the value of resolved option (like "resolved=10s")? Is it useful? Or do we care more about the fact that resolved option was specified?
Perhaps we can include a boolean? And that's sufficient?

Also, should we perhaps include booleans for things like initial_scan? I know we have plenty of various options now for this (including the recently added initial_scan_only)... But maybe we can have initial_scan as a string which can take one of the values:
off, only, cursor (i.e. no_initial_scan was specified, so cursor was either specified or we used statement time).

I liked having it as a string simply to keep the option open for future values / maybe we do one day care about how long customers are waiting / maybe we add some more information in resolved messages that then result in more options making sense.

Didn't notice that initial_scan unification was merged, I'll add that in 👍

miretskiy · 2022-04-07T12:55:44Z

pkg/util/log/eventpb/telemetry.proto

+}
+
+// ChangefeedFailed
+message ChangefeedFailed {


i'm questioning why we would want to have the separate message for falures.
Why not have failure_reason above in the main changefeed telemetry message? Even when changefeed creation fails, it is possible that the other fields (e..g format, sink type, etc) could be initialized. So, for example, we could see particular sink types seem to fail more often.

ChangefeedFailed applies for both ChangefeedCreate failures as well as runtime failures such as changefeed_behind (not implemented in this PR) and unknown_error (currently applying to all failures in this PR), so they're distinct events. I'll update the comment here to mention changefeed_behind so that there isn't that implication of it being only Create related errors.

I'll also make both messages use a shared CommonChangefeedEventDetails that includes a sink_type.

I still think my question stands: why is changefeed creation should be any different from changfeed error event? Wouldn't we want to know for example, how many webhook sinks with more than 5 tables fail?
When changefeed fails after the job is created, we have all of the information that we include in createChangefeed event...

Also, I kinda think that we should not be adding many changefeed event types. For example, wouldn't we want to also emit events if somebody alters changefeed?
We could probably use CreateChangefeed (though this should be renamed) to ChangefeedEvent maybe)?

What would be the benefit of sharing a ChangefeedEvent and I would assume adding another string property to distinguish the actual changefeed-specific event type, compared to having multiple events but putting more information in CommonChangefeedEventDetails?

There's an idea for a Backfill Event with information like backfill_type, backfill_length, and backfill_size. If we had multiple events like that with very specific information, I feel like a single ChangefeedEvent struct would get noisy.

Having just one struct would make parsing easier in the test, but I think it's worth that sacrifice for the clarity.

What would be the benefit of sharing a ChangefeedEvent and I would assume adding another string property to distinguish the actual changefeed-specific event type, compared to having multiple events but putting more information in CommonChangefeedEventDetails?

There's an idea for a Backfill Event with information like backfill_type, backfill_length, and backfill_size. If we had multiple events like that with very specific information, I feel like a single ChangefeedEvent struct would get noisy.

Having just one struct would make parsing easier in the test, but I think it's worth that sacrifice for the clarity.

Argh... I didn't realize this pr had LG. That's unfortunate. I have not for example, gave much thought to tagged errors, etc. I think I made a mistake LG-ing instead of some other pr.

Thank you for moving to changefeed common details. This addresses my biggest criticism.
I'm still not clear about the split into multiple message types -- there is "no prior" art for this (afaik other systems don't differentiate), nor do I know of any guidance for the telemetry usage.

miretskiy · 2022-04-07T12:56:29Z

pkg/ccl/changefeedccl/changefeed_stmt.go

@@ -71,6 +72,34 @@ func init() {
 	)
 }

+// Wraps changefeedPlanHook to log startup errors to the Telemetry channel
+func changefeedPlanHookWrapped(


going to delay reviewing this part until proto message comments addressed.

samiskin · 2022-04-07T16:02:50Z

pkg/ccl/changefeedccl/changefeed_stmt.go

+	if err != nil {
+		logChangefeedFailedTelemetry(ctx, nil, failureType(err))
+	}
+	hookFnWrapped := func(ctx context.Context, planNodes []sql.PlanNode, resultsCh chan<- tree.Datums) error {


For some reason doing this wrapping, even if it is a complete no-op, seems to cause a test failure in TestTenantStreaming. Not sure how a no-op wrapping of a function in another, just passing in the arguments and returning the return value, could cause an observable difference 🤔

samiskin · 2022-04-07T18:28:57Z

I pushed a new incomplete commit just to have the protobuf changes and general error wrapping setup shown for review. Still adding the filling in of the description and sink type to failure messages.

samiskin · 2022-04-07T19:28:53Z

Actually I think I'll leave it like this in this PR for now just to ensure this gets in, where we don't have any sink_type or description information for Failure events that occur during startup, but we do at least have it for events after the job's been created.

miretskiy · 2022-04-07T20:42:25Z

pkg/util/log/eventpb/telemetry.proto

+  int32 num_tables = 2 [(gogoproto.jsontag) = ",omitempty"];
+
+  // The behavior of emitted resolved spans (ex: yes, no, 10s)
+  string resolved = 3 [(gogoproto.jsontag) = ",omitempty", (gogoproto.moretags) =  "redact:\"nonsensitive\""];


I kinda think that just indication if resolved option was used or not is probably enough.
But okay..

miretskiy · 2022-04-07T20:45:40Z

pkg/util/log/eventpb/telemetry.proto

+}
+
+// ChangefeedFailed
+message ChangefeedFailed {


I still think my question stands: why is changefeed creation should be any different from changfeed error event? Wouldn't we want to know for example, how many webhook sinks with more than 5 tables fail?
When changefeed fails after the job is created, we have all of the information that we include in createChangefeed event...

Also, I kinda think that we should not be adding many changefeed event types. For example, wouldn't we want to also emit events if somebody alters changefeed?
We could probably use CreateChangefeed (though this should be renamed) to ChangefeedEvent maybe)?

samiskin · 2022-04-11T05:49:00Z

bors r+

samiskin · 2022-04-11T05:51:44Z

bors r-

craig · 2022-04-11T05:51:45Z

Canceled.

Resolves cockroachdb#77283 Previously we didn't have any logging related to the new Telemetry channel in the structured logs which are exported to Snowflake. This patch adds events around changefeed creation and failure, informing us of some creation options and a general reason for a failure. Release justification: low risk logging related changes Release note (ops change): Changefeed creations and failures event logs are now emitted to the telemetry channel

samiskin · 2022-04-11T05:56:47Z

bors r+

craig · 2022-04-11T06:53:13Z

Build succeeded:

GitHub CI (Cockroach)

blathers-crl · 2022-04-11T06:53:24Z

Encountered an error creating backports. Some common things that can go wrong:

The backport branch might have already existed.
There was a merge conflict.
The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.

error creating merge commit from 53870c8 to blathers/backport-release-22.1-79513: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.1.x failed. See errors above.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.}

samiskin force-pushed the changefeed-telemetry-logging branch 3 times, most recently from 7dc7465 to f5c4768 Compare April 6, 2022 16:55

samiskin marked this pull request as ready for review April 6, 2022 16:55

samiskin requested review from a team as code owners April 6, 2022 16:55

samiskin requested review from ajwerner, a team and miretskiy and removed request for a team April 6, 2022 16:55

samiskin force-pushed the changefeed-telemetry-logging branch from f5c4768 to 5153af1 Compare April 6, 2022 16:58

samiskin added the backport-22.1.x 22.1 is EOL label Apr 6, 2022

samiskin force-pushed the changefeed-telemetry-logging branch 9 times, most recently from 005dd3b to 54b5acc Compare April 6, 2022 20:04

miretskiy approved these changes Apr 7, 2022

View reviewed changes

samiskin commented Apr 7, 2022

View reviewed changes

samiskin force-pushed the changefeed-telemetry-logging branch from 54b5acc to a7593f7 Compare April 7, 2022 18:24

samiskin force-pushed the changefeed-telemetry-logging branch 3 times, most recently from 65f1614 to d8b9830 Compare April 7, 2022 19:27

samiskin force-pushed the changefeed-telemetry-logging branch 3 times, most recently from 37400d0 to 08fb8a7 Compare April 7, 2022 20:17

miretskiy reviewed Apr 7, 2022

View reviewed changes

samiskin force-pushed the changefeed-telemetry-logging branch 4 times, most recently from 725b538 to ded5d13 Compare April 11, 2022 04:14

samiskin mentioned this pull request Apr 11, 2022

release-22.1: changefeedccl: add structured logging for changefeed create and failed #79749

Merged

samiskin force-pushed the changefeed-telemetry-logging branch from ded5d13 to 53870c8 Compare April 11, 2022 05:56

craig bot merged commit 5c1ae72 into cockroachdb:master Apr 11, 2022

cockroach-teamcity mentioned this pull request Apr 11, 2022

changefeedccl: add structured logging for changefeed create and failed cockroachdb/docs#13564

Closed

samiskin mentioned this pull request Jun 29, 2022

changefeedccl: log create_changefeed usage to telemetry channel #70761

Closed

samiskin mentioned this pull request Jan 9, 2023

changefeedccl: implement sampling of telemetry data for create changefeed event #69721

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

changefeedccl: add structured logging for changefeed create and failed #79513

changefeedccl: add structured logging for changefeed create and failed #79513

samiskin commented Apr 6, 2022

cockroach-teamcity commented Apr 6, 2022

miretskiy Apr 7, 2022

samiskin Apr 7, 2022 •

edited

miretskiy Apr 7, 2022

samiskin Apr 7, 2022

miretskiy Apr 7, 2022

samiskin Apr 7, 2022

miretskiy Apr 7, 2022

samiskin Apr 7, 2022

miretskiy Apr 7, 2022

samiskin Apr 7, 2022

miretskiy Apr 7, 2022

samiskin Apr 7, 2022

miretskiy Apr 7, 2022

samiskin Apr 8, 2022

miretskiy Apr 11, 2022

miretskiy Apr 7, 2022

samiskin Apr 7, 2022

samiskin commented Apr 7, 2022

samiskin commented Apr 7, 2022

miretskiy Apr 7, 2022

miretskiy Apr 7, 2022

samiskin commented Apr 11, 2022

samiskin commented Apr 11, 2022

craig bot commented Apr 11, 2022

samiskin commented Apr 11, 2022

craig bot commented Apr 11, 2022

blathers-crl bot commented Apr 11, 2022

changefeedccl: add structured logging for changefeed create and failed #79513

changefeedccl: add structured logging for changefeed create and failed #79513

Conversation

samiskin commented Apr 6, 2022

cockroach-teamcity commented Apr 6, 2022

Choose a reason for hiding this comment

samiskin Apr 7, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

samiskin commented Apr 7, 2022

samiskin commented Apr 7, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

samiskin commented Apr 11, 2022

samiskin commented Apr 11, 2022

craig bot commented Apr 11, 2022

samiskin commented Apr 11, 2022

craig bot commented Apr 11, 2022

blathers-crl bot commented Apr 11, 2022

samiskin Apr 7, 2022 •

edited