-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
changefeedccl: add structured logging for changefeed create and failed #79513
changefeedccl: add structured logging for changefeed create and failed #79513
Conversation
7dc7465
to
f5c4768
Compare
f5c4768
to
5153af1
Compare
005dd3b
to
54b5acc
Compare
createChangefeedEvent := &eventpb.CreateChangefeedQuery{ | ||
CommonJobEventDetails: jobEventDetails, | ||
NumTables: int32(len(AllTargets(changefeedDetails))), | ||
SinkType: "core", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps empty string for core? I'm okay either way... it's just we're using "" elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is for the purposes of Telemetry in Snowflake I like "core", we already use "core" in our existing telemetry.Count('changefeed.create.core')
, this data will persist into the future when maybe we have another "empty-ish" type of sink that makes ""
unintuitive to someone that knows all sinks, and if someone is looking at the data without complete knowledge of the sinks then "core" gives them something to search up.
EventType string | ||
} | ||
|
||
var cmLogRe = regexp.MustCompile(`event_log\.go`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does cm stand for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Compiled, this was copied from a bulk test looking at these files, I'll just expand the variable name since its unclear.
pkg/util/log/eventpb/telemetry.proto
Outdated
@@ -71,3 +71,31 @@ message CapturedIndexUsageStats { | |||
bool is_unique = 10 [(gogoproto.jsontag) = ",omitempty"]; | |||
bool is_inverted = 11 [(gogoproto.jsontag) = ",omitempty"]; | |||
} | |||
|
|||
// CreateChangefeedQuery |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better comment needed if you going to have a comment.
Also, I'm not too keen on having "query" in its name. Maybe simply CreateChangefeed
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does look a little weird when its alongside other changefeed events unrelated to a Query
, I'll switch to just CreateChangefeed
👍
pkg/util/log/eventpb/telemetry.proto
Outdated
// CreateChangefeedQuery | ||
message CreateChangefeedQuery { | ||
CommonEventDetails common = 1 [(gogoproto.nullable) = false, (gogoproto.jsontag) = "", (gogoproto.embed) = true]; | ||
CommonJobEventDetails job = 2 [(gogoproto.nullable) = false, (gogoproto.jsontag) = "", (gogoproto.embed) = true]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure we should embed job details. We do have core style changefeeds -- and those don't have job record(s). So, it's a bit strange to see it here. Secondly, does the job detail proto give us interesting information? I.e. is knowing job id interesting? Personally, I don't think so. Furthermore, we are using "Description" of the job event details to put in, basically a redacted changefeed statement. Why don't we use CommonSQLEventDetails
which already has statement field (which gets redacted)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was mostly using it for Description, and now I think maybe it'd be better to just use our own CommonChangefeedEventDetails
with a description
and sink_type
in it, since it'll map directly to what would've been the job description, re-using our own redaction.
pkg/util/log/eventpb/telemetry.proto
Outdated
// The type of sink being emitted to (ex: kafka, nodelocal, webhook-https). | ||
string sink_type = 4 [(gogoproto.jsontag) = ",omitempty", (gogoproto.moretags) = "redact:\"nonsensitive\""]; | ||
|
||
// The behavior of emitted resolved spans. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this the value of resolved option (like "resolved=10s")? Is it useful? Or do we care more about the fact that resolved option was specified?
Perhaps we can include a boolean? And that's sufficient?
Also, should we perhaps include booleans for things like initial_scan? I know we have plenty of various options now for this (including the recently added initial_scan_only)... But maybe we can have initial_scan as a string which can take one of the values:
off, only, cursor (i.e. no_initial_scan was specified, so cursor was either specified or we used statement time).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I liked having it as a string simply to keep the option open for future values / maybe we do one day care about how long customers are waiting / maybe we add some more information in resolved messages that then result in more options making sense.
Didn't notice that initial_scan
unification was merged, I'll add that in 👍
} | ||
|
||
// ChangefeedFailed | ||
message ChangefeedFailed { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm questioning why we would want to have the separate message for falures.
Why not have failure_reason above in the main changefeed telemetry message? Even when changefeed creation fails, it is possible that the other fields (e..g format, sink type, etc) could be initialized. So, for example, we could see particular sink types seem to fail more often.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ChangefeedFailed
applies for both ChangefeedCreate
failures as well as runtime failures such as changefeed_behind
(not implemented in this PR) and unknown_error
(currently applying to all failures in this PR), so they're distinct events. I'll update the comment here to mention changefeed_behind
so that there isn't that implication of it being only Create related errors.
I'll also make both messages use a shared CommonChangefeedEventDetails
that includes a sink_type
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think my question stands: why is changefeed creation should be any different from changfeed error event? Wouldn't we want to know for example, how many webhook sinks with more than 5 tables fail?
When changefeed fails after the job is created, we have all of the information that we include in createChangefeed event...
Also, I kinda think that we should not be adding many changefeed event types. For example, wouldn't we want to also emit events if somebody alters changefeed?
We could probably use CreateChangefeed (though this should be renamed) to ChangefeedEvent maybe)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would be the benefit of sharing a ChangefeedEvent
and I would assume adding another string property to distinguish the actual changefeed-specific event type, compared to having multiple events but putting more information in CommonChangefeedEventDetails
?
There's an idea for a Backfill Event with information like backfill_type, backfill_length, and backfill_size. If we had multiple events like that with very specific information, I feel like a single ChangefeedEvent
struct would get noisy.
Having just one struct would make parsing easier in the test, but I think it's worth that sacrifice for the clarity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would be the benefit of sharing a
ChangefeedEvent
and I would assume adding another string property to distinguish the actual changefeed-specific event type, compared to having multiple events but putting more information inCommonChangefeedEventDetails
?There's an idea for a Backfill Event with information like backfill_type, backfill_length, and backfill_size. If we had multiple events like that with very specific information, I feel like a single
ChangefeedEvent
struct would get noisy.Having just one struct would make parsing easier in the test, but I think it's worth that sacrifice for the clarity.
Argh... I didn't realize this pr had LG. That's unfortunate. I have not for example, gave much thought to tagged errors, etc. I think I made a mistake LG-ing instead of some other pr.
Thank you for moving to changefeed common details. This addresses my biggest criticism.
I'm still not clear about the split into multiple message types -- there is "no prior" art for this (afaik other systems don't differentiate), nor do I know of any guidance for the telemetry usage.
@@ -71,6 +72,34 @@ func init() { | |||
) | |||
} | |||
|
|||
// Wraps changefeedPlanHook to log startup errors to the Telemetry channel | |||
func changefeedPlanHookWrapped( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
going to delay reviewing this part until proto message comments addressed.
if err != nil { | ||
logChangefeedFailedTelemetry(ctx, nil, failureType(err)) | ||
} | ||
hookFnWrapped := func(ctx context.Context, planNodes []sql.PlanNode, resultsCh chan<- tree.Datums) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For some reason doing this wrapping, even if it is a complete no-op, seems to cause a test failure in TestTenantStreaming
. Not sure how a no-op wrapping of a function in another, just passing in the arguments and returning the return value, could cause an observable difference 🤔
54b5acc
to
a7593f7
Compare
I pushed a new incomplete commit just to have the protobuf changes and general error wrapping setup shown for review. Still adding the filling in of the description and sink type to failure messages. |
65f1614
to
d8b9830
Compare
Actually I think I'll leave it like this in this PR for now just to ensure this gets in, where we don't have any |
37400d0
to
08fb8a7
Compare
pkg/util/log/eventpb/telemetry.proto
Outdated
int32 num_tables = 2 [(gogoproto.jsontag) = ",omitempty"]; | ||
|
||
// The behavior of emitted resolved spans (ex: yes, no, 10s) | ||
string resolved = 3 [(gogoproto.jsontag) = ",omitempty", (gogoproto.moretags) = "redact:\"nonsensitive\""]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kinda think that just indication if resolved option was used or not is probably enough.
But okay..
} | ||
|
||
// ChangefeedFailed | ||
message ChangefeedFailed { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think my question stands: why is changefeed creation should be any different from changfeed error event? Wouldn't we want to know for example, how many webhook sinks with more than 5 tables fail?
When changefeed fails after the job is created, we have all of the information that we include in createChangefeed event...
Also, I kinda think that we should not be adding many changefeed event types. For example, wouldn't we want to also emit events if somebody alters changefeed?
We could probably use CreateChangefeed (though this should be renamed) to ChangefeedEvent maybe)?
725b538
to
ded5d13
Compare
bors r+ |
bors r- |
Canceled. |
Resolves cockroachdb#77283 Previously we didn't have any logging related to the new Telemetry channel in the structured logs which are exported to Snowflake. This patch adds events around changefeed creation and failure, informing us of some creation options and a general reason for a failure. Release justification: low risk logging related changes Release note (ops change): Changefeed creations and failures event logs are now emitted to the telemetry channel
ded5d13
to
53870c8
Compare
bors r+ |
Build succeeded: |
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error creating merge commit from 53870c8 to blathers/backport-release-22.1-79513: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 22.1.x failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
Resolves #77283
Previously we didn't have any logging related to the new Telemetry
channel (see
https://cockroachlabs.atlassian.net/wiki/spaces/MC/pages/2283798543/Telemetry+Logging).
This patch adds events around changefeed creation and failure, informing
us of some creation options and a general reason for a failure.
Release justification: low risk logging related changes
Release note (ops change): Changefeed creations and failures event logs
are now emitted to the telemetry channel