Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

changefeedccl: add structured logging for changefeed create and failed #79513

Merged
merged 1 commit into from
Apr 11, 2022

Conversation

samiskin
Copy link
Contributor

@samiskin samiskin commented Apr 6, 2022

Resolves #77283

Previously we didn't have any logging related to the new Telemetry
channel (see
https://cockroachlabs.atlassian.net/wiki/spaces/MC/pages/2283798543/Telemetry+Logging).
This patch adds events around changefeed creation and failure, informing
us of some creation options and a general reason for a failure.

Release justification: low risk logging related changes

Release note (ops change): Changefeed creations and failures event logs
are now emitted to the telemetry channel

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@samiskin samiskin force-pushed the changefeed-telemetry-logging branch 3 times, most recently from 7dc7465 to f5c4768 Compare April 6, 2022 16:55
@samiskin samiskin marked this pull request as ready for review April 6, 2022 16:55
@samiskin samiskin requested review from a team as code owners April 6, 2022 16:55
@samiskin samiskin requested review from ajwerner, a team and miretskiy and removed request for a team April 6, 2022 16:55
@samiskin samiskin force-pushed the changefeed-telemetry-logging branch from f5c4768 to 5153af1 Compare April 6, 2022 16:58
@samiskin samiskin added the backport-22.1.x 22.1 is EOL label Apr 6, 2022
@samiskin samiskin force-pushed the changefeed-telemetry-logging branch 9 times, most recently from 005dd3b to 54b5acc Compare April 6, 2022 20:04
createChangefeedEvent := &eventpb.CreateChangefeedQuery{
CommonJobEventDetails: jobEventDetails,
NumTables: int32(len(AllTargets(changefeedDetails))),
SinkType: "core",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps empty string for core? I'm okay either way... it's just we're using "" elsewhere.

Copy link
Contributor Author

@samiskin samiskin Apr 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is for the purposes of Telemetry in Snowflake I like "core", we already use "core" in our existing telemetry.Count('changefeed.create.core'), this data will persist into the future when maybe we have another "empty-ish" type of sink that makes "" unintuitive to someone that knows all sinks, and if someone is looking at the data without complete knowledge of the sinks then "core" gives them something to search up.

EventType string
}

var cmLogRe = regexp.MustCompile(`event_log\.go`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does cm stand for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compiled, this was copied from a bulk test looking at these files, I'll just expand the variable name since its unclear.

@@ -71,3 +71,31 @@ message CapturedIndexUsageStats {
bool is_unique = 10 [(gogoproto.jsontag) = ",omitempty"];
bool is_inverted = 11 [(gogoproto.jsontag) = ",omitempty"];
}

// CreateChangefeedQuery
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better comment needed if you going to have a comment.
Also, I'm not too keen on having "query" in its name. Maybe simply CreateChangefeed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does look a little weird when its alongside other changefeed events unrelated to a Query, I'll switch to just CreateChangefeed 👍

// CreateChangefeedQuery
message CreateChangefeedQuery {
CommonEventDetails common = 1 [(gogoproto.nullable) = false, (gogoproto.jsontag) = "", (gogoproto.embed) = true];
CommonJobEventDetails job = 2 [(gogoproto.nullable) = false, (gogoproto.jsontag) = "", (gogoproto.embed) = true];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure we should embed job details. We do have core style changefeeds -- and those don't have job record(s). So, it's a bit strange to see it here. Secondly, does the job detail proto give us interesting information? I.e. is knowing job id interesting? Personally, I don't think so. Furthermore, we are using "Description" of the job event details to put in, basically a redacted changefeed statement. Why don't we use CommonSQLEventDetails which already has statement field (which gets redacted)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was mostly using it for Description, and now I think maybe it'd be better to just use our own CommonChangefeedEventDetails with a description and sink_type in it, since it'll map directly to what would've been the job description, re-using our own redaction.

// The type of sink being emitted to (ex: kafka, nodelocal, webhook-https).
string sink_type = 4 [(gogoproto.jsontag) = ",omitempty", (gogoproto.moretags) = "redact:\"nonsensitive\""];

// The behavior of emitted resolved spans.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this the value of resolved option (like "resolved=10s")? Is it useful? Or do we care more about the fact that resolved option was specified?
Perhaps we can include a boolean? And that's sufficient?

Also, should we perhaps include booleans for things like initial_scan? I know we have plenty of various options now for this (including the recently added initial_scan_only)... But maybe we can have initial_scan as a string which can take one of the values:
off, only, cursor (i.e. no_initial_scan was specified, so cursor was either specified or we used statement time).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I liked having it as a string simply to keep the option open for future values / maybe we do one day care about how long customers are waiting / maybe we add some more information in resolved messages that then result in more options making sense.

Didn't notice that initial_scan unification was merged, I'll add that in 👍

}

// ChangefeedFailed
message ChangefeedFailed {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm questioning why we would want to have the separate message for falures.
Why not have failure_reason above in the main changefeed telemetry message? Even when changefeed creation fails, it is possible that the other fields (e..g format, sink type, etc) could be initialized. So, for example, we could see particular sink types seem to fail more often.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ChangefeedFailed applies for both ChangefeedCreate failures as well as runtime failures such as changefeed_behind (not implemented in this PR) and unknown_error (currently applying to all failures in this PR), so they're distinct events. I'll update the comment here to mention changefeed_behind so that there isn't that implication of it being only Create related errors.

I'll also make both messages use a shared CommonChangefeedEventDetails that includes a sink_type.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think my question stands: why is changefeed creation should be any different from changfeed error event? Wouldn't we want to know for example, how many webhook sinks with more than 5 tables fail?
When changefeed fails after the job is created, we have all of the information that we include in createChangefeed event...

Also, I kinda think that we should not be adding many changefeed event types. For example, wouldn't we want to also emit events if somebody alters changefeed?
We could probably use CreateChangefeed (though this should be renamed) to ChangefeedEvent maybe)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be the benefit of sharing a ChangefeedEvent and I would assume adding another string property to distinguish the actual changefeed-specific event type, compared to having multiple events but putting more information in CommonChangefeedEventDetails?

There's an idea for a Backfill Event with information like backfill_type, backfill_length, and backfill_size. If we had multiple events like that with very specific information, I feel like a single ChangefeedEvent struct would get noisy.

Having just one struct would make parsing easier in the test, but I think it's worth that sacrifice for the clarity.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be the benefit of sharing a ChangefeedEvent and I would assume adding another string property to distinguish the actual changefeed-specific event type, compared to having multiple events but putting more information in CommonChangefeedEventDetails?

There's an idea for a Backfill Event with information like backfill_type, backfill_length, and backfill_size. If we had multiple events like that with very specific information, I feel like a single ChangefeedEvent struct would get noisy.

Having just one struct would make parsing easier in the test, but I think it's worth that sacrifice for the clarity.

Argh... I didn't realize this pr had LG. That's unfortunate. I have not for example, gave much thought to tagged errors, etc. I think I made a mistake LG-ing instead of some other pr.

Thank you for moving to changefeed common details. This addresses my biggest criticism.
I'm still not clear about the split into multiple message types -- there is "no prior" art for this (afaik other systems don't differentiate), nor do I know of any guidance for the telemetry usage.

@@ -71,6 +72,34 @@ func init() {
)
}

// Wraps changefeedPlanHook to log startup errors to the Telemetry channel
func changefeedPlanHookWrapped(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

going to delay reviewing this part until proto message comments addressed.

if err != nil {
logChangefeedFailedTelemetry(ctx, nil, failureType(err))
}
hookFnWrapped := func(ctx context.Context, planNodes []sql.PlanNode, resultsCh chan<- tree.Datums) error {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason doing this wrapping, even if it is a complete no-op, seems to cause a test failure in TestTenantStreaming. Not sure how a no-op wrapping of a function in another, just passing in the arguments and returning the return value, could cause an observable difference 🤔

@samiskin samiskin force-pushed the changefeed-telemetry-logging branch from 54b5acc to a7593f7 Compare April 7, 2022 18:24
@samiskin
Copy link
Contributor Author

samiskin commented Apr 7, 2022

I pushed a new incomplete commit just to have the protobuf changes and general error wrapping setup shown for review. Still adding the filling in of the description and sink type to failure messages.

@samiskin samiskin force-pushed the changefeed-telemetry-logging branch 3 times, most recently from 65f1614 to d8b9830 Compare April 7, 2022 19:27
@samiskin
Copy link
Contributor Author

samiskin commented Apr 7, 2022

Actually I think I'll leave it like this in this PR for now just to ensure this gets in, where we don't have any sink_type or description information for Failure events that occur during startup, but we do at least have it for events after the job's been created.

@samiskin samiskin force-pushed the changefeed-telemetry-logging branch 3 times, most recently from 37400d0 to 08fb8a7 Compare April 7, 2022 20:17
int32 num_tables = 2 [(gogoproto.jsontag) = ",omitempty"];

// The behavior of emitted resolved spans (ex: yes, no, 10s)
string resolved = 3 [(gogoproto.jsontag) = ",omitempty", (gogoproto.moretags) = "redact:\"nonsensitive\""];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kinda think that just indication if resolved option was used or not is probably enough.
But okay..

}

// ChangefeedFailed
message ChangefeedFailed {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think my question stands: why is changefeed creation should be any different from changfeed error event? Wouldn't we want to know for example, how many webhook sinks with more than 5 tables fail?
When changefeed fails after the job is created, we have all of the information that we include in createChangefeed event...

Also, I kinda think that we should not be adding many changefeed event types. For example, wouldn't we want to also emit events if somebody alters changefeed?
We could probably use CreateChangefeed (though this should be renamed) to ChangefeedEvent maybe)?

@samiskin
Copy link
Contributor Author

bors r+

@samiskin
Copy link
Contributor Author

bors r-

@craig
Copy link
Contributor

craig bot commented Apr 11, 2022

Canceled.

Resolves cockroachdb#77283

Previously we didn't have any logging related to the new Telemetry
channel in the structured logs which are exported to Snowflake. This
patch adds events around changefeed creation and failure, informing us
of some creation options and a general reason for a failure.

Release justification: low risk logging related changes

Release note (ops change): Changefeed creations and failures event logs
are now emitted to the telemetry channel
@samiskin samiskin force-pushed the changefeed-telemetry-logging branch from ded5d13 to 53870c8 Compare April 11, 2022 05:56
@samiskin
Copy link
Contributor Author

bors r+

@craig
Copy link
Contributor

craig bot commented Apr 11, 2022

Build succeeded:

@craig craig bot merged commit 5c1ae72 into cockroachdb:master Apr 11, 2022
@blathers-crl
Copy link

blathers-crl bot commented Apr 11, 2022

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from 53870c8 to blathers/backport-release-22.1-79513: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.1.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-22.1.x 22.1 is EOL
Projects
None yet
Development

Successfully merging this pull request may close these issues.

cdc: add telemetry logging for create changefeed
3 participants