Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

changefeed,kv: Excessive memory usage when using rangefeeds. #74219

Closed
miretskiy opened this issue Dec 22, 2021 · 3 comments
Closed

changefeed,kv: Excessive memory usage when using rangefeeds. #74219

miretskiy opened this issue Dec 22, 2021 · 3 comments
Assignees
Labels
A-cdc Change Data Capture A-kv Anything in KV that doesn't belong in a more specific category. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-cdc

Comments

@miretskiy
Copy link
Contributor

miretskiy commented Dec 22, 2021

This is technically a rangefeed issue as the problem clearly lies in the rangefeed subsystem.

However, it is only applicable to the changefeed system at this point (but will almost certainly impact replication streaming as well). Other users of rangefeeds use them on small system tables and are unlikely to experience this issue.

The crux of the problem is that running rangefeed against substantially sized table (thousands of ranges per node) can result in OOMs.

When establishing rangefeed streaming rpc, each range is treated as a separate stream. Each stream is configured to use
2MB initial window size (pkg/rpc/context.go). The "initial" name is a bit of a misnomer since using any window size above gRPC default of 65K turns off dynamic window sizing -- so, in effect, this is an upper bound on the amount of memory which can be used by an RPC stream (lots of historical context can be found here: #35161).

Critically though the memory used by gRPC stream is not tracked by any memory monitors.
This memory is "inflight" and the consumer of RPC stream is expected to consume this data quickly. However, no matter how fast the consumer is, it is possible that in a sufficiently sized system, just the mere fact of establishing a rangefeed could result in an OOM. For example, consider a 31 node cluster where a single node wants to talk to all other nodes, each containing 500 ranges. This would result in 30->1 fan-in scenario, where if those 500 ranges per node had enough data to send, we could wind up attempting to ingest 30GB of data very rapidly. Pre-reserving 30GB of memory on any memory monitor is excessive -- and probably incorrect.

The second issue, applicable to changefeeds (and soon to replication streaming) is the fact that these system emit data downstream to external system -- possibly over high latency links. Changefeed has an effective pushback mechanism (which uses memory monitor) to deal with slow sinks. Unfortunately, as described above, if the consumer of the RPC stream is not consuming, then there will be 2MB worth of data buffered by the stream itself. From the dist_sender_rangefeed.go:

for {
			event, err := stream.Recv()
			...
			select {
			case eventCh <- event:
			...
			}
		}

When emission to events channel blocks -- for any reason, the stream still keeps receiving the data, up to 2MB.

The above observations were verified by writing up a unit test that emitted 4MB of data into each of the 100 ranges, and capturing heap profiles.

Screen Shot 2021-12-22 at 11 44 44 AM

As expected, 200 MB (2MB per stream) are allocated by http2 client transport.

Lowering initial window size to 64K+1 (plus 1 byte to ensure hard limit) results in 6MB:
Screen Shot 2021-12-22 at 12 02 05 PM

A full solution to this problem is expected not to be backported. So, we'll start off with a partial solution that
will be backportable.

We need to add a client side connection class specifically for rangefeed. This connection class will use much lower initial window size. Perhaps 196K (hopefully making OOMs 10x less likely); or even 65K+1 -- a 30x reduction.

A real solution is to ensure that each client rangefeed call (which can span many ranges) establishes 1 stream per node and multiplex all of feeds on that single stream so that per-stream pushback mechanism is effective. That is, each changefeed will establish 1 stream per node instead of 1 stream per range.

Solutions that probably don't solve anything:

  • Let's have changefeeds disconnect when they enter pushback. This doesn't solve anything because resume w/ catchup scan could hit us with more data even more rapidly. Changefeed disconnect is probably too late anyway.
  • Let's have connection level limits and pushback: GRPC explicitly removed connection level limits. All pushback is per stream.

Jira issue: CRDB-11977
Epic CRDB-23738

@miretskiy miretskiy added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-cdc Change Data Capture A-kv Anything in KV that doesn't belong in a more specific category. T-cdc labels Dec 22, 2021
@blathers-crl
Copy link

blathers-crl bot commented Dec 22, 2021

cc @cockroachdb/cdc

@blathers-crl blathers-crl bot added this to Triage in [DEPRECATED] CDC Dec 22, 2021
@miretskiy miretskiy added this to Incoming in KV via automation Dec 22, 2021
@blathers-crl blathers-crl bot added the T-kv KV Team label Dec 22, 2021
craig bot pushed a commit that referenced this issue Jan 5, 2022
74222: kv,rpc: Introduce dedicated rangefeed connection class.  r=miretskiy a=miretskiy

Rangefeeds executing against very large tables may experience
OOMs under certain conditions. The underlying reason for this is that
cockroach uses 2MB per gRPC stream buffer size to improve system throughput
(particularly under high latency scenarios).  However, because the rangefeed
against a large table establishes a dedicated stream for each range,
the possiblity of an OOM exist if there are many streams from which events
are not consumed quickly enough.

This PR does two things.

First, it introduces a dedicated RPC connection class for use with rangefeeds.

The rangefeed connnection class configures rangefeed streams to use less memory
per stream than the default connection class.  The initial window size for rangefeed
connection class can be adjusted by updating `COCKROACH_RANGEFEED_RPC_INITIAL_WINDOW_SIZE`
environment variable whose default is 128KB.

For rangefeeds to use this new connection class,
a `kv.rangefeed.use_dedicated_connection_class.enabled` setting must be turned on.

Another change in this PR is that the default RPC window size can be adjusted
via `COCKROACH_RPC_INITIAL_WINDOW_SIZE` environment variable.  The default for this
variable is kept at 2MB.

Changing the values of either of those variables is an advanced operation and should
not be taken lightly since changes to those variables impact all aspects of cockroach
performance characteristics.  Values larger than 2MB will be trimmed to 2MB.
Setting either of those to 64KB or below will turn on "dynamic window" gRPC window sizes.

It should be noted that this change is a partial mitigation, and not a complete fix.
A full fix will require rework of rangefeed behavior, and will be done at a later time.

An alternative to introduction of a dedicated connection class was to simply
lower the default connection window size.  However, such change would impact
all RPCs in a system, and it was deemed too risky at this point.
Substantial benchmarking is needed before such change.

Informs #74219

Release Notes (performance improvement): Allow rangefeed streams to use separate
http connection when `kv.rangefeed.use_dedicated_connection_class.enabled` setting
is turned on.  Using separate connection class reduces the possiblity of OOMs when
running rangefeeds against very large tables.  The connection window size
for rangefeed can be adjusted via `COCKROACH_RANGEFEED_INITIAL_WINDOW_SIZE` environment
variable, whose default is 128KB.


Co-authored-by: Yevgeniy Miretskiy <yevgeniy@cockroachlabs.com>
@amruss amruss moved this from Triage to Bugs in [DEPRECATED] CDC Jan 5, 2022
@mwang1026
Copy link

@miretskiy @amruss removing kv since you seem to be handling this

@mwang1026 mwang1026 removed this from Incoming in KV Jan 11, 2022
@exalate-issue-sync exalate-issue-sync bot removed the T-kv KV Team label Jan 21, 2022
miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Jan 26, 2022
As demonstrated in cockroachdb#74219, running rangefeed over large enough table
or perhaps not consuming rangefeed events quickly enough,
may cause undesirable memory blow up, including up to OOMing the node.
Since rangefeeds tend to run on (almost) every node, a possiblity exists
that a entire cluster might get distabilized.

One of the reasons for this is that the memory used by low(er) level http2
RPC transport is not currently being tracked -- nor do we have a mechanism
to add such tracking.  cockroachdb#74456 provided partial mitigation for this issue,
where a single RangeFeed stream could be restricted to use less memory.
Nonetheless, the possibility of excessive memory usage remains since the number
of rangefeed streams in the system could be very large.

This PR introduce a new bi-directional streaming RPC called `RangeFeedStream`.
In a uni-directonal streaming RPC, the client estabilishes connection to the KV
server and requests to receive events for 1 span.  A separate RPC stream is created
for each span.

In contrast, `RangeFeedStream` is bi-directional RPC: the client
is expected to connect to the kv server and request as many spans as it wishes to
receive from that server. The server multiplexes all events onto a single
stream, and the client de-multiplexes those events appropriately on the receiving
end.

Note: just like in a uni-directional implementation, each call to
`RangeFeedStream` method (i.e. a logical rangefeed established by the client)
is treated independently.  That is, even though it is possible to
multiplex all logical rangefeed streams onto a single bi-directional stream to a
single KV node, this optimization is not currently implementation as it raises
questions about scheduling and fairness.  If we need to further reduce the number
of bi-directional streams in the future, from `number_logical_feeds*number_of_nodes`
down to the `number_of_nodes`, such optimization can be added at a later time.

Multiplexing all of the spans hosted on a node onto a single bi-directional stream
reduces the number of such stream from the `number_of_ranges` down to `number_of_nodes`
(per logical range feed).  This, in turn, enables the client to reserve the worst case
amount of memory before starting the rangefeed.  This amount of memory is bound by
the number of nodes in the cluster times per-stream maximum memory -- default is `2MB`,
but this can be changed via dedicated rangefeed connection class.
The server side of the equation may also benefit from knowing how many rangefeeds
are running right now.  Even though the server is somewhat protected from memory
blow up via http2 pushback, we may want to also manage server side memory better.
Memory accounting for client (and possible server) will be added in the follow on PRs.

The new RPC is turned on by default since the intention is to retire the uni-directional
rangefeed during later release.  However, this default maybe disable (and rangefeed
reverted to use old implementation) by setting `COCKROACH_ENABLE_STREAMING_RANGEFEED`
environment variable to `false` and restarting the cluster.

Release Notes (enteriprise change): Introduce a bi-directional implementation of
range feed called `RangeFeedStream`.  Range feeds are used by multiple subsystems
including changefeeds.  The bi-directional implementation provides a better mechanism
to monitor memory used by range feeds.  The rangefeed implementation can be revered
back to its old implementation via restarting the clluster with the
`COCKROACH_ENABLE_STREAMING_RANGEFEED` environment variable set to false.
miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Jan 26, 2022
As demonstrated in cockroachdb#74219, running rangefeed over large enough table
or perhaps not consuming rangefeed events quickly enough,
may cause undesirable memory blow up, including up to OOMing the node.
Since rangefeeds tend to run on (almost) every node, a possiblity exists
that a entire cluster might get distabilized.

One of the reasons for this is that the memory used by low(er) level http2
RPC transport is not currently being tracked -- nor do we have a mechanism
to add such tracking.  cockroachdb#74456 provided partial mitigation for this issue,
where a single RangeFeed stream could be restricted to use less memory.
Nonetheless, the possibility of excessive memory usage remains since the number
of rangefeed streams in the system could be very large.

This PR introduce a new bi-directional streaming RPC called `RangeFeedStream`.
In a uni-directonal streaming RPC, the client estabilishes connection to the KV
server and requests to receive events for 1 span.  A separate RPC stream is created
for each span.

In contrast, `RangeFeedStream` is bi-directional RPC: the client
is expected to connect to the kv server and request as many spans as it wishes to
receive from that server. The server multiplexes all events onto a single
stream, and the client de-multiplexes those events appropriately on the receiving
end.

Note: just like in a uni-directional implementation, each call to
`RangeFeedStream` method (i.e. a logical rangefeed established by the client)
is treated independently.  That is, even though it is possible to
multiplex all logical rangefeed streams onto a single bi-directional stream to a
single KV node, this optimization is not currently implementation as it raises
questions about scheduling and fairness.  If we need to further reduce the number
of bi-directional streams in the future, from `number_logical_feeds*number_of_nodes`
down to the `number_of_nodes`, such optimization can be added at a later time.

Multiplexing all of the spans hosted on a node onto a single bi-directional stream
reduces the number of such stream from the `number_of_ranges` down to `number_of_nodes`
(per logical range feed).  This, in turn, enables the client to reserve the worst case
amount of memory before starting the rangefeed.  This amount of memory is bound by
the number of nodes in the cluster times per-stream maximum memory -- default is `2MB`,
but this can be changed via dedicated rangefeed connection class.
The server side of the equation may also benefit from knowing how many rangefeeds
are running right now.  Even though the server is somewhat protected from memory
blow up via http2 pushback, we may want to also manage server side memory better.
Memory accounting for client (and possible server) will be added in the follow on PRs.

The new RPC is turned on by default since the intention is to retire the uni-directional
rangefeed during later release.  However, this default maybe disable (and rangefeed
reverted to use old implementation) by setting `COCKROACH_ENABLE_STREAMING_RANGEFEED`
environment variable to `false` and restarting the cluster.

Release Notes (enteriprise change): Introduce a bi-directional implementation of
range feed called `RangeFeedStream`.  Range feeds are used by multiple subsystems
including changefeeds.  The bi-directional implementation provides a better mechanism
to monitor memory used by range feeds.  The rangefeed implementation can be revered
back to its old implementation via restarting the clluster with the
`COCKROACH_ENABLE_STREAMING_RANGEFEED` environment variable set to false.
miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Jan 28, 2022
As demonstrated in cockroachdb#74219, running rangefeed over large enough table
or perhaps not consuming rangefeed events quickly enough,
may cause undesirable memory blow up, including up to OOMing the node.
Since rangefeeds tend to run on (almost) every node, a possiblity exists
that a entire cluster might get distabilized.

One of the reasons for this is that the memory used by low(er) level http2
RPC transport is not currently being tracked -- nor do we have a mechanism
to add such tracking.  cockroachdb#74456 provided partial mitigation for this issue,
where a single RangeFeed stream could be restricted to use less memory.
Nonetheless, the possibility of excessive memory usage remains since the number
of rangefeed streams in the system could be very large.

This PR introduce a new bi-directional streaming RPC called `RangeFeedStream`.
In a uni-directonal streaming RPC, the client estabilishes connection to the KV
server and requests to receive events for 1 span.  A separate RPC stream is created
for each span.

In contrast, `RangeFeedStream` is bi-directional RPC: the client
is expected to connect to the kv server and request as many spans as it wishes to
receive from that server. The server multiplexes all events onto a single
stream, and the client de-multiplexes those events appropriately on the receiving
end.

Note: just like in a uni-directional implementation, each call to
`RangeFeedStream` method (i.e. a logical rangefeed established by the client)
is treated independently.  That is, even though it is possible to
multiplex all logical rangefeed streams onto a single bi-directional stream to a
single KV node, this optimization is not currently implementation as it raises
questions about scheduling and fairness.  If we need to further reduce the number
of bi-directional streams in the future, from `number_logical_feeds*number_of_nodes`
down to the `number_of_nodes`, such optimization can be added at a later time.

Multiplexing all of the spans hosted on a node onto a single bi-directional stream
reduces the number of such stream from the `number_of_ranges` down to `number_of_nodes`
(per logical range feed).  This, in turn, enables the client to reserve the worst case
amount of memory before starting the rangefeed.  This amount of memory is bound by
the number of nodes in the cluster times per-stream maximum memory -- default is `2MB`,
but this can be changed via dedicated rangefeed connection class.
The server side of the equation may also benefit from knowing how many rangefeeds
are running right now.  Even though the server is somewhat protected from memory
blow up via http2 pushback, we may want to also manage server side memory better.
Memory accounting for client (and possible server) will be added in the follow on PRs.

The new RPC is turned on by default since the intention is to retire the uni-directional
rangefeed during later release.  However, this default maybe disable (and rangefeed
reverted to use old implementation) by setting `COCKROACH_ENABLE_STREAMING_RANGEFEED`
environment variable to `false` and restarting the cluster.

Release Notes (enteriprise change): Introduce a bi-directional implementation of
range feed called `RangeFeedStream`.  Range feeds are used by multiple subsystems
including changefeeds.  The bi-directional implementation provides a better mechanism
to monitor memory used by range feeds.  The rangefeed implementation can be revered
back to its old implementation via restarting the clluster with the
`COCKROACH_ENABLE_STREAMING_RANGEFEED` environment variable set to false.
miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Jan 28, 2022
As demonstrated in cockroachdb#74219, running rangefeed over large enough table
or perhaps not consuming rangefeed events quickly enough,
may cause undesirable memory blow up, including up to OOMing the node.
Since rangefeeds tend to run on (almost) every node, a possiblity exists
that a entire cluster might get distabilized.

One of the reasons for this is that the memory used by low(er) level http2
RPC transport is not currently being tracked -- nor do we have a mechanism
to add such tracking.  cockroachdb#74456 provided partial mitigation for this issue,
where a single RangeFeed stream could be restricted to use less memory.
Nonetheless, the possibility of excessive memory usage remains since the number
of rangefeed streams in the system could be very large.

This PR introduce a new bi-directional streaming RPC called `MuxRangeFeed`.
In a uni-directonal streaming RPC, the client estabilishes connection to the KV
server and requests to receive events for 1 span.  A separate RPC stream is created
for each span.

In contrast, `MuxRangeFeed` is bi-directional RPC: the client
is expected to connect to the kv server and request as many spans as it wishes to
receive from that server. The server multiplexes all events onto a single
stream, and the client de-multiplexes those events appropriately on the receiving
end.

Note: just like in a uni-directional implementation, each call to
`MuxRangeFeed` method (i.e. a logical rangefeed established by the client)
is treated independently.  That is, even though it is possible to
multiplex all logical rangefeed streams onto a single bi-directional stream to a
single KV node, this optimization is not currently implementation as it raises
questions about scheduling and fairness.  If we need to further reduce the number
of bi-directional streams in the future, from `number_logical_feeds*number_of_nodes`
down to the `number_of_nodes`, such optimization can be added at a later time.

Multiplexing all of the spans hosted on a node onto a single bi-directional stream
reduces the number of such stream from the `number_of_ranges` down to `number_of_nodes`
(per logical range feed).  This, in turn, enables the client to reserve the worst case
amount of memory before starting the rangefeed.  This amount of memory is bound by
the number of nodes in the cluster times per-stream maximum memory -- default is `2MB`,
but this can be changed via dedicated rangefeed connection class.
The server side of the equation may also benefit from knowing how many rangefeeds
are running right now.  Even though the server is somewhat protected from memory
blow up via http2 pushback, we may want to also manage server side memory better.
Memory accounting for client (and possible server) will be added in the follow on PRs.

The new RPC is turned on by default since the intention is to retire the uni-directional
rangefeed during later release.  However, this default maybe disable (and rangefeed
reverted to use old implementation) by setting `COCKROACH_ENABLE_MULTIPLEXING_RANGEFEED`
environment variable to `false` and restarting the cluster.

Release Notes (enterprise change) Rangefeeds (which are used internally e.g. by changefeeds)
now use a common HTTP/2 stream per client for all range replicas on a node,
instead of one per replica. This significantly reduces the amount of network buffer
memory usage, which could cause nodes to run out of memory if a client was slow to consume events.
The rangefeed implementation can be revered back to its old implementation
via restarting the cluster with the `COCKROACH_ENABLE_MULTIPLEXING_RANGEFEED`
environment variable set to false.
miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Jan 29, 2022
As demonstrated in cockroachdb#74219, running rangefeed over large enough table
or perhaps not consuming rangefeed events quickly enough,
may cause undesirable memory blow up, including up to OOMing the node.
Since rangefeeds tend to run on (almost) every node, a possiblity exists
that a entire cluster might get distabilized.

One of the reasons for this is that the memory used by low(er) level http2
RPC transport is not currently being tracked -- nor do we have a mechanism
to add such tracking.  cockroachdb#74456 provided partial mitigation for this issue,
where a single RangeFeed stream could be restricted to use less memory.
Nonetheless, the possibility of excessive memory usage remains since the number
of rangefeed streams in the system could be very large.

This PR introduce a new bi-directional streaming RPC called `MuxRangeFeed`.
In a uni-directonal streaming RPC, the client estabilishes connection to the KV
server and requests to receive events for 1 span.  A separate RPC stream is created
for each span.

In contrast, `MuxRangeFeed` is bi-directional RPC: the client
is expected to connect to the kv server and request as many spans as it wishes to
receive from that server. The server multiplexes all events onto a single
stream, and the client de-multiplexes those events appropriately on the receiving
end.

Note: just like in a uni-directional implementation, each call to
`MuxRangeFeed` method (i.e. a logical rangefeed established by the client)
is treated independently.  That is, even though it is possible to
multiplex all logical rangefeed streams onto a single bi-directional stream to a
single KV node, this optimization is not currently implementation as it raises
questions about scheduling and fairness.  If we need to further reduce the number
of bi-directional streams in the future, from `number_logical_feeds*number_of_nodes`
down to the `number_of_nodes`, such optimization can be added at a later time.

Multiplexing all of the spans hosted on a node onto a single bi-directional stream
reduces the number of such stream from the `number_of_ranges` down to `number_of_nodes`
(per logical range feed).  This, in turn, enables the client to reserve the worst case
amount of memory before starting the rangefeed.  This amount of memory is bound by
the number of nodes in the cluster times per-stream maximum memory -- default is `2MB`,
but this can be changed via dedicated rangefeed connection class.
The server side of the equation may also benefit from knowing how many rangefeeds
are running right now.  Even though the server is somewhat protected from memory
blow up via http2 pushback, we may want to also manage server side memory better.
Memory accounting for client (and possible server) will be added in the follow on PRs.

The new RPC is turned on by default since the intention is to retire the uni-directional
rangefeed during later release.  However, this default maybe disable (and rangefeed
reverted to use old implementation) by setting `COCKROACH_ENABLE_MULTIPLEXING_RANGEFEED`
environment variable to `false` and restarting the cluster.

Release Notes (enterprise change) Rangefeeds (which are used internally e.g. by changefeeds)
now use a common HTTP/2 stream per client for all range replicas on a node,
instead of one per replica. This significantly reduces the amount of network buffer
memory usage, which could cause nodes to run out of memory if a client was slow to consume events.
The rangefeed implementation can be revered back to its old implementation
via restarting the cluster with the `COCKROACH_ENABLE_MULTIPLEXING_RANGEFEED`
environment variable set to false.
miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Feb 1, 2022
As demonstrated in cockroachdb#74219, running rangefeed over large enough table
or perhaps not consuming rangefeed events quickly enough,
may cause undesirable memory blow up, including up to OOMing the node.
Since rangefeeds tend to run on (almost) every node, a possiblity exists
that a entire cluster might get distabilized.

One of the reasons for this is that the memory used by low(er) level http2
RPC transport is not currently being tracked -- nor do we have a mechanism
to add such tracking.  cockroachdb#74456 provided partial mitigation for this issue,
where a single RangeFeed stream could be restricted to use less memory.
Nonetheless, the possibility of excessive memory usage remains since the number
of rangefeed streams in the system could be very large.

This PR introduce a new bi-directional streaming RPC called `MuxRangeFeed`.
In a uni-directonal streaming RPC, the client estabilishes connection to the KV
server and requests to receive events for 1 span.  A separate RPC stream is created
for each span.

In contrast, `MuxRangeFeed` is bi-directional RPC: the client
is expected to connect to the kv server and request as many spans as it wishes to
receive from that server. The server multiplexes all events onto a single
stream, and the client de-multiplexes those events appropriately on the receiving
end.

Note: just like in a uni-directional implementation, each call to
`MuxRangeFeed` method (i.e. a logical rangefeed established by the client)
is treated independently.  That is, even though it is possible to
multiplex all logical rangefeed streams onto a single bi-directional stream to a
single KV node, this optimization is not currently implementation as it raises
questions about scheduling and fairness.  If we need to further reduce the number
of bi-directional streams in the future, from `number_logical_feeds*number_of_nodes`
down to the `number_of_nodes`, such optimization can be added at a later time.

Multiplexing all of the spans hosted on a node onto a single bi-directional stream
reduces the number of such stream from the `number_of_ranges` down to `number_of_nodes`
(per logical range feed).  This, in turn, enables the client to reserve the worst case
amount of memory before starting the rangefeed.  This amount of memory is bound by
the number of nodes in the cluster times per-stream maximum memory -- default is `2MB`,
but this can be changed via dedicated rangefeed connection class.
The server side of the equation may also benefit from knowing how many rangefeeds
are running right now.  Even though the server is somewhat protected from memory
blow up via http2 pushback, we may want to also manage server side memory better.
Memory accounting for client (and possible server) will be added in the follow on PRs.

The new RPC is turned on by default since the intention is to retire the uni-directional
rangefeed during later release.  However, this default maybe disable (and rangefeed
reverted to use old implementation) by setting `COCKROACH_ENABLE_MULTIPLEXING_RANGEFEED`
environment variable to `false` and restarting the cluster.

Release Notes (enterprise change) Rangefeeds (which are used internally e.g. by changefeeds)
now use a common HTTP/2 stream per client for all range replicas on a node,
instead of one per replica. This significantly reduces the amount of network buffer
memory usage, which could cause nodes to run out of memory if a client was slow to consume events.
The rangefeed implementation can be revered back to its old implementation
via restarting the cluster with the `COCKROACH_ENABLE_MULTIPLEXING_RANGEFEED`
environment variable set to false.
miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Feb 1, 2022
As demonstrated in cockroachdb#74219, running rangefeed over large enough table
or perhaps not consuming rangefeed events quickly enough,
may cause undesirable memory blow up, including up to OOMing the node.
Since rangefeeds tend to run on (almost) every node, a possiblity exists
that a entire cluster might get distabilized.

One of the reasons for this is that the memory used by low(er) level http2
RPC transport is not currently being tracked -- nor do we have a mechanism
to add such tracking.  cockroachdb#74456 provided partial mitigation for this issue,
where a single RangeFeed stream could be restricted to use less memory.
Nonetheless, the possibility of excessive memory usage remains since the number
of rangefeed streams in the system could be very large.

This PR introduce a new bi-directional streaming RPC called `MuxRangeFeed`.
In a uni-directonal streaming RPC, the client estabilishes connection to the KV
server and requests to receive events for 1 span.  A separate RPC stream is created
for each span.

In contrast, `MuxRangeFeed` is bi-directional RPC: the client
is expected to connect to the kv server and request as many spans as it wishes to
receive from that server. The server multiplexes all events onto a single
stream, and the client de-multiplexes those events appropriately on the receiving
end.

Note: just like in a uni-directional implementation, each call to
`MuxRangeFeed` method (i.e. a logical rangefeed established by the client)
is treated independently.  That is, even though it is possible to
multiplex all logical rangefeed streams onto a single bi-directional stream to a
single KV node, this optimization is not currently implementation as it raises
questions about scheduling and fairness.  If we need to further reduce the number
of bi-directional streams in the future, from `number_logical_feeds*number_of_nodes`
down to the `number_of_nodes`, such optimization can be added at a later time.

Multiplexing all of the spans hosted on a node onto a single bi-directional stream
reduces the number of such stream from the `number_of_ranges` down to `number_of_nodes`
(per logical range feed).  This, in turn, enables the client to reserve the worst case
amount of memory before starting the rangefeed.  This amount of memory is bound by
the number of nodes in the cluster times per-stream maximum memory -- default is `2MB`,
but this can be changed via dedicated rangefeed connection class.
The server side of the equation may also benefit from knowing how many rangefeeds
are running right now.  Even though the server is somewhat protected from memory
blow up via http2 pushback, we may want to also manage server side memory better.
Memory accounting for client (and possible server) will be added in the follow on PRs.

The new RPC is turned on by default since the intention is to retire the uni-directional
rangefeed during later release.  However, this default maybe disable (and rangefeed
reverted to use old implementation) by setting `COCKROACH_ENABLE_MULTIPLEXING_RANGEFEED`
environment variable to `false` and restarting the cluster.

Release Notes (enterprise change) Rangefeeds (which are used internally e.g. by changefeeds)
now use a common HTTP/2 stream per client for all range replicas on a node,
instead of one per replica. This significantly reduces the amount of network buffer
memory usage, which could cause nodes to run out of memory if a client was slow to consume events.
The rangefeed implementation can be revered back to its old implementation
via restarting the cluster with the `COCKROACH_ENABLE_MULTIPLEXING_RANGEFEED`
environment variable set to false.
miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Aug 3, 2022
As demonstrated in cockroachdb#74219, running rangefeed over large enough table or
perhaps not consuming rangefeed events quickly enough, may cause undesirable
memory blow up, including up to OOMing the node.  Since rangefeeds tend to
run on (almost) every node, a possiblity exists that a entire cluster might
get distabilized.

One of the reasons for this is that the memory used by low(er) level http2
RPC transport is not currently being tracked -- nor do we have a mechanism
to add such tracking.  cockroachdb#74456 provided partial mitigation for this issue,
where a single RangeFeed stream could be restricted to use less memory.
Nonetheless, the possibility of excessive memory usage remains since the
number of rangefeed streams in the system could be very large.

This PR introduce a new bi-directional streaming RPC called `MuxRangeFeed`.
In a uni-directonal streaming RPC, the client estabilishes connection to the
KV server and requests to receive events for 1 span.  A separate RPC stream
is created for each span.

In contrast, `MuxRangeFeed` is bi-directional RPC: the client is expected to
connect to the kv server and request as many spans as it wishes to receive
from that server. The server multiplexes all events onto a single stream,
and the client de-multiplexes those events appropriately on the receiving
end.

Note: just like in a uni-directional implementation, each call to
`MuxRangeFeed` method (i.e. a logical rangefeed established by the client)
is treated independently.  That is, even though it is possible to multiplex
all logical rangefeed streams onto a single bi-directional stream to a
single KV node, this optimization is not currently implementation as it
raises questions about scheduling and fairness.  If we need to further
reduce the number of bi-directional streams in the future, from
`number_logical_feeds*number_of_nodes` down to the `number_of_nodes`, such
optimization can be added at a later time.

Multiplexing all of the spans hosted on a node onto a single bi-directional
stream reduces the number of such stream from the `number_of_ranges` down to
`number_of_nodes` (per logical range feed).  This, in turn, enables the
client to reserve the worst case amount of memory before starting the
rangefeed.  This amount of memory is bound by the number of nodes in the
cluster times per-stream maximum memory -- default is `2MB`, but this can be
changed via dedicated rangefeed connection class.  The server side of the
equation may also benefit from knowing how many rangefeeds are running right
now.  Even though the server is somewhat protected from memory blow up via
http2 pushback, we may want to also manage server side memory better.
Memory accounting for client (and possible server) will be added in the
follow on PRs.

The use of new RPC is contingent on the callers providing `WithMuxRangefeed`
option when starting the rangefeed.  However, an envornmnet variable "kill
switch" `COCKROACH_ENABLE_MULTIPLEXING_RANGEFEED` exists to disable the use
of mux rangefeed in case of serious issues being discovered.

Release note (enterprise change): Introduce a new Rangefeed RPC called
`MuxRangeFeed`.  Rangefeeds may now use a common HTTP/2 stream per client
for all range replicas on a node, instead of one per replica. This
significantly reduces the amount of network buffer memory usage, which could
cause nodes to run out of memory if a client was slow to consume events.
The caller may opt in to use new mechanism by specifying `WithMuxRangefeed`
option when starting the rangefeed.  However, a cluster wide
`COCKROACH_ENABLE_MULTIPLEXING_RANGEFEED` environment variable may be set to
`false` to inhibit the use of this new RPC.
miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Aug 5, 2022
As demonstrated in cockroachdb#74219, running rangefeed over large enough table or
perhaps not consuming rangefeed events quickly enough, may cause undesirable
memory blow up, including up to OOMing the node.  Since rangefeeds tend to
run on (almost) every node, a possiblity exists that a entire cluster might
get distabilized.

One of the reasons for this is that the memory used by low(er) level http2
RPC transport is not currently being tracked -- nor do we have a mechanism
to add such tracking.  cockroachdb#74456 provided partial mitigation for this issue,
where a single RangeFeed stream could be restricted to use less memory.
Nonetheless, the possibility of excessive memory usage remains since the
number of rangefeed streams in the system could be very large.

This PR introduce a new bi-directional streaming RPC called `MuxRangeFeed`.
In a uni-directonal streaming RPC, the client estabilishes connection to the
KV server and requests to receive events for 1 span.  A separate RPC stream
is created for each span.

In contrast, `MuxRangeFeed` is bi-directional RPC: the client is expected to
connect to the kv server and request as many spans as it wishes to receive
from that server. The server multiplexes all events onto a single stream,
and the client de-multiplexes those events appropriately on the receiving
end.

Note: just like in a uni-directional implementation, each call to
`MuxRangeFeed` method (i.e. a logical rangefeed established by the client)
is treated independently.  That is, even though it is possible to multiplex
all logical rangefeed streams onto a single bi-directional stream to a
single KV node, this optimization is not currently implementation as it
raises questions about scheduling and fairness.  If we need to further
reduce the number of bi-directional streams in the future, from
`number_logical_feeds*number_of_nodes` down to the `number_of_nodes`, such
optimization can be added at a later time.

Multiplexing all of the spans hosted on a node onto a single bi-directional
stream reduces the number of such stream from the `number_of_ranges` down to
`number_of_nodes` (per logical range feed).  This, in turn, enables the
client to reserve the worst case amount of memory before starting the
rangefeed.  This amount of memory is bound by the number of nodes in the
cluster times per-stream maximum memory -- default is `2MB`, but this can be
changed via dedicated rangefeed connection class.  The server side of the
equation may also benefit from knowing how many rangefeeds are running right
now.  Even though the server is somewhat protected from memory blow up via
http2 pushback, we may want to also manage server side memory better.
Memory accounting for client (and possible server) will be added in the
follow on PRs.

The use of new RPC is contingent on the callers providing `WithMuxRangefeed`
option when starting the rangefeed.  However, an envornmnet variable "kill
switch" `COCKROACH_ENABLE_MULTIPLEXING_RANGEFEED` exists to disable the use
of mux rangefeed in case of serious issues being discovered.

Release note (enterprise change): Introduce a new Rangefeed RPC called
`MuxRangeFeed`.  Rangefeeds may now use a common HTTP/2 stream per client
for all range replicas on a node, instead of one per replica. This
significantly reduces the amount of network buffer memory usage, which could
cause nodes to run out of memory if a client was slow to consume events.
The caller may opt in to use new mechanism by specifying `WithMuxRangefeed`
option when starting the rangefeed.  However, a cluster wide
`COCKROACH_ENABLE_MULTIPLEXING_RANGEFEED` environment variable may be set to
`false` to inhibit the use of this new RPC.
miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Aug 9, 2022
As demonstrated in cockroachdb#74219, running rangefeed over large enough table or
perhaps not consuming rangefeed events quickly enough, may cause undesirable
memory blow up, including up to OOMing the node.  Since rangefeeds tend to
run on (almost) every node, a possiblity exists that a entire cluster might
get distabilized.

One of the reasons for this is that the memory used by low(er) level http2
RPC transport is not currently being tracked -- nor do we have a mechanism
to add such tracking.  cockroachdb#74456 provided partial mitigation for this issue,
where a single RangeFeed stream could be restricted to use less memory.
Nonetheless, the possibility of excessive memory usage remains since the
number of rangefeed streams in the system could be very large.

This PR introduce a new bi-directional streaming RPC called `MuxRangeFeed`.
In a uni-directonal streaming RPC, the client estabilishes connection to the
KV server and requests to receive events for 1 span.  A separate RPC stream
is created for each span.

In contrast, `MuxRangeFeed` is bi-directional RPC: the client is expected to
connect to the kv server and request as many spans as it wishes to receive
from that server. The server multiplexes all events onto a single stream,
and the client de-multiplexes those events appropriately on the receiving
end.

Note: just like in a uni-directional implementation, each call to
`MuxRangeFeed` method (i.e. a logical rangefeed established by the client)
is treated independently.  That is, even though it is possible to multiplex
all logical rangefeed streams onto a single bi-directional stream to a
single KV node, this optimization is not currently implementation as it
raises questions about scheduling and fairness.  If we need to further
reduce the number of bi-directional streams in the future, from
`number_logical_feeds*number_of_nodes` down to the `number_of_nodes`, such
optimization can be added at a later time.

Multiplexing all of the spans hosted on a node onto a single bi-directional
stream reduces the number of such stream from the `number_of_ranges` down to
`number_of_nodes` (per logical range feed).  This, in turn, enables the
client to reserve the worst case amount of memory before starting the
rangefeed.  This amount of memory is bound by the number of nodes in the
cluster times per-stream maximum memory -- default is `2MB`, but this can be
changed via dedicated rangefeed connection class.  The server side of the
equation may also benefit from knowing how many rangefeeds are running right
now.  Even though the server is somewhat protected from memory blow up via
http2 pushback, we may want to also manage server side memory better.
Memory accounting for client (and possible server) will be added in the
follow on PRs.

The use of new RPC is contingent on the callers providing `WithMuxRangefeed`
option when starting the rangefeed.  However, an envornmnet variable "kill
switch" `COCKROACH_ENABLE_MULTIPLEXING_RANGEFEED` exists to disable the use
of mux rangefeed in case of serious issues being discovered.

Release note (enterprise change): Introduce a new Rangefeed RPC called
`MuxRangeFeed`.  Rangefeeds now use a common HTTP/2 stream per client
for all range replicas on a node, instead of one per replica. This
significantly reduces the amount of network buffer memory usage, which could
cause nodes to run out of memory if a client was slow to consume events.
The caller may opt in to use new mechanism by specifying `WithMuxRangefeed`
option when starting the rangefeed.  However, a cluster wide
`COCKROACH_ENABLE_MULTIPLEXING_RANGEFEED` environment variable may be set to
`false` to inhibit the use of this new RPC.
miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Aug 10, 2022
As demonstrated in cockroachdb#74219, running rangefeed over large enough table or
perhaps not consuming rangefeed events quickly enough, may cause undesirable
memory blow up, including up to OOMing the node.  Since rangefeeds tend to
run on (almost) every node, a possiblity exists that a entire cluster might
get distabilized.

One of the reasons for this is that the memory used by low(er) level http2
RPC transport is not currently being tracked -- nor do we have a mechanism
to add such tracking.  cockroachdb#74456 provided partial mitigation for this issue,
where a single RangeFeed stream could be restricted to use less memory.
Nonetheless, the possibility of excessive memory usage remains since the
number of rangefeed streams in the system could be very large.

This PR introduce a new bi-directional streaming RPC called `MuxRangeFeed`.
In a uni-directonal streaming RPC, the client estabilishes connection to the
KV server and requests to receive events for 1 span.  A separate RPC stream
is created for each span.

In contrast, `MuxRangeFeed` is bi-directional RPC: the client is expected to
connect to the kv server and request as many spans as it wishes to receive
from that server. The server multiplexes all events onto a single stream,
and the client de-multiplexes those events appropriately on the receiving
end.

Note: just like in a uni-directional implementation, each call to
`MuxRangeFeed` method (i.e. a logical rangefeed established by the client)
is treated independently.  That is, even though it is possible to multiplex
all logical rangefeed streams onto a single bi-directional stream to a
single KV node, this optimization is not currently implementation as it
raises questions about scheduling and fairness.  If we need to further
reduce the number of bi-directional streams in the future, from
`number_logical_feeds*number_of_nodes` down to the `number_of_nodes`, such
optimization can be added at a later time.

Multiplexing all of the spans hosted on a node onto a single bi-directional
stream reduces the number of such stream from the `number_of_ranges` down to
`number_of_nodes` (per logical range feed).  This, in turn, enables the
client to reserve the worst case amount of memory before starting the
rangefeed.  This amount of memory is bound by the number of nodes in the
cluster times per-stream maximum memory -- default is `2MB`, but this can be
changed via dedicated rangefeed connection class.  The server side of the
equation may also benefit from knowing how many rangefeeds are running right
now.  Even though the server is somewhat protected from memory blow up via
http2 pushback, we may want to also manage server side memory better.
Memory accounting for client (and possible server) will be added in the
follow on PRs.

The use of new RPC is contingent on the callers providing `WithMuxRangefeed`
option when starting the rangefeed.  However, an envornmnet variable "kill
switch" `COCKROACH_ENABLE_MULTIPLEXING_RANGEFEED` exists to disable the use
of mux rangefeed in case of serious issues being discovered.

Release note (enterprise change): Introduce a new Rangefeed RPC called
`MuxRangeFeed`.  Rangefeeds now use a common HTTP/2 stream per client
for all range replicas on a node, instead of one per replica. This
significantly reduces the amount of network buffer memory usage, which could
cause nodes to run out of memory if a client was slow to consume events.
The caller may opt in to use new mechanism by specifying `WithMuxRangefeed`
option when starting the rangefeed.  However, a cluster wide
`COCKROACH_ENABLE_MULTIPLEXING_RANGEFEED` environment variable may be set to
`false` to inhibit the use of this new RPC.
miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Aug 16, 2022
As demonstrated in cockroachdb#74219, running rangefeed over large enough table or
perhaps not consuming rangefeed events quickly enough, may cause undesirable
memory blow up, including up to OOMing the node.  Since rangefeeds tend to
run on (almost) every node, a possiblity exists that a entire cluster might
get distabilized.

One of the reasons for this is that the memory used by low(er) level http2
RPC transport is not currently being tracked -- nor do we have a mechanism
to add such tracking.  cockroachdb#74456 provided partial mitigation for this issue,
where a single RangeFeed stream could be restricted to use less memory.
Nonetheless, the possibility of excessive memory usage remains since the
number of rangefeed streams in the system could be very large.

This PR introduce a new bi-directional streaming RPC called `MuxRangeFeed`.
In a uni-directonal streaming RPC, the client estabilishes connection to the
KV server and requests to receive events for 1 span.  A separate RPC stream
is created for each span.

In contrast, `MuxRangeFeed` is bi-directional RPC: the client is expected to
connect to the kv server and request as many spans as it wishes to receive
from that server. The server multiplexes all events onto a single stream,
and the client de-multiplexes those events appropriately on the receiving
end.

Note: just like in a uni-directional implementation, each call to
`MuxRangeFeed` method (i.e. a logical rangefeed established by the client)
is treated independently.  That is, even though it is possible to multiplex
all logical rangefeed streams onto a single bi-directional stream to a
single KV node, this optimization is not currently implementation as it
raises questions about scheduling and fairness.  If we need to further
reduce the number of bi-directional streams in the future, from
`number_logical_feeds*number_of_nodes` down to the `number_of_nodes`, such
optimization can be added at a later time.

Multiplexing all of the spans hosted on a node onto a single bi-directional
stream reduces the number of such stream from the `number_of_ranges` down to
`number_of_nodes` (per logical range feed).  This, in turn, enables the
client to reserve the worst case amount of memory before starting the
rangefeed.  This amount of memory is bound by the number of nodes in the
cluster times per-stream maximum memory -- default is `2MB`, but this can be
changed via dedicated rangefeed connection class.  The server side of the
equation may also benefit from knowing how many rangefeeds are running right
now.  Even though the server is somewhat protected from memory blow up via
http2 pushback, we may want to also manage server side memory better.
Memory accounting for client (and possible server) will be added in the
follow on PRs.

The use of new RPC is contingent on the callers providing `WithMuxRangefeed`
option when starting the rangefeed.  However, an envornmnet variable "kill
switch" `COCKROACH_ENABLE_MULTIPLEXING_RANGEFEED` exists to disable the use
of mux rangefeed in case of serious issues being discovered.

Release note (enterprise change): Introduce a new Rangefeed RPC called
`MuxRangeFeed`.  Rangefeeds now use a common HTTP/2 stream per client
for all range replicas on a node, instead of one per replica. This
significantly reduces the amount of network buffer memory usage, which could
cause nodes to run out of memory if a client was slow to consume events.
The caller may opt in to use new mechanism by specifying `WithMuxRangefeed`
option when starting the rangefeed.  However, a cluster wide
`COCKROACH_ENABLE_MULTIPLEXING_RANGEFEED` environment variable may be set to
`false` to inhibit the use of this new RPC.
miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Aug 16, 2022
As demonstrated in cockroachdb#74219, running rangefeed over large enough table or
perhaps not consuming rangefeed events quickly enough, may cause undesirable
memory blow up, including up to OOMing the node.  Since rangefeeds tend to
run on (almost) every node, a possiblity exists that a entire cluster might
get distabilized.

One of the reasons for this is that the memory used by low(er) level http2
RPC transport is not currently being tracked -- nor do we have a mechanism
to add such tracking.  cockroachdb#74456 provided partial mitigation for this issue,
where a single RangeFeed stream could be restricted to use less memory.
Nonetheless, the possibility of excessive memory usage remains since the
number of rangefeed streams in the system could be very large.

This PR introduce a new bi-directional streaming RPC called `MuxRangeFeed`.
In a uni-directonal streaming RPC, the client estabilishes connection to the
KV server and requests to receive events for 1 span.  A separate RPC stream
is created for each span.

In contrast, `MuxRangeFeed` is bi-directional RPC: the client is expected to
connect to the kv server and request as many spans as it wishes to receive
from that server. The server multiplexes all events onto a single stream,
and the client de-multiplexes those events appropriately on the receiving
end.

Note: just like in a uni-directional implementation, each call to
`MuxRangeFeed` method (i.e. a logical rangefeed established by the client)
is treated independently.  That is, even though it is possible to multiplex
all logical rangefeed streams onto a single bi-directional stream to a
single KV node, this optimization is not currently implementation as it
raises questions about scheduling and fairness.  If we need to further
reduce the number of bi-directional streams in the future, from
`number_logical_feeds*number_of_nodes` down to the `number_of_nodes`, such
optimization can be added at a later time.

Multiplexing all of the spans hosted on a node onto a single bi-directional
stream reduces the number of such stream from the `number_of_ranges` down to
`number_of_nodes` (per logical range feed).  This, in turn, enables the
client to reserve the worst case amount of memory before starting the
rangefeed.  This amount of memory is bound by the number of nodes in the
cluster times per-stream maximum memory -- default is `2MB`, but this can be
changed via dedicated rangefeed connection class.  The server side of the
equation may also benefit from knowing how many rangefeeds are running right
now.  Even though the server is somewhat protected from memory blow up via
http2 pushback, we may want to also manage server side memory better.
Memory accounting for client (and possible server) will be added in the
follow on PRs.

The use of new RPC is contingent on the callers providing `WithMuxRangefeed`
option when starting the rangefeed.  However, an envornmnet variable "kill
switch" `COCKROACH_ENABLE_MULTIPLEXING_RANGEFEED` exists to disable the use
of mux rangefeed in case of serious issues being discovered.

Release note (enterprise change): Introduce a new Rangefeed RPC called
`MuxRangeFeed`.  Rangefeeds now use a common HTTP/2 stream per client
for all range replicas on a node, instead of one per replica. This
significantly reduces the amount of network buffer memory usage, which could
cause nodes to run out of memory if a client was slow to consume events.
The caller may opt in to use new mechanism by specifying `WithMuxRangefeed`
option when starting the rangefeed.  However, a cluster wide
`COCKROACH_ENABLE_MULTIPLEXING_RANGEFEED` environment variable may be set to
`false` to inhibit the use of this new RPC.
miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Aug 17, 2022
As demonstrated in cockroachdb#74219, running rangefeed over large enough table or
perhaps not consuming rangefeed events quickly enough, may cause undesirable
memory blow up, including up to OOMing the node.  Since rangefeeds tend to
run on (almost) every node, a possiblity exists that a entire cluster might
get distabilized.

One of the reasons for this is that the memory used by low(er) level http2
RPC transport is not currently being tracked -- nor do we have a mechanism
to add such tracking.  cockroachdb#74456 provided partial mitigation for this issue,
where a single RangeFeed stream could be restricted to use less memory.
Nonetheless, the possibility of excessive memory usage remains since the
number of rangefeed streams in the system could be very large.

This PR introduce a new bi-directional streaming RPC called `MuxRangeFeed`.
In a uni-directonal streaming RPC, the client estabilishes connection to the
KV server and requests to receive events for 1 span.  A separate RPC stream
is created for each span.

In contrast, `MuxRangeFeed` is bi-directional RPC: the client is expected to
connect to the kv server and request as many spans as it wishes to receive
from that server. The server multiplexes all events onto a single stream,
and the client de-multiplexes those events appropriately on the receiving
end.

Note: just like in a uni-directional implementation, each call to
`MuxRangeFeed` method (i.e. a logical rangefeed established by the client)
is treated independently.  That is, even though it is possible to multiplex
all logical rangefeed streams onto a single bi-directional stream to a
single KV node, this optimization is not currently implementation as it
raises questions about scheduling and fairness.  If we need to further
reduce the number of bi-directional streams in the future, from
`number_logical_feeds*number_of_nodes` down to the `number_of_nodes`, such
optimization can be added at a later time.

Multiplexing all of the spans hosted on a node onto a single bi-directional
stream reduces the number of such stream from the `number_of_ranges` down to
`number_of_nodes` (per logical range feed).  This, in turn, enables the
client to reserve the worst case amount of memory before starting the
rangefeed.  This amount of memory is bound by the number of nodes in the
cluster times per-stream maximum memory -- default is `2MB`, but this can be
changed via dedicated rangefeed connection class.  The server side of the
equation may also benefit from knowing how many rangefeeds are running right
now.  Even though the server is somewhat protected from memory blow up via
http2 pushback, we may want to also manage server side memory better.
Memory accounting for client (and possible server) will be added in the
follow on PRs.

The use of new RPC is contingent on the callers providing `WithMuxRangefeed`
option when starting the rangefeed.  However, an envornmnet variable "kill
switch" `COCKROACH_ENABLE_MULTIPLEXING_RANGEFEED` exists to disable the use
of mux rangefeed in case of serious issues being discovered.

Release note (enterprise change): Introduce a new Rangefeed RPC called
`MuxRangeFeed`.  Rangefeeds now use a common HTTP/2 stream per client
for all range replicas on a node, instead of one per replica. This
significantly reduces the amount of network buffer memory usage, which could
cause nodes to run out of memory if a client was slow to consume events.
The caller may opt in to use new mechanism by specifying `WithMuxRangefeed`
option when starting the rangefeed.  However, a cluster wide
`COCKROACH_ENABLE_MULTIPLEXING_RANGEFEED` environment variable may be set to
`false` to inhibit the use of this new RPC.

Release justification:
miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Aug 17, 2022
As demonstrated in cockroachdb#74219, running rangefeed over large enough table or
perhaps not consuming rangefeed events quickly enough, may cause undesirable
memory blow up, including up to OOMing the node.  Since rangefeeds tend to
run on (almost) every node, a possiblity exists that a entire cluster might
get distabilized.

One of the reasons for this is that the memory used by low(er) level http2
RPC transport is not currently being tracked -- nor do we have a mechanism
to add such tracking.  cockroachdb#74456 provided partial mitigation for this issue,
where a single RangeFeed stream could be restricted to use less memory.
Nonetheless, the possibility of excessive memory usage remains since the
number of rangefeed streams in the system could be very large.

This PR introduce a new bi-directional streaming RPC called `MuxRangeFeed`.
In a uni-directonal streaming RPC, the client estabilishes connection to the
KV server and requests to receive events for 1 span.  A separate RPC stream
is created for each span.

In contrast, `MuxRangeFeed` is bi-directional RPC: the client is expected to
connect to the kv server and request as many spans as it wishes to receive
from that server. The server multiplexes all events onto a single stream,
and the client de-multiplexes those events appropriately on the receiving
end.

Note: just like in a uni-directional implementation, each call to
`MuxRangeFeed` method (i.e. a logical rangefeed established by the client)
is treated independently.  That is, even though it is possible to multiplex
all logical rangefeed streams onto a single bi-directional stream to a
single KV node, this optimization is not currently implementation as it
raises questions about scheduling and fairness.  If we need to further
reduce the number of bi-directional streams in the future, from
`number_logical_feeds*number_of_nodes` down to the `number_of_nodes`, such
optimization can be added at a later time.

Multiplexing all of the spans hosted on a node onto a single bi-directional
stream reduces the number of such stream from the `number_of_ranges` down to
`number_of_nodes` (per logical range feed).  This, in turn, enables the
client to reserve the worst case amount of memory before starting the
rangefeed.  This amount of memory is bound by the number of nodes in the
cluster times per-stream maximum memory -- default is `2MB`, but this can be
changed via dedicated rangefeed connection class.  The server side of the
equation may also benefit from knowing how many rangefeeds are running right
now.  Even though the server is somewhat protected from memory blow up via
http2 pushback, we may want to also manage server side memory better.
Memory accounting for client (and possible server) will be added in the
follow on PRs.

The use of new RPC is contingent on the callers providing `WithMuxRangefeed`
option when starting the rangefeed.  However, an envornmnet variable "kill
switch" `COCKROACH_ENABLE_MULTIPLEXING_RANGEFEED` exists to disable the use
of mux rangefeed in case of serious issues being discovered.

Release note (enterprise change): Introduce a new Rangefeed RPC called
`MuxRangeFeed`.  Rangefeeds now use a common HTTP/2 stream per client
for all range replicas on a node, instead of one per replica. This
significantly reduces the amount of network buffer memory usage, which could
cause nodes to run out of memory if a client was slow to consume events.
The caller may opt in to use new mechanism by specifying `WithMuxRangefeed`
option when starting the rangefeed.  However, a cluster wide
`COCKROACH_ENABLE_MULTIPLEXING_RANGEFEED` environment variable may be set to
`false` to inhibit the use of this new RPC.

Release justification:
miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Aug 17, 2022
As demonstrated in cockroachdb#74219, running rangefeed over large enough table or
perhaps not consuming rangefeed events quickly enough, may cause undesirable
memory blow up, including up to OOMing the node.  Since rangefeeds tend to
run on (almost) every node, a possiblity exists that a entire cluster might
get destabilized.

One of the reasons for this is that the memory used by low(er) level http2
RPC transport is not currently being tracked -- nor do we have a mechanism
to add such tracking.  cockroachdb#74456 provided partial mitigation for this issue,
where a single RangeFeed stream could be restricted to use less memory.
Nonetheless, the possibility of excessive memory usage remains since the
number of rangefeed streams in the system could be very large.

This PR introduce a new bi-directional streaming RPC called `MuxRangeFeed`.
In a uni-directional streaming RPC, the client establishes connection to the
KV server and requests to receive events for 1 span.  A separate RPC stream
is created for each span.

In contrast, `MuxRangeFeed` is bi-directional RPC: the client is expected to
connect to the kv server and request as many spans as it wishes to receive
from that server. The server multiplexes all events onto a single stream,
and the client de-multiplexes those events appropriately on the receiving
end.

Note: just like in a uni-directional implementation, each call to
`MuxRangeFeed` method (i.e. a logical rangefeed established by the client)
is treated independently.  That is, even though it is possible to multiplex
all logical rangefeed streams onto a single bi-directional stream to a
single KV node, this optimization is not currently implementation as it
raises questions about scheduling and fairness.  If we need to further
reduce the number of bi-directional streams in the future, from
`number_logical_feeds*number_of_nodes` down to the `number_of_nodes`, such
optimization can be added at a later time.

Multiplexing all of the spans hosted on a node onto a single bi-directional
stream reduces the number of such stream from the `number_of_ranges` down to
`number_of_nodes` (per logical range feed).  This, in turn, enables the
client to reserve the worst case amount of memory before starting the
rangefeed.  This amount of memory is bound by the number of nodes in the
cluster times per-stream maximum memory -- default is `2MB`, but this can be
changed via dedicated rangefeed connection class.  The server side of the
equation may also benefit from knowing how many rangefeeds are running right
now.  Even though the server is somewhat protected from memory blow up via
http2 pushback, we may want to also manage server side memory better.
Memory accounting for client (and possible server) will be added in the
follow on PRs.

The use of new RPC is contingent on the callers providing `WithMuxRangefeed`
option when starting the rangefeed.  However, an envornmnet variable "kill
switch" `COCKROACH_ENABLE_MULTIPLEXING_RANGEFEED` exists to disable the use
of mux rangefeed in case of serious issues being discovered.

Release note (enterprise change): Introduce a new Rangefeed RPC called
`MuxRangeFeed`.  Rangefeeds now use a common HTTP/2 stream per client
for all range replicas on a node, instead of one per replica. This
significantly reduces the amount of network buffer memory usage, which could
cause nodes to run out of memory if a client was slow to consume events.
The caller may opt in to use new mechanism by specifying `WithMuxRangefeed`
option when starting the rangefeed.  However, a cluster wide
`COCKROACH_ENABLE_MULTIPLEXING_RANGEFEED` environment variable may be set to
`false` to inhibit the use of this new RPC.

Release justification:
miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Aug 17, 2022
As demonstrated in cockroachdb#74219, running rangefeed over large enough table or
perhaps not consuming rangefeed events quickly enough, may cause undesirable
memory blow up, including up to OOMing the node.  Since rangefeeds tend to
run on (almost) every node, a possiblity exists that a entire cluster might
get destabilized.

One of the reasons for this is that the memory used by low(er) level http2
RPC transport is not currently being tracked -- nor do we have a mechanism
to add such tracking.  cockroachdb#74456 provided partial mitigation for this issue,
where a single RangeFeed stream could be restricted to use less memory.
Nonetheless, the possibility of excessive memory usage remains since the
number of rangefeed streams in the system could be very large.

This PR introduce a new bi-directional streaming RPC called `MuxRangeFeed`.
In a uni-directional streaming RPC, the client establishes connection to the
KV server and requests to receive events for 1 span.  A separate RPC stream
is created for each span.

In contrast, `MuxRangeFeed` is bi-directional RPC: the client is expected to
connect to the kv server and request as many spans as it wishes to receive
from that server. The server multiplexes all events onto a single stream,
and the client de-multiplexes those events appropriately on the receiving
end.

Note: just like in a uni-directional implementation, each call to
`MuxRangeFeed` method (i.e. a logical rangefeed established by the client)
is treated independently.  That is, even though it is possible to multiplex
all logical rangefeed streams onto a single bi-directional stream to a
single KV node, this optimization is not currently implementation as it
raises questions about scheduling and fairness.  If we need to further
reduce the number of bi-directional streams in the future, from
`number_logical_feeds*number_of_nodes` down to the `number_of_nodes`, such
optimization can be added at a later time.

Multiplexing all of the spans hosted on a node onto a single bi-directional
stream reduces the number of such stream from the `number_of_ranges` down to
`number_of_nodes` (per logical range feed).  This, in turn, enables the
client to reserve the worst case amount of memory before starting the
rangefeed.  This amount of memory is bound by the number of nodes in the
cluster times per-stream maximum memory -- default is `2MB`, but this can be
changed via dedicated rangefeed connection class.  The server side of the
equation may also benefit from knowing how many rangefeeds are running right
now.  Even though the server is somewhat protected from memory blow up via
http2 pushback, we may want to also manage server side memory better.
Memory accounting for client (and possible server) will be added in the
follow on PRs.

The use of new RPC is contingent on the callers providing `WithMuxRangefeed`
option when starting the rangefeed.  However, an envornmnet variable "kill
switch" `COCKROACH_ENABLE_MULTIPLEXING_RANGEFEED` exists to disable the use
of mux rangefeed in case of serious issues being discovered.

Release note (enterprise change): Introduce a new Rangefeed RPC called
`MuxRangeFeed`.  Rangefeeds now use a common HTTP/2 stream per client
for all range replicas on a node, instead of one per replica. This
significantly reduces the amount of network buffer memory usage, which could
cause nodes to run out of memory if a client was slow to consume events.
The caller may opt in to use new mechanism by specifying `WithMuxRangefeed`
option when starting the rangefeed.  However, a cluster wide
`COCKROACH_ENABLE_MULTIPLEXING_RANGEFEED` environment variable may be set to
`false` to inhibit the use of this new RPC.

Release justification: Rangefeed scalability and stability improvement.
Safe to merge since the functionality disabled by default.
miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Aug 18, 2022
As demonstrated in cockroachdb#74219, running rangefeed over large enough table or
perhaps not consuming rangefeed events quickly enough, may cause undesirable
memory blow up, including up to OOMing the node.  Since rangefeeds tend to
run on (almost) every node, a possiblity exists that a entire cluster might
get destabilized.

One of the reasons for this is that the memory used by low(er) level http2
RPC transport is not currently being tracked -- nor do we have a mechanism
to add such tracking.  cockroachdb#74456 provided partial mitigation for this issue,
where a single RangeFeed stream could be restricted to use less memory.
Nonetheless, the possibility of excessive memory usage remains since the
number of rangefeed streams in the system could be very large.

This PR introduce a new bi-directional streaming RPC called `MuxRangeFeed`.
In a uni-directional streaming RPC, the client establishes connection to the
KV server and requests to receive events for 1 span.  A separate RPC stream
is created for each span.

In contrast, `MuxRangeFeed` is bi-directional RPC: the client is expected to
connect to the kv server and request as many spans as it wishes to receive
from that server. The server multiplexes all events onto a single stream,
and the client de-multiplexes those events appropriately on the receiving
end.

Note: just like in a uni-directional implementation, each call to
`MuxRangeFeed` method (i.e. a logical rangefeed established by the client)
is treated independently.  That is, even though it is possible to multiplex
all logical rangefeed streams onto a single bi-directional stream to a
single KV node, this optimization is not currently implementation as it
raises questions about scheduling and fairness.  If we need to further
reduce the number of bi-directional streams in the future, from
`number_logical_feeds*number_of_nodes` down to the `number_of_nodes`, such
optimization can be added at a later time.

Multiplexing all of the spans hosted on a node onto a single bi-directional
stream reduces the number of such stream from the `number_of_ranges` down to
`number_of_nodes` (per logical range feed).  This, in turn, enables the
client to reserve the worst case amount of memory before starting the
rangefeed.  This amount of memory is bound by the number of nodes in the
cluster times per-stream maximum memory -- default is `2MB`, but this can be
changed via dedicated rangefeed connection class.  The server side of the
equation may also benefit from knowing how many rangefeeds are running right
now.  Even though the server is somewhat protected from memory blow up via
http2 pushback, we may want to also manage server side memory better.
Memory accounting for client (and possible server) will be added in the
follow on PRs.

The use of new RPC is contingent on the callers providing `WithMuxRangefeed`
option when starting the rangefeed.  However, an envornmnet variable "kill
switch" `COCKROACH_ENABLE_MULTIPLEXING_RANGEFEED` exists to disable the use
of mux rangefeed in case of serious issues being discovered.

Release note (enterprise change): Introduce a new Rangefeed RPC called
`MuxRangeFeed`.  Rangefeeds now use a common HTTP/2 stream per client
for all range replicas on a node, instead of one per replica. This
significantly reduces the amount of network buffer memory usage, which could
cause nodes to run out of memory if a client was slow to consume events.
The caller may opt in to use new mechanism by specifying `WithMuxRangefeed`
option when starting the rangefeed.  However, a cluster wide
`COCKROACH_ENABLE_MULTIPLEXING_RANGEFEED` environment variable may be set to
`false` to inhibit the use of this new RPC.

Release justification: Rangefeed scalability and stability improvement.
Safe to merge since the functionality disabled by default.
miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Aug 18, 2022
As demonstrated in cockroachdb#74219, running rangefeed over large enough table or
perhaps not consuming rangefeed events quickly enough, may cause undesirable
memory blow up, including up to OOMing the node.  Since rangefeeds tend to
run on (almost) every node, a possiblity exists that a entire cluster might
get destabilized.

One of the reasons for this is that the memory used by low(er) level http2
RPC transport is not currently being tracked -- nor do we have a mechanism
to add such tracking.  cockroachdb#74456 provided partial mitigation for this issue,
where a single RangeFeed stream could be restricted to use less memory.
Nonetheless, the possibility of excessive memory usage remains since the
number of rangefeed streams in the system could be very large.

This PR introduce a new bi-directional streaming RPC called `MuxRangeFeed`.
In a uni-directional streaming RPC, the client establishes connection to the
KV server and requests to receive events for 1 span.  A separate RPC stream
is created for each span.

In contrast, `MuxRangeFeed` is bi-directional RPC: the client is expected to
connect to the kv server and request as many spans as it wishes to receive
from that server. The server multiplexes all events onto a single stream,
and the client de-multiplexes those events appropriately on the receiving
end.

Note: just like in a uni-directional implementation, each call to
`MuxRangeFeed` method (i.e. a logical rangefeed established by the client)
is treated independently.  That is, even though it is possible to multiplex
all logical rangefeed streams onto a single bi-directional stream to a
single KV node, this optimization is not currently implementation as it
raises questions about scheduling and fairness.  If we need to further
reduce the number of bi-directional streams in the future, from
`number_logical_feeds*number_of_nodes` down to the `number_of_nodes`, such
optimization can be added at a later time.

Multiplexing all of the spans hosted on a node onto a single bi-directional
stream reduces the number of such stream from the `number_of_ranges` down to
`number_of_nodes` (per logical range feed).  This, in turn, enables the
client to reserve the worst case amount of memory before starting the
rangefeed.  This amount of memory is bound by the number of nodes in the
cluster times per-stream maximum memory -- default is `2MB`, but this can be
changed via dedicated rangefeed connection class.  The server side of the
equation may also benefit from knowing how many rangefeeds are running right
now.  Even though the server is somewhat protected from memory blow up via
http2 pushback, we may want to also manage server side memory better.
Memory accounting for client (and possible server) will be added in the
follow on PRs.

The use of new RPC is contingent on the callers providing `WithMuxRangefeed`
option when starting the rangefeed.  However, an envornmnet variable "kill
switch" `COCKROACH_ENABLE_MULTIPLEXING_RANGEFEED` exists to disable the use
of mux rangefeed in case of serious issues being discovered.

Release note (enterprise change): Introduce a new Rangefeed RPC called
`MuxRangeFeed`.  Rangefeeds now use a common HTTP/2 stream per client
for all range replicas on a node, instead of one per replica. This
significantly reduces the amount of network buffer memory usage, which could
cause nodes to run out of memory if a client was slow to consume events.
The caller may opt in to use new mechanism by specifying `WithMuxRangefeed`
option when starting the rangefeed.  However, a cluster wide
`COCKROACH_ENABLE_MULTIPLEXING_RANGEFEED` environment variable may be set to
`false` to inhibit the use of this new RPC.

Release justification: Rangefeed scalability and stability improvement.
Safe to merge since the functionality disabled by default.
miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Aug 18, 2022
As demonstrated in cockroachdb#74219, running rangefeed over large enough table or
perhaps not consuming rangefeed events quickly enough, may cause undesirable
memory blow up, including up to OOMing the node.  Since rangefeeds tend to
run on (almost) every node, a possiblity exists that a entire cluster might
get destabilized.

One of the reasons for this is that the memory used by low(er) level http2
RPC transport is not currently being tracked -- nor do we have a mechanism
to add such tracking.  cockroachdb#74456 provided partial mitigation for this issue,
where a single RangeFeed stream could be restricted to use less memory.
Nonetheless, the possibility of excessive memory usage remains since the
number of rangefeed streams in the system could be very large.

This PR introduce a new bi-directional streaming RPC called `MuxRangeFeed`.
In a uni-directional streaming RPC, the client establishes connection to the
KV server and requests to receive events for 1 span.  A separate RPC stream
is created for each span.

In contrast, `MuxRangeFeed` is bi-directional RPC: the client is expected to
connect to the kv server and request as many spans as it wishes to receive
from that server. The server multiplexes all events onto a single stream,
and the client de-multiplexes those events appropriately on the receiving
end.

Note: just like in a uni-directional implementation, each call to
`MuxRangeFeed` method (i.e. a logical rangefeed established by the client)
is treated independently.  That is, even though it is possible to multiplex
all logical rangefeed streams onto a single bi-directional stream to a
single KV node, this optimization is not currently implementation as it
raises questions about scheduling and fairness.  If we need to further
reduce the number of bi-directional streams in the future, from
`number_logical_feeds*number_of_nodes` down to the `number_of_nodes`, such
optimization can be added at a later time.

Multiplexing all of the spans hosted on a node onto a single bi-directional
stream reduces the number of such stream from the `number_of_ranges` down to
`number_of_nodes` (per logical range feed).  This, in turn, enables the
client to reserve the worst case amount of memory before starting the
rangefeed.  This amount of memory is bound by the number of nodes in the
cluster times per-stream maximum memory -- default is `2MB`, but this can be
changed via dedicated rangefeed connection class.  The server side of the
equation may also benefit from knowing how many rangefeeds are running right
now.  Even though the server is somewhat protected from memory blow up via
http2 pushback, we may want to also manage server side memory better.
Memory accounting for client (and possible server) will be added in the
follow on PRs.

The use of new RPC is contingent on the callers providing `WithMuxRangefeed`
option when starting the rangefeed.  However, an envornmnet variable "kill
switch" `COCKROACH_ENABLE_MULTIPLEXING_RANGEFEED` exists to disable the use
of mux rangefeed in case of serious issues being discovered.

Release note (enterprise change): Introduce a new Rangefeed RPC called
`MuxRangeFeed`.  Rangefeeds now use a common HTTP/2 stream per client
for all range replicas on a node, instead of one per replica. This
significantly reduces the amount of network buffer memory usage, which could
cause nodes to run out of memory if a client was slow to consume events.
The caller may opt in to use new mechanism by specifying `WithMuxRangefeed`
option when starting the rangefeed.  However, a cluster wide
`COCKROACH_ENABLE_MULTIPLEXING_RANGEFEED` environment variable may be set to
`false` to inhibit the use of this new RPC.

Release justification: Rangefeed scalability and stability improvement.
Safe to merge since the functionality disabled by default.
craig bot pushed a commit that referenced this issue Aug 18, 2022
75581: kv: Introduce MuxRangeFeed bi-directional RPC.  r=miretskiy a=miretskiy

As demonstrated in #74219, running 
rangefeed over large enough table or perhaps not consuming rangefeed events quickly enough, 
may cause undesirable memory blow up, including up to OOMing the node.  Since rangefeeds t
end to run on (almost) every node, a possibility exists that a entire cluster might
get destabilized.

One of the reasons for this is that the memory used by low(er) level http2
RPC transport is not currently being tracked -- nor do we have a mechanism
to add such tracking.  #74456 
provided partial mitigation for this issue, where a single RangeFeed stream could 
be restricted to use less memory. Nonetheless, the possibility of excessive memory 
usage remains since the number of rangefeed streams in the system could be very large.

This PR introduce a new bi-directional streaming RPC called `MuxRangeFeed`.
In a uni-directional streaming RPC, the client establishes connection to the
KV server and requests to receive events for 1 span.  A separate RPC stream
is created for each span.

In contrast, `MuxRangeFeed` is bi-directional RPC: the client is expected to
connect to the kv server and request as many spans as it wishes to receive
from that server. The server multiplexes all events onto a single stream,
and the client de-multiplexes those events appropriately on the receiving
end.

Note: just like in a uni-directional implementation, each call to
`MuxRangeFeed` method (i.e. a logical rangefeed established by the client)
is treated independently.  That is, even though it is possible to multiplex
all logical rangefeed streams onto a single bi-directional stream to a
single KV node, this optimization is not currently implementation as it
raises questions about scheduling and fairness.  If we need to further
reduce the number of bi-directional streams in the future, from
`number_logical_feeds*number_of_nodes` down to the `number_of_nodes`, such
optimization can be added at a later time.

Multiplexing all of the spans hosted on a node onto a single bi-directional
stream reduces the number of such stream from the `number_of_ranges` down to
`number_of_nodes` (per logical range feed).  This, in turn, enables the
client to reserve the worst case amount of memory before starting the
rangefeed.  This amount of memory is bound by the number of nodes in the
cluster times per-stream maximum memory -- default is `2MB`, but this can be
changed via dedicated rangefeed connection class.  The server side of the
equation may also benefit from knowing how many rangefeeds are running right
now.  Even though the server is somewhat protected from memory blow up via
http2 pushback, we may want to also manage server side memory better.
Memory accounting for client (and possible server) will be added in the
follow on PRs.

The use of new RPC is contingent on the callers providing `WithMuxRangefeed`
option when starting the rangefeed.  However, an envornmnet variable "kill
switch" `COCKROACH_ENABLE_MULTIPLEXING_RANGEFEED` exists to disable the use
of mux rangefeed in case of serious issues being discovered.

Release note (enterprise change): Introduce a new Rangefeed RPC called
`MuxRangeFeed`.  Rangefeeds now use a common HTTP/2 stream per client
for all range replicas on a node, instead of one per replica. This
significantly reduces the amount of network buffer memory usage, which could
cause nodes to run out of memory if a client was slow to consume events.
The caller may opt in to use new mechanism by specifying `WithMuxRangefeed`
option when starting the rangefeed.  However, a cluster wide
`COCKROACH_ENABLE_MULTIPLEXING_RANGEFEED` environment variable may be set to
`false` to inhibit the use of this new RPC.

Release Justification: Rangefeed scalability and stability improvement.  Safe to merge
since the functionality disabled by default.

84683: obsservice: ingest events r=andreimatei a=andreimatei

This patch adds a couple of things:
1. an RPC endpoint to CRDB for streaming out events. This RPC service is
   called by the Obs Service.
2. a library in CRDB for exporting events.
3. code in the Obs Service for ingesting the events and writing them
   into a table in the sink cluster.

The first use of the event exporting is for the system.eventlog events.
All events written to that table are now also exported. Once the Obs
Service takes hold in the future, I hope we can remove system.eventlog.

The events are represented using OpenTelemetry protos. Unfortunately,
I've had to copy over the otel protos to our tree because I couldn't
figure out a vendoring solution. Problems encountered for vendoring are:
1. The repo where these protos live
(https://github.com/open-telemetry/opentelemetry-proto) is not
go-get-able. This means hacks are needed for our vendoring tools.
2. Some of the protos in that repo do not build with gogoproto (they
only build with protoc), because they use the new-ish "optional"
option on some fields. The logs protos that we use in this patch do not
have this problem, but others do (so we'll need to figure something out
in the future when dealing with the metrics proto).  FWIW, the
OpenTelemetry Collector ironically has the same problem (it uses
gogoproto), and it solved it through a sed that changes all the optional
fields to one-of's.
3. Even if you solve the first two problems, the next one is that we
already have a dependency on these compiled protos in our tree
(go.opentelemetry.io/proto/otlp). This repo contains generated code,
using protoc. We need it because of our use of the otlp trace exporter.
Bringing in the protos again, and building them with gogo, results in go
files that want to live in the same package/have the same import path.
So changing the import paths is needed.

Between all of these, I've given up - at least for the moment - and
decided to copy over to our tree the few protos that we actually need.
I'm also changing their import paths. You'll notice that there is a
script that codifies the process of transforming the needed protos from
their otel upstream.

Release note: None
Release justification: Non-production.

Co-authored-by: Yevgeniy Miretskiy <yevgeniy@cockroachlabs.com>
Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>
@miretskiy
Copy link
Contributor Author

Mux rangefeed merged; it is somewhat undertested, but that's not part of this issue.
Closing this issue.

[DEPRECATED] CDC automation moved this from Bugs to Closed Jan 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-cdc Change Data Capture A-kv Anything in KV that doesn't belong in a more specific category. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-cdc
Projects
[DEPRECATED] CDC
  
Closed
Development

No branches or pull requests

2 participants