Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release-21.2: kv,rpc: Introduce dedicated rangefeed connection class. #74456

Merged
merged 2 commits into from
Jan 6, 2022

Conversation

miretskiy
Copy link
Contributor

Backport 2/2 commits from #74222.

/cc @cockroachdb/release


Rangefeeds executing against very large tables may experience
OOMs under certain conditions. The underlying reason for this is that
cockroach uses 2MB per gRPC stream buffer size to improve system throughput
(particularly under high latency scenarios). However, because the rangefeed
against a large table establishes a dedicated stream for each range,
the possiblity of an OOM exist if there are many streams from which events
are not consumed quickly enough.

This PR does two things.

First, it introduces a dedicated RPC connection class for use with rangefeeds.

The rangefeed connnection class configures rangefeed streams to use less memory
per stream than the default connection class. The initial window size for rangefeed
connection class can be adjusted by updating COCKROACH_RANGEFEED_RPC_INITIAL_WINDOW_SIZE
environment variable whose default is 128KB.

For rangefeeds to use this new connection class,
a kv.rangefeed.use_dedicated_connection_class.enabled setting must be turned on.

Another change in this PR is that the default RPC window size can be adjusted
via COCKROACH_RPC_INITIAL_WINDOW_SIZE environment variable. The default for this
variable is kept at 2MB.

Changing the values of either of those variables is an advanced operation and should
not be taken lightly since changes to those variables impact all aspects of cockroach
performance characteristics. Values larger than 2MB will be trimmed to 2MB.
Setting either of those to 64KB or below will turn on "dynamic window" gRPC window sizes.

It should be noted that this change is a partial mitigation, and not a complete fix.
A full fix will require rework of rangefeed behavior, and will be done at a later time.

An alternative to introduction of a dedicated connection class was to simply
lower the default connection window size. However, such change would impact
all RPCs in a system, and it was deemed too risky at this point.
Substantial benchmarking is needed before such change.

Informs #74219

Release Notes (performance improvement): Allow rangefeed streams to use separate
http connection when kv.rangefeed.use_dedicated_connection_class.enabled setting
is turned on. Using separate connection class reduces the possiblity of OOMs when
running rangefeeds against very large tables. The connection window size
for rangefeed can be adjusted via COCKROACH_RANGEFEED_INITIAL_WINDOW_SIZE environment
variable, whose default is 128KB.

Record stats while discarding data.

Release Notes: None
@miretskiy miretskiy requested review from a team as code owners January 5, 2022 14:52
@miretskiy miretskiy requested review from stevendanna and removed request for a team January 5, 2022 14:52
@blathers-crl
Copy link

blathers-crl bot commented Jan 5, 2022

Thanks for opening a backport.

Please check the backport criteria before merging:

  • Patches should only be created for serious issues or test-only changes.
  • Patches should not break backwards-compatibility.
  • Patches should change as little code as possible.
  • Patches should not change on-disk formats or node communication protocols.
  • Patches should not add new functionality.
  • Patches must not add, edit, or otherwise modify cluster versions; or add version gates.
If some of the basic criteria cannot be satisfied, ensure that the exceptional criteria are satisfied within.
  • There is a high priority need for the functionality that cannot wait until the next release and is difficult to address in another way.
  • The new functionality is additive-only and only runs for clusters which have specifically “opted in” to it (e.g. by a cluster setting).
  • New code is protected by a conditional check that is trivial to verify and ensures that it only runs for opt-in clusters.
  • The PM and TL on the team that owns the changed code have signed off that the change obeys the above rules.

Add a brief release justification to the body of your PR to justify this backport.

Some other things to consider:

  • What did we do to ensure that a user that doesn’t know & care about this backport, has no idea that it happened?
  • Will this work in a cluster of mixed patch versions? Did we test that?
  • If a user upgrades a patch version, uses this feature, and then downgrades, what happens?

@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modulo test failures.

Rangefeeds executing against very large tables may experience
OOMs under certain conditions. The underlying reason for this is that
cockroach uses 2MB per gRPC stream buffer size to improve system throughput
(particularly under high latency scenarios).  However, because the rangefeed
against a large table establishes a dedicated stream for each range,
the possiblity of an OOM exist if there are many streams from which events
are not consumed quickly enough.

This PR does two things.

First, it introduces a dedicated RPC connection class for use with rangefeeds.

The rangefeed connnection class configures rangefeed streams to use less memory
per stream than the default connection class.  The initial window size for rangefeed
connection class can be adjusted by updating `COCKROACH_RANGEFEED_RPC_INITIAL_WINDOW_SIZE`
environment variable whose default is 128KB.

For rangefeeds to use this new connection class,
a `kv.rangefeed.use_dedicated_connection_class.enabled` setting must be turned on.

Another change in this PR is that the default RPC window size can be adjusted
via `COCKROACH_RPC_INITIAL_WINDOW_SIZE` environment variable.  The default for this
variable is kept at 2MB.

Changing the values of either of those variables is an advanced operation and should
not be taken lightly since changes to those variables impact all aspects of cockroach
performance characteristics.  Values larger than 2MB will be trimmed to 2MB.
Setting either of those to 64KB or below will turn on "dynamic window" gRPC window sizes.

It should be noted that this change is a partial mitigation, and not a complete fix.
A full fix will require rework of rangefeed behavior, and will be done at a later time.

An alternative to introduction of a dedicated connection class was to simply
lower the default connection window size.  However, such change would impact
all RPCs in a system, and it was deemed too risky at this point.
Substantial benchmarking is needed before such change.

Release Notes (performance improvement): Allow rangefeed streams to use separate
http connection when `kv.rangefeed.use_dedicated_connection_class.enabled` setting
is turned on.  Using separate connection class reduces the possiblity of OOMs when
running rangefeeds against very large tables.  The connection window size
for rangefeed can be adjusted via `COCKROACH_RANGEFEED_INITIAL_WINDOW_SIZE` environment
variable, whose default is 128KB.
@miretskiy miretskiy merged commit 3880f42 into cockroachdb:release-21.2 Jan 6, 2022
miretskiy pushed a commit to miretskiy/cockroach that referenced this pull request Jan 26, 2022
As demonstrated in cockroachdb#74219, running rangefeed over large enough table
or perhaps not consuming rangefeed events quickly enough,
may cause undesirable memory blow up, including up to OOMing the node.
Since rangefeeds tend to run on (almost) every node, a possiblity exists
that a entire cluster might get distabilized.

One of the reasons for this is that the memory used by low(er) level http2
RPC transport is not currently being tracked -- nor do we have a mechanism
to add such tracking.  cockroachdb#74456 provided partial mitigation for this issue,
where a single RangeFeed stream could be restricted to use less memory.
Nonetheless, the possibility of excessive memory usage remains since the number
of rangefeed streams in the system could be very large.

This PR introduce a new bi-directional streaming RPC called `RangeFeedStream`.
In a uni-directonal streaming RPC, the client estabilishes connection to the KV
server and requests to receive events for 1 span.  A separate RPC stream is created
for each span.

In contrast, `RangeFeedStream` is bi-directional RPC: the client
is expected to connect to the kv server and request as many spans as it wishes to
receive from that server. The server multiplexes all events onto a single
stream, and the client de-multiplexes those events appropriately on the receiving
end.

Note: just like in a uni-directional implementation, each call to
`RangeFeedStream` method (i.e. a logical rangefeed established by the client)
is treated independently.  That is, even though it is possible to
multiplex all logical rangefeed streams onto a single bi-directional stream to a
single KV node, this optimization is not currently implementation as it raises
questions about scheduling and fairness.  If we need to further reduce the number
of bi-directional streams in the future, from `number_logical_feeds*number_of_nodes`
down to the `number_of_nodes`, such optimization can be added at a later time.

Multiplexing all of the spans hosted on a node onto a single bi-directional stream
reduces the number of such stream from the `number_of_ranges` down to `number_of_nodes`
(per logical range feed).  This, in turn, enables the client to reserve the worst case
amount of memory before starting the rangefeed.  This amount of memory is bound by
the number of nodes in the cluster times per-stream maximum memory -- default is `2MB`,
but this can be changed via dedicated rangefeed connection class.
The server side of the equation may also benefit from knowing how many rangefeeds
are running right now.  Even though the server is somewhat protected from memory
blow up via http2 pushback, we may want to also manage server side memory better.
Memory accounting for client (and possible server) will be added in the follow on PRs.

The new RPC is turned on by default since the intention is to retire the uni-directional
rangefeed during later release.  However, this default maybe disable (and rangefeed
reverted to use old implementation) by setting `COCKROACH_ENABLE_STREAMING_RANGEFEED`
environment variable to `false` and restarting the cluster.

Release Notes (enteriprise change): Introduce a bi-directional implementation of
range feed called `RangeFeedStream`.  Range feeds are used by multiple subsystems
including changefeeds.  The bi-directional implementation provides a better mechanism
to monitor memory used by range feeds.  The rangefeed implementation can be revered
back to its old implementation via restarting the clluster with the
`COCKROACH_ENABLE_STREAMING_RANGEFEED` environment variable set to false.
miretskiy pushed a commit to miretskiy/cockroach that referenced this pull request Jan 26, 2022
As demonstrated in cockroachdb#74219, running rangefeed over large enough table
or perhaps not consuming rangefeed events quickly enough,
may cause undesirable memory blow up, including up to OOMing the node.
Since rangefeeds tend to run on (almost) every node, a possiblity exists
that a entire cluster might get distabilized.

One of the reasons for this is that the memory used by low(er) level http2
RPC transport is not currently being tracked -- nor do we have a mechanism
to add such tracking.  cockroachdb#74456 provided partial mitigation for this issue,
where a single RangeFeed stream could be restricted to use less memory.
Nonetheless, the possibility of excessive memory usage remains since the number
of rangefeed streams in the system could be very large.

This PR introduce a new bi-directional streaming RPC called `RangeFeedStream`.
In a uni-directonal streaming RPC, the client estabilishes connection to the KV
server and requests to receive events for 1 span.  A separate RPC stream is created
for each span.

In contrast, `RangeFeedStream` is bi-directional RPC: the client
is expected to connect to the kv server and request as many spans as it wishes to
receive from that server. The server multiplexes all events onto a single
stream, and the client de-multiplexes those events appropriately on the receiving
end.

Note: just like in a uni-directional implementation, each call to
`RangeFeedStream` method (i.e. a logical rangefeed established by the client)
is treated independently.  That is, even though it is possible to
multiplex all logical rangefeed streams onto a single bi-directional stream to a
single KV node, this optimization is not currently implementation as it raises
questions about scheduling and fairness.  If we need to further reduce the number
of bi-directional streams in the future, from `number_logical_feeds*number_of_nodes`
down to the `number_of_nodes`, such optimization can be added at a later time.

Multiplexing all of the spans hosted on a node onto a single bi-directional stream
reduces the number of such stream from the `number_of_ranges` down to `number_of_nodes`
(per logical range feed).  This, in turn, enables the client to reserve the worst case
amount of memory before starting the rangefeed.  This amount of memory is bound by
the number of nodes in the cluster times per-stream maximum memory -- default is `2MB`,
but this can be changed via dedicated rangefeed connection class.
The server side of the equation may also benefit from knowing how many rangefeeds
are running right now.  Even though the server is somewhat protected from memory
blow up via http2 pushback, we may want to also manage server side memory better.
Memory accounting for client (and possible server) will be added in the follow on PRs.

The new RPC is turned on by default since the intention is to retire the uni-directional
rangefeed during later release.  However, this default maybe disable (and rangefeed
reverted to use old implementation) by setting `COCKROACH_ENABLE_STREAMING_RANGEFEED`
environment variable to `false` and restarting the cluster.

Release Notes (enteriprise change): Introduce a bi-directional implementation of
range feed called `RangeFeedStream`.  Range feeds are used by multiple subsystems
including changefeeds.  The bi-directional implementation provides a better mechanism
to monitor memory used by range feeds.  The rangefeed implementation can be revered
back to its old implementation via restarting the clluster with the
`COCKROACH_ENABLE_STREAMING_RANGEFEED` environment variable set to false.
miretskiy pushed a commit to miretskiy/cockroach that referenced this pull request Jan 28, 2022
As demonstrated in cockroachdb#74219, running rangefeed over large enough table
or perhaps not consuming rangefeed events quickly enough,
may cause undesirable memory blow up, including up to OOMing the node.
Since rangefeeds tend to run on (almost) every node, a possiblity exists
that a entire cluster might get distabilized.

One of the reasons for this is that the memory used by low(er) level http2
RPC transport is not currently being tracked -- nor do we have a mechanism
to add such tracking.  cockroachdb#74456 provided partial mitigation for this issue,
where a single RangeFeed stream could be restricted to use less memory.
Nonetheless, the possibility of excessive memory usage remains since the number
of rangefeed streams in the system could be very large.

This PR introduce a new bi-directional streaming RPC called `RangeFeedStream`.
In a uni-directonal streaming RPC, the client estabilishes connection to the KV
server and requests to receive events for 1 span.  A separate RPC stream is created
for each span.

In contrast, `RangeFeedStream` is bi-directional RPC: the client
is expected to connect to the kv server and request as many spans as it wishes to
receive from that server. The server multiplexes all events onto a single
stream, and the client de-multiplexes those events appropriately on the receiving
end.

Note: just like in a uni-directional implementation, each call to
`RangeFeedStream` method (i.e. a logical rangefeed established by the client)
is treated independently.  That is, even though it is possible to
multiplex all logical rangefeed streams onto a single bi-directional stream to a
single KV node, this optimization is not currently implementation as it raises
questions about scheduling and fairness.  If we need to further reduce the number
of bi-directional streams in the future, from `number_logical_feeds*number_of_nodes`
down to the `number_of_nodes`, such optimization can be added at a later time.

Multiplexing all of the spans hosted on a node onto a single bi-directional stream
reduces the number of such stream from the `number_of_ranges` down to `number_of_nodes`
(per logical range feed).  This, in turn, enables the client to reserve the worst case
amount of memory before starting the rangefeed.  This amount of memory is bound by
the number of nodes in the cluster times per-stream maximum memory -- default is `2MB`,
but this can be changed via dedicated rangefeed connection class.
The server side of the equation may also benefit from knowing how many rangefeeds
are running right now.  Even though the server is somewhat protected from memory
blow up via http2 pushback, we may want to also manage server side memory better.
Memory accounting for client (and possible server) will be added in the follow on PRs.

The new RPC is turned on by default since the intention is to retire the uni-directional
rangefeed during later release.  However, this default maybe disable (and rangefeed
reverted to use old implementation) by setting `COCKROACH_ENABLE_STREAMING_RANGEFEED`
environment variable to `false` and restarting the cluster.

Release Notes (enteriprise change): Introduce a bi-directional implementation of
range feed called `RangeFeedStream`.  Range feeds are used by multiple subsystems
including changefeeds.  The bi-directional implementation provides a better mechanism
to monitor memory used by range feeds.  The rangefeed implementation can be revered
back to its old implementation via restarting the clluster with the
`COCKROACH_ENABLE_STREAMING_RANGEFEED` environment variable set to false.
miretskiy pushed a commit to miretskiy/cockroach that referenced this pull request Jan 28, 2022
As demonstrated in cockroachdb#74219, running rangefeed over large enough table
or perhaps not consuming rangefeed events quickly enough,
may cause undesirable memory blow up, including up to OOMing the node.
Since rangefeeds tend to run on (almost) every node, a possiblity exists
that a entire cluster might get distabilized.

One of the reasons for this is that the memory used by low(er) level http2
RPC transport is not currently being tracked -- nor do we have a mechanism
to add such tracking.  cockroachdb#74456 provided partial mitigation for this issue,
where a single RangeFeed stream could be restricted to use less memory.
Nonetheless, the possibility of excessive memory usage remains since the number
of rangefeed streams in the system could be very large.

This PR introduce a new bi-directional streaming RPC called `MuxRangeFeed`.
In a uni-directonal streaming RPC, the client estabilishes connection to the KV
server and requests to receive events for 1 span.  A separate RPC stream is created
for each span.

In contrast, `MuxRangeFeed` is bi-directional RPC: the client
is expected to connect to the kv server and request as many spans as it wishes to
receive from that server. The server multiplexes all events onto a single
stream, and the client de-multiplexes those events appropriately on the receiving
end.

Note: just like in a uni-directional implementation, each call to
`MuxRangeFeed` method (i.e. a logical rangefeed established by the client)
is treated independently.  That is, even though it is possible to
multiplex all logical rangefeed streams onto a single bi-directional stream to a
single KV node, this optimization is not currently implementation as it raises
questions about scheduling and fairness.  If we need to further reduce the number
of bi-directional streams in the future, from `number_logical_feeds*number_of_nodes`
down to the `number_of_nodes`, such optimization can be added at a later time.

Multiplexing all of the spans hosted on a node onto a single bi-directional stream
reduces the number of such stream from the `number_of_ranges` down to `number_of_nodes`
(per logical range feed).  This, in turn, enables the client to reserve the worst case
amount of memory before starting the rangefeed.  This amount of memory is bound by
the number of nodes in the cluster times per-stream maximum memory -- default is `2MB`,
but this can be changed via dedicated rangefeed connection class.
The server side of the equation may also benefit from knowing how many rangefeeds
are running right now.  Even though the server is somewhat protected from memory
blow up via http2 pushback, we may want to also manage server side memory better.
Memory accounting for client (and possible server) will be added in the follow on PRs.

The new RPC is turned on by default since the intention is to retire the uni-directional
rangefeed during later release.  However, this default maybe disable (and rangefeed
reverted to use old implementation) by setting `COCKROACH_ENABLE_MULTIPLEXING_RANGEFEED`
environment variable to `false` and restarting the cluster.

Release Notes (enterprise change) Rangefeeds (which are used internally e.g. by changefeeds)
now use a common HTTP/2 stream per client for all range replicas on a node,
instead of one per replica. This significantly reduces the amount of network buffer
memory usage, which could cause nodes to run out of memory if a client was slow to consume events.
The rangefeed implementation can be revered back to its old implementation
via restarting the cluster with the `COCKROACH_ENABLE_MULTIPLEXING_RANGEFEED`
environment variable set to false.
miretskiy pushed a commit to miretskiy/cockroach that referenced this pull request Jan 29, 2022
As demonstrated in cockroachdb#74219, running rangefeed over large enough table
or perhaps not consuming rangefeed events quickly enough,
may cause undesirable memory blow up, including up to OOMing the node.
Since rangefeeds tend to run on (almost) every node, a possiblity exists
that a entire cluster might get distabilized.

One of the reasons for this is that the memory used by low(er) level http2
RPC transport is not currently being tracked -- nor do we have a mechanism
to add such tracking.  cockroachdb#74456 provided partial mitigation for this issue,
where a single RangeFeed stream could be restricted to use less memory.
Nonetheless, the possibility of excessive memory usage remains since the number
of rangefeed streams in the system could be very large.

This PR introduce a new bi-directional streaming RPC called `MuxRangeFeed`.
In a uni-directonal streaming RPC, the client estabilishes connection to the KV
server and requests to receive events for 1 span.  A separate RPC stream is created
for each span.

In contrast, `MuxRangeFeed` is bi-directional RPC: the client
is expected to connect to the kv server and request as many spans as it wishes to
receive from that server. The server multiplexes all events onto a single
stream, and the client de-multiplexes those events appropriately on the receiving
end.

Note: just like in a uni-directional implementation, each call to
`MuxRangeFeed` method (i.e. a logical rangefeed established by the client)
is treated independently.  That is, even though it is possible to
multiplex all logical rangefeed streams onto a single bi-directional stream to a
single KV node, this optimization is not currently implementation as it raises
questions about scheduling and fairness.  If we need to further reduce the number
of bi-directional streams in the future, from `number_logical_feeds*number_of_nodes`
down to the `number_of_nodes`, such optimization can be added at a later time.

Multiplexing all of the spans hosted on a node onto a single bi-directional stream
reduces the number of such stream from the `number_of_ranges` down to `number_of_nodes`
(per logical range feed).  This, in turn, enables the client to reserve the worst case
amount of memory before starting the rangefeed.  This amount of memory is bound by
the number of nodes in the cluster times per-stream maximum memory -- default is `2MB`,
but this can be changed via dedicated rangefeed connection class.
The server side of the equation may also benefit from knowing how many rangefeeds
are running right now.  Even though the server is somewhat protected from memory
blow up via http2 pushback, we may want to also manage server side memory better.
Memory accounting for client (and possible server) will be added in the follow on PRs.

The new RPC is turned on by default since the intention is to retire the uni-directional
rangefeed during later release.  However, this default maybe disable (and rangefeed
reverted to use old implementation) by setting `COCKROACH_ENABLE_MULTIPLEXING_RANGEFEED`
environment variable to `false` and restarting the cluster.

Release Notes (enterprise change) Rangefeeds (which are used internally e.g. by changefeeds)
now use a common HTTP/2 stream per client for all range replicas on a node,
instead of one per replica. This significantly reduces the amount of network buffer
memory usage, which could cause nodes to run out of memory if a client was slow to consume events.
The rangefeed implementation can be revered back to its old implementation
via restarting the cluster with the `COCKROACH_ENABLE_MULTIPLEXING_RANGEFEED`
environment variable set to false.
miretskiy pushed a commit to miretskiy/cockroach that referenced this pull request Feb 1, 2022
As demonstrated in cockroachdb#74219, running rangefeed over large enough table
or perhaps not consuming rangefeed events quickly enough,
may cause undesirable memory blow up, including up to OOMing the node.
Since rangefeeds tend to run on (almost) every node, a possiblity exists
that a entire cluster might get distabilized.

One of the reasons for this is that the memory used by low(er) level http2
RPC transport is not currently being tracked -- nor do we have a mechanism
to add such tracking.  cockroachdb#74456 provided partial mitigation for this issue,
where a single RangeFeed stream could be restricted to use less memory.
Nonetheless, the possibility of excessive memory usage remains since the number
of rangefeed streams in the system could be very large.

This PR introduce a new bi-directional streaming RPC called `MuxRangeFeed`.
In a uni-directonal streaming RPC, the client estabilishes connection to the KV
server and requests to receive events for 1 span.  A separate RPC stream is created
for each span.

In contrast, `MuxRangeFeed` is bi-directional RPC: the client
is expected to connect to the kv server and request as many spans as it wishes to
receive from that server. The server multiplexes all events onto a single
stream, and the client de-multiplexes those events appropriately on the receiving
end.

Note: just like in a uni-directional implementation, each call to
`MuxRangeFeed` method (i.e. a logical rangefeed established by the client)
is treated independently.  That is, even though it is possible to
multiplex all logical rangefeed streams onto a single bi-directional stream to a
single KV node, this optimization is not currently implementation as it raises
questions about scheduling and fairness.  If we need to further reduce the number
of bi-directional streams in the future, from `number_logical_feeds*number_of_nodes`
down to the `number_of_nodes`, such optimization can be added at a later time.

Multiplexing all of the spans hosted on a node onto a single bi-directional stream
reduces the number of such stream from the `number_of_ranges` down to `number_of_nodes`
(per logical range feed).  This, in turn, enables the client to reserve the worst case
amount of memory before starting the rangefeed.  This amount of memory is bound by
the number of nodes in the cluster times per-stream maximum memory -- default is `2MB`,
but this can be changed via dedicated rangefeed connection class.
The server side of the equation may also benefit from knowing how many rangefeeds
are running right now.  Even though the server is somewhat protected from memory
blow up via http2 pushback, we may want to also manage server side memory better.
Memory accounting for client (and possible server) will be added in the follow on PRs.

The new RPC is turned on by default since the intention is to retire the uni-directional
rangefeed during later release.  However, this default maybe disable (and rangefeed
reverted to use old implementation) by setting `COCKROACH_ENABLE_MULTIPLEXING_RANGEFEED`
environment variable to `false` and restarting the cluster.

Release Notes (enterprise change) Rangefeeds (which are used internally e.g. by changefeeds)
now use a common HTTP/2 stream per client for all range replicas on a node,
instead of one per replica. This significantly reduces the amount of network buffer
memory usage, which could cause nodes to run out of memory if a client was slow to consume events.
The rangefeed implementation can be revered back to its old implementation
via restarting the cluster with the `COCKROACH_ENABLE_MULTIPLEXING_RANGEFEED`
environment variable set to false.
miretskiy pushed a commit to miretskiy/cockroach that referenced this pull request Feb 1, 2022
As demonstrated in cockroachdb#74219, running rangefeed over large enough table
or perhaps not consuming rangefeed events quickly enough,
may cause undesirable memory blow up, including up to OOMing the node.
Since rangefeeds tend to run on (almost) every node, a possiblity exists
that a entire cluster might get distabilized.

One of the reasons for this is that the memory used by low(er) level http2
RPC transport is not currently being tracked -- nor do we have a mechanism
to add such tracking.  cockroachdb#74456 provided partial mitigation for this issue,
where a single RangeFeed stream could be restricted to use less memory.
Nonetheless, the possibility of excessive memory usage remains since the number
of rangefeed streams in the system could be very large.

This PR introduce a new bi-directional streaming RPC called `MuxRangeFeed`.
In a uni-directonal streaming RPC, the client estabilishes connection to the KV
server and requests to receive events for 1 span.  A separate RPC stream is created
for each span.

In contrast, `MuxRangeFeed` is bi-directional RPC: the client
is expected to connect to the kv server and request as many spans as it wishes to
receive from that server. The server multiplexes all events onto a single
stream, and the client de-multiplexes those events appropriately on the receiving
end.

Note: just like in a uni-directional implementation, each call to
`MuxRangeFeed` method (i.e. a logical rangefeed established by the client)
is treated independently.  That is, even though it is possible to
multiplex all logical rangefeed streams onto a single bi-directional stream to a
single KV node, this optimization is not currently implementation as it raises
questions about scheduling and fairness.  If we need to further reduce the number
of bi-directional streams in the future, from `number_logical_feeds*number_of_nodes`
down to the `number_of_nodes`, such optimization can be added at a later time.

Multiplexing all of the spans hosted on a node onto a single bi-directional stream
reduces the number of such stream from the `number_of_ranges` down to `number_of_nodes`
(per logical range feed).  This, in turn, enables the client to reserve the worst case
amount of memory before starting the rangefeed.  This amount of memory is bound by
the number of nodes in the cluster times per-stream maximum memory -- default is `2MB`,
but this can be changed via dedicated rangefeed connection class.
The server side of the equation may also benefit from knowing how many rangefeeds
are running right now.  Even though the server is somewhat protected from memory
blow up via http2 pushback, we may want to also manage server side memory better.
Memory accounting for client (and possible server) will be added in the follow on PRs.

The new RPC is turned on by default since the intention is to retire the uni-directional
rangefeed during later release.  However, this default maybe disable (and rangefeed
reverted to use old implementation) by setting `COCKROACH_ENABLE_MULTIPLEXING_RANGEFEED`
environment variable to `false` and restarting the cluster.

Release Notes (enterprise change) Rangefeeds (which are used internally e.g. by changefeeds)
now use a common HTTP/2 stream per client for all range replicas on a node,
instead of one per replica. This significantly reduces the amount of network buffer
memory usage, which could cause nodes to run out of memory if a client was slow to consume events.
The rangefeed implementation can be revered back to its old implementation
via restarting the cluster with the `COCKROACH_ENABLE_MULTIPLEXING_RANGEFEED`
environment variable set to false.
miretskiy pushed a commit to miretskiy/cockroach that referenced this pull request Aug 3, 2022
As demonstrated in cockroachdb#74219, running rangefeed over large enough table
or perhaps not consuming rangefeed events quickly enough,
may cause undesirable memory blow up, including up to OOMing the node.
Since rangefeeds tend to run on (almost) every node, a possiblity exists
that a entire cluster might get distabilized.

One of the reasons for this is that the memory used by low(er) level http2
RPC transport is not currently being tracked -- nor do we have a mechanism
to add such tracking.  cockroachdb#74456 provided partial mitigation for this issue,
where a single RangeFeed stream could be restricted to use less memory.
Nonetheless, the possibility of excessive memory usage remains since the number
of rangefeed streams in the system could be very large.

This PR introduce a new bi-directional streaming RPC called `MuxRangeFeed`.
In a uni-directonal streaming RPC, the client estabilishes connection to the KV
server and requests to receive events for 1 span.  A separate RPC stream is created
for each span.

In contrast, `MuxRangeFeed` is bi-directional RPC: the client
is expected to connect to the kv server and request as many spans as it wishes to
receive from that server. The server multiplexes all events onto a single
stream, and the client de-multiplexes those events appropriately on the receiving
end.

Note: just like in a uni-directional implementation, each call to
`MuxRangeFeed` method (i.e. a logical rangefeed established by the client)
is treated independently.  That is, even though it is possible to
multiplex all logical rangefeed streams onto a single bi-directional stream to a
single KV node, this optimization is not currently implementation as it raises
questions about scheduling and fairness.  If we need to further reduce the number
of bi-directional streams in the future, from `number_logical_feeds*number_of_nodes`
down to the `number_of_nodes`, such optimization can be added at a later time.

Multiplexing all of the spans hosted on a node onto a single bi-directional stream
reduces the number of such stream from the `number_of_ranges` down to `number_of_nodes`
(per logical range feed).  This, in turn, enables the client to reserve the worst case
amount of memory before starting the rangefeed.  This amount of memory is bound by
the number of nodes in the cluster times per-stream maximum memory -- default is `2MB`,
but this can be changed via dedicated rangefeed connection class.
The server side of the equation may also benefit from knowing how many rangefeeds
are running right now.  Even though the server is somewhat protected from memory
blow up via http2 pushback, we may want to also manage server side memory better.
Memory accounting for client (and possible server) will be added in the follow on PRs.

The use of new RPC is contingent on the callers providing
`WithMuxRangefeed` option when starting the rangefeed.
However, an envornmnet variable "kill switch" `COCKROACH_ENABLE_MULTIPLEXING_RANGEFEED`
exists to disable the use of mux rangefeed in case of serious issues
being discovered.

Release Notes (enterprise change) Rangefeeds (which are used internally e.g. by changefeeds)
now use a common HTTP/2 stream per client for all range replicas on a node,
instead of one per replica. This significantly reduces the amount of network buffer
memory usage, which could cause nodes to run out of memory if a client was slow to consume events.
The rangefeed implementation can be revered back to its old implementation
via restarting the cluster with the `COCKROACH_ENABLE_MULTIPLEXING_RANGEFEED`
environment variable set to false.

Release note (enterprise change): Introduce a new Rangefeed RPC called
`MuxRangeFeed`.  Rangefeeds may now use a common HTTP/2 stream per client
for all range replicas on a node, instead of one per replica. This significantly
reduces the amount of network buffer memory usage, which could cause nodes
 to run out of memory if a client was slow to consume events.
The caller may opt in to use new mechanism by specifying
`WithMuxRangefeed` option when starting the rangefeed.  However, a cluster wide
`COCKROACH_ENABLE_MULTIPLEXING_RANGEFEED` environment variable may be set to
`false` to inhibit the use of this new RPC.
miretskiy pushed a commit to miretskiy/cockroach that referenced this pull request Aug 3, 2022
As demonstrated in cockroachdb#74219, running rangefeed over large enough table or
perhaps not consuming rangefeed events quickly enough, may cause undesirable
memory blow up, including up to OOMing the node.  Since rangefeeds tend to
run on (almost) every node, a possiblity exists that a entire cluster might
get distabilized.

One of the reasons for this is that the memory used by low(er) level http2
RPC transport is not currently being tracked -- nor do we have a mechanism
to add such tracking.  cockroachdb#74456 provided partial mitigation for this issue,
where a single RangeFeed stream could be restricted to use less memory.
Nonetheless, the possibility of excessive memory usage remains since the
number of rangefeed streams in the system could be very large.

This PR introduce a new bi-directional streaming RPC called `MuxRangeFeed`.
In a uni-directonal streaming RPC, the client estabilishes connection to the
KV server and requests to receive events for 1 span.  A separate RPC stream
is created for each span.

In contrast, `MuxRangeFeed` is bi-directional RPC: the client is expected to
connect to the kv server and request as many spans as it wishes to receive
from that server. The server multiplexes all events onto a single stream,
and the client de-multiplexes those events appropriately on the receiving
end.

Note: just like in a uni-directional implementation, each call to
`MuxRangeFeed` method (i.e. a logical rangefeed established by the client)
is treated independently.  That is, even though it is possible to multiplex
all logical rangefeed streams onto a single bi-directional stream to a
single KV node, this optimization is not currently implementation as it
raises questions about scheduling and fairness.  If we need to further
reduce the number of bi-directional streams in the future, from
`number_logical_feeds*number_of_nodes` down to the `number_of_nodes`, such
optimization can be added at a later time.

Multiplexing all of the spans hosted on a node onto a single bi-directional
stream reduces the number of such stream from the `number_of_ranges` down to
`number_of_nodes` (per logical range feed).  This, in turn, enables the
client to reserve the worst case amount of memory before starting the
rangefeed.  This amount of memory is bound by the number of nodes in the
cluster times per-stream maximum memory -- default is `2MB`, but this can be
changed via dedicated rangefeed connection class.  The server side of the
equation may also benefit from knowing how many rangefeeds are running right
now.  Even though the server is somewhat protected from memory blow up via
http2 pushback, we may want to also manage server side memory better.
Memory accounting for client (and possible server) will be added in the
follow on PRs.

The use of new RPC is contingent on the callers providing `WithMuxRangefeed`
option when starting the rangefeed.  However, an envornmnet variable "kill
switch" `COCKROACH_ENABLE_MULTIPLEXING_RANGEFEED` exists to disable the use
of mux rangefeed in case of serious issues being discovered.

Release note (enterprise change): Introduce a new Rangefeed RPC called
`MuxRangeFeed`.  Rangefeeds may now use a common HTTP/2 stream per client
for all range replicas on a node, instead of one per replica. This
significantly reduces the amount of network buffer memory usage, which could
cause nodes to run out of memory if a client was slow to consume events.
The caller may opt in to use new mechanism by specifying `WithMuxRangefeed`
option when starting the rangefeed.  However, a cluster wide
`COCKROACH_ENABLE_MULTIPLEXING_RANGEFEED` environment variable may be set to
`false` to inhibit the use of this new RPC.
miretskiy pushed a commit to miretskiy/cockroach that referenced this pull request Aug 5, 2022
As demonstrated in cockroachdb#74219, running rangefeed over large enough table or
perhaps not consuming rangefeed events quickly enough, may cause undesirable
memory blow up, including up to OOMing the node.  Since rangefeeds tend to
run on (almost) every node, a possiblity exists that a entire cluster might
get distabilized.

One of the reasons for this is that the memory used by low(er) level http2
RPC transport is not currently being tracked -- nor do we have a mechanism
to add such tracking.  cockroachdb#74456 provided partial mitigation for this issue,
where a single RangeFeed stream could be restricted to use less memory.
Nonetheless, the possibility of excessive memory usage remains since the
number of rangefeed streams in the system could be very large.

This PR introduce a new bi-directional streaming RPC called `MuxRangeFeed`.
In a uni-directonal streaming RPC, the client estabilishes connection to the
KV server and requests to receive events for 1 span.  A separate RPC stream
is created for each span.

In contrast, `MuxRangeFeed` is bi-directional RPC: the client is expected to
connect to the kv server and request as many spans as it wishes to receive
from that server. The server multiplexes all events onto a single stream,
and the client de-multiplexes those events appropriately on the receiving
end.

Note: just like in a uni-directional implementation, each call to
`MuxRangeFeed` method (i.e. a logical rangefeed established by the client)
is treated independently.  That is, even though it is possible to multiplex
all logical rangefeed streams onto a single bi-directional stream to a
single KV node, this optimization is not currently implementation as it
raises questions about scheduling and fairness.  If we need to further
reduce the number of bi-directional streams in the future, from
`number_logical_feeds*number_of_nodes` down to the `number_of_nodes`, such
optimization can be added at a later time.

Multiplexing all of the spans hosted on a node onto a single bi-directional
stream reduces the number of such stream from the `number_of_ranges` down to
`number_of_nodes` (per logical range feed).  This, in turn, enables the
client to reserve the worst case amount of memory before starting the
rangefeed.  This amount of memory is bound by the number of nodes in the
cluster times per-stream maximum memory -- default is `2MB`, but this can be
changed via dedicated rangefeed connection class.  The server side of the
equation may also benefit from knowing how many rangefeeds are running right
now.  Even though the server is somewhat protected from memory blow up via
http2 pushback, we may want to also manage server side memory better.
Memory accounting for client (and possible server) will be added in the
follow on PRs.

The use of new RPC is contingent on the callers providing `WithMuxRangefeed`
option when starting the rangefeed.  However, an envornmnet variable "kill
switch" `COCKROACH_ENABLE_MULTIPLEXING_RANGEFEED` exists to disable the use
of mux rangefeed in case of serious issues being discovered.

Release note (enterprise change): Introduce a new Rangefeed RPC called
`MuxRangeFeed`.  Rangefeeds may now use a common HTTP/2 stream per client
for all range replicas on a node, instead of one per replica. This
significantly reduces the amount of network buffer memory usage, which could
cause nodes to run out of memory if a client was slow to consume events.
The caller may opt in to use new mechanism by specifying `WithMuxRangefeed`
option when starting the rangefeed.  However, a cluster wide
`COCKROACH_ENABLE_MULTIPLEXING_RANGEFEED` environment variable may be set to
`false` to inhibit the use of this new RPC.
miretskiy pushed a commit to miretskiy/cockroach that referenced this pull request Aug 9, 2022
As demonstrated in cockroachdb#74219, running rangefeed over large enough table or
perhaps not consuming rangefeed events quickly enough, may cause undesirable
memory blow up, including up to OOMing the node.  Since rangefeeds tend to
run on (almost) every node, a possiblity exists that a entire cluster might
get distabilized.

One of the reasons for this is that the memory used by low(er) level http2
RPC transport is not currently being tracked -- nor do we have a mechanism
to add such tracking.  cockroachdb#74456 provided partial mitigation for this issue,
where a single RangeFeed stream could be restricted to use less memory.
Nonetheless, the possibility of excessive memory usage remains since the
number of rangefeed streams in the system could be very large.

This PR introduce a new bi-directional streaming RPC called `MuxRangeFeed`.
In a uni-directonal streaming RPC, the client estabilishes connection to the
KV server and requests to receive events for 1 span.  A separate RPC stream
is created for each span.

In contrast, `MuxRangeFeed` is bi-directional RPC: the client is expected to
connect to the kv server and request as many spans as it wishes to receive
from that server. The server multiplexes all events onto a single stream,
and the client de-multiplexes those events appropriately on the receiving
end.

Note: just like in a uni-directional implementation, each call to
`MuxRangeFeed` method (i.e. a logical rangefeed established by the client)
is treated independently.  That is, even though it is possible to multiplex
all logical rangefeed streams onto a single bi-directional stream to a
single KV node, this optimization is not currently implementation as it
raises questions about scheduling and fairness.  If we need to further
reduce the number of bi-directional streams in the future, from
`number_logical_feeds*number_of_nodes` down to the `number_of_nodes`, such
optimization can be added at a later time.

Multiplexing all of the spans hosted on a node onto a single bi-directional
stream reduces the number of such stream from the `number_of_ranges` down to
`number_of_nodes` (per logical range feed).  This, in turn, enables the
client to reserve the worst case amount of memory before starting the
rangefeed.  This amount of memory is bound by the number of nodes in the
cluster times per-stream maximum memory -- default is `2MB`, but this can be
changed via dedicated rangefeed connection class.  The server side of the
equation may also benefit from knowing how many rangefeeds are running right
now.  Even though the server is somewhat protected from memory blow up via
http2 pushback, we may want to also manage server side memory better.
Memory accounting for client (and possible server) will be added in the
follow on PRs.

The use of new RPC is contingent on the callers providing `WithMuxRangefeed`
option when starting the rangefeed.  However, an envornmnet variable "kill
switch" `COCKROACH_ENABLE_MULTIPLEXING_RANGEFEED` exists to disable the use
of mux rangefeed in case of serious issues being discovered.

Release note (enterprise change): Introduce a new Rangefeed RPC called
`MuxRangeFeed`.  Rangefeeds now use a common HTTP/2 stream per client
for all range replicas on a node, instead of one per replica. This
significantly reduces the amount of network buffer memory usage, which could
cause nodes to run out of memory if a client was slow to consume events.
The caller may opt in to use new mechanism by specifying `WithMuxRangefeed`
option when starting the rangefeed.  However, a cluster wide
`COCKROACH_ENABLE_MULTIPLEXING_RANGEFEED` environment variable may be set to
`false` to inhibit the use of this new RPC.
miretskiy pushed a commit to miretskiy/cockroach that referenced this pull request Aug 10, 2022
As demonstrated in cockroachdb#74219, running rangefeed over large enough table or
perhaps not consuming rangefeed events quickly enough, may cause undesirable
memory blow up, including up to OOMing the node.  Since rangefeeds tend to
run on (almost) every node, a possiblity exists that a entire cluster might
get distabilized.

One of the reasons for this is that the memory used by low(er) level http2
RPC transport is not currently being tracked -- nor do we have a mechanism
to add such tracking.  cockroachdb#74456 provided partial mitigation for this issue,
where a single RangeFeed stream could be restricted to use less memory.
Nonetheless, the possibility of excessive memory usage remains since the
number of rangefeed streams in the system could be very large.

This PR introduce a new bi-directional streaming RPC called `MuxRangeFeed`.
In a uni-directonal streaming RPC, the client estabilishes connection to the
KV server and requests to receive events for 1 span.  A separate RPC stream
is created for each span.

In contrast, `MuxRangeFeed` is bi-directional RPC: the client is expected to
connect to the kv server and request as many spans as it wishes to receive
from that server. The server multiplexes all events onto a single stream,
and the client de-multiplexes those events appropriately on the receiving
end.

Note: just like in a uni-directional implementation, each call to
`MuxRangeFeed` method (i.e. a logical rangefeed established by the client)
is treated independently.  That is, even though it is possible to multiplex
all logical rangefeed streams onto a single bi-directional stream to a
single KV node, this optimization is not currently implementation as it
raises questions about scheduling and fairness.  If we need to further
reduce the number of bi-directional streams in the future, from
`number_logical_feeds*number_of_nodes` down to the `number_of_nodes`, such
optimization can be added at a later time.

Multiplexing all of the spans hosted on a node onto a single bi-directional
stream reduces the number of such stream from the `number_of_ranges` down to
`number_of_nodes` (per logical range feed).  This, in turn, enables the
client to reserve the worst case amount of memory before starting the
rangefeed.  This amount of memory is bound by the number of nodes in the
cluster times per-stream maximum memory -- default is `2MB`, but this can be
changed via dedicated rangefeed connection class.  The server side of the
equation may also benefit from knowing how many rangefeeds are running right
now.  Even though the server is somewhat protected from memory blow up via
http2 pushback, we may want to also manage server side memory better.
Memory accounting for client (and possible server) will be added in the
follow on PRs.

The use of new RPC is contingent on the callers providing `WithMuxRangefeed`
option when starting the rangefeed.  However, an envornmnet variable "kill
switch" `COCKROACH_ENABLE_MULTIPLEXING_RANGEFEED` exists to disable the use
of mux rangefeed in case of serious issues being discovered.

Release note (enterprise change): Introduce a new Rangefeed RPC called
`MuxRangeFeed`.  Rangefeeds now use a common HTTP/2 stream per client
for all range replicas on a node, instead of one per replica. This
significantly reduces the amount of network buffer memory usage, which could
cause nodes to run out of memory if a client was slow to consume events.
The caller may opt in to use new mechanism by specifying `WithMuxRangefeed`
option when starting the rangefeed.  However, a cluster wide
`COCKROACH_ENABLE_MULTIPLEXING_RANGEFEED` environment variable may be set to
`false` to inhibit the use of this new RPC.
miretskiy pushed a commit to miretskiy/cockroach that referenced this pull request Aug 16, 2022
As demonstrated in cockroachdb#74219, running rangefeed over large enough table or
perhaps not consuming rangefeed events quickly enough, may cause undesirable
memory blow up, including up to OOMing the node.  Since rangefeeds tend to
run on (almost) every node, a possiblity exists that a entire cluster might
get distabilized.

One of the reasons for this is that the memory used by low(er) level http2
RPC transport is not currently being tracked -- nor do we have a mechanism
to add such tracking.  cockroachdb#74456 provided partial mitigation for this issue,
where a single RangeFeed stream could be restricted to use less memory.
Nonetheless, the possibility of excessive memory usage remains since the
number of rangefeed streams in the system could be very large.

This PR introduce a new bi-directional streaming RPC called `MuxRangeFeed`.
In a uni-directonal streaming RPC, the client estabilishes connection to the
KV server and requests to receive events for 1 span.  A separate RPC stream
is created for each span.

In contrast, `MuxRangeFeed` is bi-directional RPC: the client is expected to
connect to the kv server and request as many spans as it wishes to receive
from that server. The server multiplexes all events onto a single stream,
and the client de-multiplexes those events appropriately on the receiving
end.

Note: just like in a uni-directional implementation, each call to
`MuxRangeFeed` method (i.e. a logical rangefeed established by the client)
is treated independently.  That is, even though it is possible to multiplex
all logical rangefeed streams onto a single bi-directional stream to a
single KV node, this optimization is not currently implementation as it
raises questions about scheduling and fairness.  If we need to further
reduce the number of bi-directional streams in the future, from
`number_logical_feeds*number_of_nodes` down to the `number_of_nodes`, such
optimization can be added at a later time.

Multiplexing all of the spans hosted on a node onto a single bi-directional
stream reduces the number of such stream from the `number_of_ranges` down to
`number_of_nodes` (per logical range feed).  This, in turn, enables the
client to reserve the worst case amount of memory before starting the
rangefeed.  This amount of memory is bound by the number of nodes in the
cluster times per-stream maximum memory -- default is `2MB`, but this can be
changed via dedicated rangefeed connection class.  The server side of the
equation may also benefit from knowing how many rangefeeds are running right
now.  Even though the server is somewhat protected from memory blow up via
http2 pushback, we may want to also manage server side memory better.
Memory accounting for client (and possible server) will be added in the
follow on PRs.

The use of new RPC is contingent on the callers providing `WithMuxRangefeed`
option when starting the rangefeed.  However, an envornmnet variable "kill
switch" `COCKROACH_ENABLE_MULTIPLEXING_RANGEFEED` exists to disable the use
of mux rangefeed in case of serious issues being discovered.

Release note (enterprise change): Introduce a new Rangefeed RPC called
`MuxRangeFeed`.  Rangefeeds now use a common HTTP/2 stream per client
for all range replicas on a node, instead of one per replica. This
significantly reduces the amount of network buffer memory usage, which could
cause nodes to run out of memory if a client was slow to consume events.
The caller may opt in to use new mechanism by specifying `WithMuxRangefeed`
option when starting the rangefeed.  However, a cluster wide
`COCKROACH_ENABLE_MULTIPLEXING_RANGEFEED` environment variable may be set to
`false` to inhibit the use of this new RPC.
miretskiy pushed a commit to miretskiy/cockroach that referenced this pull request Aug 16, 2022
As demonstrated in cockroachdb#74219, running rangefeed over large enough table or
perhaps not consuming rangefeed events quickly enough, may cause undesirable
memory blow up, including up to OOMing the node.  Since rangefeeds tend to
run on (almost) every node, a possiblity exists that a entire cluster might
get distabilized.

One of the reasons for this is that the memory used by low(er) level http2
RPC transport is not currently being tracked -- nor do we have a mechanism
to add such tracking.  cockroachdb#74456 provided partial mitigation for this issue,
where a single RangeFeed stream could be restricted to use less memory.
Nonetheless, the possibility of excessive memory usage remains since the
number of rangefeed streams in the system could be very large.

This PR introduce a new bi-directional streaming RPC called `MuxRangeFeed`.
In a uni-directonal streaming RPC, the client estabilishes connection to the
KV server and requests to receive events for 1 span.  A separate RPC stream
is created for each span.

In contrast, `MuxRangeFeed` is bi-directional RPC: the client is expected to
connect to the kv server and request as many spans as it wishes to receive
from that server. The server multiplexes all events onto a single stream,
and the client de-multiplexes those events appropriately on the receiving
end.

Note: just like in a uni-directional implementation, each call to
`MuxRangeFeed` method (i.e. a logical rangefeed established by the client)
is treated independently.  That is, even though it is possible to multiplex
all logical rangefeed streams onto a single bi-directional stream to a
single KV node, this optimization is not currently implementation as it
raises questions about scheduling and fairness.  If we need to further
reduce the number of bi-directional streams in the future, from
`number_logical_feeds*number_of_nodes` down to the `number_of_nodes`, such
optimization can be added at a later time.

Multiplexing all of the spans hosted on a node onto a single bi-directional
stream reduces the number of such stream from the `number_of_ranges` down to
`number_of_nodes` (per logical range feed).  This, in turn, enables the
client to reserve the worst case amount of memory before starting the
rangefeed.  This amount of memory is bound by the number of nodes in the
cluster times per-stream maximum memory -- default is `2MB`, but this can be
changed via dedicated rangefeed connection class.  The server side of the
equation may also benefit from knowing how many rangefeeds are running right
now.  Even though the server is somewhat protected from memory blow up via
http2 pushback, we may want to also manage server side memory better.
Memory accounting for client (and possible server) will be added in the
follow on PRs.

The use of new RPC is contingent on the callers providing `WithMuxRangefeed`
option when starting the rangefeed.  However, an envornmnet variable "kill
switch" `COCKROACH_ENABLE_MULTIPLEXING_RANGEFEED` exists to disable the use
of mux rangefeed in case of serious issues being discovered.

Release note (enterprise change): Introduce a new Rangefeed RPC called
`MuxRangeFeed`.  Rangefeeds now use a common HTTP/2 stream per client
for all range replicas on a node, instead of one per replica. This
significantly reduces the amount of network buffer memory usage, which could
cause nodes to run out of memory if a client was slow to consume events.
The caller may opt in to use new mechanism by specifying `WithMuxRangefeed`
option when starting the rangefeed.  However, a cluster wide
`COCKROACH_ENABLE_MULTIPLEXING_RANGEFEED` environment variable may be set to
`false` to inhibit the use of this new RPC.
miretskiy pushed a commit to miretskiy/cockroach that referenced this pull request Aug 17, 2022
As demonstrated in cockroachdb#74219, running rangefeed over large enough table or
perhaps not consuming rangefeed events quickly enough, may cause undesirable
memory blow up, including up to OOMing the node.  Since rangefeeds tend to
run on (almost) every node, a possiblity exists that a entire cluster might
get distabilized.

One of the reasons for this is that the memory used by low(er) level http2
RPC transport is not currently being tracked -- nor do we have a mechanism
to add such tracking.  cockroachdb#74456 provided partial mitigation for this issue,
where a single RangeFeed stream could be restricted to use less memory.
Nonetheless, the possibility of excessive memory usage remains since the
number of rangefeed streams in the system could be very large.

This PR introduce a new bi-directional streaming RPC called `MuxRangeFeed`.
In a uni-directonal streaming RPC, the client estabilishes connection to the
KV server and requests to receive events for 1 span.  A separate RPC stream
is created for each span.

In contrast, `MuxRangeFeed` is bi-directional RPC: the client is expected to
connect to the kv server and request as many spans as it wishes to receive
from that server. The server multiplexes all events onto a single stream,
and the client de-multiplexes those events appropriately on the receiving
end.

Note: just like in a uni-directional implementation, each call to
`MuxRangeFeed` method (i.e. a logical rangefeed established by the client)
is treated independently.  That is, even though it is possible to multiplex
all logical rangefeed streams onto a single bi-directional stream to a
single KV node, this optimization is not currently implementation as it
raises questions about scheduling and fairness.  If we need to further
reduce the number of bi-directional streams in the future, from
`number_logical_feeds*number_of_nodes` down to the `number_of_nodes`, such
optimization can be added at a later time.

Multiplexing all of the spans hosted on a node onto a single bi-directional
stream reduces the number of such stream from the `number_of_ranges` down to
`number_of_nodes` (per logical range feed).  This, in turn, enables the
client to reserve the worst case amount of memory before starting the
rangefeed.  This amount of memory is bound by the number of nodes in the
cluster times per-stream maximum memory -- default is `2MB`, but this can be
changed via dedicated rangefeed connection class.  The server side of the
equation may also benefit from knowing how many rangefeeds are running right
now.  Even though the server is somewhat protected from memory blow up via
http2 pushback, we may want to also manage server side memory better.
Memory accounting for client (and possible server) will be added in the
follow on PRs.

The use of new RPC is contingent on the callers providing `WithMuxRangefeed`
option when starting the rangefeed.  However, an envornmnet variable "kill
switch" `COCKROACH_ENABLE_MULTIPLEXING_RANGEFEED` exists to disable the use
of mux rangefeed in case of serious issues being discovered.

Release note (enterprise change): Introduce a new Rangefeed RPC called
`MuxRangeFeed`.  Rangefeeds now use a common HTTP/2 stream per client
for all range replicas on a node, instead of one per replica. This
significantly reduces the amount of network buffer memory usage, which could
cause nodes to run out of memory if a client was slow to consume events.
The caller may opt in to use new mechanism by specifying `WithMuxRangefeed`
option when starting the rangefeed.  However, a cluster wide
`COCKROACH_ENABLE_MULTIPLEXING_RANGEFEED` environment variable may be set to
`false` to inhibit the use of this new RPC.

Release justification:
miretskiy pushed a commit to miretskiy/cockroach that referenced this pull request Aug 17, 2022
As demonstrated in cockroachdb#74219, running rangefeed over large enough table or
perhaps not consuming rangefeed events quickly enough, may cause undesirable
memory blow up, including up to OOMing the node.  Since rangefeeds tend to
run on (almost) every node, a possiblity exists that a entire cluster might
get distabilized.

One of the reasons for this is that the memory used by low(er) level http2
RPC transport is not currently being tracked -- nor do we have a mechanism
to add such tracking.  cockroachdb#74456 provided partial mitigation for this issue,
where a single RangeFeed stream could be restricted to use less memory.
Nonetheless, the possibility of excessive memory usage remains since the
number of rangefeed streams in the system could be very large.

This PR introduce a new bi-directional streaming RPC called `MuxRangeFeed`.
In a uni-directonal streaming RPC, the client estabilishes connection to the
KV server and requests to receive events for 1 span.  A separate RPC stream
is created for each span.

In contrast, `MuxRangeFeed` is bi-directional RPC: the client is expected to
connect to the kv server and request as many spans as it wishes to receive
from that server. The server multiplexes all events onto a single stream,
and the client de-multiplexes those events appropriately on the receiving
end.

Note: just like in a uni-directional implementation, each call to
`MuxRangeFeed` method (i.e. a logical rangefeed established by the client)
is treated independently.  That is, even though it is possible to multiplex
all logical rangefeed streams onto a single bi-directional stream to a
single KV node, this optimization is not currently implementation as it
raises questions about scheduling and fairness.  If we need to further
reduce the number of bi-directional streams in the future, from
`number_logical_feeds*number_of_nodes` down to the `number_of_nodes`, such
optimization can be added at a later time.

Multiplexing all of the spans hosted on a node onto a single bi-directional
stream reduces the number of such stream from the `number_of_ranges` down to
`number_of_nodes` (per logical range feed).  This, in turn, enables the
client to reserve the worst case amount of memory before starting the
rangefeed.  This amount of memory is bound by the number of nodes in the
cluster times per-stream maximum memory -- default is `2MB`, but this can be
changed via dedicated rangefeed connection class.  The server side of the
equation may also benefit from knowing how many rangefeeds are running right
now.  Even though the server is somewhat protected from memory blow up via
http2 pushback, we may want to also manage server side memory better.
Memory accounting for client (and possible server) will be added in the
follow on PRs.

The use of new RPC is contingent on the callers providing `WithMuxRangefeed`
option when starting the rangefeed.  However, an envornmnet variable "kill
switch" `COCKROACH_ENABLE_MULTIPLEXING_RANGEFEED` exists to disable the use
of mux rangefeed in case of serious issues being discovered.

Release note (enterprise change): Introduce a new Rangefeed RPC called
`MuxRangeFeed`.  Rangefeeds now use a common HTTP/2 stream per client
for all range replicas on a node, instead of one per replica. This
significantly reduces the amount of network buffer memory usage, which could
cause nodes to run out of memory if a client was slow to consume events.
The caller may opt in to use new mechanism by specifying `WithMuxRangefeed`
option when starting the rangefeed.  However, a cluster wide
`COCKROACH_ENABLE_MULTIPLEXING_RANGEFEED` environment variable may be set to
`false` to inhibit the use of this new RPC.

Release justification:
miretskiy pushed a commit to miretskiy/cockroach that referenced this pull request Aug 17, 2022
As demonstrated in cockroachdb#74219, running rangefeed over large enough table or
perhaps not consuming rangefeed events quickly enough, may cause undesirable
memory blow up, including up to OOMing the node.  Since rangefeeds tend to
run on (almost) every node, a possiblity exists that a entire cluster might
get destabilized.

One of the reasons for this is that the memory used by low(er) level http2
RPC transport is not currently being tracked -- nor do we have a mechanism
to add such tracking.  cockroachdb#74456 provided partial mitigation for this issue,
where a single RangeFeed stream could be restricted to use less memory.
Nonetheless, the possibility of excessive memory usage remains since the
number of rangefeed streams in the system could be very large.

This PR introduce a new bi-directional streaming RPC called `MuxRangeFeed`.
In a uni-directional streaming RPC, the client establishes connection to the
KV server and requests to receive events for 1 span.  A separate RPC stream
is created for each span.

In contrast, `MuxRangeFeed` is bi-directional RPC: the client is expected to
connect to the kv server and request as many spans as it wishes to receive
from that server. The server multiplexes all events onto a single stream,
and the client de-multiplexes those events appropriately on the receiving
end.

Note: just like in a uni-directional implementation, each call to
`MuxRangeFeed` method (i.e. a logical rangefeed established by the client)
is treated independently.  That is, even though it is possible to multiplex
all logical rangefeed streams onto a single bi-directional stream to a
single KV node, this optimization is not currently implementation as it
raises questions about scheduling and fairness.  If we need to further
reduce the number of bi-directional streams in the future, from
`number_logical_feeds*number_of_nodes` down to the `number_of_nodes`, such
optimization can be added at a later time.

Multiplexing all of the spans hosted on a node onto a single bi-directional
stream reduces the number of such stream from the `number_of_ranges` down to
`number_of_nodes` (per logical range feed).  This, in turn, enables the
client to reserve the worst case amount of memory before starting the
rangefeed.  This amount of memory is bound by the number of nodes in the
cluster times per-stream maximum memory -- default is `2MB`, but this can be
changed via dedicated rangefeed connection class.  The server side of the
equation may also benefit from knowing how many rangefeeds are running right
now.  Even though the server is somewhat protected from memory blow up via
http2 pushback, we may want to also manage server side memory better.
Memory accounting for client (and possible server) will be added in the
follow on PRs.

The use of new RPC is contingent on the callers providing `WithMuxRangefeed`
option when starting the rangefeed.  However, an envornmnet variable "kill
switch" `COCKROACH_ENABLE_MULTIPLEXING_RANGEFEED` exists to disable the use
of mux rangefeed in case of serious issues being discovered.

Release note (enterprise change): Introduce a new Rangefeed RPC called
`MuxRangeFeed`.  Rangefeeds now use a common HTTP/2 stream per client
for all range replicas on a node, instead of one per replica. This
significantly reduces the amount of network buffer memory usage, which could
cause nodes to run out of memory if a client was slow to consume events.
The caller may opt in to use new mechanism by specifying `WithMuxRangefeed`
option when starting the rangefeed.  However, a cluster wide
`COCKROACH_ENABLE_MULTIPLEXING_RANGEFEED` environment variable may be set to
`false` to inhibit the use of this new RPC.

Release justification:
miretskiy pushed a commit to miretskiy/cockroach that referenced this pull request Aug 17, 2022
As demonstrated in cockroachdb#74219, running rangefeed over large enough table or
perhaps not consuming rangefeed events quickly enough, may cause undesirable
memory blow up, including up to OOMing the node.  Since rangefeeds tend to
run on (almost) every node, a possiblity exists that a entire cluster might
get destabilized.

One of the reasons for this is that the memory used by low(er) level http2
RPC transport is not currently being tracked -- nor do we have a mechanism
to add such tracking.  cockroachdb#74456 provided partial mitigation for this issue,
where a single RangeFeed stream could be restricted to use less memory.
Nonetheless, the possibility of excessive memory usage remains since the
number of rangefeed streams in the system could be very large.

This PR introduce a new bi-directional streaming RPC called `MuxRangeFeed`.
In a uni-directional streaming RPC, the client establishes connection to the
KV server and requests to receive events for 1 span.  A separate RPC stream
is created for each span.

In contrast, `MuxRangeFeed` is bi-directional RPC: the client is expected to
connect to the kv server and request as many spans as it wishes to receive
from that server. The server multiplexes all events onto a single stream,
and the client de-multiplexes those events appropriately on the receiving
end.

Note: just like in a uni-directional implementation, each call to
`MuxRangeFeed` method (i.e. a logical rangefeed established by the client)
is treated independently.  That is, even though it is possible to multiplex
all logical rangefeed streams onto a single bi-directional stream to a
single KV node, this optimization is not currently implementation as it
raises questions about scheduling and fairness.  If we need to further
reduce the number of bi-directional streams in the future, from
`number_logical_feeds*number_of_nodes` down to the `number_of_nodes`, such
optimization can be added at a later time.

Multiplexing all of the spans hosted on a node onto a single bi-directional
stream reduces the number of such stream from the `number_of_ranges` down to
`number_of_nodes` (per logical range feed).  This, in turn, enables the
client to reserve the worst case amount of memory before starting the
rangefeed.  This amount of memory is bound by the number of nodes in the
cluster times per-stream maximum memory -- default is `2MB`, but this can be
changed via dedicated rangefeed connection class.  The server side of the
equation may also benefit from knowing how many rangefeeds are running right
now.  Even though the server is somewhat protected from memory blow up via
http2 pushback, we may want to also manage server side memory better.
Memory accounting for client (and possible server) will be added in the
follow on PRs.

The use of new RPC is contingent on the callers providing `WithMuxRangefeed`
option when starting the rangefeed.  However, an envornmnet variable "kill
switch" `COCKROACH_ENABLE_MULTIPLEXING_RANGEFEED` exists to disable the use
of mux rangefeed in case of serious issues being discovered.

Release note (enterprise change): Introduce a new Rangefeed RPC called
`MuxRangeFeed`.  Rangefeeds now use a common HTTP/2 stream per client
for all range replicas on a node, instead of one per replica. This
significantly reduces the amount of network buffer memory usage, which could
cause nodes to run out of memory if a client was slow to consume events.
The caller may opt in to use new mechanism by specifying `WithMuxRangefeed`
option when starting the rangefeed.  However, a cluster wide
`COCKROACH_ENABLE_MULTIPLEXING_RANGEFEED` environment variable may be set to
`false` to inhibit the use of this new RPC.

Release justification: Rangefeed scalability and stability improvement.
Safe to merge since the functionality disabled by default.
miretskiy pushed a commit to miretskiy/cockroach that referenced this pull request Aug 18, 2022
As demonstrated in cockroachdb#74219, running rangefeed over large enough table or
perhaps not consuming rangefeed events quickly enough, may cause undesirable
memory blow up, including up to OOMing the node.  Since rangefeeds tend to
run on (almost) every node, a possiblity exists that a entire cluster might
get destabilized.

One of the reasons for this is that the memory used by low(er) level http2
RPC transport is not currently being tracked -- nor do we have a mechanism
to add such tracking.  cockroachdb#74456 provided partial mitigation for this issue,
where a single RangeFeed stream could be restricted to use less memory.
Nonetheless, the possibility of excessive memory usage remains since the
number of rangefeed streams in the system could be very large.

This PR introduce a new bi-directional streaming RPC called `MuxRangeFeed`.
In a uni-directional streaming RPC, the client establishes connection to the
KV server and requests to receive events for 1 span.  A separate RPC stream
is created for each span.

In contrast, `MuxRangeFeed` is bi-directional RPC: the client is expected to
connect to the kv server and request as many spans as it wishes to receive
from that server. The server multiplexes all events onto a single stream,
and the client de-multiplexes those events appropriately on the receiving
end.

Note: just like in a uni-directional implementation, each call to
`MuxRangeFeed` method (i.e. a logical rangefeed established by the client)
is treated independently.  That is, even though it is possible to multiplex
all logical rangefeed streams onto a single bi-directional stream to a
single KV node, this optimization is not currently implementation as it
raises questions about scheduling and fairness.  If we need to further
reduce the number of bi-directional streams in the future, from
`number_logical_feeds*number_of_nodes` down to the `number_of_nodes`, such
optimization can be added at a later time.

Multiplexing all of the spans hosted on a node onto a single bi-directional
stream reduces the number of such stream from the `number_of_ranges` down to
`number_of_nodes` (per logical range feed).  This, in turn, enables the
client to reserve the worst case amount of memory before starting the
rangefeed.  This amount of memory is bound by the number of nodes in the
cluster times per-stream maximum memory -- default is `2MB`, but this can be
changed via dedicated rangefeed connection class.  The server side of the
equation may also benefit from knowing how many rangefeeds are running right
now.  Even though the server is somewhat protected from memory blow up via
http2 pushback, we may want to also manage server side memory better.
Memory accounting for client (and possible server) will be added in the
follow on PRs.

The use of new RPC is contingent on the callers providing `WithMuxRangefeed`
option when starting the rangefeed.  However, an envornmnet variable "kill
switch" `COCKROACH_ENABLE_MULTIPLEXING_RANGEFEED` exists to disable the use
of mux rangefeed in case of serious issues being discovered.

Release note (enterprise change): Introduce a new Rangefeed RPC called
`MuxRangeFeed`.  Rangefeeds now use a common HTTP/2 stream per client
for all range replicas on a node, instead of one per replica. This
significantly reduces the amount of network buffer memory usage, which could
cause nodes to run out of memory if a client was slow to consume events.
The caller may opt in to use new mechanism by specifying `WithMuxRangefeed`
option when starting the rangefeed.  However, a cluster wide
`COCKROACH_ENABLE_MULTIPLEXING_RANGEFEED` environment variable may be set to
`false` to inhibit the use of this new RPC.

Release justification: Rangefeed scalability and stability improvement.
Safe to merge since the functionality disabled by default.
miretskiy pushed a commit to miretskiy/cockroach that referenced this pull request Aug 18, 2022
As demonstrated in cockroachdb#74219, running rangefeed over large enough table or
perhaps not consuming rangefeed events quickly enough, may cause undesirable
memory blow up, including up to OOMing the node.  Since rangefeeds tend to
run on (almost) every node, a possiblity exists that a entire cluster might
get destabilized.

One of the reasons for this is that the memory used by low(er) level http2
RPC transport is not currently being tracked -- nor do we have a mechanism
to add such tracking.  cockroachdb#74456 provided partial mitigation for this issue,
where a single RangeFeed stream could be restricted to use less memory.
Nonetheless, the possibility of excessive memory usage remains since the
number of rangefeed streams in the system could be very large.

This PR introduce a new bi-directional streaming RPC called `MuxRangeFeed`.
In a uni-directional streaming RPC, the client establishes connection to the
KV server and requests to receive events for 1 span.  A separate RPC stream
is created for each span.

In contrast, `MuxRangeFeed` is bi-directional RPC: the client is expected to
connect to the kv server and request as many spans as it wishes to receive
from that server. The server multiplexes all events onto a single stream,
and the client de-multiplexes those events appropriately on the receiving
end.

Note: just like in a uni-directional implementation, each call to
`MuxRangeFeed` method (i.e. a logical rangefeed established by the client)
is treated independently.  That is, even though it is possible to multiplex
all logical rangefeed streams onto a single bi-directional stream to a
single KV node, this optimization is not currently implementation as it
raises questions about scheduling and fairness.  If we need to further
reduce the number of bi-directional streams in the future, from
`number_logical_feeds*number_of_nodes` down to the `number_of_nodes`, such
optimization can be added at a later time.

Multiplexing all of the spans hosted on a node onto a single bi-directional
stream reduces the number of such stream from the `number_of_ranges` down to
`number_of_nodes` (per logical range feed).  This, in turn, enables the
client to reserve the worst case amount of memory before starting the
rangefeed.  This amount of memory is bound by the number of nodes in the
cluster times per-stream maximum memory -- default is `2MB`, but this can be
changed via dedicated rangefeed connection class.  The server side of the
equation may also benefit from knowing how many rangefeeds are running right
now.  Even though the server is somewhat protected from memory blow up via
http2 pushback, we may want to also manage server side memory better.
Memory accounting for client (and possible server) will be added in the
follow on PRs.

The use of new RPC is contingent on the callers providing `WithMuxRangefeed`
option when starting the rangefeed.  However, an envornmnet variable "kill
switch" `COCKROACH_ENABLE_MULTIPLEXING_RANGEFEED` exists to disable the use
of mux rangefeed in case of serious issues being discovered.

Release note (enterprise change): Introduce a new Rangefeed RPC called
`MuxRangeFeed`.  Rangefeeds now use a common HTTP/2 stream per client
for all range replicas on a node, instead of one per replica. This
significantly reduces the amount of network buffer memory usage, which could
cause nodes to run out of memory if a client was slow to consume events.
The caller may opt in to use new mechanism by specifying `WithMuxRangefeed`
option when starting the rangefeed.  However, a cluster wide
`COCKROACH_ENABLE_MULTIPLEXING_RANGEFEED` environment variable may be set to
`false` to inhibit the use of this new RPC.

Release justification: Rangefeed scalability and stability improvement.
Safe to merge since the functionality disabled by default.
miretskiy pushed a commit to miretskiy/cockroach that referenced this pull request Aug 18, 2022
As demonstrated in cockroachdb#74219, running rangefeed over large enough table or
perhaps not consuming rangefeed events quickly enough, may cause undesirable
memory blow up, including up to OOMing the node.  Since rangefeeds tend to
run on (almost) every node, a possiblity exists that a entire cluster might
get destabilized.

One of the reasons for this is that the memory used by low(er) level http2
RPC transport is not currently being tracked -- nor do we have a mechanism
to add such tracking.  cockroachdb#74456 provided partial mitigation for this issue,
where a single RangeFeed stream could be restricted to use less memory.
Nonetheless, the possibility of excessive memory usage remains since the
number of rangefeed streams in the system could be very large.

This PR introduce a new bi-directional streaming RPC called `MuxRangeFeed`.
In a uni-directional streaming RPC, the client establishes connection to the
KV server and requests to receive events for 1 span.  A separate RPC stream
is created for each span.

In contrast, `MuxRangeFeed` is bi-directional RPC: the client is expected to
connect to the kv server and request as many spans as it wishes to receive
from that server. The server multiplexes all events onto a single stream,
and the client de-multiplexes those events appropriately on the receiving
end.

Note: just like in a uni-directional implementation, each call to
`MuxRangeFeed` method (i.e. a logical rangefeed established by the client)
is treated independently.  That is, even though it is possible to multiplex
all logical rangefeed streams onto a single bi-directional stream to a
single KV node, this optimization is not currently implementation as it
raises questions about scheduling and fairness.  If we need to further
reduce the number of bi-directional streams in the future, from
`number_logical_feeds*number_of_nodes` down to the `number_of_nodes`, such
optimization can be added at a later time.

Multiplexing all of the spans hosted on a node onto a single bi-directional
stream reduces the number of such stream from the `number_of_ranges` down to
`number_of_nodes` (per logical range feed).  This, in turn, enables the
client to reserve the worst case amount of memory before starting the
rangefeed.  This amount of memory is bound by the number of nodes in the
cluster times per-stream maximum memory -- default is `2MB`, but this can be
changed via dedicated rangefeed connection class.  The server side of the
equation may also benefit from knowing how many rangefeeds are running right
now.  Even though the server is somewhat protected from memory blow up via
http2 pushback, we may want to also manage server side memory better.
Memory accounting for client (and possible server) will be added in the
follow on PRs.

The use of new RPC is contingent on the callers providing `WithMuxRangefeed`
option when starting the rangefeed.  However, an envornmnet variable "kill
switch" `COCKROACH_ENABLE_MULTIPLEXING_RANGEFEED` exists to disable the use
of mux rangefeed in case of serious issues being discovered.

Release note (enterprise change): Introduce a new Rangefeed RPC called
`MuxRangeFeed`.  Rangefeeds now use a common HTTP/2 stream per client
for all range replicas on a node, instead of one per replica. This
significantly reduces the amount of network buffer memory usage, which could
cause nodes to run out of memory if a client was slow to consume events.
The caller may opt in to use new mechanism by specifying `WithMuxRangefeed`
option when starting the rangefeed.  However, a cluster wide
`COCKROACH_ENABLE_MULTIPLEXING_RANGEFEED` environment variable may be set to
`false` to inhibit the use of this new RPC.

Release justification: Rangefeed scalability and stability improvement.
Safe to merge since the functionality disabled by default.
craig bot pushed a commit that referenced this pull request Aug 18, 2022
75581: kv: Introduce MuxRangeFeed bi-directional RPC.  r=miretskiy a=miretskiy

As demonstrated in #74219, running 
rangefeed over large enough table or perhaps not consuming rangefeed events quickly enough, 
may cause undesirable memory blow up, including up to OOMing the node.  Since rangefeeds t
end to run on (almost) every node, a possibility exists that a entire cluster might
get destabilized.

One of the reasons for this is that the memory used by low(er) level http2
RPC transport is not currently being tracked -- nor do we have a mechanism
to add such tracking.  #74456 
provided partial mitigation for this issue, where a single RangeFeed stream could 
be restricted to use less memory. Nonetheless, the possibility of excessive memory 
usage remains since the number of rangefeed streams in the system could be very large.

This PR introduce a new bi-directional streaming RPC called `MuxRangeFeed`.
In a uni-directional streaming RPC, the client establishes connection to the
KV server and requests to receive events for 1 span.  A separate RPC stream
is created for each span.

In contrast, `MuxRangeFeed` is bi-directional RPC: the client is expected to
connect to the kv server and request as many spans as it wishes to receive
from that server. The server multiplexes all events onto a single stream,
and the client de-multiplexes those events appropriately on the receiving
end.

Note: just like in a uni-directional implementation, each call to
`MuxRangeFeed` method (i.e. a logical rangefeed established by the client)
is treated independently.  That is, even though it is possible to multiplex
all logical rangefeed streams onto a single bi-directional stream to a
single KV node, this optimization is not currently implementation as it
raises questions about scheduling and fairness.  If we need to further
reduce the number of bi-directional streams in the future, from
`number_logical_feeds*number_of_nodes` down to the `number_of_nodes`, such
optimization can be added at a later time.

Multiplexing all of the spans hosted on a node onto a single bi-directional
stream reduces the number of such stream from the `number_of_ranges` down to
`number_of_nodes` (per logical range feed).  This, in turn, enables the
client to reserve the worst case amount of memory before starting the
rangefeed.  This amount of memory is bound by the number of nodes in the
cluster times per-stream maximum memory -- default is `2MB`, but this can be
changed via dedicated rangefeed connection class.  The server side of the
equation may also benefit from knowing how many rangefeeds are running right
now.  Even though the server is somewhat protected from memory blow up via
http2 pushback, we may want to also manage server side memory better.
Memory accounting for client (and possible server) will be added in the
follow on PRs.

The use of new RPC is contingent on the callers providing `WithMuxRangefeed`
option when starting the rangefeed.  However, an envornmnet variable "kill
switch" `COCKROACH_ENABLE_MULTIPLEXING_RANGEFEED` exists to disable the use
of mux rangefeed in case of serious issues being discovered.

Release note (enterprise change): Introduce a new Rangefeed RPC called
`MuxRangeFeed`.  Rangefeeds now use a common HTTP/2 stream per client
for all range replicas on a node, instead of one per replica. This
significantly reduces the amount of network buffer memory usage, which could
cause nodes to run out of memory if a client was slow to consume events.
The caller may opt in to use new mechanism by specifying `WithMuxRangefeed`
option when starting the rangefeed.  However, a cluster wide
`COCKROACH_ENABLE_MULTIPLEXING_RANGEFEED` environment variable may be set to
`false` to inhibit the use of this new RPC.

Release Justification: Rangefeed scalability and stability improvement.  Safe to merge
since the functionality disabled by default.

84683: obsservice: ingest events r=andreimatei a=andreimatei

This patch adds a couple of things:
1. an RPC endpoint to CRDB for streaming out events. This RPC service is
   called by the Obs Service.
2. a library in CRDB for exporting events.
3. code in the Obs Service for ingesting the events and writing them
   into a table in the sink cluster.

The first use of the event exporting is for the system.eventlog events.
All events written to that table are now also exported. Once the Obs
Service takes hold in the future, I hope we can remove system.eventlog.

The events are represented using OpenTelemetry protos. Unfortunately,
I've had to copy over the otel protos to our tree because I couldn't
figure out a vendoring solution. Problems encountered for vendoring are:
1. The repo where these protos live
(https://github.com/open-telemetry/opentelemetry-proto) is not
go-get-able. This means hacks are needed for our vendoring tools.
2. Some of the protos in that repo do not build with gogoproto (they
only build with protoc), because they use the new-ish "optional"
option on some fields. The logs protos that we use in this patch do not
have this problem, but others do (so we'll need to figure something out
in the future when dealing with the metrics proto).  FWIW, the
OpenTelemetry Collector ironically has the same problem (it uses
gogoproto), and it solved it through a sed that changes all the optional
fields to one-of's.
3. Even if you solve the first two problems, the next one is that we
already have a dependency on these compiled protos in our tree
(go.opentelemetry.io/proto/otlp). This repo contains generated code,
using protoc. We need it because of our use of the otlp trace exporter.
Bringing in the protos again, and building them with gogo, results in go
files that want to live in the same package/have the same import path.
So changing the import paths is needed.

Between all of these, I've given up - at least for the moment - and
decided to copy over to our tree the few protos that we actually need.
I'm also changing their import paths. You'll notice that there is a
script that codifies the process of transforming the needed protos from
their otel upstream.

Release note: None
Release justification: Non-production.

Co-authored-by: Yevgeniy Miretskiy <yevgeniy@cockroachlabs.com>
Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants