Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(resource): deprecate auto_restart_interval in favor of health_check_interval #10910

Merged
merged 6 commits into from Jun 2, 2023

Conversation

thalesmg
Copy link
Contributor

@thalesmg thalesmg commented Jun 1, 2023

See:
https://emqx.atlassian.net/wiki/spaces/P/pages/612368639/open+e5.1+remove+auto+restart+interval+from+buffer+worker+resource+options

Current problem:

In 5.0.x, we have two timer options that control the state changing of buffer worker
resources: auto_restart_interval and health_check_interval.

  • auto_restart_interval controls how often the resource attempts to transition from
    disconnected to connected.

  • health_check_interval controls how often the resource is checked and potentially moved
    from connected to disconnected or connecting.

The existence of two independent timers for very similar purposes is confusing to users,
QA and even developers. Also, an intimately related configuration is request_timeout,
which can interact badly with auto_restart_interval if the latter is poorly configured:
requests may always expire if request_timeout < auto_restart_interval and if the resource
enters the disconnected state. For health_check_interval, we attempt to derive a sane
default that gives requests a chance to retry (if request timeout is finite, then the
resource retries requests with a period of min(health_check_interval, request_timeout /
3).

Another problem with the separate auto_restart_interval is that its default value (60 s)
is too high when compared to the default request timeout and health check, leading to the
problems described above if not tuned.

Proposed solution:

We propose to drop auto_restart_interval in favor of health_check_interval, which will be
used for both disconnected -> connected and connected -> {disconnected, connecting}
transition checks. With that, the resource will attempt to reconnect at the same interval
as the health check, which currently is 15 s.

Also, as two smaller changes to accompany this one:

  • Increase the default request_timeout from 15 s to 45 s.
  • Rename request_timeout to request_ttl.

Summary

🤖 Generated by Copilot at 34d1c9e

Refactor the bridge resource management and configuration options to improve consistency, clarity, and simplicity. Remove the auto_restart_interval option and feature from all bridge types, as it is redundant and problematic. Rename the request_timeout option to request_ttl in all bridge modules and test suites, to avoid naming conflicts and unify the meaning of the request time-to-live option. Bump the version numbers of the affected bridge applications to reflect the changes.

PR Checklist

Please convert it to a draft if any of the following conditions are not met. Reviewers may skip over until all the items are checked:

  • Added tests for the changes
  • Changed lines covered in coverage report
  • Change log has been added to changes/{ce,ee}/(feat|perf|fix)-<PR-id>.en.md files
  • For internal contributor: there is a jira ticket to track this change
  • If there should be document changes, a PR to emqx-docs.git is sent, or a jira ticket is created to follow up
  • Schema changes are backward compatible

Checklist for CI (.github/workflows) changes

  • If changed package build workflow, pass this action (manual trigger)
  • Change log has been added to changes/ dir for user-facing artifacts update

…_check_interval`

See:
https://emqx.atlassian.net/wiki/spaces/P/pages/612368639/open+e5.1+remove+auto+restart+interval+from+buffer+worker+resource+options

Current problem:

In 5.0.x, we have two timer options that control the state changing of buffer worker
resources: auto_restart_interval and health_check_interval.

- auto_restart_interval controls how often the resource attempts to transition from
disconnected to connected.

- health_check_interval controls how often the resource is checked and potentially moved
from connected to disconnected or connecting.

The existence of two independent timers for very similar purposes is confusing to users,
QA and even developers.  Also, an intimately related configuration is request_timeout,
which can interact badly with auto_restart_interval if the latter is poorly configured:
requests may always expire if request_timeout < auto_restart_interval and if the resource
enters the disconnected state.  For health_check_interval, we attempt to derive a sane
default that gives requests a chance to retry (if request timeout is finite, then the
resource retries requests with a period of min(health_check_interval, request_timeout /
3).

Another problem with the separate auto_restart_interval is that its default value (60 s)
is too high when compared to the default request timeout and health check, leading to the
problems described above if not tuned.

Proposed solution:

We propose to drop auto_restart_interval in favor of health_check_interval, which will be
used for both disconnected -> connected and connected -> {disconnected, connecting}
transition checks.  With that, the resource will attempt to reconnect at the same interval
as the health check, which currently is 15 s.

Also, as two smaller changes to accompany this one:

- Increase the default request_timeout from 15 s to 45 s.
- Rename request_timeout to request_ttl.
@thalesmg thalesmg force-pushed the unify-restart-interval-v50 branch from 34d1c9e to e6cb0ee Compare June 1, 2023 14:21
@thalesmg thalesmg force-pushed the unify-restart-interval-v50 branch from e6cb0ee to 3e4790e Compare June 1, 2023 16:02
@thalesmg thalesmg marked this pull request as ready for review June 1, 2023 17:57
@thalesmg thalesmg requested a review from a team as a code owner June 1, 2023 17:57
@coveralls
Copy link
Collaborator

coveralls commented Jun 1, 2023

Pull Request Test Coverage Report for Build 5155352642

  • 16 of 16 (100.0%) changed or added relevant lines in 5 files are covered.
  • 431 unchanged lines in 42 files lost coverage.
  • Overall coverage decreased (-0.1%) to 81.618%

Files with Coverage Reduction New Missed Lines %
apps/emqx_bridge_oracle/src/emqx_bridge_oracle.erl 1 93.33%
apps/emqx_bridge_rocketmq/src/emqx_bridge_rocketmq_connector.erl 1 74.74%
apps/emqx_bridge_sqlserver/src/emqx_bridge_sqlserver_connector.erl 1 82.18%
apps/emqx_conf/src/emqx_conf_schema.erl 1 95.0%
apps/emqx_ft/src/emqx_ft_storage.erl 1 91.3%
apps/emqx_gateway_mqttsn/src/emqx_mqttsn_channel.erl 1 73.18%
apps/emqx_gateway/src/emqx_gateway_conf.erl 1 89.2%
apps/emqx_gateway/src/emqx_gateway_schema.erl 1 88.89%
apps/emqx_management/src/emqx_mgmt_app.erl 1 85.71%
apps/emqx_modules/src/emqx_delayed.erl 1 95.65%
Totals Coverage Status
Change from base Build 5144947922: -0.1%
Covered Lines: 28888
Relevant Lines: 35394

💛 - Coveralls

@thalesmg thalesmg requested a review from zmstone June 1, 2023 18:43
Copy link
Member

@zmstone zmstone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add a changelog to describe in detail the changed fields and default values.

apps/emqx_resource/src/schema/emqx_resource_schema.erl Outdated Show resolved Hide resolved
thalesmg and others added 2 commits June 2, 2023 09:08
Co-authored-by: Zaiming (Stone) Shi <zmstone@gmail.com>
@thalesmg
Copy link
Contributor Author

thalesmg commented Jun 2, 2023

please add a changelog to describe in detail the changed fields and default values.

✔️

@thalesmg thalesmg merged commit 33aa879 into emqx:master Jun 2, 2023
165 of 166 checks passed
@thalesmg thalesmg deleted the unify-restart-interval-v50 branch June 2, 2023 19:20
@zmstone zmstone mentioned this pull request Jun 9, 2023
8 tasks
@yanzhiemq
Copy link
Collaborator

yanzhiemq commented Jun 13, 2023

Enhancements

  • The data bridge resource option auto_restart_interval was deprecated in favor of health_check_interval, and request_timeout was renamed to request_ttl. Also, the default request_ttl value went from 15 seconds to 45 seconds.

    The previous existence of both auto_restart_interval and health_check_interval was a source of confusion, as both parameters influenced the recovery of data bridges under failures. An inconsistent configuration of those two parameters could lead to messages being expired without a chance to retry. Now, health_check_interval is used both to control the periodicity of health checks that may transition the data bridge into disconnected or connecting states, as well as recovering from disconnected.

@yanzhiemq
Copy link
Collaborator

yanzhiemq commented Jun 13, 2023

增强

  • 数据桥接资源选项 auto_restart_interval 已被弃用,改为使用 health_check_interval,而 request_timeout 则被重命名为 request_ttl。此外,默认的 request_ttl 值从 15 秒增加到了 45 秒。

    之前同时存在 auto_restart_intervalhealth_check_interval 会导致混淆,因为这两个参数都会影响数据桥接在故障下的恢复。这两个参数的配置不一致可能导致消息过期而无法重试。现在,health_check_interval 用于控制定期进行健康检查的频率,这可能会将数据桥接转换为 disconnectedconnecting 状态,并从 disconnected 状态中进行恢复。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants