Skip to content

CircuitBreaker: fix race condition, adjust reset policy#701

Merged
jamesge merged 6 commits into
apache:masterfrom
TousakaRin:circuit_breaker
Jun 17, 2019
Merged

CircuitBreaker: fix race condition, adjust reset policy#701
jamesge merged 6 commits into
apache:masterfrom
TousakaRin:circuit_breaker

Conversation

@TousakaRin
Copy link
Copy Markdown
Contributor

  1. fix sp->circuit_breaker->Reset() 时可能存在的data race
  2. 调整CircuitBreaker::Reset的策略,Reset之后,如果已经经过了初始化阶段,则继续使用之前ema_latency。如果Reset的时候还处于初始化阶段,则丢弃之前积累的数据

@TousakaRin TousakaRin changed the title fix data race, adjust reset policy of CircuitBreaker CircuitBreaker: fix race condition, adjust reset policy May 7, 2019
Comment thread src/brpc/circuit_breaker.cpp Outdated
if (_is_first_call_after_revived.load(butil::memory_order_relaxed) &&
_is_first_call_after_revived.exchange(false, butil::memory_order_relaxed)) {
_last_revived_time_ms.store(butil::cpuwide_time_ms(), butil::memory_order_relaxed);
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个变量在CircuitBreaker::Reset中只赋值一次有什么问题吗,为什么还需要在这里赋值一次呢,这个“ _last_revived_time_ms“在OnCallEnd里赋值似乎语义也不太对,直觉上应该是在revived的时候立刻赋值。这里你想做的是想消除一次rpc的延时带来的误差,我反而觉得可以算进去才符合现在的语义

Copy link
Copy Markdown
Contributor Author

@TousakaRin TousakaRin May 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不是为了消除一次rpc的延迟带来的误差,而是为了去掉健康检查带来的误差。之前的流程是这样:

  1. OnCallEnd返回false,触发SetFailed
  2. WaitAndReset之后开始健康检查,触发Revive,Revive之后调用Reset

但是由于Revive之后立即就会有流量,有小概率导致立即又被熔断(由于rpc的时间一般都比较长,等他们调用OnCallEnd的时候,Reset一般已经完成了)
这个MR修复了这个问题,把Reset()的调用从"Revived之后" 挪到了 "WaitAndReset之后"。 所以要在收到第一个请求之后修正下Revive的时间。

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不是很理解,在reset移动到”WaitAndReset之后“,之后会立刻调Socket::Revive,然后就会有流量了。这里需要修正什么?不修正会造成什么问题呢

Copy link
Copy Markdown
Contributor Author

@TousakaRin TousakaRin May 31, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

之前放在Socket::Revive 之后,当Revive执行之后,socket就认为是正常的并且会有流量,然而假如这个时候CircuitBreaker::Reset还没有被调用,那么这些请求返回之后调用的OnCallEnd会返回false,于是又会触发SetFailed。

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我的意思是_last_revived_time_ms在这里修正合适么,对之后的维护者可能会有点难于理解。既然是“_last_revived_time”,就应该在上一次revived的时候就赋值吧,这里的名字和修正地方有点反直觉,严格来说也是不准确的,这个值等于真实revived_time + first rpc latency?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

是有些反直觉。CircuitBreaker::Reset应该放在Socket::Reset之后,那么要在Revived之后立即重置_last_revived_time的话,那再增加一个接口MarkAsRevived?看着也有点诡异

, _short_window(FLAGS_circuit_breaker_short_window_size,
FLAGS_circuit_breaker_short_window_error_percent)
, _last_reset_time_ms(butil::cpuwide_time_ms())
, _last_reset_time_ms(0)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我理解第一次被熔断时,isolation duration 应该等于min_isolation_duration,所以初始化的值改成了0。

@jamesge jamesge merged commit cecb578 into apache:master Jun 17, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants