高并发情况下，各种规则限流并不准确 #1620

kldwz · 2020-07-19T11:56:36Z

先来说下，sentinel-1.6.3源码中提供的demo，测试还是蛮稳定的，让笔者一直以为这个sentinel应该是比较稳定的，直到自己系统准备引入的时候，在高并发情况下压测的时候才发现限流特别的不稳定。

先看一个基本的demo，FlowQpsDemo，笔者简单的修改了一下，如下所示：

`

      public void run() {
        while (!stop) {
            Entry entry = null;

            try {
                entry = SphU.entry(KEY);
                // token acquired, means pass

                // 模拟业务操作，50Ms以内的业务操作
                Random random2 = new Random();
                try {
                    TimeUnit.MILLISECONDS.sleep(random2.nextInt(50));
                } catch (InterruptedException e) {
                    // ignore
                }

                pass.addAndGet(1);
            } catch (BlockException e1) {
                block.incrementAndGet();
            } catch (Exception e2) {
                // biz exception
            } finally {
                total.incrementAndGet();
                if (entry != null) {
                    entry.exit();
                }
            }

`

原本这个50Ms的停顿在finally entry.exit之外，就是在限流动作全部执行完成之后才进行的sleep，这个时候测试是没有问题的，限流很稳定。

但是一旦按照笔者这种改法来测试的话，就会有问题了，测试结果如下：
`

1595158591552, total:833752, pass:50, block:833702
98 send qps is: 1039153
1595158592743, total:1039153, pass:26, block:1039128
97 send qps is: 978964
1595158594133, total:978964, pass:20, block:978942
96 send qps is: 784002
1595158594879, total:784002, pass:25, block:783977

`

按照sentinel提供的使用方式，笔者这种改法无可厚非，但是为什么差距这么大呢？
仔细分析了代码之后发现是限流在并发方面限制的有问题。
以下是DefaultController.canPass()方法：
`

public boolean canPass(Node node, int acquireCount, boolean prioritized) {
    int curCount = avgUsedTokens(node);
    // 当前qps+1如果大于规则设定的count，则返回false，否则返回true
    if (curCount + acquireCount > count) {
        ...
        }
        return false;
    }
    return true;
}

`

avgUsedTokens()方法：
`

private int avgUsedTokens(Node node) {
    if (node == null) {
        return DEFAULT_AVG_USED_TOKENS;
    }
    // 获取当前qps或当前线程数
    return grade == RuleConstant.FLOW_GRADE_THREAD ? node.curThreadNum() : (int)(node.passQps());
}

`

这段代码的问题在于，如果多个线程（假如有100个）并发执行到 if (curCount + acquireCount > count) ，这里并没有加锁之类的操作，则这100个线程都会返回true，限流失效。

而最终增加node.qps计数的动作在StatisticSlot
`

public void entry(Context context, ResourceWrapper resourceWrapper, DefaultNode node, int count,
                  boolean prioritized, Object... args) throws Throwable {
    try {
        // Do some checking.
        fireEntry(context, resourceWrapper, node, count, prioritized, args);

        // Request passed, add thread count and pass count.
        node.increaseThreadNum();
        // 这个时候才会增加qps，但是刚才的DefaultController.pass方法已经返回了true，
        node.addPassRequest(count);

`

究其原因，还是因为DefaultController在计数时并没有并发限制。

The text was updated successfully, but these errors were encountered:

cdfive · 2020-07-19T13:48:41Z

测试很详细~赞👍

但是一旦按照笔者这种改法来测试的话，就会有问题了，测试结果如下：
1595158591552, total:833752, pass:50, block:833702 98 send qps is: 1039153 1595158592743, total:1039153, pass:26, block:1039128 97 send qps is: 978964 1595158594133, total:978964, pass:20, block:978942 96 send qps is: 784002 1595158594879, total:784002, pass:25, block:783977

这是前面几秒的数据吗，接下来后面呢？按这个改法本机试了下，后面的时间pass是稳定在20左右的，跟放在finally结果一致。

如果多个线程（假如有100个）并发执行到 if (curCount + acquireCount > count) ，这里并没有加锁之类的操作，则这100个线程都会返回true，限流失效。

这里没有加锁可能是出于性能考虑，在性能和准确度可以接受的情况下做了一个折中。

yunfeiyanggzq · 2020-07-20T11:22:35Z

测试很详细~赞👍

但是一旦按照笔者这种改法来测试的话，就会有问题了，测试结果如下：
1595158591552, total:833752, pass:50, block:833702 98 send qps is: 1039153 1595158592743, total:1039153, pass:26, block:1039128 97 send qps is: 978964 1595158594133, total:978964, pass:20, block:978942 96 send qps is: 784002 1595158594879, total:784002, pass:25, block:783977

这是前面几秒的数据吗，接下来后面呢？按这个改法本机试了下，后面的时间pass是稳定在20左右的，跟放在finally结果一致。

如果多个线程（假如有100个）并发执行到 if (curCount + acquireCount > count) ，这里并没有加锁之类的操作，则这100个线程都会返回true，限流失效。

这里没有加锁可能是出于性能考虑，在性能和准确度可以接受的情况下做了一个折中。

我感觉这是一个bug，因为这个代码也要用于thread数限流，这样统计不准确 @cdfive @sczyh30

yunfeiyanggzq · 2020-07-20T11:41:51Z

发现这里的设计是非严格精确的，在准确性和性能方面做折中；当然这里可以考虑下如何改善精确度
之前社区的一些讨论：#76 @kldwz

15200291066 · 2020-09-27T07:53:51Z

fudali113 · 2020-11-17T07:01:38Z

如果我们希望优化这个问题的话？是否可以使用类似的方案：

其实我们主要需要控制的就是 并发量 以及 pass qps;

针对并发量:

不再使用 LongAdder 进行储存，使用 AtomicLong 进行储存，因为它可以给我们实际添加以后的值(并发安全的)，然后使用 ThreadLocal 进行一个储存；LongAdder -> AtomicLong 的实际的性能损失在业务中我认为是可以忽略不计的；

针对 pass qps :

可以采取类似的方式，就是在进行某些操作前我们执行并发安全的添加，并拿到值，类似 snapshot at the begining 的思想；我们只要在初始并发操作的情况下能够安全的拿到值，我们就可以基于线程做一个快照，这个快照可以为之后我们进行并发判断时使用；

对应到 qps，因为可能对应到了时间窗，在执行 add 操作的时候，会对应到当前某个窗口的更改，此时当前窗口之前的值是不可能会发送改变的；我们在当前窗口执行了操作之后，需要对当前窗口进行 snapshot 储存到当前线程（只需要快照一个窗口，快照窗口之前的窗口都是可以引用的，快照带来的额外储存不会太大）；

基于现在窗口里面储存的值的实现可能会有一些改变，比如我们现在储存的是 pass，为了适应我们并发处理，我们可以储存 total，并且在执行判断之前就进行 add 操作，在执行判断的时候 passQps 根据snapshot时间窗口的 total 减去 block来得到（执行add操作跟我们实际进行判断中间的时间差可能会带来影响？这个问题可能还需要更深入的思考)；

这样子看起来应该是可以得到一个并发安全的限流，因为我们后面判断的逻辑都是根据我们的快照进行的，且我们快照的操作时并发安全的，以快照的并发安全来支撑限流的并发安全；

@sczyh30 @jasonjoo2010 你们有什么看法么？

mqadmin tools with command clusterRT/checkMsgSendRT, there are spell error for message amount in long format option type.

[ISSUE alibaba#1620]fix alibaba#1620 amout spell error

cdfive added the area/flow-control Issues or PRs related to flow control label Jul 19, 2020

yunfeiyanggzq mentioned this issue Jul 31, 2020

[docs]ASoC 2020 中期总结（集群并发流控） #1639

Open

yunfeiyanggzq mentioned this issue Aug 25, 2020

[Discussion]How to design the local thread flow control. alibaba/sentinel-golang#211

Open

sczyh30 mentioned this issue Oct 9, 2020

Limiting traffic based on the maximum number of threads is not accurate #1785

Closed

sczyh30 mentioned this issue Nov 16, 2020

sentinel 并发限流结果与预期不一致的问题 #1861

Open

sczyh30 mentioned this issue Dec 18, 2020

Concurrency limiting is not strictly accurate #1900

Closed

CST11021 pushed a commit to CST11021/Sentinel that referenced this issue Nov 3, 2021

fix alibaba#1620 amout spell error

8e2ce06

mqadmin tools with command clusterRT/checkMsgSendRT, there are spell error for message amount in long format option type.

CST11021 pushed a commit to CST11021/Sentinel that referenced this issue Nov 3, 2021

Merge pull request alibaba#1629 from martianzhang/master

f7c21db

[ISSUE alibaba#1620]fix alibaba#1620 amout spell error

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

高并发情况下，各种规则限流并不准确 #1620

高并发情况下，各种规则限流并不准确 #1620

kldwz commented Jul 19, 2020

cdfive commented Jul 19, 2020

yunfeiyanggzq commented Jul 20, 2020

yunfeiyanggzq commented Jul 20, 2020 •

edited

Loading

15200291066 commented Sep 27, 2020

fudali113 commented Nov 17, 2020

高并发情况下，各种规则限流并不准确 #1620

高并发情况下，各种规则限流并不准确 #1620

Comments

kldwz commented Jul 19, 2020

cdfive commented Jul 19, 2020

yunfeiyanggzq commented Jul 20, 2020

yunfeiyanggzq commented Jul 20, 2020 • edited Loading

15200291066 commented Sep 27, 2020

fudali113 commented Nov 17, 2020

针对并发量:

针对 pass qps :

yunfeiyanggzq commented Jul 20, 2020 •

edited

Loading