Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Executor exit with unknown reason while running performance test #588

Closed
leeyr338 opened this issue May 17, 2019 · 8 comments
Closed

Executor exit with unknown reason while running performance test #588

leeyr338 opened this issue May 17, 2019 · 8 comments
Assignees
Labels
bug Something isn't working
Milestone

Comments

@leeyr338
Copy link
Contributor

Description

Executor exit with unknown reason while running performance test

Steps to Reproduce

Run CI performance test

Expected behavior: [What you expect to happen]
Everything Ok.

Actual behavior: [What actually happens]
Exit with unknown reason.

cita-executor.log:

2019-05-17 - 04:36:25 | core_executor::libex - 427   | INFO  - executor grow according to ClosedBlock(height: 2699, hash: 0x7915fb7d017795cd60120b042c2bf96a2c9055612f159e1485b5bb4df427513f, parent_hash: 0xce566e025c78afe65b74f931046984623583f1294d152b3e10d8bb78199969a7, timestamp: 1558038984439, state_root: 0x15cfba67bf3018176e2d0ebd88d52547227ca80dcb49d50aad534e96bebff465, transaction_root: 0x8acb1bd21bfb2ec7dbb809bd9110e5462fd151eef30583f84ca745e64a346754, proposer: 0x121633ae96eed9581d39a97fabb09e61976897db)
2019-05-17 - 04:36:28 | cita_executor::postm - 165   | INFO  - postman receive 2700-th ClosedBlock from executor
2019-05-17 - 04:36:28 | core_executor::libex - 427   | INFO  - executor grow according to ClosedBlock(height: 2700, hash: 0xcab6e42e4c46e3393df8f4fd7c8b14571fd0c444e080fad7f706da13ac91f540, parent_hash: 0x7915fb7d017795cd60120b042c2bf96a2c9055612f159e1485b5bb4df427513f, timestamp: 1558038987462, state_root: 0x8801217bf3ab941a7ba1d0be5c0171f70afb95e123f0b9f886fcd702f1b3f364, transaction_root: 0xb56b50339d85a4f2ba8d63776ce6842840816569a2302543eb211dcfce881a88, proposer: 0xd5af3de8b988cb592bce85e0c72890b76d41c367)
-------------------------------------------------------> Exit here!

2019-05-17 - 04:50:16 | cita_executor        - 172   | INFO  - CITA:executor
2019-05-17 - 04:50:16 | cita_executor        - 176   | INFO  - Version: v0.23.1-98-g88ac295-dev
2019-05-17 - 04:50:16 | cita_executor        - 177   | INFO  - Config: Options { prooftype: 2, grpc_port: 5000, journaldb_type: "archive", genesis_path: "./genesis.json", statedb_cache_size: 5242880, eth_compatibility: false }
2019-05-17 - 04:50:16 | amqp::session        - 196   | INFO  - Session initialized
2019-05-17 - 04:50:17 | amqp::session        - 196   | INFO  - Session initialized
2019-05-17 - 04:50:19 | core_executor::contr - 58    | INFO  - Use default quota price

Reproduce how often: [What percentage of the time does it reproduce?]
High probability.

Versions

Develop.

@leeyr338 leeyr338 added the bug Something isn't working label May 17, 2019
@leeyr338 leeyr338 assigned kaikai1024 and leeyr338 and unassigned kaikai1024 May 17, 2019
@leeyr338
Copy link
Contributor Author

leeyr338 commented May 20, 2019

测过测试,发现 cita-executor 模块在压测过程中内存出现持续的增长,到了一定的值后,被系统 kill 掉。

屏幕快照 2019-05-20 下午6 17 28

现在需确认是否存在内存泄漏的情况。

@leeyr338
Copy link
Contributor Author

当测试完成后,也就是交易量降为 0 后, executor 模块的内存使用量并没有下降。
屏幕快照 2019-05-20 下午10 15 34

从现象上看,其存在内存泄漏的风险。需进一步排查内存泄漏的问题。

@leeyr338
Copy link
Contributor Author

executor 被 kill 时的 dmesg 日志:
dmesg.log

@keroro520
Copy link
Contributor

cita-executor/src/, the main structure is Postman, should not use too much memory. I remember it will prune the staled historical state when processing a ClosedBlock completely, Backlogs::complete, Backlogs::prune. I think it highly unlikely memory leak here (Postman).

cita-executor/core/src/libexecutor, the main structure is Executor, is only a state-machine. It does not store any memory state inside itself but may store in StateDB(pub state_db: Arc<RwLock<StateDB>>).

Recommend to debug State and StateDB. Seems it has always use too much memory. I know less about it, you can ask others.

@leeyr338
Copy link
Contributor Author

问题分析

Postman 会将每次Executor 执行的结果缓存在 Backlogs 的 completed 结构中,然后在收到 Chain 的 RichStatus 后,将其中的缓存清空。其调用的接口为 prune:

pub fn prune(&mut self, height: u64) {
        // Importance guard: we must keep the executed result of the recent
        // 2 height(current_height - 1, current_height - 2), which used when
        // postman check arrived proof via `Postman::check_proof`
        if height + 2 < self.get_current_height() {
            self.completed = self.completed.split_off(&height);
        }
    }

为了使 Executor 能正常验证交易,缓存至少需要保存 [current_height, current_height - 1, current_height - 2] 3 个块的执行结果。

然而以上的代码逻辑并不能满足这个要求:

  1. 代码中的入参 height 指的是下一块高度;而
  2. self.get_current_height() 表示当前高度;
  3. 在正常情况下,height = self.get_current_height() + 1 , 条件永远不成立;也即
  4. 在正常情况下,Executor 不能够正常清理执行缓存,从而导致内存持续增大。

更改方案

pub fn prune(&mut self, height: u64) {
        // Importance guard: we must keep the executed result of the recent
        // 3 height(current_height, current_height - 1, current_height - 2),
        // which used when postman check arrived proof via `Postman::check_proof`
        if self.get_current_height() > 2 {
            let split_height = min(height, self.get_current_height() - 2);
            self.completed = self.completed.split_off(&split_height);
        }
    }

测试用例

#[test]
    fn test_update_rich_status_0() {
        let mut postman = helpers::generate_postman(5, Default::default());

        backlogs_prepare(&mut postman);

        let mut rich_status = RichStatus::new();

        rich_status.set_height(0);

        postman.update_by_rich_status(&rich_status);

        // block height in rich_status is from Chain, that means database's block height.
        // block height in postman, that means executed block height which cache in Executor.
        // Testcase 0: block_height[chain] = 0, block_height[executor] = 5.
        // Expected: remove the block 0 from cache, but not other block.
        assert!(postman.backlogs.get_completed_result(0).is_none());
        assert!(postman.backlogs.get_completed_result(1).is_some());
        assert!(postman.backlogs.get_completed_result(2).is_some());
        assert!(postman.backlogs.get_completed_result(3).is_some());
        assert!(postman.backlogs.get_completed_result(4).is_some());
        assert!(postman.backlogs.get_completed_result(5).is_some());
    }

    #[test]
    fn test_update_rich_status_1() {
        let mut postman = helpers::generate_postman(5, Default::default());

        backlogs_prepare(&mut postman);

        let mut rich_status = RichStatus::new();

        rich_status.set_height(1);

        postman.update_by_rich_status(&rich_status);

        // block height in rich_status is from Chain, that means database's block height.
        // block height in postman, that means executed block height which cache in Executor.
        // Testcase 0: block_height[chain] = 1, block_height[executor] = 5.
        // Expected: remove the block 0, 1 from cache, but not other block.
        assert!(postman.backlogs.get_completed_result(0).is_none());
        assert!(postman.backlogs.get_completed_result(1).is_none());
        assert!(postman.backlogs.get_completed_result(2).is_some());
        assert!(postman.backlogs.get_completed_result(3).is_some());
        assert!(postman.backlogs.get_completed_result(4).is_some());
        assert!(postman.backlogs.get_completed_result(5).is_some());
    }

    #[test]
    fn test_update_rich_status_2() {
        let mut postman = helpers::generate_postman(5, Default::default());

        backlogs_prepare(&mut postman);

        let mut rich_status = RichStatus::new();

        rich_status.set_height(2);

        postman.update_by_rich_status(&rich_status);

        // block height in rich_status is from Chain, that means database's block height.
        // block height in postman, that means executed block height which cache in Executor.
        // Testcase 0: block_height[chain] = 2, block_height[executor] = 5.
        // Expected: remove the block 0, 1, 2 from cache, but not other block.
        assert!(postman.backlogs.get_completed_result(0).is_none());
        assert!(postman.backlogs.get_completed_result(1).is_none());
        assert!(postman.backlogs.get_completed_result(2).is_none());
        assert!(postman.backlogs.get_completed_result(3).is_some());
        assert!(postman.backlogs.get_completed_result(4).is_some());
        assert!(postman.backlogs.get_completed_result(5).is_some());
    }

    #[test]
    fn test_update_rich_status_3() {
        let mut postman = helpers::generate_postman(5, Default::default());

        backlogs_prepare(&mut postman);

        let mut rich_status = RichStatus::new();

        rich_status.set_height(3);

        postman.update_by_rich_status(&rich_status);

        // block height in rich_status is from Chain, that means database's block height.
        // block height in postman, that means executed block height which cache in Executor.
        // Testcase 0: block_height[chain] = 3, block_height[executor] = 5.
        // Expected: postman needs to keep at least 3 block in cache, so remove the block 0, 1, 2 from cache, but not other block.
        assert!(postman.backlogs.get_completed_result(0).is_none());
        assert!(postman.backlogs.get_completed_result(1).is_none());
        assert!(postman.backlogs.get_completed_result(2).is_none());
        assert!(postman.backlogs.get_completed_result(3).is_some());
        assert!(postman.backlogs.get_completed_result(4).is_some());
        assert!(postman.backlogs.get_completed_result(5).is_some());
    }

    #[test]
    fn test_update_rich_status_4() {
        let mut postman = helpers::generate_postman(5, Default::default());

        backlogs_prepare(&mut postman);

        let mut rich_status = RichStatus::new();

        rich_status.set_height(4);

        postman.update_by_rich_status(&rich_status);

        // block height in rich_status is from Chain, that means database's block height.
        // block height in postman, that means executed block height which cache in Executor.
        // Testcase 0: block_height[chain] = 3, block_height[executor] = 5.
        // Expected: postman needs to keep at least 3 block in cache, so remove the block 0, 1, 2 from cache, but not other block.
        assert!(postman.backlogs.get_completed_result(0).is_none());
        assert!(postman.backlogs.get_completed_result(1).is_none());
        assert!(postman.backlogs.get_completed_result(2).is_none());
        assert!(postman.backlogs.get_completed_result(3).is_some());
        assert!(postman.backlogs.get_completed_result(4).is_some());
        assert!(postman.backlogs.get_completed_result(5).is_some());
    }

    #[test]
    fn test_update_rich_status_5() {
        let mut postman = helpers::generate_postman(5, Default::default());

        backlogs_prepare(&mut postman);

        let mut rich_status = RichStatus::new();

        rich_status.set_height(5);

        postman.update_by_rich_status(&rich_status);

        // block height in rich_status is from Chain, that means database's block height.
        // block height in postman, that means executed block height which cache in Executor.
        // Testcase 0: block_height[chain] = 3, block_height[executor] = 5.
        // Expected: postman needs to keep at least 3 block in cache, so remove the block 0, 1, 2 from cache, but not other block.
        assert!(postman.backlogs.get_completed_result(0).is_none());
        assert!(postman.backlogs.get_completed_result(1).is_none());
        assert!(postman.backlogs.get_completed_result(2).is_none());
        assert!(postman.backlogs.get_completed_result(3).is_some());
        assert!(postman.backlogs.get_completed_result(4).is_some());
        assert!(postman.backlogs.get_completed_result(5).is_some());
    }

    fn backlogs_prepare(postman: &mut Postman) {
        let execute_result_0 = generate_executed_result(0);
        let execute_result_1 = generate_executed_result(1);
        let execute_result_2 = generate_executed_result(2);
        let execute_result_3 = generate_executed_result(3);
        let execute_result_4 = generate_executed_result(4);
        let execute_result_5 = generate_executed_result(5);

        postman
            .backlogs
            .insert_completed_result(0, execute_result_0);
        postman
            .backlogs
            .insert_completed_result(1, execute_result_1);
        postman
            .backlogs
            .insert_completed_result(2, execute_result_2);
        postman
            .backlogs
            .insert_completed_result(3, execute_result_3);
        postman
            .backlogs
            .insert_completed_result(4, execute_result_4);
        postman
            .backlogs
            .insert_completed_result(5, execute_result_5);
    }

@kaikai1024
Copy link
Contributor

在正常情况下,height = self.get_current_height() + 1 , 条件永远不成立

"height + 2 < self.get_current_height() "

是 self.get_current_height() = height + 3 永远不成立?

@leeyr338
Copy link
Contributor Author

leeyr338 commented Jun 11, 2019

"height + 2 < self.get_current_height() "

是 self.get_current_height() = height + 3 永远不成立?

链如果正常执行,heigth (下一块高) 应该是 (当前高度 + 1) ,所以原来的条件:

if height + 2 < self.get_current_height() {}

是不成立的,也就是说在链正常运行的时候,缓存是没法清理的。

在非常特殊的情况,在链被意外删除后,才会出现条件 height + 2 < self.get_current_height() 成立。但这已经不符合我们的设计目标了。

@kaikai1024
Copy link
Contributor

ok,刚刚理解错了。
我以为 height = self.get_current_height() + 1 这个条件永远不成立

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants