Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Too many sst files of rocksdb. #206

Closed
kaikai1024 opened this issue Jan 30, 2019 · 16 comments
Closed

Too many sst files of rocksdb. #206

kaikai1024 opened this issue Jan 30, 2019 · 16 comments
Assignees
Labels
enhancement New feature or request help wanted Extra attention is needed more description TBD research Need to research urgent
Milestone

Comments

@kaikai1024
Copy link
Contributor

Description

TBD

Steps to Reproduce

TBD

Expected behavior: TBD

Actual behavior: TBD

Reproduce how often: TBD

Versions

0.20.2

Additional Information

------------------Chinese-----------------------------

CITA 长时间运行后发现 rocksdb 会生成较多的 sst 文件,测试链运行 3 个月,数据大小 30G,nosql 目录下接近 200W 个 sst 文件,导致 CITA-chain CPU 占用率 持续增长(偶尔出现卡顿 不出块 不知道是否和这个有关系),inode 占比 30% ,目测 在 4c 8G 100G 硬盘的服务器 启动一个节点 运行一年 inode 会满, 是否有办法 控制 sst 生成的文件数量的,同时 不要把所有的 sst 文件放在 nosql一级目录下,可以按hash 目录索引 ,目录下存放子目录再放 sst。

@kaikai1024 kaikai1024 added enhancement New feature or request more description TBD help wanted Extra attention is needed labels Jan 30, 2019
@kaikai1024 kaikai1024 changed the title sst files of rocksdb are too many. Sst files of rocksdb are too many. Jan 30, 2019
@kaikai1024 kaikai1024 added the research Need to research label Jan 30, 2019
@kaikai1024 kaikai1024 changed the title Sst files of rocksdb are too many. Too many Sst files of rocksdb. Jan 30, 2019
@kaikai1024 kaikai1024 changed the title Too many Sst files of rocksdb. Too many sst files of rocksdb. Jan 30, 2019
@kaikai1024
Copy link
Contributor Author

kaikai1024 commented Feb 15, 2019

@janx
Copy link
Contributor

janx commented Feb 15, 2019

I think @zhangsoledad 's suggestion is not use ssd, but tune cita's rocksdb configuration/usage to match hdd.

@kaikai1024
Copy link
Contributor Author

I think @zhangsoledad 's suggestion is not use ssd, but tune cita's rocksdb configuration/usage to match hdd.

Thx.

@kaikai1024
Copy link
Contributor Author

@rink1969 @jerry-yu @jiangxianliang007

帮忙贴下之前调查过程中总结的一些信息

@kaikai1024
Copy link
Contributor Author

@jerry-yu Please update the info.
You can create new issue if find other problem

@jerry-yu
Copy link
Contributor

jerry-yu commented Feb 21, 2019

关于测试链的bug: 经过rabbitmq的抓包,发现 mq 已经向 chain 发送了接收到的 executed_result 消息,但是chain的处理线程 recv_timeout 显示没有收到这个消息。 经过定位和google,发现跟标准库的channel的 recv_timeout 函数的 bug 有关。
-> 一堆issue( rust-lang/rust#48460

解决: 使用 crossbeam 的channel 替换标准库的channel,需要修改cita-common等

暂时解决:可以先做快照。测试发现,低压力(做过快照的)节点,丢包的数量比较少,很快恢复。而压力大的节点,会丢失较长时间。

@jerry-yu
Copy link
Contributor

关于调优rocksdb:我们现有的kv-db的库,可供调优的参数提供的很少。
#[derive(Clone)]
pub struct DatabaseConfig {
/// Max number of open files.
pub max_open_files: i32,
/// Cache sizes (in MiB) for specific columns.
pub cache_sizes: HashMap<Option, usize>,
/// Compaction profile
pub compaction: CompactionProfile,
/// Set number of columns
pub columns: Option,
/// Should we keep WAL enabled?
pub wal: bool,
}

pub struct CompactionProfile {
/// L0-L1 target file size
pub initial_file_size: u64,
/// L2-LN target file size multiplier
pub file_size_multiplier: i32,
/// rate limiter for background flushes and compactions, bytes/sec, if any
pub write_rate_limit: Option,
}

@classicalliu
Copy link
Contributor

关于测试链的bug: 经过rabbitmq的抓包,发现 mq 已经向 chain 发送了接收到的 executed_result 消息,但是chain的处理线程 recv_timeout 显示没有收到这个消息。 经过定位和google,发现跟标准库的channel的 recv_timeout 函数的 bug 有关。
-> 一堆issue( rust-lang/rust#48460

解决: 使用 crossbeam 的channel 替换标准库的channel,需要修改cita-common等

It's a new problem, open a new issue is better :)

@kaikai1024
Copy link
Contributor Author

kaikai1024 commented Feb 21, 2019

关于测试链的bug: 经过rabbitmq的抓包,发现 mq 已经向 chain 发送了接收到的 executed_result 消息,但是chain的处理线程 recv_timeout 显示没有收到这个消息。 经过定位和google,发现跟标准库的channel的 recv_timeout 函数的 bug 有关。
-> 一堆issue( rust-lang/rust#48460

解决: 使用 crossbeam 的channel 替换标准库的channel,需要修改cita-common等

所以我理解这个是变成了另外一个 issue ? 然后接着按照这个解决办法修改?如果是这样,可以麻烦建一个 issue 然后 reference 这个

@jerry-yu
Copy link
Contributor

另外:在进行重构时,注意提供 kvdb 的接口,如果能够支持 tikv 的接口,减轻磁盘紧张的状况

@yangby-cryptape
Copy link
Contributor

yangby-cryptape commented Mar 5, 2019

Source Code

CITA use RocksDB to store CurrentProof, CurrentHash and CurrentHeight.

https://github.com/cryptape/cita/blob/dd3e0cea53b9f96c3bcd9d387d5eead7042646d1/cita-chain/types/src/extras.rs#L48-L76

Check Data in Test Environment

Today, I checked our test environment:

  • There are 2_712_488 files in rocksdb for chain.
    Total size is 33 GiB.
  • 2_712_441 files of them are sst files.
    1_314_306 sst files are 972 bytes.
  • I checked few 972-bytes sst files randomly, all of them were used for recording CurrentHeight.
    And half of new files were 972 bytes.
  • Almost all files smaller than 1000 bytes are used to store CurrentHeigh.
  • Almost all files between 1500 bytes and 2500 bytes are used to store CurrentProof.

Comparison Test

Recently, I write a simple patch to let CITA do not store block data and current data in RocksDB.
Today, I run a chain with three nodes: node-1 and node-2 use my patched version CITA and node-3 uses the original version CITA.

In 6000 height:

  • My patched version has only 1 sst file.
    The original version has more than 7000 sst files.
  • In the nodes which use my patched version CITA, all data are 59MB.
    In the nodes which used the original version CITA, all data has 124MB.

Conclusion

DO NOT store current statuses into RocksDB.

I did not find the reason of these results, now.
If we care about that, I need more time to study RocksDB.

But, if we just want a solution to decrease sst files of RocksDB, this conclusion is enough.

@rainchen
Copy link
Member

rainchen commented Mar 5, 2019

rocksdb 有个选项是target_file_size_multiplier

官方解释:

Q: What is options.target_file_size_multiplier useful for?

A: It's a rarely used feature. For example, you can use it to reduce the number of the SST files.

而默认的target_file_size_multiplier 配置为:

target_file_size_multiplier=1

配置值是1

对于这个配置项,看到有这个讨论:

facebook/rocksdb#3265 (comment)

I think parameter target_file_size_multiplier was introduced in early experimentation with LevelDB and never actually used in production.

既然默认配置是1,又不是常用参数,那么SST文件数量多可能不是真正引起CPU 过高的问题,
因为ext4 分区下是支持无限数量文件的。

如果不是文件数过多引起的高CPU占用,那需要同分析工具找出 CPU 占用高的是哪部分。

建议先开启 statistics,没有统计无法分析真正原因。

statistics是RocksDB用来统计系统性能和吞吐信息的功能,开启它可以更直接的提供性能观测数据,能快速发现系统的瓶颈或系统运行状态,由于统计信息在引擎内的各类操作都会设置很多的埋点,用来更新统计信息,但是开启statistics会增加5%到10%的额外开销。

参考:https://xiking.win/2018/12/05/rocksdb-tuning/

@yangby-cryptape
Copy link
Contributor

同时 不要把所有的 sst 文件放在 nosql一级目录下,可以按hash 目录索引 ,目录下存放子目录再放 sst。

Benchmark: Deep directory structure vs. flat directory structure to store millions of files on ext4:

Write is 44% faster using a flat directory structure instead of deep/tree directory structure. Read is even 7.8x faster.

In conclusion, just use a flat directory structure. It’s easier to use. Faster in write. Much faster in read. Save on ionodes. And doesn’t need to pre-create or dynamically generate the branch folders.

@kaikai1024
Copy link
Contributor Author

@jerry-yu 能把解决方案贴过来吗

@kaikai1024
Copy link
Contributor Author

Will release a patch version to fix it.

@kaikai1024
Copy link
Contributor Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed more description TBD research Need to research urgent
Projects
None yet
Development

No branches or pull requests

6 participants