Skip to content

Latest commit

 

History

History
1005 lines (708 loc) · 40.5 KB

20140704_01.md

File metadata and controls

1005 lines (708 loc) · 40.5 KB

flashcache usage guide

作者

digoal

日期

2014-07-04

标签

PostgreSQL , Linux , flashcache , ssd , zfs


背景

前几天写过一篇关于使用flashcache提升PostgreSQL IOPS性能的文章

http://blog.163.com/digoal@126/blog/static/1638770402014528115551323/

本文将要介绍一下flashcache的使用注意事项, 更好的使用flashcache.

1. 内核的适配, 目前flashcache在2.6.18到2.6.38之间的Linux内核做过测试, 可以使用. 其他内核的话, 不建议使用.

2. 缓存模式的选择, flashcache目前支持3种模式

Writethrough - safest, all writes are cached to ssd but also written to disk  
immediately.  If your ssd has slower write performance than your disk (likely  
for early generation SSDs purchased in 2008-2010), this may limit your system   
write performance.  All disk reads are cached (tunable).    

写操作, 会写SSD(flashcache盘), 同时写磁盘(数据盘).

读操作, 所有读操作的数据都会读入SSD(flashcache盘), 但可以通过sysctl调整. dev.flashcache..cache_all

Writearound - again, very safe, writes are not written to ssd but directly to  
disk.  Disk blocks will only be cached after they are read.  All disk reads  
are cached (tunable).  

写操作, 直接写磁盘(数据盘), 不写SSD(flashcache盘).

读操作, 所有读操作的数据都会读入SSD(flashcache盘), 但可以通过sysctl调整. dev.flashcache..cache_all

Writeback - fastest but less safe.  Writes only go to the ssd initially, and  
based on various policies are written to disk later.  All disk reads are  
cached (tunable).    

写操作, 写SSD(flashcache盘), 然后异步的写入磁盘(数据盘).

读操作, 所有读操作的数据都会读入SSD(flashcache盘), 但可以通过sysctl调整. dev.flashcache..cache_all

对于顺序写入, 一般的SSD和普通15K转速的磁盘性能差别不是特别大. 如果普通盘的性能更好的话, writearound更合算. 一般的场景的话三种模式差不多.

对于离散写入, SSD性能要比普通磁盘好很多. writeback很适合.

后面会提到如何优化顺序写入.

3. 缓存持久化.

只有writeback会持久化到ssd(flashcache盘), 因为它是异步写入到磁盘的. 所以必须持久化不能丢.

而对于writethrough 和 writearound 重启或设备remove后, 数据就丢了, 也不影响数据一致性.

4. 已知的BUG

https://github.com/facebook/flashcache/issues

5. cachedev块设备的管理, dmsetup命令, 或者使用flashcache封装好的3个命令.

5.1 创建cache dev设备

flashcache_create, flashcache_load and flashcache_destroy.   
These utilities use dmsetup internally, presenting a simpler interface to create,   
load and destroy flashcache volumes.  
It is expected that the majority of users can use these utilities instead of using dmsetup.  
  
flashcache_create : Create a new flashcache volume.  
  
# flashcache_create   
Usage: flashcache_create [-v] [-p back|thru|around] [-b block size] [-m md block size] [-s cache size] [-a associativity] cachedev ssd_devname disk_devname  
Usage : flashcache_create Cache Mode back|thru|around is required argument  
Usage : flashcache_create Default units for -b, -m, -s are sectors, or specify in k/M/G. Default associativity is 512.  
  
-v : verbose.  
-p : cache mode (writeback/writethrough/writearound).  
-s : cache size. Optional. If this is not specified, the entire ssd device  
     is used as cache. The default units is sectors. But you can specify   
     k/m/g as units as well.  
-b : block size. Optional. Defaults to 4KB. Must be a power of 2.  建议和SSD设备(flashcache设备) 的扇区大小一致.   
     The default units is sectors. But you can specify k as units as well.  
     (A 4KB blocksize is the correct choice for the vast majority of   
     applications. But see the section "Cache Blocksize selection" below).  
-f : force create. by pass checks (eg for ssd sectorsize).  
  
Examples :  
flashcache_create -p back -s 1g -b 4k cachedev /dev/sdc /dev/sdb  
Creates a 1GB writeback cache volume with a 4KB block size on ssd   
device /dev/sdc to cache the disk volume /dev/sdb. The name of the device   
created is "cachedev".  
  
flashcache_create -p thru -s 2097152 -b 8 cachedev /dev/sdc /dev/sdb  
Same as above but creates a write through cache with units specified in   
sectors instead. The name of the device created is "cachedev".  

注意指定-s cache size, 否则整个ssd或ssd分区全部使用.

-b cache dev blocksize 和 -m cache dev metadata blocksize  

cache数据块和metadata 数据块大小的选择原则 :

Cache Blocksize selection : 推荐和底层SSD设备一致.  
=========================  
Cache blocksize selection is critical for good cache utilization and performance.   
A 4KB cache blocksize for the vast majority of workloads (and filesystems).  
  
Cache Metadata Blocksize selection : 推荐和底层SSD设备一致.  
==================================  
This section only applies to the writeback cache mode. 只有writeback需要存储metadata块.  
Writethrough and writearound modes store no cache metadata at all.  
  
In Flashcache version 1, the metadata blocksize was fixed at 1 (512b) sector.  
Flashcache version 2 removes this limitation. In version 2, we can configure   
a larger flashcache metadata blocksize.   
Version 2 maintains backwards compatibility for caches created with Version 1.   
For these cases, a metadata blocksize of 512 will continue to be used.  
  
flashcache_create -m can be used to optionally configure the metadata blocksize.  
Defaults to 4KB.   
  
Ideal choices for the metadata blocksize are 4KB (default) or 8KB. There is   
little benefit to choosing a metadata blocksize greater than 8KB. The choice   
of metadata blocksize is subject to the following rules :  

metadata blocksize的选择原则 :

1) Metadata blocksize must be a power of 2.  
2) Metadata blocksize cannot be smaller than sector size configured on the   
ssd device.     metadata blocksize不能小于SSD(flashcache设备)的扇区大小.  
3) A single metadata block cannot contain metadata for 2 cache sets.   
In other words,   
with the default associativity of 512 (with each cache metadata slot sizing at 16 bytes),   
the entire metadata for a given set fits in 8KB (512*16b).  
For an associativity of 512, we cannot configure a metadata blocksize greater than 8KB.  

选择大metadata blocksize的好处

Advantages of choosing a larger (than 512b) metadata blocksize :  
- Allows the ssd to be configured to larger sectors. For example, some ssds  
  allow choosing a 4KB sector, often a more performant choice.  允许配置大的SSD扇区.  
- Allows flashache to do better batching of metadata updates, potentially   
  reducing metadata updates, small ssd writes, reducing write amplification  
  and higher ssd lifetimes. 减少SSD些操作, 提高SSD使用寿命.  
Thanks due to Earle Philhower of Virident for this feature !  

5.2 加载已经存在的write back cache dev设备.

使用flashcache_load加载已经存在的writeback flashcache设备.

因为重启需要重新加载, 或者使用chkconfig来管理自动加载.

writearound和writethrough不需要加载.(前面已经说过了, 这两只缓存不持久化到ssd, 重启即删了).

flashcache_load : Load an existing writeback cache volume.    
flashcache_load ssd_devname [cachedev_name]  
  
Example :  
flashcache_load /dev/sd  
Load the existing writeback cache on /dev/sdc, using the virtual cachedev_name from when the device was created.   
If you're upgrading from an older flashcache device format that didn't store the cachedev name internally, or you want to change the cachedev name use, you can specify it as an optional second argument to flashcache_load.  
  
For writethrough and writearound caches flashcache_load is not needed; flashcache_create   
should be used each time.  

5.3 删除flashcache dev.

删除write backup设备的flashcache设备, 比较危险, 所有flashcache中的数据将被删除(未说明是否写脏数据).

writeback的flashcache设备不推荐这么做. 如果要删除的话, 建议使用dmsetup删除cache dev, 因为dmsetup会自动将脏数据写入磁盘.

flashcache_destroy : Destroy an existing writeback flashcache. All data will be lost !!!  
  
flashcache_destroy ssd_devname  
  
Example :  
flashcache_destroy /dev/sdc  
Destroy the existing cache on /dev/sdc. All data is lost !!!  
For writethrough and writearound caches this is not necessary.  

6. 移除flashcache cache dev设备(即device mapper设备).

对于writeback的cache dev, 先把脏数据自动写入磁盘再移除.

Removing a flashcache volume :  
============================  
Use dmsetup remove to remove a flashcache volume. For writeback   
cache mode,  the default behavior on a remove is to clean all dirty   
cache blocks to disk. The remove will not return until all blocks   
are cleaned. Progress on disk cleaning is reported on the console   
(also see the "fast_remove" flashcache sysctl).  
  
A reboot of the node will also result in all dirty cache blocks being  
cleaned synchronously   
(again see the note about "fast_remove" in the sysctls section).  
  
For writethrough and writearound caches, the device removal or reboot  
results in the cache being destroyed.  However, there is no harm is  
doing a 'dmsetup remove' to tidy up before boot, and indeed  
this will be needed if you ever need to unload the flashcache kernel  
module (for example to load an new version into a running system).  
  
Example:  
dmsetup remove cachedev  
  
This removes the flashcache volume name cachedev. Cleaning  
all blocks prior to removal.   

快速移除选项如果配置为1的话, 不会同步脏数据到磁盘. 非常危险, 不推荐这么做.

dev.flashcache.<cachedev>.fast_remove = 0  
	Don't sync dirty blocks when removing cache. On a reload  
	both DIRTY and CLEAN blocks persist in the cache. This   
	option can be used to do a quick cache remove.   
	CAUTION: The cache still has uncommitted (to disk) dirty  
	blocks after a fast_remove.  

7. flashcache cache dev设备统计信息的查看, 通过dmsetup status或dmsetup table来查看.

Cache Stats :  
===========  
Use 'dmsetup status' for cache statistics.  
  
'dmsetup table' also dumps a number of cache related statistics.  
  
Examples :  
dmsetup status cachedev  
dmsetup table cachedev  

或者直接查看设备的状态文件

Flashcache errors are reported in   
/proc/flashcache/<cache name>/flashcache_errors  
  
Flashcache stats are also reported in   
/proc/flashcache/<cache name>/flashcache_stats  
for easier parseability.  

例如

[root@db-172-16-3-150 sda1+sdc3]# dmsetup table cachedev1  
0 207254565 flashcache conf:  
        ssd dev (/dev/sda1), disk dev (/dev/sdc3) cache mode(WRITE_BACK)  
        capacity(10216M), associativity(512), data block size(8K) metadata block size(4096b)  
        disk assoc(256K)  
        skip sequential thresh(0K)  
        total blocks(1307648), cached blocks(0), cache percent(0)  
        dirty blocks(0), dirty percent(0)  
        nr_queued(0)  
Size Hist: 512:2660 1024:851 2048:832 4096:8159776 8192:317   
  
[root@db-172-16-3-150 sda1+sdc3]# dmsetup status cachedev1  
0 207254565 flashcache stats:   
        reads(2477), writes(6)  
        read hits(0), read hit percent(0)  
        write hits(0) write hit percent(0)  
        dirty write hits(0) dirty write hit percent(0)  
        replacement(0), write replacement(0)  
        write invalidates(0), read invalidates(0)  
        pending enqueues(0), pending inval(0)  
        metadata dirties(0), metadata cleans(0)  
        metadata batch(0) metadata ssd writes(0)  
        cleanings(0) fallow cleanings(0)  
        no room(0) front merge(0) back merge(0)  
        force_clean_block(0)  
        disk reads(2477), disk writes(6) ssd reads(0) ssd writes(0)  
        uncached reads(2477), uncached writes(6), uncached IO requeue(0)  
        disk read errors(0), disk write errors(0) ssd read errors(0) ssd write errors(0)  
        uncached sequential reads(0), uncached sequential writes(0)  
        pid_adds(0), pid_dels(0), pid_drops(0) pid_expiry(0)  
        lru hot blocks(653824), lru warm blocks(653824)  
        lru promotions(0), lru demotions(0)  

或者直接查看设备状态文件

[root@db-172-16-3-150 sda1+sdc3]# cat /proc/flashcache/sda1+sdc3/flashcache_  
flashcache_errors       flashcache_iosize_hist  flashcache_pidlists     flashcache_stats          

错误统计

[root@db-172-16-3-150 sda1+sdc3]# cat /proc/flashcache/sda1+sdc3/flashcache_errors   
disk_read_errors=0 disk_write_errors=0 ssd_read_errors=0 ssd_write_errors=0 memory_alloc_errors=0  

进程白名单和黑名单, 通过sysctl设置使用flashcache设备的PID白名单和黑名单列表 .

[root@db-172-16-3-150 sda1+sdc3]# cat /proc/flashcache/sda1+sdc3/flashcache_pidlists   
Blacklist:   
Whitelist:   

IOSIZE历史

[root@db-172-16-3-150 sda1+sdc3]# cat /proc/flashcache/sda1+sdc3/flashcache_iosize_hist   
512:2660 1024:851 1536:0 2048:832 2560:0 3072:0 3584:0 4096:8159776 4608:0 5120:0 5632:0 6144:0 6656:0 7168:0 7680:0 8192:317 8704:0 9216:0 9728:0 10240:0 10752:0 11264:0 11776:0 12288:0 12800:0 13312:0 13824:0 14336:0 14848:0 15360:0 15872:0 16384:0   

状态信息

[root@db-172-16-3-150 sda1+sdc3]# cat /proc/flashcache/sda1+sdc3/flashcache_stats   
reads=2477 writes=6   
read_hits=0 read_hit_percent=0 write_hits=0 write_hit_percent=0 dirty_write_hits=0 dirty_write_hit_percent=0 replacement=0 write_replacement=0 write_invalidates=0 read_invalidates=0 pending_enqueues=0 pending_inval=0 metadata_dirties=0 metadata_cleans=0 metadata_batch=0 metadata_ssd_writes=0 cleanings=0 fallow_cleanings=0 no_room=0 front_merge=0 back_merge=0 disk_reads=2477 disk_writes=6 ssd_reads=0 ssd_writes=0 uncached_reads=2477 uncached_writes=6 uncached_IO_requeue=0 uncached_sequential_reads=0 uncached_sequential_writes=0 pid_adds=0 pid_dels=0 pid_drops=0 pid_expiry=0  

8. 红帽或centos的自启动脚本, 脚本内容见 https://github.com/facebook/flashcache/blob/master/utils/flashcache

用于开机时自动加载flashcache模块, 自动创建cache dev, 自动挂载.

关机时自动remove device mapper block dev. (注意关机时如果没有remove cache dev, 可能导致关机失败.)

需要在脚本中配置几个变量: SSD_DISK, BACKEND_DISK, CACHEDEV_NAME, MOUNTPOINT, FLASHCACHE_NAME

但是这个目前仅支持1个cachedev的自动加载和自动卸载.

Using Flashcache sysVinit script (Redhat based systems):  
=======================================================  
Kindly note that, this sections only applies to the Redhat based systems. Use  
'utils/flashcache' from the repository as the sysvinit script.  
  
This script is to load, unload and get statistics of an existing flashcache   
writeback cache volume. It helps in loading the already created cachedev during   
system boot and removes the flashcache volume before system halt happens.  
  
This script is necessary, because, when a flashcache volume is not removed   
before the system halt, kernel panic occurs.  
注意关机时如果没有remove cache dev, 可能导致关机失败.  
  
Configuring the script using chkconfig:  
  
1. Copy 'utils/flashcache' from the repo to '/etc/init.d/flashcache'  
2. Make sure this file has execute permissions,  
   'sudo chmod +x /etc/init.d/flashcache'.  
3. Edit this file and specify the values for the following variables  
   SSD_DISK, BACKEND_DISK, CACHEDEV_NAME, MOUNTPOINT, FLASHCACHE_NAME  
4. Modify the headers in the file if necessary.  
   By default, it starts in runlevel 3, with start-stop priority 90-10  
5. Register this file using chkconfig  
   'chkconfig --add /etc/init.d/flashcache'  

例如 :

[root@db-172-16-3-150 ~]# cp /opt/soft_bak/flashcache/flashcache-master/utils/flashcache /etc/init.d/  
[root@db-172-16-3-150 ~]# chmod 755 /etc/init.d/flashcache   
[root@db-172-16-3-150 ~]# vi /etc/init.d/flashcache  
SSD_DISK=/dev/sda1  
BACKEND_DISK=/dev/sdc3  
CACHEDEV_NAME=cachedev1  
MOUNTPOINT=/opt  
FLASHCACHE_NAME=sda1+sdc3  
[root@db-172-16-3-150 ~]# service flashcache start  
Starting Flashcache...  
[root@db-172-16-3-150 ~]# df -h  
/dev/mapper/cachedev1  
                       98G   51G   42G  55% /opt  
  
[root@db-172-16-3-150 ~]# service flashcache status  
Flashcache status: loaded  
0 207254565 flashcache stats:   
        reads(1598), writes(1)  
        read hits(0), read hit percent(0)  
        write hits(0) write hit percent(0)  
        dirty write hits(0) dirty write hit percent(0)  
        replacement(0), write replacement(0)  
        write invalidates(0), read invalidates(0)  
        pending enqueues(0), pending inval(0)  
        metadata dirties(0), metadata cleans(0)  
        metadata batch(0) metadata ssd writes(0)  
        cleanings(0) fallow cleanings(0)  
        no room(0) front merge(0) back merge(0)  
        force_clean_block(0)  
        disk reads(1598), disk writes(1) ssd reads(0) ssd writes(0)  
        uncached reads(1598), uncached writes(1), uncached IO requeue(0)  
        disk read errors(0), disk write errors(0) ssd read errors(0) ssd write errors(0)  
        uncached sequential reads(0), uncached sequential writes(0)  
        pid_adds(0), pid_dels(0), pid_drops(0) pid_expiry(0)  
        lru hot blocks(6144), lru warm blocks(6144)  
        lru promotions(0), lru demotions(0)  
  
[root@db-172-16-3-150 ~]# service flashcache stop  
dev.flashcache.sda1+sdc3.fast_remove = 0  
Flushing flashcache: Flushes to /dev/sdc3  

9. flashcache 模块参数设置, 需要针对ssd devname+disk devname配置 :

FlashCache Sysctls :  
==================  
Flashcache sysctls operate on a per-cache device basis. A couple of examples  
first.  
  
Sysctls for a writearound or writethrough mode cache :  
cache device /dev/ram3, disk device /dev/ram4  
  
dev.flashcache.ram3+ram4.cache_all = 1  
dev.flashcache.ram3+ram4.zero_stats = 0  
dev.flashcache.ram3+ram4.reclaim_policy = 0  
dev.flashcache.ram3+ram4.pid_expiry_secs = 60  
dev.flashcache.ram3+ram4.max_pids = 100  
dev.flashcache.ram3+ram4.do_pid_expiry = 0  
dev.flashcache.ram3+ram4.io_latency_hist = 0  
dev.flashcache.ram3+ram4.skip_seq_thresh_kb = 0  
  
Sysctls for a writeback mode cache :  
cache device /dev/sdb, disk device /dev/cciss/c0d2  
  
dev.flashcache.sdb+c0d2.fallow_delay = 900  
dev.flashcache.sdb+c0d2.fallow_clean_speed = 2  
dev.flashcache.sdb+c0d2.cache_all = 1  
dev.flashcache.sdb+c0d2.fast_remove = 0  
dev.flashcache.sdb+c0d2.zero_stats = 0  
dev.flashcache.sdb+c0d2.reclaim_policy = 0  
dev.flashcache.sdb+c0d2.pid_expiry_secs = 60  
dev.flashcache.sdb+c0d2.max_pids = 100  
dev.flashcache.sdb+c0d2.do_pid_expiry = 0  
dev.flashcache.sdb+c0d2.max_clean_ios_set = 2  
dev.flashcache.sdb+c0d2.max_clean_ios_total = 4  
dev.flashcache.sdb+c0d2.dirty_thresh_pct = 20  
dev.flashcache.sdb+c0d2.stop_sync = 0  
dev.flashcache.sdb+c0d2.do_sync = 0  
dev.flashcache.sdb+c0d2.io_latency_hist = 0  
dev.flashcache.sdb+c0d2.skip_seq_thresh_kb = 0  
  
Sysctls common to all cache modes :  
  
dev.flashcache.<cachedev>.cache_all:  
	Global caching mode to cache everything or cache nothing.  
	See section on Caching Controls. Defaults to "cache everything". 时候缓存所有或啥都不缓存( 另外可以通过进程ID白名单和黑名单控制) , 如果要用白名单, cache_all=0, 如果要用黑名单, 那就设置为cache_all=1;  
dev.flashcache.<cachedev>.zero_stats:  
	Zero stats (once).  
dev.flashcache.<cachedev>.reclaim_policy: 缓存回收策略, 可以动态调整.  
	FIFO (0) vs LRU (1). Defaults to FIFO. Can be switched at   
	runtime.  
dev.flashcache.<cachedev>.io_latency_hist:  是否统计IO延迟柱状图, 对clocksource慢的机器有比较大的性能影响.   
	Compute IO latencies and plot these out on a histogram.  
	The scale is 250 usecs. This is disabled by default since   
	internally flashcache uses gettimeofday() to compute latency  
	and this can get expensive depending on the clocksource used.  
	Setting this to 1 enables computation of IO latencies.  
	The IO latency histogram is appended to 'dmsetup status'.  

以下不建议调整

(There is little reason to tune these)  
dev.flashcache.<cachedev>.max_pids:  
	Maximum number of pids in the white/black lists.  
dev.flashcache.<cachedev>.do_pid_expiry:  
	Enable expiry on the list of pids in the white/black lists.  
dev.flashcache.<cachedev>.pid_expiry_secs:  
	Set the expiry on the pid white/black lists.  
  
dev.flashcache.<cachedev>.skip_seq_thresh_kb:  有点类似ZFS在ARC的设计, 跳过连续IO扫描的CACHE, 例如数据库大表的全表扫描, 可能不推荐加载到CACHE中. 但是因为是后触发的, 所以必须先达到这么大的IO量才会关闭后续的写入CACHE, 也就是说连续IO的开始部分(触发skip前)的数据还是写入SSD了. 结合cache dev所对应的机械盘的连续IO能力来判断, 例如100MB.  
	Skip (don't cache) sequential IO larger than this number (in kb).  
	0 (default) means cache all IO, both sequential and random.  
	Sequential IO can only be determined 'after the fact', so  
	this much of each sequential I/O will be cached before we skip   
	the rest.  Does not affect searching for IO in an existing cache.  

以下只有writeback模式才允许的设置 :

Sysctls for writeback mode only :  
dev.flashcache.<cachedev>.fallow_delay = 900   多少秒之后, 未有读写的缓存脏数据会写入磁盘.  
	In seconds. Clean dirty blocks that have been "idle" (not   
	read or written) for fallow_delay seconds. Default is 15  
	minutes.    
	Setting this to 0 disables idle cleaning completely.  
dev.flashcache.<cachedev>.fallow_clean_speed = 2  
	The maximum number of "fallow clean" disk writes per set   
	per second. Defaults to 2.  
dev.flashcache.<cachedev>.fast_remove = 0  是否在remove device mapper设备前将脏数据写入对应的磁盘.  
	Don't sync dirty blocks when removing cache. On a reload  
	both DIRTY and CLEAN blocks persist in the cache. This   
	option can be used to do a quick cache remove.   
	CAUTION: The cache still has uncommitted (to disk) dirty  
	blocks after a fast_remove.  
dev.flashcache.<cachedev>.dirty_thresh_pct = 20   允许的脏数据的比例.  
	Flashcache will attempt to keep the dirty blocks in each set   
	under this %. A lower dirty threshold increases disk writes,   
	and reduces block overwrites, but increases the blocks  
	available for read caching.  
dev.flashcache.<cachedev>.stop_sync = 0    停止sync.  
	Stop the sync in progress.  
dev.flashcache.<cachedev>.do_sync = 0    执行sync, 将脏数据写入磁盘.  
	Schedule cleaning of all dirty blocks in the cache.   

以下不建议调整 :

(There is little reason to tune these)  
dev.flashcache.<cachedev>.max_clean_ios_set = 2  
	Maximum writes that can be issues per set when cleaning  
	blocks.  
dev.flashcache.<cachedev>.max_clean_ios_total = 4  
	Maximum writes that can be issued when syncing all blocks.  

10. 直接使用dmsetup管理cache device. 可以直接使用flashcache_xxx来封装管理, 所以dmsetup可以不必使用.

Using dmsetup to create and load flashcache volumes :  
===================================================  
Few users will need to use dmsetup natively to create and load   
flashcache volumes. This section covers that.  
  
dmsetup create device_name table_file  
  
where  
  
device_name: name of the flashcache device being created or loaded.  
table_file : other cache args (format below). If this is omitted, dmsetup   
	     attempts to read this from stdin.  
  
table_file format :  
0 <disk dev sz in sectors> flashcache <disk dev> <ssd dev> <dm virtual name> <cache mode> <flashcache cmd> <blksize in sectors> [size of cache in sectors] [cache set size]  
  
cache mode:  
	   1: Write Back  
	   2: Write Through  
	   3: Write Around  
  
flashcache cmd:   
	   1: load existing cache  
	   2: create cache  
	   3: force create cache (overwriting existing cache). USE WITH CAUTION  
  
blksize in sectors:  
	   4KB (8 sectors, PAGE_SIZE) is the right choice for most applications.  
	   See note on block size selection below.  
	   Unused (can be omitted) for cache loads.  
  
size of cache in sectors:  
	   Optional. if size is not specified, the entire ssd device is used as  
	   cache. Needs to be a power of 2.  
	   Unused (can be omitted) for cache loads.  
  
cache set size:  
	   Optional. The default set size is 512, which works well for most   
	   applications. Little reason to change this. Needs to be a  
	   power of 2.  
	   Unused (can be omitted) for cache loads.  
  
Example :  
  
echo 0 `blockdev --getsize /dev/cciss/c0d1p2` flashcache /dev/cciss/c0d1p2 /dev/fioa2 cachedev 1 2 8 522000000 | dmsetup create cachedev  
  
This creates a writeback cache device called "cachedev" (/dev/mapper/cachedev)  
with a 4KB blocksize to cache /dev/cciss/c0d1p2 on /dev/fioa2.  
The size of the cache is 522000000 sectors.  
  
(TODO : Change loading of the cache happen via "dmsetup load" instead  
of "dmsetup create").  

11. 缓存的控制 , 白名单和黑名单.

Caching Controls  
================  
Flashcache can be put in one of 2 modes - Cache Everything or   
Cache Nothing (dev.flashcache.cache_all). The defaults is to "cache  
everything".  
  
These 2 modes have a blacklist and a whitelist.  
  
The tgid (thread group id) for a group of pthreads can be used as a  
shorthand to tag all threads in an application. The tgid for a pthread  
is returned by getpid() and the pid of the individual thread is  
returned by gettid().  

pid和tgid分别使用getpid()和gettid()获取, 可以用systemtap试一试. 参见

https://sourceware.org/systemtap/documentation.html

http://blog.163.com/digoal@126/blog/#m=0&t=1&c=fks_084068084086080075085082085095085080082075083081086071084

The algorithm works as follows :  
  
In "cache everything" mode, 缓存所有, 先查黑名单(不缓存), 再查白名单(缓存), 最后达到连续IO限制的话跳过缓存.  
1) If the pid of the process issuing the IO is in the blacklist, do  
not cache the IO. ELSE,  
2) If the tgid is in the blacklist, don't cache this IO. UNLESS  
3) The particular pid is marked as an exception (and entered in the  
whitelist, which makes the IO cacheable).  
4) Finally, even if IO is cacheable up to this point, skip sequential IO   
if configured by the sysctl.  
  
Conversely, in "cache nothing" mode, 不缓存任何, 先查白名单(缓存), 再查黑名单(不换成), 最后达到连续IO限制的话跳过缓存.  
1) If the pid of the process issuing the IO is in the whitelist,  
cache the IO. ELSE,  
2) If the tgid is in the whitelist, cache this IO. UNLESS  
3) The particular pid is marked as an exception (and entered in the  
blacklist, which makes the IO non-cacheable).  
4) Anything whitelisted is cached, regardless of sequential or random  
IO.  
  
Examples :  
--------  
1) You can make the global cache setting "cache nothing", and add the  
tgid of your pthreaded application to the whitelist. Which makes only  
IOs issued by your application cacheable by Flashcache.   
2) You can make the global cache setting "cache everything" and add  
tgids (or pids) of other applications that may issue IOs on this  
volume to the blacklist, which will make those un-interesting IOs not  
cacheable.   
  
Note that this only works for O_DIRECT IOs. For buffered IOs, pdflush,  
kswapd would also do the writes, with flashcache caching those. 只对O_DIRECT IO请求有效控制.  
  
The following cacheability ioctls are supported on /dev/mapper/<cachedev>   
  
FLASHCACHEADDBLACKLIST: add the pid (or tgid) to the blacklist.  
FLASHCACHEDELBLACKLIST: Remove the pid (or tgid) from the blacklist.  
FLASHCACHEDELALLBLACKLIST: Clear the blacklist. This can be used to  
cleanup if a process dies.  
  
FLASHCACHEADDWHITELIST: add the pid (or tgid) to the whitelist.  
FLASHCACHEDELWHITELIST: Remove the pid (or tgid) from the whitelist.  
FLASHCACHEDELALLWHITELIST: Clear the whitelist. This can be used to  
cleanup if a process dies.  
  
/proc/flashcache_pidlists shows the list of pids on the whitelist and the blacklist.  

12. 缓存安全, 用户进程可能在只有只读权限的情况下, 破坏缓存盘的数据.

现在的解决办法是, 收紧权限, 哪怕只读权限也不给其他用户.

Security Note :  
=============  
With Flashcache, it is possible for a malicious user process to   
corrupt data in files with only read access. In a future revision  
of flashcache, this will be addressed (with an extra data copy).  
  
Not documenting the mechanics of how a malicious process could   
corrupt data here.  
  
You can work around this by setting file permissions on files in   
the flashcache volume appropriately.  

13. SSD使用率过低的问题.

因为SSD sets和HDD sets是一对多的关系, 也就是说多个HDD数据块可能竞争一个SSD cache区域.

如果竞争同一个SSD CACHE区域的块都是需要缓存的块, 而不发生竞争的块都不需要缓存的话, 最糟糕的的情况就发生了, 利用率会极低. 看个例子 :

Why is my cache only (<< 100%) utilized ?  
=======================================  
(Answer contributed by Will Smith)  
  
- There is essentially a  1:many mapping between SSD blocks and HDD blocks. (ssd blocks和hdd blocks是一对多的映射关系.)  
- In more detail, a HDD block gets hashed to a set on SSD which contains by   
  default 512 blocks.  It can only be stored in that set on SSD, nowhere else.  
  
So with a simplified SSD containing only 3 sets:  
  
SSD = 1 2 3 , and a HDD with 9 sets worth of data, the HDD sets would map to the SSD   
sets like this:  
  
HDD: 1 2 3 4 5 6 7 8 9  
SSD: 1 2 3 1 2 3 1 2 3  
  
So if your data only happens to live in HDD sets 1 and 4, they will compete for   
SSD set 1 and your SSD will at most become 33% utilized.  

HDD 数据集1和4都存储在SSD的1号集, 如果HDD1,4都是需要缓存的, 其他HDD集(2,3,5,6,7,8,9)都不是活跃数据不需要缓存, 那么最糟的情况就是SSD只有33%在使用, 为了提高使用率, XFS文件系统支持调整agsize和agcount来实现目的.

If you use XFS you can tune the XFS agsize/agcount to try and mitigate this   
(described next section).  

14. XFS文件系统优化, 应对CACHE 使用率过低的问题.

通过调整xfs的allocation group参数agsize, agcount来优化SSD的使用.

Tuning XFS for better flashcache performance :  
============================================  
If you run XFS/Flashcache, it is worth tuning XFS' allocation group  
parameters (agsize/agcount) to achieve better flashcache performance.  
XFS allocates blocks for files in a given directory in a new  (利用XFS可以将一个目录中的多个文件分散到多个agroup来分散数据块存储)  
allocation group. By tuning agsize and agcount (mkfs.xfs parameters),  
we can achieve much better distribution of blocks across  
flashcache. Better distribution of blocks across flashcache will  
decrease collisions on flashcache sets considerably, increase cache  
hit rates significantly and result in lower IO latencies. (分散的数据块可以优化FLASHCACHE的冲突, 13章节已经提到了这个原因)  
  
We can achieve this by computing agsize (and implicitly agcount) using  
these equations,  

计算公式 :

C = Cache size,   
V = Size of filesystem Volume.  
  
agsize % C = (1/agcount)*C  
agsize * agcount ~= V  
  
where agsize <= 1000g (XFS limits on agsize).  
  
A couple of examples that illustrate the formula,  
  
For agcount = 4, let's divide up the cache into 4 equal parts (each  
part is size C/agcount). Let's call the parts C1, C2, C3, C4. One  
ideal way to map the allocation groups onto the cache is as follows.  

理想的HDD和CACHE对应的条带组合, 每个条带错位, 得到好CACHE的分布, 减少CACHE SET征用的冲突.

Ag1	     Ag2	Ag3	       Ag4  
--	     --		--	       --  
C1	     C2		C3	       C4	(stripe 1)  
C2	     C3		C4	       C1	(stripe 2)  
C3	     C4		C1	       C2	(stripe 3)  
C4	     C1		C2	       C3	(stripe 4)  
C1	     C2		C3	       C4	(stripe 5)  
  
In this simple example, note that each "stripe" has 2 properties  
1) Each element of the stripe is a unique part of the cache.  
2) The union of all the parts for a stripe gives us the entire cache.  
  
Clearly, this is an ideal mapping, from a distribution across the  
cache point of view.  
  
Another example, this time with agcount = 5, the cache is divided into  
5 equal parts C1, .. C5.  
  
Ag1	     Ag2	Ag3	       Ag4	Ag5  
--	     --		--	       --	--  
C1	     C2		C3	       C4	C5	(stripe 1)  
C2	     C3		C4	       C5	C1	(stripe 2)  
C3	     C4		C5	       C1	C2	(stripe 3)  
C4	     C5		C1	       C2	C3	(stripe 4)  
C5	     C1		C2	       C3	C4	(stripe 5)  
C1	     C2		C3	       C4	C5	(stripe 6)  
  
A couple of examples that compute the optimal agsize for a given  
Cachesize and Filesystem volume size.  
  
a) C = 600g, V = 3,5TB  
Consider agcount = 5  
  
agsize % 600 = (1/5)*600  
agsize % 600 = 120  
  
So an agsize of 720g would work well, and 720*5 = 3.6TB (~ 3.5TB)  
  
b) C = 150g, V = 3.5TB  
Consider agcount=4  
  
agsize % 150 = (1/4)*150  
agsize % 150 = 37.5  
  
So an agsize of 937g would work well, and 937*4 = 3.7TB (~ 3.5TB)  
  
As an alternative,   
  
agsize % C = (1 - (1/agcount))*C  
agsize * agcount ~= V  
  
Works just as well as the formula above.  

不想自己计算的话, 可以尝试一下直接使用flashcache提供的get_agsize工具.

This computation has been implemented in the utils/get_agsize utility.  

使用mkfs.xfs创建文件系统时指定agsize, agcount.

man mkfs.xfs  
       -d data_section_options  
              These  options  specify  the location, size, and other parameters of the data section of the filesystem.  
              The valid data_section_options are:  
  
                   agcount=value  
                          This is used to specify the number of allocation groups. The data section of the  filesystem  
                          is  divided into allocation groups to improve the performance of XFS. More allocation groups  
                          imply that more parallelism can be achieved when allocating blocks and inodes.  The  minimum  
                          allocation  group size is 16 MiB; the maximum size is just under 1 TiB.  The data section of  
                          the filesystem is divided into value allocation groups (default value  is  scaled  automati-  
                          cally based on the underlying device size).  
  
                   agsize=value  
                          This  is an alternative to using the agcount suboption. The value is the desired size of the  
                          allocation group expressed in bytes (usually using the m or g suffixes).  This value must be  
                          a  multiple of the filesystem block size, and must be at least 16MiB, and no more than 1TiB,  
                          and may be automatically adjusted to properly align with the stripe geometry.   The  agcount  
                          and agsize suboptions are mutually exclusive.  

15. 连续IO载入SSD是否影响性能, 如果影响, 如何通过跳过连续IO载入缓存来优化性能.

如果开启了cache all io, 可能存在连续IO载入SSD后带来的问题.

Tuning Sequential IO Skipping for better flashcache performance  
===============================================================  
Skipping sequential IO makes sense in two cases:  
1) your sequential write speed of your SSD is slower than  
   the sequential write speed or read speed of your disk.  In   
   particular, for implementations with RAID disks (especially   
   modes 0, 10 or 5) sequential reads may be very fast.  If   
   'cache_all' mode is used, every disk read miss must also be   
   written to SSD.  If you notice slower sequential reads and writes   
   after enabling flashcache, this is likely your problem.  

如果数据设备是RAID磁盘, 并且RAID组较大或有RAID缓存的情况下, 连续IO的读写性能可能很好, 甚至超越SSD的性能(当然PCI-E的SSD几乎很难超越).

这种情况下, 连续IO加载到SSD就带来负面影响了, 一个是占据了大量的SSD空间, 另一方面还得不到好的性能提升(仅仅当SSD连续IO的性能低于RAID组的情况).

如果你遇到以上情况, 那么说明要调整一下flashcache的模块参数, 跳过连续IO载入SSD缓存.

2) Your 'resident set' of disk blocks that you want cached, i.e.  
   those that you would hope to keep in cache, is smaller  
   than the size of your SSD.  You can check this by monitoring  
   how quick your cache fills up ('dmsetup table').  If this  
   is the case, it makes sense to prioritize caching of random IO,  
   since SSD performance vastly exceeds disk performance for   
   random IO, but is typically not much better for sequential IO.  

如果SSD很快被填满, 可能出现了连续IO读载入SSD的情况, 如果带来了负面影响, 那么也说明要调整一下flashcache的模块参数, 跳过连续IO载入SSD缓存.

如果已经出现负面影响(例如加SSD后性能反而下降), 并且通过以上观察, 已经发现确实是连续IO载入缓存引起的, 那么可以通过以下方法来调整.

通过sysctl 设置 dev.flashcache..skip_seq_thresh_kb的阈值, 从1024k开始, 每设置一个值, 使用测试模型测试, 查看是否有效果, 如果没有效果, 调小这个值, 重新测试, 直到有效果为止.

In the above cases, start with a high value (say 1024k) for  
sysctl dev.flashcache.<device>.skip_seq_thresh_kb, so only the  
largest sequential IOs are skipped, and gradually reduce  
if benchmarks show it's helping.  Don't leave it set to a very  
high value, return it to 0 (the default), since there is some  
overhead in categorizing IO as random or sequential.  

如果没有遇到问题, 那么继续使用cache all io即可.

If neither of the above hold, continue to cache all IO,   
(the default) you will likely benefit from it.  

参考

1. https://raw.githubusercontent.com/facebook/flashcache/master/doc/flashcache-sa-guide.txt

2. https://github.com/facebook/flashcache/issues

3. http://blog.163.com/digoal@126/blog/static/1638770402014528115551323/

4. https://github.com/facebook/flashcache/blob/master/utils/flashcache

您的愿望将传达给PG kernel hacker、数据库厂商等, 帮助提高数据库产品质量和功能, 说不定下一个PG版本就有您提出的功能点. 针对非常好的提议,奖励限量版PG文化衫、纪念品、贴纸、PG热门书籍等,奖品丰富,快来许愿。开不开森.

digoal's wechat