40b8b41 Aug 1, 2014
@willsmithorg @mohans @stnoonan @boopathi
588 lines (460 sloc) 22.6 KB
FlashCache System Administration Guide
Introduction :
Flashcache is a block cache for Linux, built as a kernel module,
using the Device Mapper. Flashcache supports writeback, writethrough
and writearound caching modes. This document is a quick administration
guide to flashcache.
Requirements :
Flashcache has been tested on variety of kernels between 2.6.18 and 2.6.38.
If you'd like to build and use it on a newer kernel, please send me an email
and I can help. I will not support older than 2.6.18 kernels.
Choice of Caching Modes :
Writethrough - safest, all writes are cached to ssd but also written to disk
immediately. If your ssd has slower write performance than your disk (likely
for early generation SSDs purchased in 2008-2010), this may limit your system
write performance. All disk reads are cached (tunable).
Writearound - again, very safe, writes are not written to ssd but directly to
disk. Disk blocks will only be cached after they are read. All disk reads
are cached (tunable).
Writeback - fastest but less safe. Writes only go to the ssd initially, and
based on various policies are written to disk later. All disk reads are
cached (tunable).
Writeonly - variant of writeback caching. In this mode, only incoming writes
are cached. No reads are ever cached.
Cache Persistence :
Writethrough and Writearound caches are not persistent across a device removal
or a reboot. Only Writeback caches are persistent across device removals
and reboots. This reinforces 'writeback is fastest', 'writethrough is safest'.
Known Bugs :
See and report new issues there please.
Data corruption has been reported when using a loopback device for the cache device.
See also the 'Futures and Features' section of the design document, flashcache-doc.txt.
Cache creation and loading using the flashcache utilities :
Included are 3 utilities - flashcache_create, flashcache_load and
flashcache_destroy. These utilities use dmsetup internally, presenting
a simpler interface to create, load and destroy flashcache volumes.
It is expected that the majority of users can use these utilities
instead of using dmsetup.
flashcache_create : Create a new flashcache volume.
flashcache_create [-v] -p back|around|thru [-s cache size] [-w] [-b block size] cachedevname ssd_devname disk_devname
-v : verbose.
-p : cache mode (writeback/writethrough/writearound).
-s : cache size. Optional. If this is not specified, the entire ssd device
is used as cache. The default units is sectors. But you can specify
k/m/g as units as well.
-b : block size. Optional. Defaults to 4KB. Must be a power of 2.
The default units is sectors. But you can specify k as units as well.
(A 4KB blocksize is the correct choice for the vast majority of
applications. But see the section "Cache Blocksize selection" below).
-f : force create. by pass checks (eg for ssd sectorsize).
-w : write cache mode. Only writes are cached, not reads
-d : disk associativity, within each cache set, we store several contigous
disk extents. Defaults to off.
Examples :
flashcache_create -p back -s 1g -b 4k cachedev /dev/sdc /dev/sdb
Creates a 1GB writeback cache volume with a 4KB block size on ssd
device /dev/sdc to cache the disk volume /dev/sdb. The name of the device
created is "cachedev".
flashcache_create -p thru -s 2097152 -b 8 cachedev /dev/sdc /dev/sdb
Same as above but creates a write through cache with units specified in
sectors instead. The name of the device created is "cachedev".
flashcache_load : Load an existing writeback cache volume.
flashcache_load ssd_devname [cachedev_name]
Example :
flashcache_load /dev/sd
Load the existing writeback cache on /dev/sdc, using the virtual
cachedev_name from when the device was created. If you're upgrading from
an older flashcache device format that didn't store the cachedev name
internally, or you want to change the cachedev name use, you can specify
it as an optional second argument to flashcache_load.
For writethrough and writearound caches flashcache_load is not needed; flashcache_create
should be used each time.
flashcache_destroy : Destroy an existing writeback flashcache. All data will be lost !!!
flashcache_destroy ssd_devname
Example :
flashcache_destroy /dev/sdc
Destroy the existing cache on /dev/sdc. All data is lost !!!
For writethrough and writearound caches this is not necessary.
Removing a flashcache volume :
Use dmsetup remove to remove a flashcache volume. For writeback
cache mode, the default behavior on a remove is to clean all dirty
cache blocks to disk. The remove will not return until all blocks
are cleaned. Progress on disk cleaning is reported on the console
(also see the "fast_remove" flashcache sysctl).
A reboot of the node will also result in all dirty cache blocks being
cleaned synchronously (again see the note about "fast_remove" in the
sysctls section).
For writethrough and writearound caches, the device removal or reboot
results in the cache being destroyed. However, there is no harm is
doing a 'dmsetup remove' to tidy up before boot, and indeed
this will be needed if you ever need to unload the flashcache kernel
module (for example to load an new version into a running system).
dmsetup remove cachedev
This removes the flashcache volume name cachedev. Cleaning
all blocks prior to removal.
Cache Stats :
Use 'dmsetup status' for cache statistics.
'dmsetup table' also dumps a number of cache related statistics.
Examples :
dmsetup status cachedev
dmsetup table cachedev
Flashcache errors are reported in
/proc/flashcache/<cache name>/flashcache_errors
Flashcache stats are also reported in
/proc/flashcache/<cache name>/flashcache_stats
for easier parseability.
Using Flashcache sysVinit script (Redhat based systems):
Kindly note that, this sections only applies to the Redhat based systems. Use
'utils/flashcache' from the repository as the sysvinit script.
This script is to load, unload and get statistics of an existing flashcache
writeback cache volume. It helps in loading the already created cachedev during
system boot and removes the flashcache volume before system halt happens.
This script is necessary, because, when a flashcache volume is not removed
before the system halt, kernel panic occurs.
Configuring the script using chkconfig:
1. Copy 'utils/flashcache' from the repo to '/etc/init.d/flashcache'
2. Make sure this file has execute permissions,
'sudo chmod +x /etc/init.d/flashcache'.
3. Edit this file and specify the values for the following variables
4. Modify the headers in the file if necessary.
By default, it starts in runlevel 3, with start-stop priority 90-10
5. Register this file using chkconfig
'chkconfig --add /etc/init.d/flashcache'
Cache Blocksize selection :
Cache blocksize selection is critical for good cache utilization and
A 4KB cache blocksize for the vast majority of workloads (and filesystems).
Cache Metadata Blocksize selection :
This section only applies to the writeback cache mode. Writethrough and
writearound modes store no cache metadata at all.
In Flashcache version 1, the metadata blocksize was fixed at 1 (512b) sector.
Flashcache version 2 removes this limitation. In version 2, we can configure
a larger flashcache metadata blocksize. Version 2 maintains backwards compatibility
for caches created with Version 1. For these cases, a metadata blocksize of 512
will continue to be used.
flashcache_create -m can be used to optionally configure the metadata blocksize.
Defaults to 4KB.
Ideal choices for the metadata blocksize are 4KB (default) or 8KB. There is
little benefit to choosing a metadata blocksize greater than 8KB. The choice
of metadata blocksize is subject to the following rules :
1) Metadata blocksize must be a power of 2.
2) Metadata blocksize cannot be smaller than sector size configured on the
ssd device.
3) A single metadata block cannot contain metadata for 2 cache sets. In other
words, with the default associativity of 512 (with each cache metadata slot
sizing at 16 bytes), the entire metadata for a given set fits in 8KB (512*16b).
For an associativity of 512, we cannot configure a metadata blocksize greater
than 8KB.
Advantages of choosing a larger (than 512b) metadata blocksize :
- Allows the ssd to be configured to larger sectors. For example, some ssds
allow choosing a 4KB sector, often a more performant choice.
- Allows flashache to do better batching of metadata updates, potentially
reducing metadata updates, small ssd writes, reducing write amplification
and higher ssd lifetimes.
Thanks due to Earle Philhower of Virident for this feature !
FlashCache Sysctls :
Flashcache sysctls operate on a per-cache device basis. A couple of examples
Sysctls for a writearound or writethrough mode cache :
cache device /dev/ram3, disk device /dev/ram4
dev.flashcache.ram3+ram4.cache_all = 1
dev.flashcache.ram3+ram4.zero_stats = 0
dev.flashcache.ram3+ram4.reclaim_policy = 0
dev.flashcache.ram3+ram4.pid_expiry_secs = 60
dev.flashcache.ram3+ram4.max_pids = 100
dev.flashcache.ram3+ram4.do_pid_expiry = 0
dev.flashcache.ram3+ram4.io_latency_hist = 0
dev.flashcache.ram3+ram4.skip_seq_thresh_kb = 0
Sysctls for a writeback mode cache :
cache device /dev/sdb, disk device /dev/cciss/c0d2
dev.flashcache.sdb+c0d2.fallow_delay = 900
dev.flashcache.sdb+c0d2.fallow_clean_speed = 2
dev.flashcache.sdb+c0d2.cache_all = 1
dev.flashcache.sdb+c0d2.fast_remove = 0
dev.flashcache.sdb+c0d2.zero_stats = 0
dev.flashcache.sdb+c0d2.reclaim_policy = 0
dev.flashcache.sdb+c0d2.pid_expiry_secs = 60
dev.flashcache.sdb+c0d2.max_pids = 100
dev.flashcache.sdb+c0d2.do_pid_expiry = 0
dev.flashcache.sdb+c0d2.max_clean_ios_set = 2
dev.flashcache.sdb+c0d2.max_clean_ios_total = 4
dev.flashcache.sdb+c0d2.dirty_thresh_pct = 20
dev.flashcache.sdb+c0d2.stop_sync = 0
dev.flashcache.sdb+c0d2.do_sync = 0
dev.flashcache.sdb+c0d2.io_latency_hist = 0
dev.flashcache.sdb+c0d2.skip_seq_thresh_kb = 0
Sysctls common to all cache modes :
Global caching mode to cache everything or cache nothing.
See section on Caching Controls. Defaults to "cache everything".
Zero stats (once).
FIFO (0) vs LRU (1). Defaults to FIFO. Can be switched at
Compute IO latencies and plot these out on a histogram.
The scale is 250 usecs. This is disabled by default since
internally flashcache uses gettimeofday() to compute latency
and this can get expensive depending on the clocksource used.
Setting this to 1 enables computation of IO latencies.
The IO latency histogram is appended to 'dmsetup status'.
(There is little reason to tune these)
Maximum number of pids in the white/black lists.
Enable expiry on the list of pids in the white/black lists.
Set the expiry on the pid white/black lists.
Skip (don't cache) sequential IO larger than this number (in kb).
0 (default) means cache all IO, both sequential and random.
Sequential IO can only be determined 'after the fact', so
this much of each sequential I/O will be cached before we skip
the rest. Does not affect searching for IO in an existing cache.
Sysctls for writeback mode only :
dev.flashcache.<cachedev>.fallow_delay = 900
In seconds. Clean dirty blocks that have been "idle" (not
read or written) for fallow_delay seconds. Default is 15
Setting this to 0 disables idle cleaning completely.
dev.flashcache.<cachedev>.fallow_clean_speed = 2
The maximum number of "fallow clean" disk writes per set
per second. Defaults to 2.
dev.flashcache.<cachedev>.fast_remove = 0
Don't sync dirty blocks when removing cache. On a reload
both DIRTY and CLEAN blocks persist in the cache. This
option can be used to do a quick cache remove.
CAUTION: The cache still has uncommitted (to disk) dirty
blocks after a fast_remove.
dev.flashcache.<cachedev>.dirty_thresh_pct = 20
Flashcache will attempt to keep the dirty blocks in each set
under this %. A lower dirty threshold increases disk writes,
and reduces block overwrites, but increases the blocks
available for read caching.
dev.flashcache.<cachedev>.stop_sync = 0
Stop the sync in progress.
dev.flashcache.<cachedev>.do_sync = 0
Schedule cleaning of all dirty blocks in the cache.
(There is little reason to tune these)
dev.flashcache.<cachedev>.max_clean_ios_set = 2
Maximum writes that can be issues per set when cleaning
dev.flashcache.<cachedev>.max_clean_ios_total = 4
Maximum writes that can be issued when syncing all blocks.
Using dmsetup to create and load flashcache volumes :
Few users will need to use dmsetup natively to create and load
flashcache volumes. This section covers that.
dmsetup create device_name table_file
device_name: name of the flashcache device being created or loaded.
table_file : other cache args (format below). If this is omitted, dmsetup
attempts to read this from stdin.
table_file format :
0 <disk dev sz in sectors> flashcache <disk dev> <ssd dev> <dm virtual name> <cache mode> <flashcache cmd> <blksize in sectors> [size of cache in sectors] [cache set size]
cache mode:
1: Write Back
2: Write Through
3: Write Around
flashcache cmd:
1: load existing cache
2: create cache
3: force create cache (overwriting existing cache). USE WITH CAUTION
blksize in sectors:
4KB (8 sectors, PAGE_SIZE) is the right choice for most applications.
See note on block size selection below.
Unused (can be omitted) for cache loads.
size of cache in sectors:
Optional. if size is not specified, the entire ssd device is used as
cache. Needs to be a power of 2.
Unused (can be omitted) for cache loads.
cache set size:
Optional. The default set size is 512, which works well for most
applications. Little reason to change this. Needs to be a
power of 2.
Unused (can be omitted) for cache loads.
Example :
echo 0 `blockdev --getsize /dev/cciss/c0d1p2` flashcache /dev/cciss/c0d1p2 /dev/fioa2 cachedev 1 2 8 522000000 | dmsetup create cachedev
This creates a writeback cache device called "cachedev" (/dev/mapper/cachedev)
with a 4KB blocksize to cache /dev/cciss/c0d1p2 on /dev/fioa2.
The size of the cache is 522000000 sectors.
(TODO : Change loading of the cache happen via "dmsetup load" instead
of "dmsetup create").
Caching Controls
Flashcache can be put in one of 2 modes - Cache Everything or
Cache Nothing (dev.flashcache.cache_all). The defaults is to "cache
These 2 modes have a blacklist and a whitelist.
The tgid (thread group id) for a group of pthreads can be used as a
shorthand to tag all threads in an application. The tgid for a pthread
is returned by getpid() and the pid of the individual thread is
returned by gettid().
The algorithm works as follows :
In "cache everything" mode,
1) If the pid of the process issuing the IO is in the blacklist, do
not cache the IO. ELSE,
2) If the tgid is in the blacklist, don't cache this IO. UNLESS
3) The particular pid is marked as an exception (and entered in the
whitelist, which makes the IO cacheable).
4) Finally, even if IO is cacheable up to this point, skip sequential IO
if configured by the sysctl.
Conversely, in "cache nothing" mode,
1) If the pid of the process issuing the IO is in the whitelist,
cache the IO. ELSE,
2) If the tgid is in the whitelist, cache this IO. UNLESS
3) The particular pid is marked as an exception (and entered in the
blacklist, which makes the IO non-cacheable).
4) Anything whitelisted is cached, regardless of sequential or random
Examples :
1) You can make the global cache setting "cache nothing", and add the
tgid of your pthreaded application to the whitelist. Which makes only
IOs issued by your application cacheable by Flashcache.
2) You can make the global cache setting "cache everything" and add
tgids (or pids) of other applications that may issue IOs on this
volume to the blacklist, which will make those un-interesting IOs not
Note that this only works for O_DIRECT IOs. For buffered IOs, pdflush,
kswapd would also do the writes, with flashcache caching those.
The following cacheability ioctls are supported on /dev/mapper/<cachedev>
FLASHCACHEADDBLACKLIST: add the pid (or tgid) to the blacklist.
FLASHCACHEDELBLACKLIST: Remove the pid (or tgid) from the blacklist.
FLASHCACHEDELALLBLACKLIST: Clear the blacklist. This can be used to
cleanup if a process dies.
FLASHCACHEADDWHITELIST: add the pid (or tgid) to the whitelist.
FLASHCACHEDELWHITELIST: Remove the pid (or tgid) from the whitelist.
FLASHCACHEDELALLWHITELIST: Clear the whitelist. This can be used to
cleanup if a process dies.
/proc/flashcache_pidlists shows the list of pids on the whitelist
and the blacklist.
Security Note :
With Flashcache, it is possible for a malicious user process to
corrupt data in files with only read access. In a future revision
of flashcache, this will be addressed (with an extra data copy).
Not documenting the mechanics of how a malicious process could
corrupt data here.
You can work around this by setting file permissions on files in
the flashcache volume appropriately.
Why is my cache only (<< 100%) utilized ?
(Answer contributed by Will Smith)
- There is essentially a 1:many mapping between SSD blocks and HDD blocks.
- In more detail, a HDD block gets hashed to a set on SSD which contains by
default 512 blocks. It can only be stored in that set on SSD, nowhere else.
So with a simplified SSD containing only 3 sets:
SSD = 1 2 3 , and a HDD with 9 sets worth of data, the HDD sets would map to the SSD
sets like this:
HDD: 1 2 3 4 5 6 7 8 9
SSD: 1 2 3 1 2 3 1 2 3
So if your data only happens to live in HDD sets 1 and 4, they will compete for
SSD set 1 and your SSD will at most become 33% utilized.
If you use XFS you can tune the XFS agsize/agcount to try and mitigate this
(described next section).
Tuning XFS for better flashcache performance :
If you run XFS/Flashcache, it is worth tuning XFS' allocation group
parameters (agsize/agcount) to achieve better flashcache performance.
XFS allocates blocks for files in a given directory in a new
allocation group. By tuning agsize and agcount (mkfs.xfs parameters),
we can achieve much better distribution of blocks across
flashcache. Better distribution of blocks across flashcache will
decrease collisions on flashcache sets considerably, increase cache
hit rates significantly and result in lower IO latencies.
We can achieve this by computing agsize (and implicitly agcount) using
these equations,
C = Cache size,
V = Size of filesystem Volume.
agsize % C = (1/agcount)*C
agsize * agcount ~= V
where agsize <= 1000g (XFS limits on agsize).
A couple of examples that illustrate the formula,
For agcount = 4, let's divide up the cache into 4 equal parts (each
part is size C/agcount). Let's call the parts C1, C2, C3, C4. One
ideal way to map the allocation groups onto the cache is as follows.
Ag1 Ag2 Ag3 Ag4
-- -- -- --
C1 C2 C3 C4 (stripe 1)
C2 C3 C4 C1 (stripe 2)
C3 C4 C1 C2 (stripe 3)
C4 C1 C2 C3 (stripe 4)
C1 C2 C3 C4 (stripe 5)
In this simple example, note that each "stripe" has 2 properties
1) Each element of the stripe is a unique part of the cache.
2) The union of all the parts for a stripe gives us the entire cache.
Clearly, this is an ideal mapping, from a distribution across the
cache point of view.
Another example, this time with agcount = 5, the cache is divided into
5 equal parts C1, .. C5.
Ag1 Ag2 Ag3 Ag4 Ag5
-- -- -- -- --
C1 C2 C3 C4 C5 (stripe 1)
C2 C3 C4 C5 C1 (stripe 2)
C3 C4 C5 C1 C2 (stripe 3)
C4 C5 C1 C2 C3 (stripe 4)
C5 C1 C2 C3 C4 (stripe 5)
C1 C2 C3 C4 C5 (stripe 6)
A couple of examples that compute the optimal agsize for a given
Cachesize and Filesystem volume size.
a) C = 600g, V = 3,5TB
Consider agcount = 5
agsize % 600 = (1/5)*600
agsize % 600 = 120
So an agsize of 720g would work well, and 720*5 = 3.6TB (~ 3.5TB)
b) C = 150g, V = 3.5TB
Consider agcount=4
agsize % 150 = (1/4)*150
agsize % 150 = 37.5
So an agsize of 937g would work well, and 937*4 = 3.7TB (~ 3.5TB)
As an alternative,
agsize % C = (1 - (1/agcount))*C
agsize * agcount ~= V
Works just as well as the formula above.
This computation has been implemented in the utils/get_agsize utility.
Tuning Sequential IO Skipping for better flashcache performance
Skipping sequential IO makes sense in two cases:
1) your sequential write speed of your SSD is slower than
the sequential write speed or read speed of your disk. In
particular, for implementations with RAID disks (especially
modes 0, 10 or 5) sequential reads may be very fast. If
'cache_all' mode is used, every disk read miss must also be
written to SSD. If you notice slower sequential reads and writes
after enabling flashcache, this is likely your problem.
2) Your 'resident set' of disk blocks that you want cached, i.e.
those that you would hope to keep in cache, is smaller
than the size of your SSD. You can check this by monitoring
how quick your cache fills up ('dmsetup table'). If this
is the case, it makes sense to prioritize caching of random IO,
since SSD performance vastly exceeds disk performance for
random IO, but is typically not much better for sequential IO.
In the above cases, start with a high value (say 1024k) for
sysctl dev.flashcache.<device>.skip_seq_thresh_kb, so only the
largest sequential IOs are skipped, and gradually reduce
if benchmarks show it's helping. Don't leave it set to a very
high value, return it to 0 (the default), since there is some
overhead in categorizing IO as random or sequential.
If neither of the above hold, continue to cache all IO,
(the default) you will likely benefit from it.
Further Information
Git repository :
Developer mailing list :