doc/flashcache-sa-guide.txt

		FlashCache System Administration Guide
		--------------------------------------

Introduction :
============
Flashcache is a writeback block cache for Linux, built as a kernel module,
using the Device Mapper. This document is a quick admministration guide 
to flashcache.

Requirements :
============
Flashcache has been built and tested on 2.6.18 and 2.6.20. If you'd like
to build and use it on a newer kernel, please send me an email and I 
can help. I will not support older than 2.6.18 kernels.

Cache creation and loading using the flashcache utilities :
=========================================================
Included are 3 utilities - flashcache_create, flashcache_load and 
flashcache_destroy. These utilities use dmsetup internally, presenting
a simpler interface to create, load and destroy flashcache volumes.
It is expected that the majority of users can use these utilities
instead of using dmsetup.

flashcache_create : Create a new flashcache volume.

flashcache_create [-s cache size] [-b block size] cachedevname ssd_devname disk_devname
-s : cache size. Optional. If this is not specified, the entire ssd device
     is used as cache. The default units is sectors. But you can specify 
     k/m/g as units as well.
-b : block size. Optional. Defaults to 4KB. Must be a power of 2.
     The default units is sectors. But you can specify k as units as well.
     (A 4KB blocksize is the correct choice for the vast majority of 
     applications. But see the section "Cache Blocksize selection" below).
-f : force create. by pass checks (eg for ssd sectorsize).

Examples :
flashcache_create -s 1g -b 4k cachedev /dev/sdc /dev/sdb
Creates a 1GB cache volume with a 4KB block size on ssd device /dev/sdc
to cache the disk volume /dev/sdb. The name of the device created is
"cachedev".

flashcache_create -s 2097152 -b 8 cachedev /dev/sdc /dev/sdb
Same as above but units specified in sectors instead. The name of the 
device created is "cachedev".

flashcache_load : Load an existing flashcache volume.

flashcache_load cachedevname ssd_devname disk_devname

Example :
flashcache_load cachedev /dev/sdc /dev/sdb
Load the existing cache on /dev/sdc, calling the flashcache volume
cachedev.

flashcache_destroy : Destroy an existing flashcache. All data will be lost !!!

flashcache_destroy ssd_devname

Example :
flashcache_destroy /dev/sdc
Destroy the existing cache on /dev/sdc. All data is lost !!!

Removing a flashcache volume :
============================
Use dmsetup remove to remove a flashcache volume. Note that the 
default behavior on a remove is to clean all dirty cacheblock to
disk. The remove will not return until all blocks are cleaned.
Progress on disk cleaning is reported on the console (also 
see the "fast_remove" flashcache sysctl).

A reboot of the node will also result in all dirty cache blocks being
cleaned synchronously (again see the note about "fast_remove" in the
sysctls section).

Example:
dmsetup remove cachedev

This removes the flashcache volume name cachedev. Cleaning
all blocks prior to removal.

Cache Stats :
===========
Use 'dmsetup status' for cache statistics.

'dmsetup table' also dumps a number of cache related statistics.

Examples :
dmsetup status cachedev
dmsetup table cachedev

Flashcache errors are reported in 
/proc/flashcache_errors

Flashcache stats are also reported in 
/proc/flashcache_stats 
for easier parseability.

Cache Blocksize selection :
=========================
Cache blocksize selection is critical for good cache utilization and
performance. 

A 4KB cache blocksize for the vast majority of workloads (and filesystems).

FlashCache Sysctls :
==================
These sysctls will apply to all cache devices on the node.

dev.flashcache.fast_remove:
	Don't sync dirty blocks when removing cache. On a reload
	both DIRTY and CLEAN blocks persist in the cache. This 
	option can be used to do a quick cache remove. 
	CAUTION: The cache still has uncommitted (to disk) dirty
	blocks after a fast_remove.
dev.flashcache.zero_stats:
	Zero out all cache stats reported by "dmsetup status".
dev.flashcache.reclaim_policy:
	FIFO (0) vs LRU (1). Defaults to FIFO. Can be switched at 
	runtime.
dev.flashcache.write_merge:
	Enable write merging. When cleaning blocks tack on any
	contigous blocks in the set to the ones that are being cleaned
	Defaults to on.
dev.flashcache.dirty_thresh_pct:
	Flashcache will attempt to keep the dirty blocks in each set 
	under this %. A lower dirty threshold increases disk writes, 
	and reduces block overwrites, but increases the blocks
	available for read caching.
dev.flashcache.do_sync: 
	Schedule cleaning of all dirty blocks in the cache. 
dev.flashcache.stop_sync: 
	Stop the sync in progress.
dev.flashcache.cache_all:
	Global caching mode to cache everything or cache nothing.
	See section on Caching Controls. Defaults to "cache everything".

There is little reason to change these :

dev.flashcache.max_clean_ios_set:
	Maximum writes that can be issues per set when cleaning
	blocks.
dev.flashcache.max_clean_ios_total:
	Maximum writes that can be issued when syncing all blocks.
dev.flashcache.debug:
	Enable verbose debugging.
dev.flashcache.do_pid_expiry:
	Enable expiry on the list of pids in the white/black lists.
dev.flashcache.pid_expiry_secs:
	Set the expiry on the pid white/black lists.
dev.flashcache.max_pids:
	Maximum number of pids in the white/black lists.

Using dmsetup to create and load flashcache volumes :
===================================================
Few users will need to use dmsetup natively to create and load 
flashcache volumes. This section covers that.

dmsetup create device_name table_file

where

device_name: name of the flashcache device being created or loaded.
table_file : other cache args (format below). If this is omitted, dmsetup 
	     attempts to read this from stdin.

table_file format :
0 <disk dev sz in sectors> flashcache <disk dev> <ssd dev> <flashcache cmd> <blksize in sectors> [size of cache in sectors] [cache set size]

flashcache cmd: 
	   1: load existing cache
	   2: create cache
	   3: force create cache (overwriting existing cache). USE WITH CAUTION

blksize in sectors:
	   4KB (8 sectors, PAGE_SIZE) is the right choice for most applications.
	   See note on block size selection below.
	   Unused (can be omitted) for cache loads.

size of cache in sectors:
	   Optional. if size is not specified, the entire ssd device is used as
	   cache. Needs to be a power of 2.
	   Unused (can be omitted) for cache loads.

cache set size:
	   Optional. The default set size is 512, which works well for most 
	   applications. Little reason to change this. Needs to be a
	   power of 2.
	   Unused (can be omitted) for cache loads.

Example :

echo 0 `blockdev --getsize /dev/cciss/c0d1p2` flashcache /dev/cciss/c0d1p2 /dev/fioa2 2 8 522000000 | dmsetup create cachedev

This creates a cache device called "cachedev" (/dev/mapper/cachedev)
with a 4KB blocksize to cache /dev/cciss/c0d1p2 on /dev/fioa2.
The size of the cache is 522000000 sectors.

(TODO : Change loading of the cache happen via "dmsetup load" instead
of "dmsetup create").

Caching Controls
================
Flashcache can be put in one of 2 modes - Cache Everything or 
Cache Nothing (dev.flashcache.cache_all). The defaults is to "cache
everything".

These 2 modes have a blacklist and a whitelist.

The tgid (thread group id) for a group of pthreads can be used as a
shorthand to tag all threads in an application. The tgid for a pthread
is returned by getpid() and the pid of the individual thread is
returned by gettid().

The algorithm works as follows :

In "cache everything" mode,
1) If the pid of the process issuing the IO is in the blacklist, do
not cache the IO. ELSE,
2) If the tgid is in the blacklist, don't cache this IO. UNLESS
3) The particular pid is marked as an exception (and entered in the
whitelist, which makes the IO cacheable).

Conversely, in "cache nothing" mode,
1) If the pid of the process issuing the IO is in the whitelist,
cache the IO. ELSE,
2) If the tgid is in the whitelist, cache this IO. UNLESS
3) The particular pid is marked as an exception (and entered in the
blacklist, which makes the IO non-cacheable).

Examples :
--------
1) You can make the global cache setting "cache nothing", and add the
tgid of your pthreaded application to the whitelist. Which makes only
IOs issued by your application cacheable by Flashcache. 
2) You can make the global cache setting "cache everything" and add
tgids (or pids) of other applications that may issue IOs on this
volume to the blacklist, which will make those un-interesting IOs not
cacheable. 

Note that this only works for O_DIRECT IOs. For buffered IOs, pdflush,
kswapd would also do the writes, with flashcache caching those.

The following cacheability ioctls are supported on /dev/mapper/<cachedev> 

FLASHCACHEADDBLACKLIST: add the pid (or tgid) to the blacklist.
FLASHCACHEDELBLACKLIST: Remove the pid (or tgid) from the blacklist.
FLASHCACHEDELALLBLACKLIST: Clear the blacklist. This can be used to
cleanup if a process dies.

FLASHCACHEADDWHITELIST: add the pid (or tgid) to the whitelist.
FLASHCACHEDELWHITELIST: Remove the pid (or tgid) from the whitelist.
FLASHCACHEDELALLWHITELIST: Clear the whitelist. This can be used to
cleanup if a process dies.

/proc/flashcache_pidlists shows the list of pids on the whitelist
and the blacklist.

Security Note :
=============
With Flashcache, it is possible for a malicious user process to 
corrupt data in files with only read access. In a future revision
of flashcache, this will be addressed (with an extra data copy).

Not documenting the mechanics of how a malicious process could 
corrupt data here.

You can work around this by setting file permissions on files in 
the flashcache volume appropriately.

Tuning XFS for better flashcache performance :
============================================
If you run XFS/Flashcache, it is worth tuning XFS' allocation group
parameters (agsize/agcount) to achieve better flashcache performance.
XFS allocates blocks for files in a given directory in a new
allocation group. By tuning agsize and agcount (mkfs.xfs parameters),
we can achieve much better distribution of blocks across
flashcache. Better distribution of blocks across flashcache will
increase collisions on flashcache sets considerably, increase cache
hit rates significantly and result in lower IO latencies.

We can achieve this by computing agsize (and implicitly agcount) using
these equations,

C = Cache size, 
V = Size of filesystem Volume.

agsize % C = (1/agcount)*C
agsize * agcount ~= V

where agsize <= 1000g (XFS limits on agsize).

A couple of examples that illustrate the formula,

For agcount = 4, let's divide up the cache into 4 equal parts (each
part is size C/agcount). Let's call the parts C1, C2, C3, C4. One
ideal way to map the allocation groups onto the cache is as follows.

Ag1	     Ag2	Ag3	       Ag4
--	     --		--	       --
C1	     C2		C3	       C4	(stripe 1)
C2	     C3		C4	       C1	(stripe 2)
C3	     C4		C1	       C2	(stripe 3)
C4	     C1		C2	       C3	(stripe 4)
C1	     C2		C3	       C4	(stripe 5)

In this simple example, note that each "stripe" has 2 properties
1) Each element of the stripe is a unique part of the cache.
2) The union of all the parts for a stripe gives us the entire cache.

Clearly, this is an ideal mapping, from a distribution across the
cache point of view.

Another example, this time with agcount = 5, the cache is divided into
5 equal parts C1, .. C5.

Ag1	     Ag2	Ag3	       Ag4	Ag5
--	     --		--	       --	--
C1	     C2		C3	       C4	C5	(stripe 1)
C2	     C3		C4	       C5	C1	(stripe 2)
C3	     C4		C5	       C1	C2	(stripe 3)
C4	     C5		C1	       C2	C3	(stripe 4)
C5	     C1		C2	       C3	C4	(stripe 5)
C1	     C2		C3	       C4	C5	(stripe 6)

A couple of examples that compute the optimal agsize for a given
Cachesize and Filesystem volume size.

a) C = 600g, V = 3,5TB
Consider agcount = 5

agsize % 600 = (1/5)*600
agsize % 600 = 120

So an agsize of 720g would work well, and 720*5 = 3.6TB (~ 3.5TB)

b) C = 150g, V = 3.5TB
Consider agcount=4

agsize % 150 = (1/5)*150
agsize % 150 = 37.5

So an agsize of 937g would work well, and 937*4 = 3.7TB (~ 3.5TB)

As an alternative, 

agsize % C = (1 - (1/agcount))*C
agsize * agcount ~= V

Works just as well as the formula above.