forked from facebookarchive/flashcache
-
Notifications
You must be signed in to change notification settings - Fork 0
/
flashcache-sa-guide.txt
349 lines (271 loc) · 12.1 KB
/
flashcache-sa-guide.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
FlashCache System Administration Guide
--------------------------------------
Introduction :
============
Flashcache is a writeback block cache for Linux, built as a kernel module,
using the Device Mapper. This document is a quick admministration guide
to flashcache.
Requirements :
============
Flashcache has been built and tested on 2.6.18 and 2.6.20. If you'd like
to build and use it on a newer kernel, please send me an email and I
can help. I will not support older than 2.6.18 kernels.
Cache creation and loading using the flashcache utilities :
=========================================================
Included are 3 utilities - flashcache_create, flashcache_load and
flashcache_destroy. These utilities use dmsetup internally, presenting
a simpler interface to create, load and destroy flashcache volumes.
It is expected that the majority of users can use these utilities
instead of using dmsetup.
flashcache_create : Create a new flashcache volume.
flashcache_create [-s cache size] [-b block size] cachedevname ssd_devname disk_devname
-s : cache size. Optional. If this is not specified, the entire ssd device
is used as cache. The default units is sectors. But you can specify
k/m/g as units as well.
-b : block size. Optional. Defaults to 4KB. Must be a power of 2.
The default units is sectors. But you can specify k as units as well.
(A 4KB blocksize is the correct choice for the vast majority of
applications. But see the section "Cache Blocksize selection" below).
-f : force create. by pass checks (eg for ssd sectorsize).
Examples :
flashcache_create -s 1g -b 4k cachedev /dev/sdc /dev/sdb
Creates a 1GB cache volume with a 4KB block size on ssd device /dev/sdc
to cache the disk volume /dev/sdb. The name of the device created is
"cachedev".
flashcache_create -s 2097152 -b 8 cachedev /dev/sdc /dev/sdb
Same as above but units specified in sectors instead. The name of the
device created is "cachedev".
flashcache_load : Load an existing flashcache volume.
flashcache_load cachedevname ssd_devname disk_devname
Example :
flashcache_load cachedev /dev/sdc /dev/sdb
Load the existing cache on /dev/sdc, calling the flashcache volume
cachedev.
flashcache_destroy : Destroy an existing flashcache. All data will be lost !!!
flashcache_destroy ssd_devname
Example :
flashcache_destroy /dev/sdc
Destroy the existing cache on /dev/sdc. All data is lost !!!
Removing a flashcache volume :
============================
Use dmsetup remove to remove a flashcache volume. Note that the
default behavior on a remove is to clean all dirty cacheblock to
disk. The remove will not return until all blocks are cleaned.
Progress on disk cleaning is reported on the console (also
see the "fast_remove" flashcache sysctl).
A reboot of the node will also result in all dirty cache blocks being
cleaned synchronously (again see the note about "fast_remove" in the
sysctls section).
Example:
dmsetup remove cachedev
This removes the flashcache volume name cachedev. Cleaning
all blocks prior to removal.
Cache Stats :
===========
Use 'dmsetup status' for cache statistics.
'dmsetup table' also dumps a number of cache related statistics.
Examples :
dmsetup status cachedev
dmsetup table cachedev
Flashcache errors are reported in
/proc/flashcache_errors
Flashcache stats are also reported in
/proc/flashcache_stats
for easier parseability.
Cache Blocksize selection :
=========================
Cache blocksize selection is critical for good cache utilization and
performance.
A 4KB cache blocksize for the vast majority of workloads (and filesystems).
FlashCache Sysctls :
==================
These sysctls will apply to all cache devices on the node.
dev.flashcache.fast_remove:
Don't sync dirty blocks when removing cache. On a reload
both DIRTY and CLEAN blocks persist in the cache. This
option can be used to do a quick cache remove.
CAUTION: The cache still has uncommitted (to disk) dirty
blocks after a fast_remove.
dev.flashcache.zero_stats:
Zero out all cache stats reported by "dmsetup status".
dev.flashcache.reclaim_policy:
FIFO (0) vs LRU (1). Defaults to FIFO. Can be switched at
runtime.
dev.flashcache.write_merge:
Enable write merging. When cleaning blocks tack on any
contigous blocks in the set to the ones that are being cleaned
Defaults to on.
dev.flashcache.dirty_thresh_pct:
Flashcache will attempt to keep the dirty blocks in each set
under this %. A lower dirty threshold increases disk writes,
and reduces block overwrites, but increases the blocks
available for read caching.
dev.flashcache.do_sync:
Schedule cleaning of all dirty blocks in the cache.
dev.flashcache.stop_sync:
Stop the sync in progress.
dev.flashcache.cache_all:
Global caching mode to cache everything or cache nothing.
See section on Caching Controls. Defaults to "cache everything".
There is little reason to change these :
dev.flashcache.max_clean_ios_set:
Maximum writes that can be issues per set when cleaning
blocks.
dev.flashcache.max_clean_ios_total:
Maximum writes that can be issued when syncing all blocks.
dev.flashcache.debug:
Enable verbose debugging.
dev.flashcache.do_pid_expiry:
Enable expiry on the list of pids in the white/black lists.
dev.flashcache.pid_expiry_secs:
Set the expiry on the pid white/black lists.
dev.flashcache.max_pids:
Maximum number of pids in the white/black lists.
Using dmsetup to create and load flashcache volumes :
===================================================
Few users will need to use dmsetup natively to create and load
flashcache volumes. This section covers that.
dmsetup create device_name table_file
where
device_name: name of the flashcache device being created or loaded.
table_file : other cache args (format below). If this is omitted, dmsetup
attempts to read this from stdin.
table_file format :
0 <disk dev sz in sectors> flashcache <disk dev> <ssd dev> <flashcache cmd> <blksize in sectors> [size of cache in sectors] [cache set size]
flashcache cmd:
1: load existing cache
2: create cache
3: force create cache (overwriting existing cache). USE WITH CAUTION
blksize in sectors:
4KB (8 sectors, PAGE_SIZE) is the right choice for most applications.
See note on block size selection below.
Unused (can be omitted) for cache loads.
size of cache in sectors:
Optional. if size is not specified, the entire ssd device is used as
cache. Needs to be a power of 2.
Unused (can be omitted) for cache loads.
cache set size:
Optional. The default set size is 512, which works well for most
applications. Little reason to change this. Needs to be a
power of 2.
Unused (can be omitted) for cache loads.
Example :
echo 0 `blockdev --getsize /dev/cciss/c0d1p2` flashcache /dev/cciss/c0d1p2 /dev/fioa2 2 8 522000000 | dmsetup create cachedev
This creates a cache device called "cachedev" (/dev/mapper/cachedev)
with a 4KB blocksize to cache /dev/cciss/c0d1p2 on /dev/fioa2.
The size of the cache is 522000000 sectors.
(TODO : Change loading of the cache happen via "dmsetup load" instead
of "dmsetup create").
Caching Controls
================
Flashcache can be put in one of 2 modes - Cache Everything or
Cache Nothing (dev.flashcache.cache_all). The defaults is to "cache
everything".
These 2 modes have a blacklist and a whitelist.
The tgid (thread group id) for a group of pthreads can be used as a
shorthand to tag all threads in an application. The tgid for a pthread
is returned by getpid() and the pid of the individual thread is
returned by gettid().
The algorithm works as follows :
In "cache everything" mode,
1) If the pid of the process issuing the IO is in the blacklist, do
not cache the IO. ELSE,
2) If the tgid is in the blacklist, don't cache this IO. UNLESS
3) The particular pid is marked as an exception (and entered in the
whitelist, which makes the IO cacheable).
Conversely, in "cache nothing" mode,
1) If the pid of the process issuing the IO is in the whitelist,
cache the IO. ELSE,
2) If the tgid is in the whitelist, cache this IO. UNLESS
3) The particular pid is marked as an exception (and entered in the
blacklist, which makes the IO non-cacheable).
Examples :
--------
1) You can make the global cache setting "cache nothing", and add the
tgid of your pthreaded application to the whitelist. Which makes only
IOs issued by your application cacheable by Flashcache.
2) You can make the global cache setting "cache everything" and add
tgids (or pids) of other applications that may issue IOs on this
volume to the blacklist, which will make those un-interesting IOs not
cacheable.
Note that this only works for O_DIRECT IOs. For buffered IOs, pdflush,
kswapd would also do the writes, with flashcache caching those.
The following cacheability ioctls are supported on /dev/mapper/<cachedev>
FLASHCACHEADDBLACKLIST: add the pid (or tgid) to the blacklist.
FLASHCACHEDELBLACKLIST: Remove the pid (or tgid) from the blacklist.
FLASHCACHEDELALLBLACKLIST: Clear the blacklist. This can be used to
cleanup if a process dies.
FLASHCACHEADDWHITELIST: add the pid (or tgid) to the whitelist.
FLASHCACHEDELWHITELIST: Remove the pid (or tgid) from the whitelist.
FLASHCACHEDELALLWHITELIST: Clear the whitelist. This can be used to
cleanup if a process dies.
/proc/flashcache_pidlists shows the list of pids on the whitelist
and the blacklist.
Security Note :
=============
With Flashcache, it is possible for a malicious user process to
corrupt data in files with only read access. In a future revision
of flashcache, this will be addressed (with an extra data copy).
Not documenting the mechanics of how a malicious process could
corrupt data here.
You can work around this by setting file permissions on files in
the flashcache volume appropriately.
Tuning XFS for better flashcache performance :
============================================
If you run XFS/Flashcache, it is worth tuning XFS' allocation group
parameters (agsize/agcount) to achieve better flashcache performance.
XFS allocates blocks for files in a given directory in a new
allocation group. By tuning agsize and agcount (mkfs.xfs parameters),
we can achieve much better distribution of blocks across
flashcache. Better distribution of blocks across flashcache will
increase collisions on flashcache sets considerably, increase cache
hit rates significantly and result in lower IO latencies.
We can achieve this by computing agsize (and implicitly agcount) using
these equations,
C = Cache size,
V = Size of filesystem Volume.
agsize % C = (1/agcount)*C
agsize * agcount ~= V
where agsize <= 1000g (XFS limits on agsize).
A couple of examples that illustrate the formula,
For agcount = 4, let's divide up the cache into 4 equal parts (each
part is size C/agcount). Let's call the parts C1, C2, C3, C4. One
ideal way to map the allocation groups onto the cache is as follows.
Ag1 Ag2 Ag3 Ag4
-- -- -- --
C1 C2 C3 C4 (stripe 1)
C2 C3 C4 C1 (stripe 2)
C3 C4 C1 C2 (stripe 3)
C4 C1 C2 C3 (stripe 4)
C1 C2 C3 C4 (stripe 5)
In this simple example, note that each "stripe" has 2 properties
1) Each element of the stripe is a unique part of the cache.
2) The union of all the parts for a stripe gives us the entire cache.
Clearly, this is an ideal mapping, from a distribution across the
cache point of view.
Another example, this time with agcount = 5, the cache is divided into
5 equal parts C1, .. C5.
Ag1 Ag2 Ag3 Ag4 Ag5
-- -- -- -- --
C1 C2 C3 C4 C5 (stripe 1)
C2 C3 C4 C5 C1 (stripe 2)
C3 C4 C5 C1 C2 (stripe 3)
C4 C5 C1 C2 C3 (stripe 4)
C5 C1 C2 C3 C4 (stripe 5)
C1 C2 C3 C4 C5 (stripe 6)
A couple of examples that compute the optimal agsize for a given
Cachesize and Filesystem volume size.
a) C = 600g, V = 3,5TB
Consider agcount = 5
agsize % 600 = (1/5)*600
agsize % 600 = 120
So an agsize of 720g would work well, and 720*5 = 3.6TB (~ 3.5TB)
b) C = 150g, V = 3.5TB
Consider agcount=4
agsize % 150 = (1/5)*150
agsize % 150 = 37.5
So an agsize of 937g would work well, and 937*4 = 3.7TB (~ 3.5TB)
As an alternative,
agsize % C = (1 - (1/agcount))*C
agsize * agcount ~= V
Works just as well as the formula above.