Skip to content
This repository
Newer
Older
100644 317 lines (265 sloc) 14.842 kb
3d54a883 »
2010-04-21 Initial import of flashcache.
1
2 Flashcache : A Write Back Block Cache for Linux
3 Author: Mohan Srinivasan
4 -----------------------------------------------
5
6 Introduction :
7 ============
8 Flashcache is a write back block cache Linux kernel module. This
9 document describes the design, futures ideas, configuration, tuning of
10 the flashcache and concludes with a note covering the testability
11 hooks within flashcache and the testing that we did. Flashcache was
12 built primarily as a block cache for InnoDB but is general purpose and
13 can be used by other applications as well.
14
15 Design :
16 ======
17 Flashcache is built using the Linux Device Mapper (DM), part of the
18 Linux Storage Stack infrastructure that facilitates building SW-RAID
19 and other components. LVM, for example, is built using the DM.
20
21 The cache is structured as a set associative hash, where the cache is
22 divided up into a number of fixed size sets (buckets) with linear
23 probing within a set to find blocks. The set associative hash has a
24 number of advantages (called out in sections below) and works very
25 well in practice.
26
27 The block size, set size and cache size are configurable parameters,
28 specified at cache creation. The default set size is 512 (blocks) and
29 there is little reason to change this.
30
31 In what follows, dbn refers to "disk block number", the logical
32 device block number in sectors.
33
34 To compute the target set for a given dbn
35
36 target set = (dbn / block size / set size) mod (number of sets)
37
38 Once we have the target set, linear probe within the set finds the
39 block. Note that a sequential range of disk blocks will all map onto a
40 given set.
41
42 The DM layer breaks up all IOs into blocksize chunks before passing
da61639d »
2011-10-16 Skip sequential IO (optional, defaults to off, sysctl'able).
43 the IOs down to the cache layer. By default, flashcache caches all
44 full blocksize IOs, but can be configured to only cache random IO
45 whilst ignoring sequential IO.
3d54a883 »
2010-04-21 Initial import of flashcache.
46
47 Replacement policy is either FIFO or LRU within a cache set. The
48 default is FIFO but policy can be switched at any point at run time
49 via a sysctl (see the configuration and tuning section).
50
51 To handle a cache read, compute the target set (from the dbn), linear
52 search for the dbn in the set. In the case of a cache hit, the read is
53 serviced from flash. For a cache miss, the data is read from disk,
54 populated into flash and the data returned from the read.
55
56 Since the cache is writeback, a write only writes to flash,
57 synchronously updates the cache metadata (to mark the cache block as
58 dirty) and completes the write. On a block re-dirty, the metadata
59 update is skipped.
60
61 It is important to note that in the first cut, cache writes are
62 non-atomic, ie, the "Torn Page Problem" exists. In the event of a
63 power failure or a failed write, part of the block could be written,
64 resulting in a partial write. We have ideas on how to fix this and
65 provide atomic cache writes (see the Futures section).
66
67 Each cache block has on-flash metadata associated with it for cache
68 persistence. This per-block metadata consists of the dbn (disk block
69 cached in this slot) and flags (DIRTY, VALID, INVALID).
70
71 Cache metadata is only updated on a write or when a cache block is
72 cleaned. The former results in the state being marked DIRTY and the
73 latter results in the state being marked ~DIRTY. To minimize small
74 flash writes, cache block metadata is not updated in the read path.
75
76 In addition, we also have a on-flash cache superblock, which contains
77 cache parameters (read on a cache reload) and whether the cache
78 shutdown was clean (orderly) or unclean (node crash, power failure
79 etc).
80
81 On an clean cache shutdown, metadata for all cache blocks is written
82 out to flash. After an orderly shutdown, both VALID and DIRTY blocks
83 will persist on a subsequent cache reload. After a node crash or a
84 power failure, only DIRTY cache blocks will persist on a subsequent
85 cache reload. Node crashes or power failures will not result in data
86 loss, but they will result in the cache losing VALID and non-DIRTY
87 cached blocks.
88
89 Cache metadata updates are "batched" when possible. So if we have
90 pending metadata updates to multiple cache blocks which fall on the
91 same metadata sector, we batch these updates into 1 flash metadata
92 write. When a file is written sequentially, we will commonly be able
93 to batch several metadata updates (resulting from sequential block
94 writes) into 1 cache metadata update.
95
96 Dirty cache blocks are written lazily to disk in the background.
97 Flashcache's lazy writing is controlled by a configurable dirty
98 threshold (see the configuration and tunings section). Flashcache
99 strives to keep the percentage of dirty blocks in each set below the
100 dirty threshold. When the dirty blocks in a set exceeds the dirty
101 threshold, the set is eligible for cleaning.
102
a9cf0f7b »
2011-02-22 Adds a second block cleaning heuristic. Clean dirty blocks based on
103 Dirty blocks are also cleaned based on "idleness" By defalt a
8643738f »
2011-04-06 Document the idle cleaning support.
104 dirty block not read or written for 15 minutes (dev.flashcache.fallow_delay)
105 will be cleaned. To disable idle cleaning set that value to 0.
106 A 2 handed clocklike algorithm is used to pick off fallow dirty
107 blocks to clean.
a9cf0f7b »
2011-02-22 Adds a second block cleaning heuristic. Clean dirty blocks based on
108
3d54a883 »
2010-04-21 Initial import of flashcache.
109 DIRTY blocks are selected for cleaning based on the replacement policy
110 (FIFO vs LRU). Once we have a target set of blocks to clean, we sort
111 these blocks, search for other contigous dirty blocks in the set
112 (which can be cleaned for free since they'll be merged into a large
113 IO) and send the writes down to the disk.
114
115 As mentioned earlier, the DM will break IOs into blocksize pieces
116 before passing them on to flashcache. For smaller (than blocksize) IOs
117 or IOs that straddle 2 cache blocks, we pass the IO directly to disk.
118 But before doing so, we invalidate any cacheblocks that overlap the
119 IO. If the overlapping cacheblocks are DIRTY we clean those
120 cacheblocks and pass the new overlapping IO do disk after those are
121 successfully cleaned. Invalidating cacheblocks for IOs that overlap 2
122 cache blocks is easy with a set associative hash, we need to search
123 for overlaps precisely in 2 cache sets.
124
125 Flashcache has support for block checksums, which are computed on
126 cache population and validated on every cache read. Block checksums is
127 a compile switch, turned off by default because of the "Torn Page"
128 problem. If a cache write fails after part of the block was committed
129 to flash, the block checksum will be wrong and any subsequent attempt
130 to read that block will fail (because of checksum mismatches).
131
132 How much cache metadata overhead do we incur ? For each cache block,
133 we have in-memory state of 24 bytes (on 64 bit architectures) and 16
134 bytes of on-flash metadata state. For a 300GB cache with 16KB blocks,
135 we have approximately 20 Million cacheblocks, resulting in an
136 in-memory metadata footprint of 480MB. If we were to configure a 300GB
137 cache with 4KB pages, that would quadruple to 1.8GB.
138
139 It is possible to mark IOs issued by particular pids as noncacheable
140 via flashcache ioctls. If a process is about to scan a large table
141 sequentially (for a backup say), it can mark itself as non-cacheable.
142 For a read issued by a "non cacheable" process, if the read results
143 in a cache hit, the data is served from cache. If the read results in
144 a cache miss, the read is served directly from disk (without a cache
145 population). For a write issued by a non cacheable process, the
146 write is sent directly to disk. But before that is done, we invalidate
147 any overlapping cache blocks (cleaning them first if necessary).
148
149 A few things to note about tagging pids non-cacheable. First, this
150 only really works reliably with Direct IO. For buffered IO, writes
151 will almost always happen from kernel threads (eg pdflush). So writes
152 will continue to be cached. For most filesystems, these ioctls will
153 make buffered reads uncached - readaheads will be kicked off the filemap
154 code, so the readaheads will be kicked off from the same context as the
155 reads.
156
157 If a process that marked itself non-cacheable dies, flashcache has
158 no way of cleaning up (the Linux kernel doesn't have a at_exit() hook).
159 Applications have to work around this (see configuration below). The
160 cleanup issue can be fixed by making the cache control aspect of
161 flashcache a pseudo-filesystem so that the last close of the fd on
162 process exit cleans things up (see Futures for more details).
163
164 In spite of the limitations, we think the ability to mark Direct IOs
165 issued by a pid will be valuable to prevent backups from wiping out
166 the cache.
167
da61639d »
2011-10-16 Skip sequential IO (optional, defaults to off, sysctl'able).
168 Alternatively, rather than specifically marking pids as non-cacheable,
ece3c8e1 »
2011-10-17 sysctl name changed to skip_seq_thresh_kb. Record stats of uncached s…
169 users may wish to experiment with the sysctl 'skip_seq_thresh_kb' which
da61639d »
2011-10-16 Skip sequential IO (optional, defaults to off, sysctl'able).
170 disables caching of IO determined to be sequential, above a configurable
171 threshold of consecutive reads or writes. The algorithm to spot
172 sequential IO has some ability to handle multiple 'flows' of IO, so
173 it should, for example, be able to skip caching of IOs of two
174 flows of sequential reads or writes, but only cache IOs from a third
175 random IO flow. Note that multiple small files may be written to
176 consecutive blocks. If these are written out in a batch (e.g. by
60964a6b »
2011-10-19 minor : fix error in doc (mirroring is RAID1 not RAID0) and tidy up a…
177 an untar), this may appear as a single sequential write, hence these
da61639d »
2011-10-16 Skip sequential IO (optional, defaults to off, sysctl'able).
178 multiple small files will not be cached. The categorization of IO as
179 sequential or random occurs purely at the block level, not the file level.
180
d0630063 »
2010-07-21 Completely re-designed cache controls. The new design is documented
181 (For a more detailed discussion about caching controls, see the SA Guide).
182
3d54a883 »
2010-04-21 Initial import of flashcache.
183 Futures and Features :
184 ====================
185 Cache Mirroring :
186 ---------------
187 Mirroring the cache across 2 physical flash devices should work
188 without any code changes. Since the cache device is a block device, we
60964a6b »
2011-10-19 minor : fix error in doc (mirroring is RAID1 not RAID0) and tidy up a…
189 can build a RAID-1 block device out of the 2 physical flash devices
190 and use that as our cache device. (I have not yet tested this).
3d54a883 »
2010-04-21 Initial import of flashcache.
191
192 Cache Resizing :
193 --------------
194 The easiest way to resize the cache is to bring the cache offline, and
195 then resize. Resizing the cache when active is complicated and bug
196 prone.
197
198 Integration with ATA TRIM Command :
199 ---------------------------------
200 The ATA TRIM command was introduced as a way for the filesystem to
201 inform the ssd that certain blocks were no longer in use to faciliate
202 improved wear levelling algorithms in the ssd controller. Flashcache
203 can leverage this as well. We can simply discard all blocks falling
204 within a TRIM block range from the cache regardless of the state,
205 since they are no longer needed.
206
207 Deeper integration with filesystems :
208 -----------------------------------
209 Non-cacheability could be much better implemented with a deeper
210 integration of flashcache and the filesystem. The filesystem could
211 easily tag IOs as non-cacheable, based on user actions.
212
213 Fixing the "Torn Page Problem" (make Cache Writes atomic) :
214 ---------------------------------------------------------
215 As mentioned above, cache block writes are non-atomic. If we have a
216 power failure or if the flash write fails, part of the block (a few
217 sectors) could be written out, corrupting the block. In this respect,
218 flashcache behaves no different from disk.
219
220 We have ideas on how to fix this and achieve atomic cache block writes
221 using shadow paging techniques. Mark C. says that we could avoid
222 doublebuffer writes if we have atomic cache block writes. However,
223 with flashcache, doublebuffer writes will all be absorbed by the flash
224 (and we should get excellent write hits/overwrites for doublebuffer
225 blocks so they would never hit disk). So it is not clear how much of a
226 win atomic cache block writes will be.
227
228 It is however a handy feature to provide. If we have atomic block
229 writes we could also enable cache block checksums.
230
231 There are broadly 3 ways to fix this.
232 1) If the flash device offers configurable sector sizes, configure it
233 to match the cache block size (FusionIO offers upto a 4KB configurable
234 sector size).
235 2) If we choose a shadow page that falls in the same metadata sector
236 as the page being overwritten, we can do the shadow page write and
237 switch the metadata atomically.
238 3) If we don't want the restriction that the shadow page and the page
239 overwritten are part of the same metadata sector, to allow us to pick
240 a shadow page more freely across the cache set, we would need to
241 introduce a monotonically increasing timestamp per write in the cache
242 metadata that will allow us to disambiguate dirty blocks in the event
243 of a crash.
244
245 Breaking up the cache spinlock :
246 ------------------------------
247 All cache state is protected by a single spinlock. Currently CPU
248 utilization in the cache routines is very low, and there is no
249 contention on this spinlock. That may change in the future.
250
251 Make non-cacheability more robust :
252 ---------------------------------
253 The non-cacheability aspect need fixing in terms of cleanup when a
254 process dies. Probably the best way to approach this is to approach
255 this in a pseudo filesystemish way.
256
257 Several other implementation TODOs/Futures are documented in the code.
258
259 Testing and Testability :
260 =======================
261 Stress Tester :
262 -------------
263 I modified NetApps open source sio load generator, adding support for
264 data verification to it with block checksums maintained in an
265 mmap'ed file. I've been stress testing the cache with this tool. We
266 can vary the read/write mix, seq/rand IO mix, block size, direct IO vs
267 buffered IO, number of IO threads etc with this tool.
268
269 In addition, I've used other workloads to stress test flashcache.
270
271 Error Injection :
272 ---------------
273 I've added hooks for injecting all kinds of errors into the flashcache
274 code (flash IO errors, disk IO errors, various kernel memory
275 allocation errors). The error injection can be controlled by a sysctl
276 "error_inject". Writing the following flags into "error_inject" causes
277 the next event of that type to result in an error. The flag is cleared
278 after the error is simulated. So we'd need to set the flag for each
279 error we'd like to simulate.
280
281 /* Error injection flags */
282 #define READDISK_ERROR 0x00000001
283 #define READCACHE_ERROR 0x00000002
284 #define READFILL_ERROR 0x00000004
285 #define WRITECACHE_ERROR 0x00000008
286 #define WRITECACHE_MD_ERROR 0x00000010
287 #define WRITEDISK_MD_ERROR 0x00000020
288 #define KCOPYD_CALLBACK_ERROR 0x00000040
289 #define DIRTY_WRITEBACK_JOB_ALLOC_FAIL 0x00000080
290 #define READ_MISS_JOB_ALLOC_FAIL 0x00000100
291 #define READ_HIT_JOB_ALLOC_FAIL 0x00000200
292 #define READ_HIT_PENDING_JOB_ALLOC_FAIL 0x00000400
293 #define INVAL_PENDING_JOB_ALLOC_FAIL 0x00000800
294 #define WRITE_HIT_JOB_ALLOC_FAIL 0x00001000
295 #define WRITE_HIT_PENDING_JOB_ALLOC_FAIL 0x00002000
296 #define WRITE_MISS_JOB_ALLOC_FAIL 0x00004000
297 #define WRITES_LIST_ALLOC_FAIL 0x00008000
298 #define MD_ALLOC_SECTOR_ERROR 0x00010000
299
300 I then use a script like this to simulate errors under heavy IO load.
301
302 #!/bin/bash
303
304 for ((debug = 0x00000001 ; debug<=0x00010000 ; debug=debug*2))
305 do
306 echo $debug >/proc/sys/dev/flashcache/error_inject
307 sleep 1
308 done
309
310 Acknowledgements :
311 ================
312 I would like to thank Bob English for doing a critical review of the
313 design and the code of flashcache, for discussing this in detail with
314 me and providing valuable suggestions.
da61639d »
2011-10-16 Skip sequential IO (optional, defaults to off, sysctl'able).
315
316 The option to detect and skip sequential IO was added by Will Smith.
Something went wrong with that request. Please try again.