Skip to content
Newer
Older
100644 300 lines (250 sloc) 13.5 KB
3d54a88 Initial import of flashcache.
ps authored
1
2 Flashcache : A Write Back Block Cache for Linux
3 Author: Mohan Srinivasan
4 -----------------------------------------------
5
6 Introduction :
7 ============
8 Flashcache is a write back block cache Linux kernel module. This
9 document describes the design, futures ideas, configuration, tuning of
10 the flashcache and concludes with a note covering the testability
11 hooks within flashcache and the testing that we did. Flashcache was
12 built primarily as a block cache for InnoDB but is general purpose and
13 can be used by other applications as well.
14
15 Design :
16 ======
17 Flashcache is built using the Linux Device Mapper (DM), part of the
18 Linux Storage Stack infrastructure that facilitates building SW-RAID
19 and other components. LVM, for example, is built using the DM.
20
21 The cache is structured as a set associative hash, where the cache is
22 divided up into a number of fixed size sets (buckets) with linear
23 probing within a set to find blocks. The set associative hash has a
24 number of advantages (called out in sections below) and works very
25 well in practice.
26
27 The block size, set size and cache size are configurable parameters,
28 specified at cache creation. The default set size is 512 (blocks) and
29 there is little reason to change this.
30
31 In what follows, dbn refers to "disk block number", the logical
32 device block number in sectors.
33
34 To compute the target set for a given dbn
35
36 target set = (dbn / block size / set size) mod (number of sets)
37
38 Once we have the target set, linear probe within the set finds the
39 block. Note that a sequential range of disk blocks will all map onto a
40 given set.
41
42 The DM layer breaks up all IOs into blocksize chunks before passing
43 the IOs down to the cache layer. Flashcache caches all full blocksize
44 IOs.
45
46 Replacement policy is either FIFO or LRU within a cache set. The
47 default is FIFO but policy can be switched at any point at run time
48 via a sysctl (see the configuration and tuning section).
49
50 To handle a cache read, compute the target set (from the dbn), linear
51 search for the dbn in the set. In the case of a cache hit, the read is
52 serviced from flash. For a cache miss, the data is read from disk,
53 populated into flash and the data returned from the read.
54
55 Since the cache is writeback, a write only writes to flash,
56 synchronously updates the cache metadata (to mark the cache block as
57 dirty) and completes the write. On a block re-dirty, the metadata
58 update is skipped.
59
60 It is important to note that in the first cut, cache writes are
61 non-atomic, ie, the "Torn Page Problem" exists. In the event of a
62 power failure or a failed write, part of the block could be written,
63 resulting in a partial write. We have ideas on how to fix this and
64 provide atomic cache writes (see the Futures section).
65
66 Each cache block has on-flash metadata associated with it for cache
67 persistence. This per-block metadata consists of the dbn (disk block
68 cached in this slot) and flags (DIRTY, VALID, INVALID).
69
70 Cache metadata is only updated on a write or when a cache block is
71 cleaned. The former results in the state being marked DIRTY and the
72 latter results in the state being marked ~DIRTY. To minimize small
73 flash writes, cache block metadata is not updated in the read path.
74
75 In addition, we also have a on-flash cache superblock, which contains
76 cache parameters (read on a cache reload) and whether the cache
77 shutdown was clean (orderly) or unclean (node crash, power failure
78 etc).
79
80 On an clean cache shutdown, metadata for all cache blocks is written
81 out to flash. After an orderly shutdown, both VALID and DIRTY blocks
82 will persist on a subsequent cache reload. After a node crash or a
83 power failure, only DIRTY cache blocks will persist on a subsequent
84 cache reload. Node crashes or power failures will not result in data
85 loss, but they will result in the cache losing VALID and non-DIRTY
86 cached blocks.
87
88 Cache metadata updates are "batched" when possible. So if we have
89 pending metadata updates to multiple cache blocks which fall on the
90 same metadata sector, we batch these updates into 1 flash metadata
91 write. When a file is written sequentially, we will commonly be able
92 to batch several metadata updates (resulting from sequential block
93 writes) into 1 cache metadata update.
94
95 Dirty cache blocks are written lazily to disk in the background.
96 Flashcache's lazy writing is controlled by a configurable dirty
97 threshold (see the configuration and tunings section). Flashcache
98 strives to keep the percentage of dirty blocks in each set below the
99 dirty threshold. When the dirty blocks in a set exceeds the dirty
100 threshold, the set is eligible for cleaning.
101
a9cf0f7 Adds a second block cleaning heuristic. Clean dirty blocks based on
Mohan Srinivasan authored
102 Dirty blocks are also cleaned based on "idleness" By defalt a
103 dirty block not read or written for 60 seconds
104 (dev.flashcache.fallow_delay) will be cleaned. To disable idle
105 cleaning set that value to 0.
106
3d54a88 Initial import of flashcache.
ps authored
107 DIRTY blocks are selected for cleaning based on the replacement policy
108 (FIFO vs LRU). Once we have a target set of blocks to clean, we sort
109 these blocks, search for other contigous dirty blocks in the set
110 (which can be cleaned for free since they'll be merged into a large
111 IO) and send the writes down to the disk.
112
113 As mentioned earlier, the DM will break IOs into blocksize pieces
114 before passing them on to flashcache. For smaller (than blocksize) IOs
115 or IOs that straddle 2 cache blocks, we pass the IO directly to disk.
116 But before doing so, we invalidate any cacheblocks that overlap the
117 IO. If the overlapping cacheblocks are DIRTY we clean those
118 cacheblocks and pass the new overlapping IO do disk after those are
119 successfully cleaned. Invalidating cacheblocks for IOs that overlap 2
120 cache blocks is easy with a set associative hash, we need to search
121 for overlaps precisely in 2 cache sets.
122
123 Flashcache has support for block checksums, which are computed on
124 cache population and validated on every cache read. Block checksums is
125 a compile switch, turned off by default because of the "Torn Page"
126 problem. If a cache write fails after part of the block was committed
127 to flash, the block checksum will be wrong and any subsequent attempt
128 to read that block will fail (because of checksum mismatches).
129
130 How much cache metadata overhead do we incur ? For each cache block,
131 we have in-memory state of 24 bytes (on 64 bit architectures) and 16
132 bytes of on-flash metadata state. For a 300GB cache with 16KB blocks,
133 we have approximately 20 Million cacheblocks, resulting in an
134 in-memory metadata footprint of 480MB. If we were to configure a 300GB
135 cache with 4KB pages, that would quadruple to 1.8GB.
136
137 It is possible to mark IOs issued by particular pids as noncacheable
138 via flashcache ioctls. If a process is about to scan a large table
139 sequentially (for a backup say), it can mark itself as non-cacheable.
140 For a read issued by a "non cacheable" process, if the read results
141 in a cache hit, the data is served from cache. If the read results in
142 a cache miss, the read is served directly from disk (without a cache
143 population). For a write issued by a non cacheable process, the
144 write is sent directly to disk. But before that is done, we invalidate
145 any overlapping cache blocks (cleaning them first if necessary).
146
147 A few things to note about tagging pids non-cacheable. First, this
148 only really works reliably with Direct IO. For buffered IO, writes
149 will almost always happen from kernel threads (eg pdflush). So writes
150 will continue to be cached. For most filesystems, these ioctls will
151 make buffered reads uncached - readaheads will be kicked off the filemap
152 code, so the readaheads will be kicked off from the same context as the
153 reads.
154
155 If a process that marked itself non-cacheable dies, flashcache has
156 no way of cleaning up (the Linux kernel doesn't have a at_exit() hook).
157 Applications have to work around this (see configuration below). The
158 cleanup issue can be fixed by making the cache control aspect of
159 flashcache a pseudo-filesystem so that the last close of the fd on
160 process exit cleans things up (see Futures for more details).
161
162 In spite of the limitations, we think the ability to mark Direct IOs
163 issued by a pid will be valuable to prevent backups from wiping out
164 the cache.
165
d063006 Completely re-designed cache controls. The new design is documented
Mohan Srinivasan authored
166 (For a more detailed discussion about caching controls, see the SA Guide).
167
3d54a88 Initial import of flashcache.
ps authored
168 Futures and Features :
169 ====================
170 Cache Mirroring :
171 ---------------
172 Mirroring the cache across 2 physical flash devices should work
173 without any code changes. Since the cache device is a block device, we
174 can build a RAID-0 block device out of the 2 physical flash devices
175 and use that as our cache device. (I have not yet tested this yet).
176
177 Cache Resizing :
178 --------------
179 The easiest way to resize the cache is to bring the cache offline, and
180 then resize. Resizing the cache when active is complicated and bug
181 prone.
182
183 Integration with ATA TRIM Command :
184 ---------------------------------
185 The ATA TRIM command was introduced as a way for the filesystem to
186 inform the ssd that certain blocks were no longer in use to faciliate
187 improved wear levelling algorithms in the ssd controller. Flashcache
188 can leverage this as well. We can simply discard all blocks falling
189 within a TRIM block range from the cache regardless of the state,
190 since they are no longer needed.
191
192 Deeper integration with filesystems :
193 -----------------------------------
194 Non-cacheability could be much better implemented with a deeper
195 integration of flashcache and the filesystem. The filesystem could
196 easily tag IOs as non-cacheable, based on user actions.
197
198 Fixing the "Torn Page Problem" (make Cache Writes atomic) :
199 ---------------------------------------------------------
200 As mentioned above, cache block writes are non-atomic. If we have a
201 power failure or if the flash write fails, part of the block (a few
202 sectors) could be written out, corrupting the block. In this respect,
203 flashcache behaves no different from disk.
204
205 We have ideas on how to fix this and achieve atomic cache block writes
206 using shadow paging techniques. Mark C. says that we could avoid
207 doublebuffer writes if we have atomic cache block writes. However,
208 with flashcache, doublebuffer writes will all be absorbed by the flash
209 (and we should get excellent write hits/overwrites for doublebuffer
210 blocks so they would never hit disk). So it is not clear how much of a
211 win atomic cache block writes will be.
212
213 It is however a handy feature to provide. If we have atomic block
214 writes we could also enable cache block checksums.
215
216 There are broadly 3 ways to fix this.
217 1) If the flash device offers configurable sector sizes, configure it
218 to match the cache block size (FusionIO offers upto a 4KB configurable
219 sector size).
220 2) If we choose a shadow page that falls in the same metadata sector
221 as the page being overwritten, we can do the shadow page write and
222 switch the metadata atomically.
223 3) If we don't want the restriction that the shadow page and the page
224 overwritten are part of the same metadata sector, to allow us to pick
225 a shadow page more freely across the cache set, we would need to
226 introduce a monotonically increasing timestamp per write in the cache
227 metadata that will allow us to disambiguate dirty blocks in the event
228 of a crash.
229
230 Breaking up the cache spinlock :
231 ------------------------------
232 All cache state is protected by a single spinlock. Currently CPU
233 utilization in the cache routines is very low, and there is no
234 contention on this spinlock. That may change in the future.
235
236 Make non-cacheability more robust :
237 ---------------------------------
238 The non-cacheability aspect need fixing in terms of cleanup when a
239 process dies. Probably the best way to approach this is to approach
240 this in a pseudo filesystemish way.
241
242 Several other implementation TODOs/Futures are documented in the code.
243
244 Testing and Testability :
245 =======================
246 Stress Tester :
247 -------------
248 I modified NetApps open source sio load generator, adding support for
249 data verification to it with block checksums maintained in an
250 mmap'ed file. I've been stress testing the cache with this tool. We
251 can vary the read/write mix, seq/rand IO mix, block size, direct IO vs
252 buffered IO, number of IO threads etc with this tool.
253
254 In addition, I've used other workloads to stress test flashcache.
255
256 Error Injection :
257 ---------------
258 I've added hooks for injecting all kinds of errors into the flashcache
259 code (flash IO errors, disk IO errors, various kernel memory
260 allocation errors). The error injection can be controlled by a sysctl
261 "error_inject". Writing the following flags into "error_inject" causes
262 the next event of that type to result in an error. The flag is cleared
263 after the error is simulated. So we'd need to set the flag for each
264 error we'd like to simulate.
265
266 /* Error injection flags */
267 #define READDISK_ERROR 0x00000001
268 #define READCACHE_ERROR 0x00000002
269 #define READFILL_ERROR 0x00000004
270 #define WRITECACHE_ERROR 0x00000008
271 #define WRITECACHE_MD_ERROR 0x00000010
272 #define WRITEDISK_MD_ERROR 0x00000020
273 #define KCOPYD_CALLBACK_ERROR 0x00000040
274 #define DIRTY_WRITEBACK_JOB_ALLOC_FAIL 0x00000080
275 #define READ_MISS_JOB_ALLOC_FAIL 0x00000100
276 #define READ_HIT_JOB_ALLOC_FAIL 0x00000200
277 #define READ_HIT_PENDING_JOB_ALLOC_FAIL 0x00000400
278 #define INVAL_PENDING_JOB_ALLOC_FAIL 0x00000800
279 #define WRITE_HIT_JOB_ALLOC_FAIL 0x00001000
280 #define WRITE_HIT_PENDING_JOB_ALLOC_FAIL 0x00002000
281 #define WRITE_MISS_JOB_ALLOC_FAIL 0x00004000
282 #define WRITES_LIST_ALLOC_FAIL 0x00008000
283 #define MD_ALLOC_SECTOR_ERROR 0x00010000
284
285 I then use a script like this to simulate errors under heavy IO load.
286
287 #!/bin/bash
288
289 for ((debug = 0x00000001 ; debug<=0x00010000 ; debug=debug*2))
290 do
291 echo $debug >/proc/sys/dev/flashcache/error_inject
292 sleep 1
293 done
294
295 Acknowledgements :
296 ================
297 I would like to thank Bob English for doing a critical review of the
298 design and the code of flashcache, for discussing this in detail with
299 me and providing valuable suggestions.
Something went wrong with that request. Please try again.