Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Newer
Older
100644 487 lines (391 sloc) 18.171 kb
94471b5 @snoyberg Added filehash example
snoyberg authored
1 # Example: File Hash Lookup
2
3 __Learning Objectives__:
4
5 * Streaming data via `conduit`
6 * `crypto-api` and `crypto-conduit` for hashing
7 * Understanding the pitfalls of lazy I/O
8 * Better file path management via `system-filepath`
9 * More efficient lookups with `unordered-containers`
10
11 __Use Case__:
12
13 * Work on a folder containing a large number of files.
14 * User provides the MD5 hash of one of the files.
15 * Application reports path to matching file, if present.
16
17 _Note_: A simple optimization for this program would be to cache information
18 between runs. We will not be doing so here, it is left as an exercise to the
19 reader.
20
21 ## Simple approach: lazy I/O
22
23 Let's start off with a basic approach. We'll use the standard file access
24 functions from `directory`, `pureMD5` for hashing, and will read file contents
25 with lazy I/O. We'll start off with an import list:
26
032f870 @snoyberg Switch codeblock to ```haskell style to get syntax highlighting.
snoyberg authored
27 ```haskell
28 import Data.Maybe (catMaybes)
29 import Control.Applicative ((<$>))
30 import Data.Word (Word8)
31 import Numeric (showHex)
94471b5 @snoyberg Added filehash example
snoyberg authored
32
032f870 @snoyberg Switch codeblock to ```haskell style to get syntax highlighting.
snoyberg authored
33 import System.Directory (getDirectoryContents, doesFileExist)
34 import System.FilePath ((</>))
35 import System.Environment (getArgs, getProgName)
94471b5 @snoyberg Added filehash example
snoyberg authored
36
032f870 @snoyberg Switch codeblock to ```haskell style to get syntax highlighting.
snoyberg authored
37 import qualified Data.ByteString as S
38 import qualified Data.ByteString.Char8 as S8
39 import qualified Data.ByteString.Lazy as L
94471b5 @snoyberg Added filehash example
snoyberg authored
40
032f870 @snoyberg Switch codeblock to ```haskell style to get syntax highlighting.
snoyberg authored
41 import Data.Digest.Pure.MD5 (md5)
42 import Data.Serialize (encode)
43 ```
94471b5 @snoyberg Added filehash example
snoyberg authored
44
45 We want to represent the MD5 sum as a sequence of hexadecmal characters. Let's
46 define a simple helper function to convert a `ByteString` to a hex
47 representation. Note that this implementation is not particularly efficient.
48 Writing a more efficient version in terms of `unfoldrN` is left as an exercise
49 to the reader.
50
032f870 @snoyberg Switch codeblock to ```haskell style to get syntax highlighting.
snoyberg authored
51 ```haskell
52 toHex :: S.ByteString -> S.ByteString
53 toHex =
54 S.concatMap word8ToHex
55 where
56 word8ToHex :: Word8 -> S.ByteString
57 word8ToHex w = S8.pack $ pad $ showHex w []
94471b5 @snoyberg Added filehash example
snoyberg authored
58
032f870 @snoyberg Switch codeblock to ```haskell style to get syntax highlighting.
snoyberg authored
59 -- We know that the input will always be 1 or 2 characters long.
60 pad :: String -> String
61 pad [x] = ['0', x]
62 pad s = s
63 ```
94471b5 @snoyberg Added filehash example
snoyberg authored
64
65 Next, we'll want to have a function that takes the entire contents of a file as
66 a lazy `ByteString` and returns its hex representation. This is actually very
67 simple:
68
032f870 @snoyberg Switch codeblock to ```haskell style to get syntax highlighting.
snoyberg authored
69 ```haskell
70 hash :: L.ByteString -> S.ByteString
71 hash = toHex . encode . md5
72 ```
94471b5 @snoyberg Added filehash example
snoyberg authored
73
74 The one trick comes from understanding the need for `encode`. The `md5`
75 function returns a `MD5Digest` value, which we must convert to a `ByteString`
76 via the `cereal` package.
77
78 Now let's get into the meat of this program. We want to take a `FilePath` to a folder, and get a list of pairs mapping hash to `FilePath`.
79
032f870 @snoyberg Switch codeblock to ```haskell style to get syntax highlighting.
snoyberg authored
80 ```haskell
81 buildMap :: FilePath -> IO [(S.ByteString, FilePath)]
82 buildMap dir = do
83 fps <- getDirectoryContents dir
84 catMaybes <$> mapM getPair fps
85 where
86 getPair :: FilePath -- ^ filename without directory!
87 -> IO (Maybe (S.ByteString, FilePath))
88 getPair name = do
89 exists <- doesFileExist fp
90 if exists
91 then do
92 lbs <- L.readFile fp
93 return $ Just (hash lbs, fp)
94 else return Nothing
94471b5 @snoyberg Added filehash example
snoyberg authored
95 where
032f870 @snoyberg Switch codeblock to ```haskell style to get syntax highlighting.
snoyberg authored
96 fp = dir </> name
97 ```
94471b5 @snoyberg Added filehash example
snoyberg authored
98
99 `getDirectoryContents` returns a list of filenames contained in a specific
100 folder. Note that this is a filename *without* the folder. You need to manually
101 add the folder name to the beginning. We use the `filepath` package's `</>`
102 operator to do so.
103
104 Also, `getDirectoryContents` returns both files and folders, so we need to
105 explicitly check (via `doesFileExist`) if the path we are looking at is a file.
106 If it is, we read the contents in lazily, hash them, and return the pair of
107 hash and filepath. We use `catMaybes` to get rid of any `Nothing` values
108 returned.
109
110 Finally, we have our `main` function:
111
032f870 @snoyberg Switch codeblock to ```haskell style to get syntax highlighting.
snoyberg authored
112 ```haskell
113 main :: IO ()
114 main = do
115 args <- getArgs
116 (folder, needle) <-
117 case args of
118 [a, b] -> return (a, b)
119 _ -> do
120 pn <- getProgName
121 error $ concat
122 [ "Usage: "
123 , pn
124 , " <folder> <needle>"
125 ]
126 md5Map <- buildMap folder
127 case lookup (S8.pack needle) md5Map of
128 Nothing -> putStrLn "No match found"
129 Just fp -> putStrLn $ "Match found: " ++ fp
130 ```
94471b5 @snoyberg Added filehash example
snoyberg authored
131
132 This program seems to work properly:
133
134 $ md5sum README
135 d41d8cd98f00b204e9800998ecf8427e README
136 $ runghc simple.hs . d41d8cd98f00b204e9800998ecf8427e
137 Match found: ./README
138
139 ## Problem 1: FilePath handling
140
141 There are in fact four problems with how we've dealt with `FilePath`s:
142
143 1. It's tedious and error-prone to have to prepend the folder name to the
144 results of `getDirectoryContents`.
145
146 2. Having to check whether a `FilePath` is a file or folder is, again, tedious
147 and error-prone.
148
149 3. We are likely not handling character encodings of the paths properly.
150 `FilePath` is defined as a type synonym for `[Char]`, but in fact does not
151 properly handle encodings on all systems.
152
153 4. We're ignoring the contents of subfolders. We haven't actually specified
154 whether files in subfolders should be inspected or not, so this is not
155 actually a bug, but a feature enhancement.
156
157 We'll start with issues 1 and 3. These can be addressed by relying on a
158 separate library for better filepath handling, `system-filepath`, and its
159 associated `system-fileio`. The former defines an abstract datatype for
160 representing paths, along with a number of utility functions for manipulating
161 them. The latter exposes functions for interacting with the filesystem.
162
163 The simplest way to switch over to this library is to add two import
164 statements:
165
032f870 @snoyberg Switch codeblock to ```haskell style to get syntax highlighting.
snoyberg authored
166 ```haskell
167 import Prelude hiding (FilePath)
168 import Filesystem.Path.CurrentOS (FilePath, encodeString, decodeString)
169 ```
94471b5 @snoyberg Added filehash example
snoyberg authored
170
171 All we're doing is replacing the standard type synonym definition of `FilePath`
172 with `system-filepath`'s abstract definition, as well as importing functions to
173 convert to and from normal strings.
174 `filepath` with `system-filepath`'s version.
175
176 After that, start try to compile and start changing code. We have a few simple
177 fixes in `main`: `md5Map <- buildMap folder` becomes
178 `md5Map <- buildMap $ decodeString folder` and
179 `++ fp` becomes `++ encodeString fp`.
180
181 For the next two changes, we'll need to replace our imports from `System.Directory` with:
182
032f870 @snoyberg Switch codeblock to ```haskell style to get syntax highlighting.
snoyberg authored
183 ```haskell
184 import Filesystem (listDirectory, isFile)
185 ```
94471b5 @snoyberg Added filehash example
snoyberg authored
186
187 In place of `getDirectoryContents`, we'll use `listDirectory`. However, this
188 new function will return a full file path, not just the filename. We also
189 replace `doesFileExist` with `isFile`. In other words, the beginning of
190 `getPair` is now:
191
032f870 @snoyberg Switch codeblock to ```haskell style to get syntax highlighting.
snoyberg authored
192 ```haskell
193 getPair fp = do
194 exists <- isFile fp
195 ```
94471b5 @snoyberg Added filehash example
snoyberg authored
196
197 Much simpler!
198
199 We'll come back to problems 2 and 4 later, after we introduce `conduit`s.
200
201 ## Problem 2: Laziness
202
203 How much memory does our program consume? Seemingly, not very much. We never
204 read the entire body of the files into memory, we only read in one chunk at a
205 time. So most likely, we read in a chunk, update the state of our MD5 digest,
206 and then discard that chunk entirely. Lazy I/O to the rescue!
207
208 While this may be true, we actually have an entirely different form of resource
209 exhaustion on our hands: file descriptors. Do you know precisely when the file
210 handles will be open. It's not obvious if this happens when we first call
211 `readFile`, or when the hash value is first evaluated. And how about when the
212 file handles are closed? It's completely non-deterministic.
213
214 If you have 50 files in your folder, no big deal. But suppose you have 5000, or
215 200,000. You'll quickly run out of file descriptors! This isn't just an
216 academic concern; this kind of question has come up multiple times on the
217 Haskell cafe, and was a bug at one point in `yesod-static`.
218
219 One possible solution would be to try adding `seq` in a few places and force
220 evaluation early. But even this isn't completely deterministic. Instead, let's
221 tackle this problem by using a library designed to avoid lazy I/O: `conduit`.
222
223 In `conduit`, we have three main datatypes: a `Source` is a producer of data, a
224 `Sink` is a consumer of data, and a `Conduit` is a transformer of data. For our
225 example, we'll need the `sourceFile` function to produce a stream of bytes from
226 a file and the `sinkHash` function to consume a stream of bytes into a digest.
227 We won't be using a `Conduit` here, but a possible usage would be to
228 automatically decompress any files with a .gz file extension. Adding this
229 enhancement would be an excellent exercise.
230
231 Note: We're actually taking the "long way around" in implementing this, since
232 the `crypto-conduit` package conveniently provides a `sinkFile` function
233 already. We're doing this all manually to demonstrate how to use `conduit`.
234
235 Let's add our import statements:
236
032f870 @snoyberg Switch codeblock to ```haskell style to get syntax highlighting.
snoyberg authored
237 ```haskell
238 import Data.Conduit (($$), runResourceT)
239 import Data.Conduit.Filesystem (sourceFile)
240 import Crypto.Conduit (sinkHash)
94471b5 @snoyberg Added filehash example
snoyberg authored
241
032f870 @snoyberg Switch codeblock to ```haskell style to get syntax highlighting.
snoyberg authored
242 import Data.Digest.Pure.MD5 (MD5Digest)
243 ```
94471b5 @snoyberg Added filehash example
snoyberg authored
244
245 We've already explained `sourceFile` and `sinkHash`. One note about the former:
246 note that it is imported from the `Data.Conduit.Filesystem` module. This module
247 provides functions that work with `system-filepath`'s `FilePath` type by
248 default. This means we don't have to do any encoding or decoding of `String`
249 values.
250
251 The `$$` operator is known as the __connect operator__. It pulls data from a
252 `Source` and pushes it to a `Sink`. This is a completely strict, deterministic
253 action. We are guaranteed that the file handle will be opened immediately, that
254 data will be pulled one chunk at a time, sent to the sink, and then discarded.
255 As soon as the last chunk of data is pulled, or when the sink completes
256 processing early, the source is closed.
257
258 The one problem with that statement is exceptions. If we had a sink that threw
259 an exception halfway through reading a file, we want to be certain that the
260 file handle is still closed. For this, we have `runResourceT`. Any `conduit`
261 function which allocates scarce resources must live in a `MonadResource`. When
262 the file handle is first opened, a release action is registered to close the
263 file handle. If processing terminates normally, the file handle is closed
264 immediately. If an exception is thrown, `runResourceT` will catch the
265 exception, perform any cleanup actions, and then rethrow the exception. Either
266 way, you can't leak a resource.
267
268 That was a pretty long description, but it all boils down to one line of code:
269
032f870 @snoyberg Switch codeblock to ```haskell style to get syntax highlighting.
snoyberg authored
270 ```haskell
271 digest <- runResourceT $ sourceFile fp $$ sinkHash
272 ```
94471b5 @snoyberg Added filehash example
snoyberg authored
273
274 All we're doing is reading from the file, connecting to the hash-producing
275 function, and pulling out the digest.
276
277 Next, we need to turn this `digest` value into a hex-encoded `ByteString`. Like
278 previously, we'll need to use `cereal`'s `encode` function. However, `sinkHash`
279 can work with many different kinds of hash algorithms (e.g., skein, SHA256). So
280 we need to explicitly tell GHC which type of digest we want. We do this by
281 giving an explicit signature to `digest`:
282
032f870 @snoyberg Switch codeblock to ```haskell style to get syntax highlighting.
snoyberg authored
283 ```haskell
284 let hash = toHex . encode $ (digest :: MD5Digest)
285 ```
94471b5 @snoyberg Added filehash example
snoyberg authored
286
287 The `cryptohash` package provides a large number of hashes. Since we're just
288 sticking with MD5, this example still uses the `pureMD5` package.
289
290 ## Dealing with those subfolders
291
292 Let's see another example of `conduit`. The `Data.Conduit.Filesystem` module
293 provides a function `traverse`, which gives a `Source` of all files in a
294 folder, or any of its subfolders. This can help us deal with points 2 and 4
295 from problem #1: it will only provide files, not folders, and will
296 automatically traverse subfolders. We'll need to update our imports a bit more:
297
032f870 @snoyberg Switch codeblock to ```haskell style to get syntax highlighting.
snoyberg authored
298 ```haskell
299 import Data.Conduit (($$), (=$), runResourceT)
300 import qualified Data.Conduit.List as CL
301 import Data.Conduit.Filesystem (sourceFile, traverse)
302 ```
94471b5 @snoyberg Added filehash example
snoyberg authored
303
304 The `=$` operator is called __right fuse__. It combines a `Conduit` and a
305 `Sink` together into a `Sink`. The `Data.Conduit.List` module provides a number
306 of familiar functions for working with conduit, such as `mapM` and `fold`.
307 Let's see how we put this together:
308
032f870 @snoyberg Switch codeblock to ```haskell style to get syntax highlighting.
snoyberg authored
309 ```haskell
310 buildMap :: FilePath -> IO [(S.ByteString, FilePath)]
311 buildMap dir =
312 traverse False dir
313 $$ CL.mapM getPair
314 =$ CL.consume
315 where
316 getPair :: FilePath -> IO (S.ByteString, FilePath)
317 getPair fp = do
318 -- Now we know that fp is a file, not a folder.
319 -- No need to check it.
320 digest <- runResourceT $ sourceFile fp $$ sinkHash
321 let hash = toHex . encode $ (digest :: MD5Digest)
322 return (hash, fp)
323 ```
94471b5 @snoyberg Added filehash example
snoyberg authored
324
325 The first argument to `traverse` indicates whether we should follow symbolic
326 links. We've elected not to. Notice how we connect this two the `CL.mapM
327 getPair`, and fuse that with `CL.consume`. What we're really doing is:
328
329 * Fuse `CL.mapM getPair` and `CL.consume` into a new `Sink`.
330 * Connecting that new `Sink` with the `traverse` `Source`.
331
332 In this way, it's easy to build up pipelines of operations.
333
334 `mapM_` does what you would expect: it transforms each element in a stream
335 using some monadic function. `consume` will read in a stream of values and
336 store them as a list. By fusing these two actions together, we're creating a
337 `Sink` that will convert a stream of `FilePath`s into a list of pairs of hashes
338 and `FilePath`s.
339
340 We're also able to completely skip the `isFile` check at this point. In other
341 words: mission accomplished.
342
343 ## Problem 3: Inefficient lookup
344
345 One last annoyance: that `lookup` we're performing in `main` is an O(n)
346 operation. Since we're only ever doing a single lookup, we could restructure
347 our program to simply terminate as soon as it finds a matching hash value. And
348 in fact, that would be the most efficient thing we could do, given our current
349 constraints.
350
351 But suppose we want to change our program to allow a user to enter multiple
352 hash values to be lookup up, or we want to create a server that will respond.
353 We'll want to cache all of the hash values when our program starts, and
354 continue using that throughout the duration of the application. To do this,
355 we'll use the `unordered-containers` package's `HashMap`.
356
357 First we'll need to import the module in question:
358
032f870 @snoyberg Switch codeblock to ```haskell style to get syntax highlighting.
snoyberg authored
359 ```haskell
360 import Data.HashMap.Strict (HashMap)
361 import qualified Data.HashMap.Strict as HMap
362 ```
94471b5 @snoyberg Added filehash example
snoyberg authored
363
364 Modifying the main function is simple, just replace `lookup` with
365 `HMap.lookup`. The real work comes from `buildMap`, but even that's not too
366 bad:
367
032f870 @snoyberg Switch codeblock to ```haskell style to get syntax highlighting.
snoyberg authored
368 ```haskell
369 buildMap :: FilePath -> IO (HashMap S.ByteString FilePath)
370 buildMap dir =
371 traverse False dir
372 $$ CL.mapM getPair
373 =$ CL.fold HMap.union HMap.empty
374 where
375 getPair :: FilePath -> IO (HashMap S.ByteString FilePath)
376 getPair fp = do
377 -- Now we know that fp is a file, not a folder.
378 -- No need to check it.
379 digest <- runResourceT $ sourceFile fp $$ sinkHash
380 let hash = toHex . encode $ (digest :: MD5Digest)
381 return $ HMap.singleton hash fp
382 ```
94471b5 @snoyberg Added filehash example
snoyberg authored
383
384 Instead of returning a tuple, `getPair` now returns a `HashMap`. And instead of
385 using `CL.consume`, we use `CL.fold` to join together each successive
386 `HashMap`.
387
388 ### Possibly misguided optimization
389
390 If we wanted to optimize this a bit more, we could skip the creation of the
391 intermediate `HashMap`s and avoid the intermediate `Conduit`, by rewriting our
392 code as:
393
032f870 @snoyberg Switch codeblock to ```haskell style to get syntax highlighting.
snoyberg authored
394 ```haskell
395 buildMap :: FilePath -> IO (HashMap S.ByteString FilePath)
396 buildMap dir =
397 traverse False dir $$ CL.foldM addFP HMap.empty
398 where
399 addFP :: HashMap S.ByteString FilePath
400 -> FilePath
401 -> IO (HashMap S.ByteString FilePath)
402 addFP hmap fp = do
403 digest <- runResourceT $ sourceFile fp $$ sinkHash
404 let hash = toHex . encode $ (digest :: MD5Digest)
405 return $ HMap.insert hash fp hmap
406 ```
94471b5 @snoyberg Added filehash example
snoyberg authored
407
408 This is called possibly misguided since there's no actual evidence that this
409 will speed up the code. As much as our gut may say "look, there are less lines
410 of code," without profiling it, we can't be certain. At this point, it's a
411 matter of style whether you prefer the previous version of `buildMap` or this
412 one.
413
414 The former is nicer since it clearly separates between two separate actions
415 (turning a `FilePath` into the hash-pair, and combining multiple `HashMap`)
416 whereas this in some ways makes it clearer what is going on (I'm inserting into
417 an existing map). Which approach you take is entirely your decision.
418
419 ## Final source code
420
032f870 @snoyberg Switch codeblock to ```haskell style to get syntax highlighting.
snoyberg authored
421 ```haskell
422 import Prelude hiding (FilePath)
423 import Data.Word (Word8)
424 import Numeric (showHex)
425
426 import System.Environment (getArgs, getProgName)
427
428 import qualified Data.ByteString as S
429 import qualified Data.ByteString.Char8 as S8
430
431 import Data.HashMap.Strict (HashMap)
432 import qualified Data.HashMap.Strict as HMap
433
434 import Filesystem.Path.CurrentOS (FilePath, encodeString, decodeString)
435
436 import Data.Conduit (($$), runResourceT)
437 import qualified Data.Conduit.List as CL
438 import Data.Conduit.Filesystem (sourceFile, traverse)
439 import Crypto.Conduit (sinkHash)
440
441 import Data.Digest.Pure.MD5 (MD5Digest)
442 import Data.Serialize (encode)
443
444 main :: IO ()
445 main = do
446 args <- getArgs
447 (folder, needle) <-
448 case args of
449 [a, b] -> return (a, b)
450 _ -> do
451 pn <- getProgName
452 error $ concat
453 [ "Usage: "
454 , pn
455 , " <folder> <needle>"
456 ]
457 md5Map <- buildMap $ decodeString folder
458 case HMap.lookup (S8.pack needle) md5Map of
459 Nothing -> putStrLn "No match found"
460 Just fp -> putStrLn $ "Match found: " ++ encodeString fp
461
462 buildMap :: FilePath -> IO (HashMap S.ByteString FilePath)
463 buildMap dir =
464 traverse False dir $$ CL.foldM addFP HMap.empty
465 where
466 addFP :: HashMap S.ByteString FilePath
467 -> FilePath
468 -> IO (HashMap S.ByteString FilePath)
469 addFP hmap fp = do
470 digest <- runResourceT $ sourceFile fp $$ sinkHash
471 let hash = toHex . encode $ (digest :: MD5Digest)
472 return $ HMap.insert hash fp hmap
473
474 -- Overall, this function is pretty inefficient. Writing an optimized version
475 -- in terms of unfoldR is left as an exercise to the reader.
476 toHex :: S.ByteString -> S.ByteString
477 toHex =
478 S.concatMap word8ToHex
479 where
480 word8ToHex :: Word8 -> S.ByteString
481 word8ToHex w = S8.pack $ pad $ showHex w []
482
483 -- We know that the input will always be 1 or 2 characters long.
484 pad :: String -> String
485 pad [x] = ['0', x]
486 pad s = s
487 ```
Something went wrong with that request. Please try again.