Database for PNG images
C C++ Shell
Latest commit b5c5bbb Mar 11, 2011 @albertz more stats
Failed to load latest commit information.
hiredis hiredis: removing more obsolete stuff Mar 7, 2011
kyotocabinet delete all kyotocabinet tools/tests Mar 7, 2011
png-db.xcodeproj xcode Mar 10, 2011
sample-images add sample images Mar 5, 2011
.gitignore gitign Mar 10, 2011
Crc.cpp moved CRC code Mar 2, 2011
Crc.h moved CRC code Mar 2, 2011
Db.cpp small fix Mar 7, 2011
Db.h small fix Mar 7, 2011
DbDefBackend.h change default back to KyotoCabinet. just works better for now Mar 10, 2011
DbFileBackend.cpp lower branching factor Mar 10, 2011
DbFileBackend.h made multithreading safe Mar 10, 2011
DbFsBackend.cpp some refactoring. prepared for multiple DB backend implementations Mar 7, 2011
DbFsBackend.h some cleanup Mar 7, 2011
DbKyotoBackend.cpp readonly support Mar 7, 2011
DbKyotoBackend.h readonly support Mar 7, 2011
DbPng.cpp small fix/check Mar 10, 2011
DbPng.h made writing more general Mar 7, 2011
DbRedisBackend.cpp some cleanup Mar 7, 2011
DbRedisBackend.h some cleanup Mar 7, 2011
Endianess.h some initial code Mar 2, 2011
FileUtils.cpp more file utils Mar 7, 2011
FileUtils.h made writing more general Mar 7, 2011
Mutex.h simple Mutex class Mar 8, 2011
Png.cpp made writing more general Mar 7, 2011
Png.h small compiler safety check Mar 8, 2011 note about video compressions Mar 8, 2011
Return.h small fix (wrong construction otherwise for const char*) Mar 2, 2011
Sha1.cpp stupid sha1 fix Mar 6, 2011
Sha1.h stupid sha1 fix Mar 6, 2011
SmartPointer.h simplified SmartPointer for this project Mar 8, 2011
StaticAssert.h some initial code Mar 2, 2011
StringUtils.cpp made writing more general Mar 7, 2011
StringUtils.h made writing more general Mar 7, 2011
Utils.h made writing more general Mar 7, 2011 fix for FUSE linking (mac only?) Mar 7, 2011
db-extract-file.cpp made writing more general Mar 7, 2011
db-fuse.cpp small comment Mar 8, 2011
db-list-dir.cpp readonly support Mar 7, 2011
db-push-dir.cpp db-push-dir: only push new files Mar 7, 2011
db-push.cpp implemented DbPngEntryReader Mar 7, 2011
pnginfo.cpp Db::get implemented Mar 6, 2011 more stats Mar 10, 2011
test-png-dumpchunks.cpp Db::get implemented Mar 6, 2011
test-png-reader.cpp Db::get implemented Mar 6, 2011

DB optimized for a bunch of PNG images.


The idea is to split PNG images into many blocks and have each block stored in a DB. If there are several equal blocks, it is only stored once. Via a hash table, the lookup for such blocks is made fast.

Use case

I am collecting screenshots (for several reasons; one is to play around with machine learning / computer vision; one example is here: A lot of them. :)

Right now, I have about 88k screenshots with about 77GB. And as many of them have a lot of repetitive areas (on some days, I were making a screenshot every 10 seconds, even when not using the computer at all, so the only changing part was the time display), I didn't wanted to waste so much space on so much repetitive data.

With this PNG DB, I have a compression rate of about 400-500% (for the first 1k screenshots or so; probably the rate will even be higher for all of them).

This example with the screenshots is probably an extreme case (where this applies extremely well). But I guess in many other cases where you are collecting a huge amount of PNG images (with computer-generated content; real-world images would not work that well), you can safe some space by it.

And if this gets optimized as far as possible, it may be even faster than normal filesystem access (because of less disk IO).

Technical details

To make things easier on the PNG side, it just parses down until it gets a scanline serialization. Multiple directly following scanline (of same width) serializations parts build up a block (so it actually really matches a block in the real picture). But I don't do any of the PNG filtering. PNG spec:

The general DB layout is as follows:

  • ("data." unique id -> zlib compressed data) data pairs
  • ("sha1refs." SHA1 -> set of ids) data pairs
  • ("fs." filename -> id) data pairs

Such data value (uncompressed) starts with a data-type-byte. Only 3 types are there currently:

  • PNG file summary
  • PNG chunk (all non-data PNG chunks)
  • PNG block

There are multiple DB backend implementations:

  • The filesystem itself. But creates a lot of files!
  • Redis. Via hiredis. As everything is in memory, you are a bit limited.
  • KyotoCabinet. (Currently the default.)

Comparison with other compression methods / deduplicators

In the beginning, I thought about using some generic image library to be able to handle just any image type and then operate just on the raw data. This would even give me some slight better compression rate because now, I am operating on PNGs scanline serializations and there are 5 different ways (filters) in PNG to represent a scanline.

However, because I am storing all the data as PNG raw data in the DB, the reconstruction of the PNG should be much faster. In the more generic case, I would have to recompress/reencode the PNG. Now I only have to (roughly) collect and glew the parts together and run the PNG zlib over it.

Using a general deduplicator / compressor on the raw data (uncompressed PNG, TGA or BMP): It would be based on connected chunks of data; i.e., in the image, it would mean one or many following scanlines. But what I am doing is based on rectangular blocks in the image. So I am able to get much bigger chunks of data which is repetitive.

Something like what is done in video compressions methods like H264: This might actually be a very good idea. And it should be possible to just add it now to my current method.


It comes with several tools. Some of them:

  • db-push: Pushes a single PNG into the DB.
  • db-push-dir: Pushes all PNGs in a given directory into the DB.
  • db-extract-file: Extracts a single PNG from the DB.
  • db-fuse: Simple FUSE interface to the DB. (Slow though because it is not very optimized!)


Just run ./

For Mac: If you haven't MacFUSE installed, install it from here:


  • Many parts could be optimized a lot.
  • Try with other DB backend implementations. Maybe mongoDB or Basho Riak. Or improve the filesystem implementation (which is incomplete anyway currently).
  • To make the FUSE interface faster, the caching must be improved. Also, there should be a way to get the filesize in advance and maybe also to seek in constant time. Probably, to make this possible, we need to have a fixed compression algorithm and the file summary must contain some offset information.
  • We could also store other image formats in a similar way. And also general files. There should be also a standard fallback.
  • The FUSE interface could also support writing.

-Albert Zeyer,