Skip to content


Subversion checkout URL

You can clone with
Download ZIP
Database for PNG images
C C++ Shell
Branch: master
Failed to load latest commit information.
hiredis hiredis: removing more obsolete stuff
kyotocabinet delete all kyotocabinet tools/tests
png-db.xcodeproj xcode
sample-images add sample images
.gitignore gitign
Crc.cpp moved CRC code
Crc.h moved CRC code
Db.cpp small fix
Db.h small fix
DbDefBackend.h change default back to KyotoCabinet. just works better for now
DbFileBackend.cpp lower branching factor
DbFileBackend.h made multithreading safe
DbFsBackend.cpp some refactoring. prepared for multiple DB backend implementations
DbFsBackend.h some cleanup
DbKyotoBackend.cpp readonly support
DbKyotoBackend.h readonly support
DbPng.cpp small fix/check
DbPng.h made writing more general
DbRedisBackend.cpp some cleanup
DbRedisBackend.h some cleanup
Endianess.h some initial code
FileUtils.cpp more file utils
FileUtils.h made writing more general
Mutex.h simple Mutex class
Png.cpp made writing more general
Png.h small compiler safety check note about video compressions
Return.h small fix (wrong construction otherwise for const char*)
Sha1.cpp stupid sha1 fix
Sha1.h stupid sha1 fix
SmartPointer.h simplified SmartPointer for this project
StaticAssert.h some initial code
StringUtils.cpp made writing more general
StringUtils.h made writing more general
Utils.h made writing more general fix for FUSE linking (mac only?)
db-extract-file.cpp made writing more general
db-fuse.cpp small comment
db-list-dir.cpp readonly support
db-push-dir.cpp db-push-dir: only push new files
db-push.cpp implemented DbPngEntryReader
pnginfo.cpp Db::get implemented more stats
test-png-dumpchunks.cpp Db::get implemented
test-png-reader.cpp Db::get implemented

DB optimized for a bunch of PNG images.


The idea is to split PNG images into many blocks and have each block stored in a DB. If there are several equal blocks, it is only stored once. Via a hash table, the lookup for such blocks is made fast.

Use case

I am collecting screenshots (for several reasons; one is to play around with machine learning / computer vision; one example is here: A lot of them. :)

Right now, I have about 88k screenshots with about 77GB. And as many of them have a lot of repetitive areas (on some days, I were making a screenshot every 10 seconds, even when not using the computer at all, so the only changing part was the time display), I didn't wanted to waste so much space on so much repetitive data.

With this PNG DB, I have a compression rate of about 400-500% (for the first 1k screenshots or so; probably the rate will even be higher for all of them).

This example with the screenshots is probably an extreme case (where this applies extremely well). But I guess in many other cases where you are collecting a huge amount of PNG images (with computer-generated content; real-world images would not work that well), you can safe some space by it.

And if this gets optimized as far as possible, it may be even faster than normal filesystem access (because of less disk IO).

Technical details

To make things easier on the PNG side, it just parses down until it gets a scanline serialization. Multiple directly following scanline (of same width) serializations parts build up a block (so it actually really matches a block in the real picture). But I don't do any of the PNG filtering. PNG spec:

The general DB layout is as follows:

  • ("data." unique id -> zlib compressed data) data pairs
  • ("sha1refs." SHA1 -> set of ids) data pairs
  • ("fs." filename -> id) data pairs

Such data value (uncompressed) starts with a data-type-byte. Only 3 types are there currently:

  • PNG file summary
  • PNG chunk (all non-data PNG chunks)
  • PNG block

There are multiple DB backend implementations:

  • The filesystem itself. But creates a lot of files!
  • Redis. Via hiredis. As everything is in memory, you are a bit limited.
  • KyotoCabinet. (Currently the default.)

Comparison with other compression methods / deduplicators

In the beginning, I thought about using some generic image library to be able to handle just any image type and then operate just on the raw data. This would even give me some slight better compression rate because now, I am operating on PNGs scanline serializations and there are 5 different ways (filters) in PNG to represent a scanline.

However, because I am storing all the data as PNG raw data in the DB, the reconstruction of the PNG should be much faster. In the more generic case, I would have to recompress/reencode the PNG. Now I only have to (roughly) collect and glew the parts together and run the PNG zlib over it.

Using a general deduplicator / compressor on the raw data (uncompressed PNG, TGA or BMP): It would be based on connected chunks of data; i.e., in the image, it would mean one or many following scanlines. But what I am doing is based on rectangular blocks in the image. So I am able to get much bigger chunks of data which is repetitive.

Something like what is done in video compressions methods like H264: This might actually be a very good idea. And it should be possible to just add it now to my current method.


It comes with several tools. Some of them:

  • db-push: Pushes a single PNG into the DB.
  • db-push-dir: Pushes all PNGs in a given directory into the DB.
  • db-extract-file: Extracts a single PNG from the DB.
  • db-fuse: Simple FUSE interface to the DB. (Slow though because it is not very optimized!)


Just run ./

For Mac: If you haven't MacFUSE installed, install it from here:


  • Many parts could be optimized a lot.
  • Try with other DB backend implementations. Maybe mongoDB or Basho Riak. Or improve the filesystem implementation (which is incomplete anyway currently).
  • To make the FUSE interface faster, the caching must be improved. Also, there should be a way to get the filesize in advance and maybe also to seek in constant time. Probably, to make this possible, we need to have a fixed compression algorithm and the file summary must contain some offset information.
  • We could also store other image formats in a similar way. And also general files. There should be also a standard fallback.
  • The FUSE interface could also support writing.

-Albert Zeyer,

Something went wrong with that request. Please try again.