-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Benchmark and document the overhead of the SQLite index DB #16
Comments
Since this looks from a first glance like a key/value store, did you consider something more lightweight, like leveldb (or RocksDB) or lmdb? When it comes to pure key-lookup performance they're likely going to outperform Sqlite. Both of them are available in Anaconda. Also, using an ORM layer (even though convenient) seems a bit too much for this use case. |
I wasn't aware of these, I'll give a look into them, also in terms of requiring a server running, and of resilience to failures (also hard drive, machine reboot etc.). When you refer to the ORM layer, you mean the use of SQLAlchemy? I.e. I could just call directly into the SQLite library? Or something else? Indeed I did it mostly for convenience as I know how SQLAlchemy works. I didn't check the overhead - and considering we have already SQLAlchemy in AiiDA I didn't think it was a problem, but happy to rediscuss. |
None of them should require a server, all of them should be resilient since they're designed as embedded databases.
Yes, for a single operation like this I would probably avoid the ORM layer and directly talk to the SQlite.
Well, this the sort of library which may be picked up by other projects as well, having SQLAlchemy as dep then makes less sense (and gives unneeded complexity). |
Ok, if I decide to go in this direction, I'll drop SQLA before the production-ready release, it should be quick to do at the end. In terms of the suggestions, I think LevelDB is not OK because it the readme it says:
Which is not ok (we have multiple daemon workers). I'll check into the other two that seem more promising in this respect. |
I think we'll also have to drop RocksDB. The library itself seems to be very much developed by facebook, but it lacks a robust and maintained python interface.
So, in conclusion, there does not seem to be a developed library supporting recent versions of rocksdb. In addition, rocksdb itself is not super-trivial to install: apt-packages on ubuntu are old (on 16.04 it's still 4.1 for instance, in 18.04 it's 5.8.8) mostly because development is happening very fast, with jumps in major versions, which worries me if there isn't at least a python wrapper following it. |
As a comment, LMDB seems much easier to install (it just worked with a pip install). There is a For testing, I am currently setting a very big map size to begin with (1TB, I don't know if this impacts performance). Also, as a final comment: these might/seem to be faster when dealing with a lot of small objects. However, maybe the performance of the implementation suggested here is enough for the goals of AiiDA? Also, we'll need to double check how concurrency is managed - the current implementation provides fully parallel writing (at least for loose objects). I'm not sure if these libraries in the end need to resort to some level of locking, that might turn out to be problematic in production scenarios. |
Performance tests performed by @ramirezfranciscof show that while leveldb is much faster for very small objects (below the page size, ~4kb), for which it's designed, it loses all the advantage for larger objects, that is our use case. Maybe @ramirezfranciscof could report here a very small summary of the benchmark performance comparison between this implementation (stating which version) and the leveldb. |
OK, I have some good results that I report here. These were run on a big container, with 6714808 objects (SDB by @sphuber). I also report some timings to give a feeling of the advantage of this library.
The results above therefore suggest that any rsync command would take 2+ days, even if there is nothing to do, even just to check which files exist.
(3hours), and the second time it goes super fast: even after dropping the disk caches, if there is nothing to do it takes 1s:
and it would be relatively fast on additions, since only the last pack would change. Also, to get an idea of the read speed of the disk, after dropping the caches with
|
I think that with these statistics we have demonstrated the space savings of the new approach so we can safely close this issue. Also, in terms of speed overhead, I refer to #70 for a speed assessment (and possible code improvements). |
Regarding the implementation in |
Measure the overhead of the SQLite DBs (this is already measured by the
get_total_size
method).Check the cost per object with a few tests of different size, and then report them together with the other tests (see also #3 and #10)
The text was updated successfully, but these errors were encountered: