Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark and document the overhead of the SQLite index DB #16

Closed
giovannipizzi opened this issue Apr 13, 2020 · 10 comments
Closed

Benchmark and document the overhead of the SQLite index DB #16

giovannipizzi opened this issue Apr 13, 2020 · 10 comments
Milestone

Comments

@giovannipizzi
Copy link
Member

Measure the overhead of the SQLite DBs (this is already measured by the get_total_size method).

Check the cost per object with a few tests of different size, and then report them together with the other tests (see also #3 and #10)

@dev-zero
Copy link

dev-zero commented Apr 18, 2020

Since this looks from a first glance like a key/value store, did you consider something more lightweight, like leveldb (or RocksDB) or lmdb? When it comes to pure key-lookup performance they're likely going to outperform Sqlite. Both of them are available in Anaconda.

Also, using an ORM layer (even though convenient) seems a bit too much for this use case.

@giovannipizzi
Copy link
Member Author

I wasn't aware of these, I'll give a look into them, also in terms of requiring a server running, and of resilience to failures (also hard drive, machine reboot etc.).
I agree that the performance will be better, but at the moment it seems acceptable for the use cases of AiiDA we have today (might not be in the future of course).

When you refer to the ORM layer, you mean the use of SQLAlchemy? I.e. I could just call directly into the SQLite library? Or something else? Indeed I did it mostly for convenience as I know how SQLAlchemy works. I didn't check the overhead - and considering we have already SQLAlchemy in AiiDA I didn't think it was a problem, but happy to rediscuss.

@dev-zero
Copy link

I wasn't aware of these, I'll give a look into them, also in terms of requiring a server running, and of resilience to failures (also hard drive, machine reboot etc.).

None of them should require a server, all of them should be resilient since they're designed as embedded databases.

When you refer to the ORM layer, you mean the use of SQLAlchemy? I.e. I could just call directly into the SQLite library?

Yes, for a single operation like this I would probably avoid the ORM layer and directly talk to the SQlite.

Or something else? Indeed I did it mostly for convenience as I know how SQLAlchemy works. I didn't check the overhead - and considering we have already SQLAlchemy in AiiDA I didn't think it was a problem, but happy to rediscuss.

Well, this the sort of library which may be picked up by other projects as well, having SQLAlchemy as dep then makes less sense (and gives unneeded complexity).
On the other hand: since in AiiDA you already have PostgreSQL you could reuse that one? (either tightly tie it to the AiiDA-ecosystem for profit or not all if possible).

@giovannipizzi
Copy link
Member Author

Ok, if I decide to go in this direction, I'll drop SQLA before the production-ready release, it should be quick to do at the end.

In terms of the suggestions, I think LevelDB is not OK because it the readme it says:

Limitations

  • ...
  • Only a single process (possibly multi-threaded) can access a particular database at a time.
  • ...

Which is not ok (we have multiple daemon workers).

I'll check into the other two that seem more promising in this respect.

@giovannipizzi
Copy link
Member Author

giovannipizzi commented Apr 21, 2020

I think we'll also have to drop RocksDB. The library itself seems to be very much developed by facebook, but it lacks a robust and maintained python interface.
In particular, there used to be an old python wrapper pyrocksdb but it stopped development at version 0.4, in 2015.
This was later forked in python-rocksdb (which appears under pypi under at least 3-4 different names, but seems to be linking always to the same GitHub repository). This actually does not have any new tag, even if there is a 0.7 version on pypi. In any case, there are few issues:

  • development anyway seems to have stopped in the first half of 2019 (apart from some experimental branch, which supports py3.7, and it's not clear which version of rocksdb)
  • the version on pip supports only officially py3.4 (but I tested on 3.7 and the basic functionality is OK), see docs
  • what is more problematic, as it says in the docs, it supports RocksDB v5.3. Indeed, I could install it only with v5.18.3, but I couldn't on v6.7 (the current most recent release).

So, in conclusion, there does not seem to be a developed library supporting recent versions of rocksdb.

In addition, rocksdb itself is not super-trivial to install: apt-packages on ubuntu are old (on 16.04 it's still 4.1 for instance, in 18.04 it's 5.8.8) mostly because development is happening very fast, with jumps in major versions, which worries me if there isn't at least a python wrapper following it.
One needs to clone it and install it by compiling it (also on Mac OS I tried with brew, but the compilation fails) I would try to avoid as much as possible a library that becomes a strong dependency of AiiDA and makes installation (of AiiDA, specifically) even more complex than now it is.

@giovannipizzi
Copy link
Member Author

As a comment, LMDB seems much easier to install (it just worked with a pip install).
We'll give a try for efficiency (even if I'm quite worried by the fact that, while initialising the repository, one needs to specify a map_size, being the total size of the DB.

There is a set_mapsize(map_size) function, whose doc says: "Change the maximum size of the map file. This function will fail if any transactions are active in the current process."

For testing, I am currently setting a very big map size to begin with (1TB, I don't know if this impacts performance).
But then we'll need to double check how to safely increase map size during real usage: it's going to be fragile in real-case scenarios, I fear.

Also, as a final comment: these might/seem to be faster when dealing with a lot of small objects. However, maybe the performance of the implementation suggested here is enough for the goals of AiiDA? Also, we'll need to double check how concurrency is managed - the current implementation provides fully parallel writing (at least for loose objects). I'm not sure if these libraries in the end need to resort to some level of locking, that might turn out to be problematic in production scenarios.

@giovannipizzi
Copy link
Member Author

Performance tests performed by @ramirezfranciscof show that while leveldb is much faster for very small objects (below the page size, ~4kb), for which it's designed, it loses all the advantage for larger objects, that is our use case. Maybe @ramirezfranciscof could report here a very small summary of the benchmark performance comparison between this implementation (stating which version) and the leveldb.

@giovannipizzi
Copy link
Member Author

OK, I have some good results that I report here. These were run on a big container, with 6714808 objects (SDB by @sphuber).

I also report some timings to give a feeling of the advantage of this library.

  • Running du on the folder in the old AiiDA format took > 2 days (!). It reported 263GB.
  • Time to list all files to import (10744328 nodes): 170529.155 s (almost 2 days).
  • Time to read 16195149 files and then store them directly to packs: 192438.410 s (2 days, 5 hours)

The results above therefore suggest that any rsync command would take 2+ days, even if there is nothing to do, even just to check which files exist.
In the new repo (40 files of ~40GB + 1 SQLite DB of ~1.2GB), rsync goes relatively fast the first time:

sent 845 bytes  received 41,309,058,351 bytes  4,035,861.38 bytes/sec
total size is 170,071,626,707  speedup is 4.12

real time:    170m34.242s

(3hours), and the second time it goes super fast: even after dropping the disk caches, if there is nothing to do it takes 1s:

receiving incremental file list

sent 23 bytes  received 965 bytes  395.20 bytes/sec
total size is 170,071,626,707  speedup is 172,137,273.99

real    0m1.339s

and it would be relatively fast on additions, since only the last pack would change.

Also, to get an idea of the read speed of the disk, after dropping the caches with sudo su -c "echo 3 > /proc/sys/vm/drop_caches", if I run time cat packs/* > /dev/null to force reading everything back from disk, I get:

real	1m55.257s
user	0m0.532s
sys	1m4.276s
  • Time to commit folder_meta for 10744328 nodes (note: only 7181854 with files) to postgres: 3149.467 s [this means that storing the file metadata in a nested dictionary is still longish, maybe we can optimise this - but this is not in this library, it depends on the AiiDA implementation - pinging @sphuber]

  • After converting, we get a SQLite DB of 1.2GB, and based on the size, we have a size of files before deduplication of 157GB (using 1024 bytes per KB, 1024 KB per MB, ...), and after removing duplicates (via sha256 hashes) of 94GB.

  • The DB therefore takes only 1.2GB. Before, we are "wasting" instead 263-157=106GB, occupied either by the inode of folders, or by the remaining space at the end of a block by a file. The block size in our case is 4KB, and so we can estimate the wasted size by considering 4KB per folder, and 2KB per file (assuming a random distribution of file sizes, so some will almost fill the last block, some will almost leave it empty, ...).

    • We have, in the legacy repo, 16'195'149 files, and about 36 million nodes, so about 36-16=20 million folders.
    • we estimate 20million folders at 4KB each = 76GB, + 16million files, each of which will only on average occupy half of the last block => 30GB --> 106GB: very good estimate of what we 'waste'. And this does not count the fact we had to reformat the HD and waste more space to accomodate the FS table!
  • if we consider deduplication, we save even more space on a real project.

@giovannipizzi
Copy link
Member Author

I think that with these statistics we have demonstrated the space savings of the new approach so we can safely close this issue.

Also, in terms of speed overhead, I refer to #70 for a speed assessment (and possible code improvements).

@sphuber
Copy link
Contributor

sphuber commented Jul 15, 2020

Regarding the implementation in aiida-core: I have also used a JSONB column, but the structure of it is slightly different. I haven't performed any benchmarking yet but was mostly focusing on getting a working implementation for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants