Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

export archive format forces reading of data.json file to memory #493

Closed
fernandogargiulo1986 opened this issue Mar 29, 2017 · 4 comments · Fixed by #5145
Closed

export archive format forces reading of data.json file to memory #493

fernandogargiulo1986 opened this issue Mar 29, 2017 · 4 comments · Fixed by #5145

Comments

@fernandogargiulo1986
Copy link
Contributor

fernandogargiulo1986 commented Mar 29, 2017

It seems to me that when doing a verdi import, such as

verdi import <export_file>.aiida

the content of <export_file>.aiida is first cached in the RAM, if not entirely, at least for what concerns the database. As a consequence, this actually sets the largest size of a file that can be imported to the instantaneous amount of free RAM.

@fernandogargiulo1986 fernandogargiulo1986 changed the title verdi import uses a lot of RAM verdi import is voracious of RAM Mar 30, 2017
@giovannipizzi giovannipizzi added this to To do in Import/export via automation Dec 3, 2018
@giovannipizzi giovannipizzi added this to the v1.1.0 milestone Dec 3, 2018
@ltalirz
Copy link
Member

ltalirz commented Jan 14, 2020

I started looking into this issue as well.

It turns out that while json is not really meant to be streamed, in practice people have to deal with large json files and there actually are a couple of json stream parsing libraries.
These would allow us to solve the memory issue without/before completely revising the export format.

The one I'd probably go for is ijson - this shows how to iterate over the groups array in the metadata.json file:

import ijson

f = open('metadata.json', 'r')
objects = ijson.items(f, 'export_parameters.entities_starting_set.Group.item')
for uuid in objects:
    print(uuid)

It has built-in support to iterate over list items at arbitrary places in the json via the notation above - however, for some reason, iterating over keys of a dictionary (which we need for the data.json) is not built in.

I'll look into how to do this if I find the time.

@ltalirz
Copy link
Member

ltalirz commented Jan 21, 2020

Support has been added in ICRAR/ijson@d4cca87 , i.e. one can now do

import ijson
f = open('data.json', 'r')
for k,v in ijson.kvitems(f, 'node_attributes'):
    print(k, v)

In my test, the python implementation takes about 0.2ms per key-value pair (=per node).
However, the wheel of the package comes with a pre-built C extension for many platforms that takes about 0.01ms per key-value pair, putting it within less than a factor of 2 of calling json.load in terms of time per k/v pair - pretty fast!

There's one more issue - the layout of AiiDA's data.json file splits node columns, attributes and extras into three separate lists.
That makes it quite difficult to get a "batch of nodes" - you need at least three iterators and it is not even clear whether all three lists always have the same length & order.

There are several possible ways forward:

A) We stick with the current format and try to implement a (suboptimal and slightly complex) batch parser with three iterators going at once.

B) We change the layout of data.json to something that can be sensibly parsed as a stream - e.g. simply move node_extras and node_attributes inside the Node dict.

C) We switch to a new file format that is made for "seeking" and "slicing" (like HDF5)

Mentioning @giovannipizzi @sphuber for comment

@giovannipizzi
Copy link
Member

I would definitely go for B, but possibly taking into account also other requirements rather than just this one in the redesign of th export format (that could possibly end up in solutions like C). If however implementing this is fast and (probably) C it's going to take longer, ok to start working on it.

@sphuber sphuber removed this from the v1.1.0 milestone Feb 28, 2020
@ltalirz ltalirz changed the title verdi import is voracious of RAM export archive format forces reading of data.json file to memory Apr 15, 2020
@ltalirz
Copy link
Member

ltalirz commented Jun 26, 2020

Just as a comment: There is a new format called JSON Lines (or "newline-delimited JSON) which is essentially one JSON per line and suitable for storing large numbers of nested data structures in one file (Google adopted it for BigQuery).
It might be worth considering this format when going with option B)

See also discussion on "Loading Data Efficiently" for binary file formats to consider in option C)

@chrisjsewell chrisjsewell linked a pull request Sep 24, 2021 that will close this issue
1 task
Import/export automation moved this from To do to Done Dec 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Import/export
  
Done
Development

Successfully merging a pull request may close this issue.

6 participants