Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Archive export refactor (2) #4534

Merged
merged 34 commits into from
Nov 12, 2020

Conversation

chrisjsewell
Copy link
Member

@chrisjsewell chrisjsewell commented Oct 30, 2020

FYI @ltalirz just a heads up, this is the final PR I alluded to (written on top of #4532)

Based on the experience from the import refactor, it refactors the archive writer to be somewhat similar; using it as a context manage and "streaming" the data to its methods, as opposed to now where all the data is passed to it in a single dataclass.

The refactor also removes a few spurious database queries; merging them in to the entity queries.

A --batch-size argument is also added to the CLI, to allow control of query batch sizing.

The implementation tentatively works for zip so far (as in, that is what I was running in the meeting demo) but requires a bit more work to clean up and apply to tar as well.

@ltalirz
Copy link
Member

ltalirz commented Oct 30, 2020

Ok, I will check this after #4532 is merged.

A --batch-size argument is also added to the CLI, to allow control of query batch sizing.

Is this already impacting memory significantly?

If not, I suggest to keep this parameter at the python API only for the moment and don't expose it on the cli yet - we should limit the number of flags and having the flag there raises expectations that we may not be able to meet (yet).

@ramirezfranciscof ramirezfranciscof added this to the v1.5.0 milestone Nov 2, 2020
Copy link
Member

@ltalirz ltalirz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @chrisjsewell
I started having a look through and adding comments; I then saw that it's still in draft status

let me know when I should re-review this

@@ -101,10 +101,18 @@ def inspect(archive, version, data, meta_data):
show_default=True,
help='Include or exclude comments for node(s) in export. (Will also export extra users who commented).'
)
@click.option(
Copy link
Member

@ltalirz ltalirz Nov 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment #4534 (comment)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This allows for performance (cpu/memory) testing from the CLI, as we have done for other PRs (see below). I would just comment it out before merging, because it is definitely something very useful and something that will be beneficial if not now then for the new format.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

commented out

aiida/tools/importexport/archive/writers.py Outdated Show resolved Hide resolved
aiida/tools/importexport/archive/writers.py Show resolved Hide resolved
},
'conversion_info': export_data.metadata.conversion_info
}
def close(self, excepted: bool):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens with this bool?
Also, pylint does not complain about this unused argument, perhaps I'm missing something here...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it allows the writer to decide whether it wants to actually write the output file, given that the export process excepted

sharded_uuid = export_shard_uuid(uuid)
@abstractmethod
def write_metadata(self, data: ArchiveMetadata):
""" """
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hehe, that's an interesting docstring. could use some love (also below)?

aiida/tools/importexport/common/config.py Show resolved Hide resolved
aiida/tools/importexport/dbexport/__init__.py Show resolved Hide resolved
aiida/tools/importexport/archive/writers.py Outdated Show resolved Hide resolved
@chrisjsewell
Copy link
Member Author

I started having a look through and adding comments; I then saw that it's still in draft status

Yes indeed patience lol.
The general structure of the export function I believe is finished, but the writers.py I am working on now, since at this point is a bit hacky just to get an initial working implementation.

@chrisjsewell chrisjsewell marked this pull request as ready for review November 4, 2020 13:38
@chrisjsewell
Copy link
Member Author

chrisjsewell commented Nov 4, 2020

@ltalirz this is now ready for review
(I have not yet run performance testing)

Not ZipFolder has been replaced by ZipPath.
Really, I could also write a TarPath, but I don't think I have the time now.
This is how the original code was; for the tar just writing to a folder then compressing afterwards, rather than opening the tar and writing directly to it.
I'm not sure if there was a particular reason for this

@codecov
Copy link

codecov bot commented Nov 4, 2020

Codecov Report

Merging #4534 (86a81ab) into develop (008580e) will decrease coverage by 0.09%.
The diff coverage is 88.87%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #4534      +/-   ##
===========================================
- Coverage    79.50%   79.41%   -0.08%     
===========================================
  Files          482      481       -1     
  Lines        35325    35279      -46     
===========================================
- Hits         28083    28015      -68     
- Misses        7242     7264      +22     
Flag Coverage Δ
django 73.57% <88.87%> (-0.09%) ⬇️
sqlalchemy 72.75% <88.87%> (-0.10%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
aiida/tools/importexport/common/archive.py 74.70% <ø> (-1.20%) ⬇️
aiida/tools/importexport/common/config.py 100.00% <ø> (ø)
aiida/cmdline/commands/cmd_export.py 90.99% <55.56%> (-3.03%) ⬇️
aiida/tools/importexport/archive/migrators.py 86.40% <68.00%> (-4.44%) ⬇️
aiida/tools/importexport/archive/readers.py 89.32% <72.73%> (-3.09%) ⬇️
aiida/tools/importexport/archive/writers.py 92.00% <91.01%> (-5.29%) ⬇️
aiida/tools/importexport/dbexport/__init__.py 97.40% <95.35%> (-0.69%) ⬇️
aiida/cmdline/commands/cmd_import.py 83.02% <100.00%> (+0.17%) ⬆️
aiida/tools/importexport/archive/common.py 76.05% <100.00%> (-3.51%) ⬇️
...ools/importexport/archive/migrations/v03_to_v04.py 91.16% <100.00%> (ø)
... and 9 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 008580e...86a81ab. Read the comment docs.

@chrisjsewell
Copy link
Member Author

chrisjsewell commented Nov 5, 2020

With ~40,000 node repo; big speedup, a little more memory usage

develop:

(base) aiida@088cc428b4c4:/$ cd tmp
(base) aiida@088cc428b4c4:/tmp$ syrupy.py -i 1 --separator=, --no-align verdi export create -G 1 -- mount_folder/export_develop.aiida
SYRUPY: Writing process resource usage samples to 'syrupy_20201105105553.ps.log'
SYRUPY: Writing raw process resource usage logs to 'syrupy_20201105105553.ps.raw'
SYRUPY: Executing command 'verdi export create -G 1 -- mount_folder/export_develop.aiida'
SYRUPY: Redirecting command output stream to 'syrupy_20201105105553.out.log'
SYRUPY: Redirecting command error stream to 'syrupy_20201105105553.err.log'
SYRUPY: Completed running: verdi export create -G 1 -- mount_folder/export_develop.aiida
SYRUPY: Started at 2020-11-05 10:55:53.406105
SYRUPY: Ended at 2020-11-05 10:59:35.599326
SYRUPY: Total run time: 0 hour(s), 03 minute(s), 42.193221 second(s)
(base) aiida@088cc428b4c4:/tmp$ python -c 'import pandas as pd; ax = pd.read_csv("syrupy_20201105105553.ps.log").set_index("ELAPSED").plot(y="RSS", grid=True); ax.get_figure().savefig("mount_folder/output_develop.png")'

output_develop

this PR:

(base) aiida@088cc428b4c4:/tmp$ syrupy.py -i 1 --separator=, --no-align verdi export create -G 1 -- mount_folder/export_new.aiida
SYRUPY: Writing process resource usage samples to 'syrupy_20201105110111.ps.log'
SYRUPY: Writing raw process resource usage logs to 'syrupy_20201105110111.ps.raw'
SYRUPY: Executing command 'verdi export create -G 1 -- mount_folder/export_new.aiida'
SYRUPY: Redirecting command output stream to 'syrupy_20201105110111.out.log'
SYRUPY: Redirecting command error stream to 'syrupy_20201105110111.err.log'
SYRUPY: Completed running: verdi export create -G 1 -- mount_folder/export_new.aiida
SYRUPY: Started at 2020-11-05 11:01:11.163114
SYRUPY: Ended at 2020-11-05 11:03:34.289323
SYRUPY: Total run time: 0 hour(s), 02 minute(s), 23.126209 second(s)
(base) aiida@088cc428b4c4:/tmp$ python -c 'import pandas as pd; ax = pd.read_csv("syrupy_20201105110111.ps.log").set_index("ELAPSED").plot(y="RSS", grid=True); ax.get_figure().savefig("mount_folder/output_new.png")'

output_new

(base) aiida@088cc428b4c4:/tmp$ verdi export create -G 1 --verbosity DEBUG -- mount_folder/export_new2.aiida
/
EXPORT
--------------  ------------------------------
Archive         mount_folder/export_new2.aiida
Format          JSON Zip (compression=8)
Export version  0.9

Inclusion rules
-----------------  ----
Include Comments   True
Include Logs       True

Traversal rules
---------------------------------  -----
Follow links input calc forwards   False
Follow links input calc backwards  True
Follow links create forwards       True
Follow links create backwards      True
Follow links return forwards       True
Follow links return backwards      False
Follow links input work forwards   False
Follow links input work backwards  True
Follow links call calc forwards    True
Follow links call calc backwards   True
Follow links call work forwards    True
Follow links call work backwards   True

STARTING EXPORT...
Collecting nodes in groups               100.0%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 40065/40065
Traversing provenance via links ...      100.0%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1
WRITING METADATA...
Writing links                            100.0%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 40064/40064
Building entity database queries         100.0%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4
Writing entity data                      100.0%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 40066/40066
Writing group UUID -> [nodes UUIDs]
Exporting node repositories: 40065       100.0%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 40065/40065
FINALIZING EXPORT...
Exported Entities:
  - Node  : 40065
  - User  : 1
  - Group : 1

Success: wrote the export archive file to mount_folder/export_new2.aiida

@ltalirz
Copy link
Member

ltalirz commented Nov 5, 2020

Thanks for the performance tests!

Do you know why the memory usage increases?

And what are the units here - are 30'000 = 30 GB? That's a lot (of course also before this PR)

@chrisjsewell
Copy link
Member Author

chrisjsewell commented Nov 5, 2020

And what are the units here - are 30'000 = 30 GB? That's a lot (of course also before this PR)

I think your missing a zero there lol, its in bytes so 3Gb kilobytes so 0.3 Gb

@chrisjsewell
Copy link
Member Author

Do you know why the memory usage increases?

Could not say off hand

Copy link
Member

@ltalirz ltalirz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @chrisjsewell ; looks good to me

@@ -47,7 +47,7 @@ class ArchiveMetadata:
# optional data
graph_traversal_rules: Optional[Dict[str, bool]] = dataclasses.field(default=None)
# Entity type -> UUID list
entities_starting_set: Optional[Dict[str, Set[str]]] = dataclasses.field(default=None)
entities_starting_set: Optional[Dict[str, List[str]]] = dataclasses.field(default=None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess there is a good reason for this change; just mentioning that set is also in the name of the attribute...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeh I went back and forth on this, because it has to be converted to a list before storing as a json and so it was a little easier to convert before passing to the writer.
but your right that the naming is now a little off

aiida/tools/importexport/archive/zip_path.py Outdated Show resolved Hide resolved
aiida/tools/importexport/archive/zip_path.py Outdated Show resolved Hide resolved
aiida/tools/importexport/common/config.py Show resolved Hide resolved
@ltalirz
Copy link
Member

ltalirz commented Nov 5, 2020

I think your missing a zero there lol, its in bytes so 3Gb

Ah, right. Thanks, that is much more sensible.

Could not say off hand

Ok. Might be worth a quick look, since memory can be a hard limit, i.e. it may prevent people from exporting.

@chrisjsewell
Copy link
Member Author

chrisjsewell commented Nov 10, 2020

I've reviewed the latest commits - some minor comments

thanks, I've responded to the comments

@chrisjsewell chrisjsewell linked an issue Nov 11, 2020 that may be closed by this pull request
@chrisjsewell
Copy link
Member Author

archive-path has just been merged into conda-forge 🎉

@chrisjsewell
Copy link
Member Author

chrisjsewell commented Nov 12, 2020

FYI, in case it was not clear, but the test coverage % is reduced because I've actually removed more code than I've added (including all the code for compression being moved to archive-path), so less code lines are now tested

@ltalirz
Copy link
Member

ltalirz commented Nov 12, 2020

FYI, in case it was not clear, but the test coverage % is reduced because I've actually removed more code than I've added (including all the code for compression being moved to archive-path), so less code lines are now tested

Ok! Just a last suggestion to accept and we can merge

Co-authored-by: Leopold Talirz <leopold.talirz@gmail.com>
@chrisjsewell
Copy link
Member Author

Ok! Just a last suggestion to accept and we can merge

done cheers

ltalirz
ltalirz previously approved these changes Nov 12, 2020
@ltalirz
Copy link
Member

ltalirz commented Nov 12, 2020

approved!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Export file size. Big exports.
3 participants