Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Float and sort all imports #84

Merged
merged 17 commits into from
Sep 12, 2019
Merged

Float and sort all imports #84

merged 17 commits into from
Sep 12, 2019

Conversation

ndevenish
Copy link
Collaborator

This patch floats all inline imports in dxtbx to the top of the file, then sorts them with isort. The imports are sorted with a custom ordering such that all cctbx_project imports are clearly separated into their own block. This has been configured in the root pyproject.toml with the aim of integrating isort into the pre-commit hook in the future. isort now runs cleanly on the repository.

The imports were floated with a custom script (in the commit "Sort and Float Imports") which ignores imports that are commented, inside an if block or inside a try block. The cases that didn't get automatically floated were then manually inspected and moved if appropriate.

There's a few highly circular imports in dxtbx that would be more complicated to remove, mainly with the ImageSet and Format modules, so I've manually reinserted them for now, with an explicit note why they are inline. There's also a couple of other cases that I manually fixed and made robust to e.g. handling missing imports like lz4, bitshuffle which aren't present in conda.

Tests run and pass on python 2 and python3 installs.

@Anthchirp
Copy link
Member

looks alright, apart from the change in dxtbx/ext.py which I take it you are reverting right now

Good effort.

@ndevenish
Copy link
Collaborator Author

ndevenish commented Sep 11, 2019

Yes, ext.py reverted in 93962b8 because it's not the right place to fiddle about with that

@phyy-nx
Copy link
Contributor

phyy-nx commented Sep 11, 2019

This pull request increases the number of imports required to read a file (in this case a full cbf) from 407 to 439:

$ cd dials_regression/image_examples/LCLS_cspad_nexus
$ libtbx.python -v -c "import dxtbx; dxtbx.load('idx-20130301060858401.cbf')" 2> tmp1
$ grep import tmp1 | sed 's/\#.*$//' | sort -u > t1
$ wc -l t1
407 t1
$ <checkout this branch>
$ libtbx.python -v -c "import dxtbx; dxtbx.load('idx-20130301060858401.cbf')" 2> tmp2
$ grep import tmp2 | sed 's/\#.*$//' | sort -u > t2
$ wc -l t2
439 t2

Given how important reducing the number of imports on large computing clusters is, I'd like to know if there is a way we can satisfy pep8 without increasing the number of imports.

For completeness, here is the set of new imports:

$ diff --changed-group-format='%>' --unchanged-group-format='' t1 t2
import cbflib_adaptbx
import cbflib_ext
import dxtbx.sweep_filenames
import gettext
import iotbx
import iotbx.detectors
import iotbx.detectors.adsc
import iotbx.detectors.adsc_module
import iotbx.detectors.bruker
import iotbx_detectors_bruker_ext
import iotbx.detectors.cbf
import iotbx.detectors.detectorbase
import iotbx.detectors.dtrek
import iotbx.detectors.edf
import iotbx.detectors.eiger
import iotbx.detectors.eiger_minicbf
import iotbx_detectors_ext
import iotbx.detectors.hamamatsu
import iotbx.detectors.macscience
import iotbx.detectors.mar
import iotbx.detectors.marIP
import iotbx.detectors.noir
import iotbx.detectors.pilatus_minicbf
import iotbx.detectors.pilatus_slice
import iotbx.detectors.raxis
import iotbx.detectors.raxisbase
import iotbx.detectors.raxis_nonsquare
import iotbx.detectors.saturn
import libtbx.option_parser
import libtbx.test_utils
import optparse
import Queue

@graeme-winter
Copy link
Collaborator

graeme-winter commented Sep 11, 2019

Do we know that this has anything more than a trivial increase in time taken to access the data? It is noteworthy that this is not a usual way of loading data, though it is a neat way of demonstrating imports.

This also highlights to me we carry a lot of "historical baggage" around namely the iotbx.detectorbase which I don't think we really use or have ever used in DIALS. This has frequently caused us issues when dealing with new instruments.

I would finally ask if we can demonstrate that these imports are a significant fraction of the run-time of performing the full analysis of a real data set i.e. we don't (to my knowledge) scatter many thousands of jobs independently spinning up to load a single frame as your example demonstrates - we usually do more substantial analysis across multiple frames, in which case I would suggest the presence or absence of imports becomes a trivial contribution to the overall time.

Looking at this more thoroughly by performing an actual task (in this case, find_spots) we see the list of differences to be smaller:

> dxtbx.format.FormatCBFMultiTile
170a172
> dxtbx.serialize.imageset
> iotbx.detectors
> iotbx.detectors.adsc
> iotbx.detectors.adsc_minicbf
> iotbx.detectors.adsc_module
> iotbx.detectors.bruker
> iotbx_detectors_bruker_ext
> iotbx.detectors.cbf
> iotbx.detectors.detectorbase
> iotbx.detectors.dtrek
> iotbx.detectors.edf
> iotbx.detectors.eiger
> iotbx.detectors.eiger_minicbf
> iotbx_detectors_ext
> iotbx.detectors.hamamatsu
> iotbx.detectors.macscience
> iotbx.detectors.mar
> iotbx.detectors.marIP
> iotbx.detectors.noir
> iotbx.detectors.pilatus_minicbf
> iotbx.detectors.pilatus_slice
> iotbx.detectors.raxis
> iotbx.detectors.raxisbase
> iotbx.detectors.raxis_nonsquare
> iotbx.detectors.saturn
> iotbx.xds
> iotbx.xds.xds_inp
> iotbx.xds.xparm
> rstbx.cftbx.coordinate_frame_converter

and dominated by iotbx imports , and - when it comes down to it - the key point you are getting at is about timing, and it appears that the real effect on the wall clock is very modest indeed:

Old:
real    0m2.509s
user    0m1.230s
sys    0m0.481s
New:
real    0m2.452s
user    0m1.248s
sys    0m0.479s

doing

echo 3 | sudo /usr/bin/tee /proc/sys/vm/drop_caches

before running the find spots a couple of times to ensure no compilation time recorded.

So - while the number of files inspected has gone up I would assert that the real-life effect of this is close to unmeasurable when compared with performing any analysis at all, and the benefits to clearly written code outweigh these. Thanks to @Anthchirp in preparing this response.

@phyy-nx
Copy link
Contributor

phyy-nx commented Sep 11, 2019

It's probably fine. At small scale it probably doesn't matter. Just FYI, we have had issues with imports at the large scale (thousands of cores, huge amounts of data). We've even had live processing killed by this issue, as all the cores tried to hit the python files at once. To be fair, probably no amount of slimming down the number of imports would have helped (we had to eventually distribute the filesystem metadata to the compute nodes). But because of that experience, I tend to inline my imports as much as possible.

@ndevenish
Copy link
Collaborator Author

I mean without being able to analyse the specific situation it's a guess, but that sort of thing is probably better dealt with by other methods to lighten the central load, like random delays to startup for flood control so that not every job hits at once, or zipped packages, or - as you say you solved the problem with - architecting the cluster to match the use case so you can forget the problem. Large parts of the code are pretty interdependent, so without large active efforts to disentangle modules I'm not convinced that there's much to be bought on that front.

Faster startup time for very short-lived, or user facing jobs that need to turn around and give error feedback quickly is IMO a reasonable reason to care about this, but it's probably easier to actually do that closer to the root e.g. directly in the entry module - it's a lot harder to cut down a tree if you start by pulling off every leaf.

@graeme-winter
Copy link
Collaborator

graeme-winter commented Sep 11, 2019 via email

@phyy-nx
Copy link
Contributor

phyy-nx commented Sep 11, 2019 via email

@Anthchirp
Copy link
Member

Minor note on import time and file operations in parallel environments:
If you are in a multiprocessing environment then (assuming a posix model) you will want to get your necessary imports done as early as possible. The reason for this is that imports before the fork() simply persist, but imports after fork() have to be done separately in each process.

@phyy-nx
Copy link
Contributor

phyy-nx commented Sep 11, 2019 via email

This configures cctbx packages to be treated as a separate block - they
will be sorted after third-party, but before e.g. dxtbx and dials.

This also uses the black-recommended isort configuration otherwise.
Done automatically with custom import float script (ignoring anything
within an if, try or separately commented). Sorted afterwards with isort.
e.g. inside if, try, commented (in which case might be deliberate)
Although it doesn't _currently_ seem to be causing test failures,
historically this was doing something important. Because it's core,
leave behaviour as-is and leave until later to investigate.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants