-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Float and sort all imports #84
Conversation
looks alright, apart from the change in Good effort. |
Yes, |
This pull request increases the number of imports required to read a file (in this case a full cbf) from 407 to 439:
Given how important reducing the number of imports on large computing clusters is, I'd like to know if there is a way we can satisfy pep8 without increasing the number of imports. For completeness, here is the set of new imports:
|
Do we know that this has anything more than a trivial increase in time taken to access the data? It is noteworthy that this is not a usual way of loading data, though it is a neat way of demonstrating imports. This also highlights to me we carry a lot of "historical baggage" around namely the I would finally ask if we can demonstrate that these imports are a significant fraction of the run-time of performing the full analysis of a real data set i.e. we don't (to my knowledge) scatter many thousands of jobs independently spinning up to load a single frame as your example demonstrates - we usually do more substantial analysis across multiple frames, in which case I would suggest the presence or absence of imports becomes a trivial contribution to the overall time. Looking at this more thoroughly by performing an actual task (in this case,
and dominated by
doing
before running the find spots a couple of times to ensure no compilation time recorded. So - while the number of files inspected has gone up I would assert that the real-life effect of this is close to unmeasurable when compared with performing any analysis at all, and the benefits to clearly written code outweigh these. Thanks to @Anthchirp in preparing this response. |
It's probably fine. At small scale it probably doesn't matter. Just FYI, we have had issues with imports at the large scale (thousands of cores, huge amounts of data). We've even had live processing killed by this issue, as all the cores tried to hit the python files at once. To be fair, probably no amount of slimming down the number of imports would have helped (we had to eventually distribute the filesystem metadata to the compute nodes). But because of that experience, I tend to inline my imports as much as possible. |
I mean without being able to analyse the specific situation it's a guess, but that sort of thing is probably better dealt with by other methods to lighten the central load, like random delays to startup for flood control so that not every job hits at once, or zipped packages, or - as you say you solved the problem with - architecting the cluster to match the use case so you can forget the problem. Large parts of the code are pretty interdependent, so without large active efforts to disentangle modules I'm not convinced that there's much to be bought on that front. Faster startup time for very short-lived, or user facing jobs that need to turn around and give error feedback quickly is IMO a reasonable reason to care about this, but it's probably easier to actually do that closer to the root e.g. directly in the entry module - it's a lot harder to cut down a tree if you start by pulling off every leaf. |
It’s certainly true that there are things we could do to help the job start up problem. Actually sorting the build system to produce a small number of library files containing all the object code would help, which I think would also simplify conda distribution. Thinking along the lines of libdials.so etc.
Having zip files full of the python would also help, so that the overall number of file system operations required to start any program would be massively reduced. We started looking at this here then found that absolute imports helped enough that we could move on.
While both of these would help in massively parallel environments fixing them up is a fair chunk of work and a long game. Possibly one worth considering but will require a long hard look at our build system at least!
Anyhow it’s good we understand the sensitivities behind the problem. Sounds like this is not such a problem today though.
|
👍
…On Wed, Sep 11, 2019 at 10:34 AM Graeme Winter ***@***.***> wrote:
It’s certainly true that there are things we could do to help the job
start up problem. Actually sorting the build system to produce a small
number of library files containing all the object code would help, which I
think would also simplify conda distribution. Thinking along the lines of
libdials.so etc.
Having zip files full of the python would also help, so that the overall
number of file system operations required to start any program would be
massively reduced. We started looking at this here then found that absolute
imports helped enough that we could move on.
While both of these would help in massively parallel environments fixing
them up is a fair chunk of work and a long game. Possibly one worth
considering but will require a long hard look at our build system at least!
Anyhow it’s good we understand the sensitivities behind the problem.
Sounds like this is not such a problem today though.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#84?email_source=notifications&email_token=ADGY5SSUBKDN2WJAYQJ3YMTQJETXZA5CNFSM4IVUKGW2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6PIZ7A#issuecomment-530484476>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ADGY5SWY3ERWMNJ7U6RCOSDQJETXZANCNFSM4IVUKGWQ>
.
|
Minor note on import time and file operations in parallel environments: |
Sure, that makes sense. However most of our big work, where I think
imports are an issue, is on MPI. I don't think that applies. I think
every MPI rank loads its own imports.
…On Wed, Sep 11, 2019 at 2:53 PM Markus Gerstel ***@***.***> wrote:
Minor note on import time and file operations in parallel environments:
If you are in a multiprocessing environment then (assuming a posix model)
you will want to get your necessary imports done as early as possible. The
reason for this is that imports before the fork() simply persist, but
imports after fork() have to be done separately in each process.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#84?email_source=notifications&email_token=ADGY5SQRVSFGUTV66D73GFDQJFSFJA5CNFSM4IVUKGW2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6QAOAI#issuecomment-530581249>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ADGY5SR6EI6PDAB3PV3SR33QJFSFJANCNFSM4IVUKGWQ>
.
|
This configures cctbx packages to be treated as a separate block - they will be sorted after third-party, but before e.g. dxtbx and dials. This also uses the black-recommended isort configuration otherwise.
Done automatically with custom import float script (ignoring anything within an if, try or separately commented). Sorted afterwards with isort.
e.g. inside if, try, commented (in which case might be deliberate)
Although it doesn't _currently_ seem to be causing test failures, historically this was doing something important. Because it's core, leave behaviour as-is and leave until later to investigate.
This patch floats all inline imports in dxtbx to the top of the file, then sorts them with
isort
. The imports are sorted with a custom ordering such that allcctbx_project
imports are clearly separated into their own block. This has been configured in the rootpyproject.toml
with the aim of integratingisort
into the pre-commit hook in the future.isort
now runs cleanly on the repository.The imports were floated with a custom script (in the commit "Sort and Float Imports") which ignores imports that are commented, inside an
if
block or inside atry
block. The cases that didn't get automatically floated were then manually inspected and moved if appropriate.There's a few highly circular imports in dxtbx that would be more complicated to remove, mainly with the ImageSet and Format modules, so I've manually reinserted them for now, with an explicit note why they are inline. There's also a couple of other cases that I manually fixed and made robust to e.g. handling missing imports like
lz4
,bitshuffle
which aren't present in conda.Tests run and pass on python 2 and python3 installs.