New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement unique experiment identifiers #902
Conversation
Update flattening of data and multi-dataset handling
Allow -1 id entries in the table
Tidy up scaling files, update selecting function name.
Looking at this now. Will try side-by-side comparison to get a sense of the real differences... |
certainly looks like a UUID 🙂 |
OK, hypothesis test:
Gives:
(actually:
whilst
i.e. in order
gives
i.e. more to do... the scans are not assigned id's when imported by |
Also that xia2 job failed on the second sweep with
=> will try some manual runs |
Sorry, should have made it clearer that I've only added the order resolving to dials.scale at the moment, but appreciate that is not very useful for testing, so will add to all programs now. The second thing with integrate is probably a real issue, will look into. |
Indexing should now be fixed, unsure about the integration issue at the moment |
OK, @jbeilstenedmands we are now cooking on gas 🙂 |
@jbeilstenedmands thanks for the updates - now behaves in the way I would expect... properly testing now |
Worked through some data - joint indexing works as it used to behaviour wise, everything else makes sense so e.g.
looks right |
A thought, do we ever use an imported.expt or strong.refl with a post-indexing datafile, as this will not work as the identifiers are different? |
You can re-index a strong reflection file with an indexed experiment...
which apparently still works |
Okay, perhaps index is a bit special for this kind of use as it assigns new identifiers anyway, do the identifiers in the indexed.{expt,refl} files match in this case? |
Time passes....
computer says yes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the recent changes this now behaves like I would expect (without challenging by e.g. switching order of input files around)
Unexpectedly it also works when I switch around the order of input files:
input {
experiments = ../one/imported.expt
experiments = ../two/imported.expt
reflections = ../two/strong.refl
reflections = ../one/strong.refl
}
and gives numerically identical output:
Grey-Area shuffle :) $ diff dials.index.log ../dials.index.log
4a5,7
> output {
> split_experiments = True
> }
6,9c9,12
< experiments = ../one/imported.expt
< experiments = ../two/imported.expt
< reflections = ../two/strong.refl
< reflections = ../one/strong.refl
---
> experiments = one/imported.expt
> experiments = two/imported.expt
> reflections = one/strong.refl
> reflections = two/strong.refl
720a724
> Splitting experiments before output
So to me this delivers on the original promise of using unique identifiers.
I would like to also see at the least feedback from @ndevenish and @dagewa for technical and API user feedback. As an end user this does what I want and makes sense => 👍
... that said
then
Only gives you one set of indexed reflections in So I think |
On this last one... I get the same behaviour with |
Putative fix for the last one in #906 |
reflections = flatten_reflections(params.input.reflections) | ||
experiments = flatten_experiments(params.input.experiments) | ||
reflections, experiments = reflections_and_experiments_from_files( | ||
params.input.reflections, params.input.experiments |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering about this pattern.
Is there any benefit in exposing params.input.reflections/
params.input.experimentsto the application at all, or could/should this be integrated into
params.input/the
parse_args` call?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good suggestion but more wide-ranging - possibly a follow up pull request?
If |
dials/command_line/combine_experiments.py Line 529 in 7a239f2
but does nothing with experiment_identifiers . Is this a problem?
|
@graeme-winter I see. I think we need to figure out this interesting/unexpected behaviour and exercise it in some new tests. I wonder if some of these problems go away if reflection tables always kept their I'd like it if we could work with reflection tables the way we always have done with the additional bookkeeping all being done behind the scenes. I'm not worried about the replacement of |
Codecov Report
@@ Coverage Diff @@
## master #902 +/- ##
==========================================
- Coverage 67.33% 67.28% -0.06%
==========================================
Files 608 605 -3
Lines 69294 69166 -128
Branches 8465 8454 -11
==========================================
- Hits 46659 46537 -122
+ Misses 20798 20793 -5
+ Partials 1837 1836 -1
Continue to review full report at Codecov.
|
@dagewa yes I think we should decorate the reflection table select method to discreetly update the identifiers, good idea |
This is a noninvasive section from previously-approved PR #902
Overview
This pull request implements the generation and use of unique experiment identifiers throughout the dials processing workflow. The motivation behind this is to allow matching of experiments and reflection tables without relying on the ordering of these objects (either on loading data or during processing. This feature has already proved useful in
dials.scale
anddials.cosym
).These identifiers (a uuid string by default) are stored in
experiment.identifier
, and thereflection_table.experiment_identifiers()
map is used to record the numerical_id -> identifier mapping.When a new experiment is created, a unique identifier is generated - this only happens in
dials.import
anddials.index
. When a reflection table is created, the experiment_identifiers map is populated - this happens in spot-finding, indexing and integration. Other programs in the standard dials workflow add to the existing table, so the mapping is preserved.So far I have only implemented and tested the uuid identifier generation for the sweeps case. For the stills case, I have added in function calls in the relevant locations in
stills_indexer
(which would currently generate a uuid); I invite @asmit3 and collaborators to add to this PR to modify this as you wish as I am unfamiliar with the ins and outs of the stills_process workflow.People can probably ignore the changes to the scaling files if reviewing this. I shall continue to review the dials programs for the 'potential issue' (see below). However, the tests pass locally for me and wanted to get this PR out there for feedback.
Detailed description on behaviour changes
The behavioural changes are primarily to help programs which handle the input of multiple datasets in separate files - which will likely have duplicate numerical ids in the reflection tables.
To implement safe handling of data on loading, regardless of order, there is a new function to perform the role of
flatten_experiments
andflatten_refections
, which also reorders the reflection tables to match the experiments order using the experiment identifiers;refls, expts = reflections_and_experiments_from_files(refl_file_objs, expt_file_objs)
(note that the id columns in the tables are not renumbered at this point. This was something I introduced recently in
flatten_reflections
for flatten_reflections should wrangle experiment identifiers gracefully #656 but agree that this functionality should occur elsewhere, see below).So far I have only replaced the flatten functions in
dials.scale
, but this can be rolled out to other programs if people are happy about this.To ensure that multiple reflection tables can be joined together, one must ensure that the column id values do not clash. To do this, one can use the
renumber_table_id_columns(reflection_tables)
function indials.util.multi_dataset_handling
. There is also asplit_reflection_tables_on_ids(reflection_tables)
function for splitting multi-dataset tables on ids.Therefore for programs that handle reflection tables with multiple datasets, the following calls are the suggested way to load the data, to provide a list of experiments and reflections with different numerical ids:
Potential issue
One problematic aspect is that performing a selection on a multi-dataset reflection table based on the "id" column may no longer be safe, as the identifier map is not updated (which could be an issue if trying to recombine the tables). Therefore code like this, to select data from the 0th experiment;
new_refl = refl.select(refl["id"] == 0)
should be replaced with one of
or should be modified to
I'm continuing to check the repository for places where this could be an issue.