-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consolidate config file and dict paths for substitution config #3887
Consolidate config file and dict paths for substitution config #3887
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Just need to finish making the tests pass and rebase the "Fix formatting" commit :)
5b831ee
to
2a18f6c
Compare
src/clib/lib/enkf/queue_config.cpp
Outdated
@@ -55,7 +55,7 @@ static queue_config_type *queue_config_alloc_empty() { | |||
queue_config->driver_type = NULL_DRIVER; | |||
queue_config->user_mode = false; | |||
queue_config->max_submit = 2; // Default value | |||
queue_config->num_cpu = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this change would be the correct if this code was following normal conventions. However, I think something non-standard is going on. num_cpu=0 is clearly not correct, but perhaps this is signalling some sort of "don't force the number of cpus" behavior. I think we will have to dig a bit to see what is going on before we make a change here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm gonna try to investigate a bit then! thanks for pointing it out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
found the first hint in src/ert/ensemble_evaluator/builder/_ensemble_builder.py
, in the function from_legacy
, lines 124 ff:
num_cpu = res_config.queue_config.num_cpu
if num_cpu == 0:
num_cpu = res_config.ecl_config.num_cpu
... with this change, we will never get the number of CPUs from the ECL config. the ecl config uses ecl_util_get_num_cpu
found in lib/ecl/ecl_util.cpp
, which tries to read the # of cpus from the DATA file, and returns a default value of 1 if it doesn't find nothing in the DATA file.
under what circumstances would num_cpu
be zero in queue config? the line I changed is in queue_config_alloc_empty
, which is called in queue_config_alloc_default
. after queue_config_alloc_empty
, there is a call to queue_config_init
, which actually tries to parse CPU_NUM
from the config content. So this is only the case if CPU_NUM
is not given in the config file!
as a short aside, the config dict path is different, it tries to assign the config dict value for NUM_CPU
directly. Wonder what happens if that's None
🤔
so let's say NUM_CPU
is not present in the config content, and thus in the config file, and say it is set in a DATA file, using whatever mechanism is expected there, so that ecl_config.num_cpu
is actually greater than 1. Then we have a problem, as the old behaviour was to prefer the ecl_config
number, but after this change, we would just go with 1
. So perhaps I should change the above lines into something like
queue_config_num_cpu = res_config.queue_config.num_cpu
res_config_num_cpu = res_config.ecl_config.num_cpu
if queue_config_num_cpu == 1 and res_config_num_cpu >= queue_config_num_cpu:
num_cpu = res_config_num_cpu
else:
num_cpu = queue_config_num_cpu
even though this is kind of verbose and maybe too complicated? but it might reflect the intended logic better?
this also makes me wonder why so far at least three different configs (subst, queue, ecl) need / use a number of CPUs. perhaps this particular issue is something that can be resolved down the line, during our current milestone, in a more meaningful way (then guessing what the logic should be here)...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and we have the same pattern in src/ert/_c_wrappers/job_queue/queue.py
, in add_job_from_run_arg
:
num_cpu = res_config.queue_config.num_cpu
if num_cpu == 0:
num_cpu = res_config.ecl_config.num_cpu
what do you think about the solution I proposed above, @eivindjahren ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that seem to be all instances of usages of queue_config
's num_cpu
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like this code is signaling that 0 means use res_config.ecl_config.num_cpu
. That is not great, but I think we should clean that up in a separate PR. Isn't it possible to just recreate how this number is instantiated in this PR and then clean this particular behavior up separately?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'll see if i can get away in this PR without those changes. i think something made me introduce them, but perhaps it was just confusion ; )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have removed that commit, and rather pushed it with changes as discussed here in #3919
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh and i fixed the test that was failing and triggered me to do the changes in question to work without those changes - basically just needed to make sure that the same value for NUM_CPU
was present both in config file and config dict (the former was not the case)
I believe we can merge everything except bb97391 which we will have to investigate further before we do anything. |
2a18f6c
to
78e8fea
Compare
@eivindjahren i think i have resolved the problematic commit (removed it and fixed the failing test), and all checks have passed (yeahy), so I would appreciate a hopefully final review - you may focus on the last four commits, if you wish, which can also be rebased to make things more compact |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm sorry that's the same commit twice, probably you meant two different commits here? ping @eivindjahren |
So I am thinking we should have "Write whole config dict for tests" and "Correct test config file writing" in the same commit and we drop "Refactoring in test" and |
78e8fea
to
9d8eb38
Compare
9d8eb38
to
0078a08
Compare
Codecov Report
@@ Coverage Diff @@
## main #3887 +/- ##
==========================================
+ Coverage 58.56% 58.59% +0.03%
==========================================
Files 545 545
Lines 41463 41447 -16
Branches 3765 3765
==========================================
+ Hits 24281 24287 +6
+ Misses 16181 16160 -21
+ Partials 1001 1000 -1
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
@@ -349,6 +349,7 @@ def test_res_config_dict_constructor(setup_case): | |||
# add a missing entries to config file | |||
with open("user_config.ert", "a+") as ert_file: | |||
ert_file.write("JOB_SCRIPT ../../../script.sh\n") | |||
ert_file.write("NUM_CPU 0\n") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This entry is optional. Why do we need to add it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because in the config dict, it's set to zero, and the config file path sets that value to 1 if it's not given. so it's to make the test pass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we want the config_dict path to also set it to 1 then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably, but what i'm trying to say is that there's a config dict config_data_new
which literally contains the NUM_CPU key, with value 1, in this test - see line 373 / 374. So since the test sets a specific value in the config dict, it seemed appropriate to set an explicit value in the file as well (in particular since the specific value differs from the config file path default value).
honestly off the top of my head right now i don't know what the config dict path does by default, and i think it's outside the scope of this PR (it's actually in the scope of #3918 and #3919)
i believe fixing these defaults was part of changes that we agreed to put in their own PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
0078a08
to
2cd7d1d
Compare
Test needed to be fixed up as a result of changes related to equinor#3802
2cd7d1d
to
561d2db
Compare
Issue
Resolves #3802
Approach
Small changes here and there to make the acceptance test in question pass.
I recommend reading commit by commit, it will make the review easierer (I suppose)
Pre review checklist
Adding labels helps the maintainers when writing release notes. This is the list of release note labels.