-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
auto-calculate ADDED_MOS based on basissets #131
base: main
Are you sure you want to change the base?
Conversation
I guess this PR will be ignored like #105, but maybe someone else can use it some day. |
005a7ad
to
4c417df
Compare
No-no, I will have a look at it. Sorry for the #105 I kept postponing it and never had time to look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I think it is very nice. I have two minor comments/questions, that I posted below
aiida_cp2k/utils/input_generator.py
Outdated
def get_keyword_value(self, kwpath): | ||
return self._get_section_or_kw(kwpath, False) | ||
|
||
def _get_section_or_kw(self, kwpath, section_requested): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder, why do we need to have this section_requested
parameter? Why don't we return whatever is under the specified path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the structure would already be validated in the sense that we can be sure the user gave a single value when a keyword was allowed and a dict only for a section I would agree. But this way we can directly do some validation in a central place. But we can move this validation into the get_section_dict
and get_keyword_value
of course!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My point is: the Cp2kInput
class is not performing the validation of the sections' content. Its current responsibilities are browsing through the dictionary, modify it, and convert it to the CP2K input.
That is why I was thinking that instead of _get_section_or_kw
we can rename it to get_keyword
(to be consistent with add_keyword
) and we make it return whatever is under the path. It will then become the user's responsibility to understand whether the data they got is making sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I've reworked the functions and added more tests.
I see your point of keeping it lean (possibly until we include the input file abstraction from the cp2k-input-tools), but since I am using those functions right in the input generation, not having those checks here would mean to clutter unrelated parts of the code (a part in which pylint is already complaining about too-many everything).
After the rework the difference is now also a bit clearer (default argument for section), other than just the returned type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point of keeping it lean
good, but instead of less code, we got even more code (why?). My ultimate goal is to make the code as Simple and as Explicit as possible.
possibly until we include the input file abstraction from the cp2k-input-tools
I would be very happy if you manage to inject the cp2k-input-tools
here and delegate all the input checking to this package. I think you did a great job with this package, so we should just reuse it.
but since I am using those functions right in the input generation, not having those checks here would mean to clutter unrelated parts of the code (a part in which pylint is already complaining about too-many everything).
I have two comments here: 1) the check you need to add is something like:
if not isinstance(section, Mapping):
complain
and it appears to be super clear and super explicit; 2) The pylint related issues should probably be fixed differently, I think we should create a couple of subfunctions and move there the structure, the kpoints, and the basissets insertion parts. I am happy to make a PR on this once the current one is merged. Do you agree?
Following the aforementioned arguments, I still find it hard to understand why we need to split get_section_dict
and get_keyword_value
. At the same time, I like very much the implementation of _get_section_or_kw
and _stringify_path
. I think the latter two are great additions.
Additionally, I see that you are returning a deep copy of the subdictionary in get_section_dict
. I would, honestly, not do that because this would allow directly modifying the subparts of the "main dictionary".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if not isinstance(section, Mapping): complain
and it appears to be super clear and super explicit;
It's the same check, but with less duplication when that function gets reused.
- The pylint related issues should probably be fixed differently, I think we should create a couple of subfunctions and move there the structure, the kpoints, and the basissets insertion parts. I am happy to make a PR on this once the current one is merged. Do you agree?
Sure. I just haven't found a way to decouple things in a way that we get nice little functions without much side-effect on the class state:
Most of the code needs most of the data in the class and changes different states. Just factoring them out into different functions didn't have much benefit so far. If we're doing this I'd recommend to move all functionality out of prepare_for_submission
except for calling the factored out functions.
Following the aforementioned arguments, I still find it hard to understand why we need to split
get_section_dict
andget_keyword_value
. At the same time, I like very much the implementation of_get_section_or_kw
and_stringify_path
. I think the latter two are great additions.
To avoid that every user of it has to duplicate said check. And to anticipate the corner case where a path can be both a keyword and a section (yes, there are section and keywords in CP2K with the same name): if we know the user wants the keyword and both are present we return the keyword, if the user meant the section we could return the section.
Also: we may have a more rich structure behind it rather than a simple dictionary (see cp2k-input-tools again). This way we can keep that function the same and simply convert that richer structure to a dict for the user. For which again it helps to know the users intention.
Additionally, I see that you are returning a deep copy of the subdictionary in
get_section_dict
. I would, honestly, not do that because this would allow directly modifying the subparts of the "main dictionary".
This is intentional. Otherwise why are we hiding self._params
if we give the user full access to it? And as soon as we switch to cp2k-input-tools it may not be a dictionary anymore in the backend.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, let us start from the beginning.
One of the base principles of the Python language is duck typing: "If it walks like a duck..", you know it. The cool feature about that is being able to take an object, pretend that this is what we want and deal with it the way we want, right?
Now let's consider our use case: we access a subpart (I intentionally don't call it sub-dictionary, because it can be anything) of the whole thing and we interact with it the way we want. In case of a problem, we except.
now let's have a look at your code
if 'basissets' in self.inputs and 'structure' in self.inputs:
try:
scf_section = inp.get_section_dict('FORCE_EVAL/DFT/SCF')
if 'SMEAR' in scf_section and 'ADDED_MOS' not in scf_section:
# now is our time to shine!
added_mos = estimate_added_mos(self.inputs.basissets, self.inputs.structure)
...
except (KeyError, TypeError):
pass
If the function inp.get_section_dict
is returning a non-dictionary object, then the code below it will except and this will be nicely taken care of by the try-except
you provided.
What problem does it solve: whatever way you represent the input object as: as a dictionary, as a custom-dictionary, as whatever-thing-you-want - as soon as it behaves as expected, the code will work. As soon as it doesn't - the exception will be raised and will be caught by try-except.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, let us start from the beginning.
One of the base principles of the Python language is duck typing: "If it walks like a duck..", you know it. The cool feature about that is being able to take an object, pretend that this is what we want and deal with it the way we want, right?
A little bit condescending, but yes, I am aware of duck typing as well as the fact that exceptions are first class citizens in Python and do not incur the same runtime performance overhead as for lets say exceptions in C++. As can be seen in the rest of the PR.
Now let's consider our use case: we access a subpart (I intentionally don't call it sub-dictionary, because it can be anything) of the whole thing and we interact with it the way we want. In case of a problem, we except.
now let's have a look at your code
if 'basissets' in self.inputs and 'structure' in self.inputs: try: scf_section = inp.get_section_dict('FORCE_EVAL/DFT/SCF') if 'SMEAR' in scf_section and 'ADDED_MOS' not in scf_section: # now is our time to shine! added_mos = estimate_added_mos(self.inputs.basissets, self.inputs.structure) ... except (KeyError, TypeError): passIf the function
inp.get_section_dict
is returning a non-dictionary object, then the code below it will except and this will be nicely taken care of by thetry-except
you provided.
The question is where will it except and why. Since we don't do any validation before the user could have set it to a string or a list. In both cases the if ...
may succeed given the right values, at which point we do some calculation (in our case mostly negligible I guess but others might not be) before we figure out that we can't add it in the end because the structure is invalid.
And since I don't know what inp.add_keyword
will raise in that case I can't anticipate whether I have caught all the exceptions and the user might get a meaningless exception.
So, while duck typing is useful in some places it can quickly become difficult to reason why a specific part of the code raises an exception, but that's not all (see below). Meaning: at some point you need it to be a full duck, not one just mimicking the quack.
What problem does it solve: whatever way you represent the input object as: as a dictionary, as a custom-dictionary, as whatever-thing-you-want - as soon as it behaves as expected, the code will work. As soon as it doesn't - the exception will be raised and will be caught by try-except.
The guarantee by a function get_section_dict(...)
is a bit stronger than that: it tells its user that whatever we do in the background, it will return a (possibly nested) dictionary. And this is the point of an interface, to make it more future proof towards change while the user does not have to touch his code.
While a more generalized get_keyword
would give the user no guarantee whatsoever BUT (following your argument that everything should behave like before) still limit us to to return a Mapping-like object because the users code will start to break if we change it. Only that with functions like get_keyword_value
and get_section_dict
it is quiet clear which guarantees we've given the user in the first place.
Given a choice I usually prefer to to limit myself to write a compatible function if I have to make a change (here: to return a dict no matter what we have in the backend) to match the users expectation.
But, I've taken your criticism into account and am now only handling the exception where I expect one (the TypeError
should be resolved once we support multiple force-evals and do full validation).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is even a name for this: Robustness principle, respectively a saying: be contravariant in the input type and covariant in the output type. Besides, since I just worked with Python typings and a library returning a Union of things (like the get_keyword
you proposed would do) I can say that it is just ugly having to use cast(...)
.
c1129d0
to
bbfe248
Compare
bbfe248
to
2362f08
Compare
2362f08
to
a8d4572
Compare
@@ -160,6 +160,31 @@ def validate_basissets(inp, basissets, structure): | |||
kind_sec["ELEMENT"] = bset.element | |||
|
|||
|
|||
def estimate_added_mos(basissets, structure, fraction=0.3): | |||
"""Calculate an estimate for ADDED_MOS based on used basis sets""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @dev-zero! I think adding a guess for ADDED_MOS
for smearing calculations is a great idea.
Could you perhaps document the reasoning behind the choices made in the documentation of this function?
In particular, I suspect that you may want to make the value of the smearing an input of this function, since the number of orbitals that are necessary is very much determined by it.
Do you have some test data on what energy range above the highest occupied MO will be spanned as a function of the number of added MOs (measured in % of the basis functions)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can for sure explain it a bit better, but in the end it is an empiric fraction.
Sure, you can most likely come up with a better one (or even a better motivation for the current one) given the type of smearing and the temperature, but I have no plans or time for it atm. This might also be a good idea for the common-workflow project since there you do the initial structure before doing the rest, hence you could gather some info from that first calculation to narrow it down for the rest of them (with the risk of skewing your result if done wrong).
No test data unfortunately (unless you want to start parsing the kp-wfn files), but if you want I can gather it when doing some more runs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can for sure explain it a bit better, but in the end it is an empiric fraction.
I think that would make sense.
By the way, if I remember well, then quantum espresso by default adds 10% of the number of electrons (or occupied orbitals).
While this is also approximate of course, to me the number of electrons is the more physically meaningful quantity here - as long as the basis sets are good enough to represent the electronic states in question, the number of ADDED MOS to use should be independent of the basis set size.
This might also be a good idea for the common-workflow project
Indeed. For a given system, smearing value and smearing method, the guess for ADDED_MOS should be independent of the code.
No test data unfortunately (unless you want to start parsing the kp-wfn files), but if you want I can gather it when doing some more runs.
I think that would make sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Intuitively I'd have said that making the number of added MOS dependent on the basis functions rather than just the number electrons is due to the fact that we're using a localized basis set. But, any value is empirical and might be wrong.
To make it a bit more systematic I've added a new keyword to the relevant &PRINT
section in CP2K which outputs the number of occupied MOs (aggregated as max over all kpoints) on each SCF cycle (one could already print occ, but only for each KP and spin) and the option to directly add all MOs (set ADDED_MOS to -1) The idea is that for something like the aiida-common-workflows the first calculation will be run with all MOs and reporting turned on and the scaled structures will run with that max.
The question is what to do for aiida-cp2k alone: we can leave it completely to the user, or with a recent enough version of CP2K add the -1 if undefined, or add some empirical value as proposed here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Intuitively I'd have said that making the number of added MOS dependent on the basis functions rather than just the number electrons is due to the fact that we're using a localized basis set.
May I ask why?
To make my argument as clearly as possible, let's consider the case where your calculation is already converged as a function of the basis set size.
If you now go ahead and double the size of your basis set, the number of MOs to add should remain exactly the same - it depends only on the smearing function you use and on the electronic structure of your system.
However, if you double the size of your system (and thus also the number of electrons), the number of added MOs should double.
the number of occupied MOs (aggregated as max over all kpoints)
Occupied here means: occupation above some relevance cutoff (say 1e-3), correct?
Yes, I agree this is very useful to know.
the option to directly add all MOs (set ADDED_MOS to -1) The idea is that for something like the aiida-common-workflows the first calculation will be run with all MOs and reporting turned on and the scaled structures will run with that max.
The question is what to do for aiida-cp2k alone: we can leave it completely to the user, or with a recent enough version of CP2K add the -1 if undefined, or add some empirical value as proposed here.
I see. Taking all possible orbitals allowed by the basis is certainly a "safe" solution. For systems where it comes essentially for free, do it, but considering that the number of electrons is typically quite substantially below the number of basis functions, I suspect that for large systems the overhead both in compute time and max. memory needed will be prohibitive.
I strongly suspect that an empirical rule of adding some percentage of the number of (fully) occupied orbitals is going to be enough for most cases (after all, the occupied states tend to span several eVs while half an eV would already be a very high smearing).
As mentioned before, you may want to make the smearing value a parameter in the guess for the number of added MOs.
If you want, I can check the QE source to see exactly what they ended up with.
No description provided.