auto-calculate ADDED_MOS based on basissets #131

dev-zero · 2021-04-07T16:21:08Z

No description provided.

dev-zero · 2021-04-07T16:25:38Z

I guess this PR will be ignored like #105, but maybe someone else can use it some day.
Also, another option would be to make the calculation of ADDED_MOS a calcfunction instead and integrate it more explicitly in the base restart workchain instead. 🤷

yakutovicha · 2021-04-07T16:27:39Z

No-no, I will have a look at it. Sorry for the #105 I kept postponing it and never had time to look.

yakutovicha

Thanks, I think it is very nice. I have two minor comments/questions, that I posted below

aiida_cp2k/utils/input_generator.py

yakutovicha · 2021-04-07T16:49:03Z

aiida_cp2k/utils/input_generator.py

+    def get_keyword_value(self, kwpath):
+        return self._get_section_or_kw(kwpath, False)
+
+    def _get_section_or_kw(self, kwpath, section_requested):


I wonder, why do we need to have this section_requested parameter? Why don't we return whatever is under the specified path?

If the structure would already be validated in the sense that we can be sure the user gave a single value when a keyword was allowed and a dict only for a section I would agree. But this way we can directly do some validation in a central place. But we can move this validation into the get_section_dict and get_keyword_value of course!

My point is: the Cp2kInput class is not performing the validation of the sections' content. Its current responsibilities are browsing through the dictionary, modify it, and convert it to the CP2K input.

That is why I was thinking that instead of _get_section_or_kw we can rename it to get_keyword (to be consistent with add_keyword) and we make it return whatever is under the path. It will then become the user's responsibility to understand whether the data they got is making sense.

Ok, I've reworked the functions and added more tests.
I see your point of keeping it lean (possibly until we include the input file abstraction from the cp2k-input-tools), but since I am using those functions right in the input generation, not having those checks here would mean to clutter unrelated parts of the code (a part in which pylint is already complaining about too-many everything).
After the rework the difference is now also a bit clearer (default argument for section), other than just the returned type.

I see your point of keeping it lean

good, but instead of less code, we got even more code (why?). My ultimate goal is to make the code as Simple and as Explicit as possible.

possibly until we include the input file abstraction from the cp2k-input-tools

I would be very happy if you manage to inject the cp2k-input-tools here and delegate all the input checking to this package. I think you did a great job with this package, so we should just reuse it.

but since I am using those functions right in the input generation, not having those checks here would mean to clutter unrelated parts of the code (a part in which pylint is already complaining about too-many everything).

I have two comments here: 1) the check you need to add is something like:

if not isinstance(section, Mapping): complain

and it appears to be super clear and super explicit; 2) The pylint related issues should probably be fixed differently, I think we should create a couple of subfunctions and move there the structure, the kpoints, and the basissets insertion parts. I am happy to make a PR on this once the current one is merged. Do you agree?

Following the aforementioned arguments, I still find it hard to understand why we need to split get_section_dict and get_keyword_value. At the same time, I like very much the implementation of _get_section_or_kw and _stringify_path. I think the latter two are great additions.

Additionally, I see that you are returning a deep copy of the subdictionary in get_section_dict. I would, honestly, not do that because this would allow directly modifying the subparts of the "main dictionary".

if not isinstance(section, Mapping): complain

and it appears to be super clear and super explicit;

It's the same check, but with less duplication when that function gets reused.

The pylint related issues should probably be fixed differently, I think we should create a couple of subfunctions and move there the structure, the kpoints, and the basissets insertion parts. I am happy to make a PR on this once the current one is merged. Do you agree?

Sure. I just haven't found a way to decouple things in a way that we get nice little functions without much side-effect on the class state:
Most of the code needs most of the data in the class and changes different states. Just factoring them out into different functions didn't have much benefit so far. If we're doing this I'd recommend to move all functionality out of prepare_for_submission except for calling the factored out functions.

Following the aforementioned arguments, I still find it hard to understand why we need to split get_section_dict and get_keyword_value. At the same time, I like very much the implementation of _get_section_or_kw and _stringify_path. I think the latter two are great additions.

To avoid that every user of it has to duplicate said check. And to anticipate the corner case where a path can be both a keyword and a section (yes, there are section and keywords in CP2K with the same name): if we know the user wants the keyword and both are present we return the keyword, if the user meant the section we could return the section.
Also: we may have a more rich structure behind it rather than a simple dictionary (see cp2k-input-tools again). This way we can keep that function the same and simply convert that richer structure to a dict for the user. For which again it helps to know the users intention.

Additionally, I see that you are returning a deep copy of the subdictionary in get_section_dict. I would, honestly, not do that because this would allow directly modifying the subparts of the "main dictionary".

This is intentional. Otherwise why are we hiding self._params if we give the user full access to it? And as soon as we switch to cp2k-input-tools it may not be a dictionary anymore in the backend.

Okay, let us start from the beginning.

One of the base principles of the Python language is duck typing: "If it walks like a duck..", you know it. The cool feature about that is being able to take an object, pretend that this is what we want and deal with it the way we want, right?

Now let's consider our use case: we access a subpart (I intentionally don't call it sub-dictionary, because it can be anything) of the whole thing and we interact with it the way we want. In case of a problem, we except.

now let's have a look at your code

if 'basissets' in self.inputs and 'structure' in self.inputs: try: scf_section = inp.get_section_dict('FORCE_EVAL/DFT/SCF') if 'SMEAR' in scf_section and 'ADDED_MOS' not in scf_section: # now is our time to shine! added_mos = estimate_added_mos(self.inputs.basissets, self.inputs.structure) ... except (KeyError, TypeError): pass

If the function inp.get_section_dict is returning a non-dictionary object, then the code below it will except and this will be nicely taken care of by the try-except you provided.

What problem does it solve: whatever way you represent the input object as: as a dictionary, as a custom-dictionary, as whatever-thing-you-want - as soon as it behaves as expected, the code will work. As soon as it doesn't - the exception will be raised and will be caught by try-except.

Okay, let us start from the beginning.

One of the base principles of the Python language is duck typing: "If it walks like a duck..", you know it. The cool feature about that is being able to take an object, pretend that this is what we want and deal with it the way we want, right?

A little bit condescending, but yes, I am aware of duck typing as well as the fact that exceptions are first class citizens in Python and do not incur the same runtime performance overhead as for lets say exceptions in C++. As can be seen in the rest of the PR.

Now let's consider our use case: we access a subpart (I intentionally don't call it sub-dictionary, because it can be anything) of the whole thing and we interact with it the way we want. In case of a problem, we except.

now let's have a look at your code

if 'basissets' in self.inputs and 'structure' in self.inputs: try: scf_section = inp.get_section_dict('FORCE_EVAL/DFT/SCF') if 'SMEAR' in scf_section and 'ADDED_MOS' not in scf_section: # now is our time to shine! added_mos = estimate_added_mos(self.inputs.basissets, self.inputs.structure) ... except (KeyError, TypeError): pass

If the function inp.get_section_dict is returning a non-dictionary object, then the code below it will except and this will be nicely taken care of by the try-except you provided.

The question is where will it except and why. Since we don't do any validation before the user could have set it to a string or a list. In both cases the if ... may succeed given the right values, at which point we do some calculation (in our case mostly negligible I guess but others might not be) before we figure out that we can't add it in the end because the structure is invalid.
And since I don't know what inp.add_keyword will raise in that case I can't anticipate whether I have caught all the exceptions and the user might get a meaningless exception.

So, while duck typing is useful in some places it can quickly become difficult to reason why a specific part of the code raises an exception, but that's not all (see below). Meaning: at some point you need it to be a full duck, not one just mimicking the quack.

What problem does it solve: whatever way you represent the input object as: as a dictionary, as a custom-dictionary, as whatever-thing-you-want - as soon as it behaves as expected, the code will work. As soon as it doesn't - the exception will be raised and will be caught by try-except.

The guarantee by a function get_section_dict(...) is a bit stronger than that: it tells its user that whatever we do in the background, it will return a (possibly nested) dictionary. And this is the point of an interface, to make it more future proof towards change while the user does not have to touch his code.
While a more generalized get_keyword would give the user no guarantee whatsoever BUT (following your argument that everything should behave like before) still limit us to to return a Mapping-like object because the users code will start to break if we change it. Only that with functions like get_keyword_value and get_section_dict it is quiet clear which guarantees we've given the user in the first place.
Given a choice I usually prefer to to limit myself to write a compatible function if I have to make a change (here: to return a dict no matter what we have in the backend) to match the users expectation.

But, I've taken your criticism into account and am now only handling the exception where I expect one (the TypeError should be resolved once we support multiple force-evals and do full validation).

there is even a name for this: Robustness principle, respectively a saying: be contravariant in the input type and covariant in the output type. Besides, since I just worked with Python typings and a library returning a Union of things (like the get_keyword you proposed would do) I can say that it is just ugly having to use cast(...).

aiida_cp2k/calculations/__init__.py

ltalirz · 2021-05-03T09:35:07Z

aiida_cp2k/utils/datatype_helpers.py

@@ -160,6 +160,31 @@ def validate_basissets(inp, basissets, structure):
            kind_sec["ELEMENT"] = bset.element


+def estimate_added_mos(basissets, structure, fraction=0.3):
+    """Calculate an estimate for ADDED_MOS based on used basis sets"""


thanks @dev-zero! I think adding a guess for ADDED_MOS for smearing calculations is a great idea.

Could you perhaps document the reasoning behind the choices made in the documentation of this function?

In particular, I suspect that you may want to make the value of the smearing an input of this function, since the number of orbitals that are necessary is very much determined by it.

Do you have some test data on what energy range above the highest occupied MO will be spanned as a function of the number of added MOs (measured in % of the basis functions)?

I can for sure explain it a bit better, but in the end it is an empiric fraction.
Sure, you can most likely come up with a better one (or even a better motivation for the current one) given the type of smearing and the temperature, but I have no plans or time for it atm. This might also be a good idea for the common-workflow project since there you do the initial structure before doing the rest, hence you could gather some info from that first calculation to narrow it down for the rest of them (with the risk of skewing your result if done wrong).

No test data unfortunately (unless you want to start parsing the kp-wfn files), but if you want I can gather it when doing some more runs.

I can for sure explain it a bit better, but in the end it is an empiric fraction.

I think that would make sense.
By the way, if I remember well, then quantum espresso by default adds 10% of the number of electrons (or occupied orbitals).

While this is also approximate of course, to me the number of electrons is the more physically meaningful quantity here - as long as the basis sets are good enough to represent the electronic states in question, the number of ADDED MOS to use should be independent of the basis set size.

This might also be a good idea for the common-workflow project

Indeed. For a given system, smearing value and smearing method, the guess for ADDED_MOS should be independent of the code.

No test data unfortunately (unless you want to start parsing the kp-wfn files), but if you want I can gather it when doing some more runs.

I think that would make sense

Intuitively I'd have said that making the number of added MOS dependent on the basis functions rather than just the number electrons is due to the fact that we're using a localized basis set. But, any value is empirical and might be wrong.
To make it a bit more systematic I've added a new keyword to the relevant &PRINT section in CP2K which outputs the number of occupied MOs (aggregated as max over all kpoints) on each SCF cycle (one could already print occ, but only for each KP and spin) and the option to directly add all MOs (set ADDED_MOS to -1) The idea is that for something like the aiida-common-workflows the first calculation will be run with all MOs and reporting turned on and the scaled structures will run with that max.
The question is what to do for aiida-cp2k alone: we can leave it completely to the user, or with a recent enough version of CP2K add the -1 if undefined, or add some empirical value as proposed here.

Intuitively I'd have said that making the number of added MOS dependent on the basis functions rather than just the number electrons is due to the fact that we're using a localized basis set.

May I ask why?

To make my argument as clearly as possible, let's consider the case where your calculation is already converged as a function of the basis set size.
If you now go ahead and double the size of your basis set, the number of MOs to add should remain exactly the same - it depends only on the smearing function you use and on the electronic structure of your system.
However, if you double the size of your system (and thus also the number of electrons), the number of added MOs should double.

the number of occupied MOs (aggregated as max over all kpoints)

Occupied here means: occupation above some relevance cutoff (say 1e-3), correct?
Yes, I agree this is very useful to know.

the option to directly add all MOs (set ADDED_MOS to -1) The idea is that for something like the aiida-common-workflows the first calculation will be run with all MOs and reporting turned on and the scaled structures will run with that max.
The question is what to do for aiida-cp2k alone: we can leave it completely to the user, or with a recent enough version of CP2K add the -1 if undefined, or add some empirical value as proposed here.

I see. Taking all possible orbitals allowed by the basis is certainly a "safe" solution. For systems where it comes essentially for free, do it, but considering that the number of electrons is typically quite substantially below the number of basis functions, I suspect that for large systems the overhead both in compute time and max. memory needed will be prohibitive.

I strongly suspect that an empirical rule of adding some percentage of the number of (fully) occupied orbitals is going to be enough for most cases (after all, the occupied states tend to span several eVs while half an eV would already be a very high smearing).
As mentioned before, you may want to make the smearing value a parameter in the guess for the number of added MOs.

If you want, I can check the QE source to see exactly what they ended up with.

dev-zero force-pushed the feature/auto_calc-added_mos branch from 005a7ad to 4c417df Compare April 7, 2021 16:27

yakutovicha self-requested a review April 7, 2021 16:37

yakutovicha reviewed Apr 7, 2021

View reviewed changes

aiida_cp2k/calculations/__init__.py Show resolved Hide resolved

dev-zero force-pushed the feature/auto_calc-added_mos branch 2 times, most recently from c1129d0 to bbfe248 Compare April 8, 2021 07:25

yakutovicha reviewed Apr 8, 2021

View reviewed changes

aiida_cp2k/calculations/__init__.py Outdated Show resolved Hide resolved

dev-zero force-pushed the feature/auto_calc-added_mos branch from bbfe248 to 2362f08 Compare April 8, 2021 16:53

auto-calculate ADDED_MOS based on basissets

a8d4572

dev-zero force-pushed the feature/auto_calc-added_mos branch from 2362f08 to a8d4572 Compare April 9, 2021 10:08

ltalirz reviewed May 3, 2021

View reviewed changes

yakutovicha changed the base branch from develop to master November 21, 2022 08:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

auto-calculate ADDED_MOS based on basissets #131

auto-calculate ADDED_MOS based on basissets #131

dev-zero commented Apr 7, 2021

dev-zero commented Apr 7, 2021

yakutovicha commented Apr 7, 2021

yakutovicha left a comment

yakutovicha Apr 7, 2021

dev-zero Apr 7, 2021

yakutovicha Apr 7, 2021

dev-zero Apr 8, 2021

yakutovicha Apr 8, 2021

dev-zero Apr 8, 2021

yakutovicha Apr 8, 2021

dev-zero Apr 8, 2021

dev-zero Apr 13, 2021

ltalirz May 3, 2021 •

edited

Loading

dev-zero May 3, 2021

ltalirz May 3, 2021 •

edited

Loading

dev-zero Jul 19, 2021

ltalirz Jul 19, 2021 •

edited

Loading

auto-calculate ADDED_MOS based on basissets #131

Are you sure you want to change the base?

auto-calculate ADDED_MOS based on basissets #131

Conversation

dev-zero commented Apr 7, 2021

dev-zero commented Apr 7, 2021

yakutovicha commented Apr 7, 2021

yakutovicha left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ltalirz May 3, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ltalirz May 3, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ltalirz Jul 19, 2021 • edited Loading

Choose a reason for hiding this comment

ltalirz May 3, 2021 •

edited

Loading

ltalirz May 3, 2021 •

edited

Loading

ltalirz Jul 19, 2021 •

edited

Loading