`PpCalculation`: add support for retrieving and parsing multiple files #533

sphuber · 2020-06-25T16:20:49Z

Fixes #530

The implementation of the PpCalculation and PpParser assumed that
the code would always only produce on output file with pre-processed
data and one output file with the data formatted in a custom plot
format. However, for certain input parameter combinations, such as
INPUTPP.plot_num = 7 where more than a single band is requested, a
pair of outputfiles is produced for each band. The filename of the data
file has INPUTPP.filplot as a prefix, with some kind of suffix to
distinguish it from the others. The corresponding plot file will use
that filename as a prefix with the PLOT.fileout value as a suffix.
For example, for the inputs:

INPUTPP
    filplot = 'aiida.filplot'
PLOT
    fileout = 'aiida.fileout'

The data files will be named aiida.filplot_K1_B1 and the plot files
are formatted as aiida.filplot_K1_B1aiida.fileout.

To support this use case, the PpCalculation is updated to not just
retrieve a single file, but add a directive to the retrieve_files or
retrieve_temporary_list that contains the corresponding globbing
pattern. The PpParser then simply loops over the content of the
retrieved (temporary) folder and parses each file whose filename matches
the pattern described above. We assume that if there are more than one
file, they all have the exact same format and so can be parsed with the
same logic.

Since now there are potentially more than one parsed output ArrayData
node, the output_data port, which is not a namespace can not be used.
To keep backwards compatibility, we add the output_data_multiple
namespace, which is used if more than one output plot file is parsed.

sphuber · 2020-06-25T16:23:31Z

@yakutovicha I created an alternative implementation including unit tests that now actually also test the retrieved temporary folder functionality. I also tested this locally, at least for plot_num=7 which produces multiple files. But please try and give this a go.

@cpignedoli if you have the time, would be great if you could maybe give this branch a go as well to see if it fixes your use case, thanks.

cpignedoli · 2020-06-25T16:40:22Z

I will for sure give it a try next week. Thanks a lot

…

On 25 Jun 2020, at 18:23, Sebastiaan Huber ***@***.***> wrote: @yakutovicha I created an alternative implementation including unit tests that now actually also test the retrieved temporary folder functionality. I also tested this locally, at least for plot_num=7 which produces multiple files. But please try and give this a go. @cpignedoli if you have the time, would be great if you could maybe give this branch a go as well to see if it fixes your use case, thanks. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

aiida_quantumespresso/parsers/pp.py

greschd · 2020-06-25T16:51:49Z

aiida_quantumespresso/parsers/pp.py

+
+        # How to get the output filenames and how to open them, depends on whether they will have been retrieved in the
+        # `retrieved` output node, or in the `retrieved_temporary_folder`. Instead of having a conditional with almost
+        # the same loop logic in each branch, we apply a somewhat dirty trick to define an `opener` which is a callable


Dirty trick? That's fantastic! 👍

greschd · 2020-06-25T16:54:16Z

aiida_quantumespresso/parsers/pp.py

+        # `retrieved` output node, or in the `retrieved_temporary_folder`. Instead of having a conditional with almost
+        # the same loop logic in each branch, we apply a somewhat dirty trick to define an `opener` which is a callable
+        # that will open a handle to the output file given a certain filename.
+        if retrieve_temporary_list:


Might be worth mentioning here that there can either be all temporary files or all retrieved, not a mix of the two.

I think this is a decent choice, but had to go check in the calculation that this is what's implemented there, too.

yakutovicha

I tested with the nanoribbon work chain and it worked fine. There is still the parsing error reported in #531, but this is another story.

ConradJohnston

Looks good!

Is there a case where multiple sub-output files are written but there is a single overall output file?
If so, we may want to offer some user control over whether all sub-outputs are parsed.

yakutovicha · 2020-06-27T09:25:57Z

I tested with the fix you made in #534 and everything went smooth. Thanks a lot @sphuber!

sphuber · 2020-06-27T12:42:11Z

Is there a case where multiple sub-output files are written but there is a single overall output file?
If so, we may want to offer some user control over whether all sub-outputs are parsed.

I wasn't able to figure this out from the docs, but it looks like it should always be a pair of output files. So I guess for now this should work until we find another edge case I guess

The implementation of the `PpCalculation` and `PpParser` assumed that the code would always only produce on output file with pre-processed data and one output file with the data formatted in a custom plot format. However, for certain input parameter combinations, such as `INPUTPP.plot_num = 7` where more than a single band is requested, a pair of outputfiles is produced for each band. The filename of the data file has `INPUTPP.filplot` as a prefix, with some kind of suffix to distinguish it from the others. The corresponding plot file will use that filename as a prefix with the `PLOT.fileout` value as a suffix. For example, for the inputs: INPUTPP filplot = 'aiida.filplot' PLOT fileout = 'aiida.fileout' The data files will be named `aiida.filplot_K1_B1` and the plot files are formatted as `aiida.filplot_K1_B1aiida.fileout`. To support this use case, the `PpCalculation` is updated to not just retrieve a single file, but add a directive to the `retrieve_files` or `retrieve_temporary_list` that contains the corresponding globbing pattern. The `PpParser` then simply loops over the content of the retrieved (temporary) folder and parses each file whose filename matches the pattern described above. We assume that if there are more than one file, they all have the exact same format and so can be parsed with the same logic. Since now there are potentially more than one parsed output `ArrayData` node, the `output_data` port, which is not a namespace can not be used. To keep backwards compatibility, we add the `output_data_multiple` namespace, which is used if more than one output plot file is parsed.

yakutovicha · 2020-06-29T12:58:33Z

I believe this code should definitely be changed:

datalist = []
for line in data_lines:
    for i in range(0, len(line), 13):
        data_point = line[i:i + 13].strip()
        if data_point != '':
            datalist.append(float(data_point))
# Unpack the list and repack as a 3D array
# Note the unusual indexing: cube files run over the z index first, then y and x.
# E.g. The first volumetric data point is x,y,z = (0,0,0) and the second is (0,0,1)
for i in range(0, xdim):
    for j in range(0, ydim):
        for k in range(0, zdim):
            data_array[i, j, k] = (datalist[(i * ydim * zdim) + (j * zdim) + k])

If pp.x developers from day to night decide to change the number formatting, the parser will fail.

A better approach would look something like:

    data_array = np.empty(xdim*ydim*zdim, dtype=float)
    cursor = 0
    for line in lines:
        ls = line.split()
        data_array[cursor:cursor+len(ls)] = ls
        cursor += len(ls)
    data_array = data_array.reshape((xdim, ydim, zdim))

Update.

fixed in #535

sphuber · 2020-06-30T08:47:34Z

@greschd this is ready for re-review

yakutovicha · 2020-06-30T13:33:18Z

@greschd this is ready for re-review

I am not @greschd but I believe this can be merged. I tested the plugin quite extensively with the nanoribbon work chain and it works nicely. Maybe merge #535 before (the changes there do not conflict with the current PR), as without it nanoribbon work chain wouldn't work.

sphuber requested review from greschd, yakutovicha and ConradJohnston June 25, 2020 16:20

greschd reviewed Jun 25, 2020

View reviewed changes

aiida_quantumespresso/parsers/pp.py Outdated Show resolved Hide resolved

greschd reviewed Jun 25, 2020

View reviewed changes

yakutovicha reviewed Jun 25, 2020

View reviewed changes

ConradJohnston approved these changes Jun 26, 2020

View reviewed changes

sphuber force-pushed the fix/530/pp-multiple-files branch from 02979f1 to efd41ad Compare June 27, 2020 12:46

sphuber requested a review from greschd June 27, 2020 13:12

yakutovicha self-requested a review June 29, 2020 12:49

yakutovicha approved these changes Jun 30, 2020

View reviewed changes

sphuber merged commit a52266d into develop Jun 30, 2020

sphuber deleted the fix/530/pp-multiple-files branch June 30, 2020 13:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`PpCalculation`: add support for retrieving and parsing multiple files #533

`PpCalculation`: add support for retrieving and parsing multiple files #533

sphuber commented Jun 25, 2020

sphuber commented Jun 25, 2020

cpignedoli commented Jun 25, 2020 via email

greschd Jun 25, 2020

greschd Jun 25, 2020

yakutovicha left a comment

ConradJohnston left a comment

yakutovicha commented Jun 27, 2020 •

edited

sphuber commented Jun 27, 2020

yakutovicha commented Jun 29, 2020 •

edited

sphuber commented Jun 30, 2020

yakutovicha commented Jun 30, 2020

PpCalculation: add support for retrieving and parsing multiple files #533

PpCalculation: add support for retrieving and parsing multiple files #533

Conversation

sphuber commented Jun 25, 2020

sphuber commented Jun 25, 2020

cpignedoli commented Jun 25, 2020 via email

greschd Jun 25, 2020

Choose a reason for hiding this comment

greschd Jun 25, 2020

Choose a reason for hiding this comment

yakutovicha left a comment

Choose a reason for hiding this comment

ConradJohnston left a comment

Choose a reason for hiding this comment

yakutovicha commented Jun 27, 2020 • edited

sphuber commented Jun 27, 2020

yakutovicha commented Jun 29, 2020 • edited

Update.

sphuber commented Jun 30, 2020

yakutovicha commented Jun 30, 2020

`PpCalculation`: add support for retrieving and parsing multiple files #533

`PpCalculation`: add support for retrieving and parsing multiple files #533

yakutovicha commented Jun 27, 2020 •

edited

yakutovicha commented Jun 29, 2020 •

edited