Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PpCalculation: add support for retrieving and parsing multiple files #533

Merged
merged 1 commit into from Jun 30, 2020

Conversation

sphuber
Copy link
Contributor

@sphuber sphuber commented Jun 25, 2020

Fixes #530

The implementation of the PpCalculation and PpParser assumed that
the code would always only produce on output file with pre-processed
data and one output file with the data formatted in a custom plot
format. However, for certain input parameter combinations, such as
INPUTPP.plot_num = 7 where more than a single band is requested, a
pair of outputfiles is produced for each band. The filename of the data
file has INPUTPP.filplot as a prefix, with some kind of suffix to
distinguish it from the others. The corresponding plot file will use
that filename as a prefix with the PLOT.fileout value as a suffix.
For example, for the inputs:

INPUTPP
    filplot = 'aiida.filplot'
PLOT
    fileout = 'aiida.fileout'

The data files will be named aiida.filplot_K1_B1 and the plot files
are formatted as aiida.filplot_K1_B1aiida.fileout.

To support this use case, the PpCalculation is updated to not just
retrieve a single file, but add a directive to the retrieve_files or
retrieve_temporary_list that contains the corresponding globbing
pattern. The PpParser then simply loops over the content of the
retrieved (temporary) folder and parses each file whose filename matches
the pattern described above. We assume that if there are more than one
file, they all have the exact same format and so can be parsed with the
same logic.

Since now there are potentially more than one parsed output ArrayData
node, the output_data port, which is not a namespace can not be used.
To keep backwards compatibility, we add the output_data_multiple
namespace, which is used if more than one output plot file is parsed.

@sphuber
Copy link
Contributor Author

sphuber commented Jun 25, 2020

@yakutovicha I created an alternative implementation including unit tests that now actually also test the retrieved temporary folder functionality. I also tested this locally, at least for plot_num=7 which produces multiple files. But please try and give this a go.

@cpignedoli if you have the time, would be great if you could maybe give this branch a go as well to see if it fixes your use case, thanks.

@cpignedoli
Copy link
Contributor

cpignedoli commented Jun 25, 2020 via email


# How to get the output filenames and how to open them, depends on whether they will have been retrieved in the
# `retrieved` output node, or in the `retrieved_temporary_folder`. Instead of having a conditional with almost
# the same loop logic in each branch, we apply a somewhat dirty trick to define an `opener` which is a callable
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dirty trick? That's fantastic! 👍

# `retrieved` output node, or in the `retrieved_temporary_folder`. Instead of having a conditional with almost
# the same loop logic in each branch, we apply a somewhat dirty trick to define an `opener` which is a callable
# that will open a handle to the output file given a certain filename.
if retrieve_temporary_list:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth mentioning here that there can either be all temporary files or all retrieved, not a mix of the two.

I think this is a decent choice, but had to go check in the calculation that this is what's implemented there, too.

Copy link
Collaborator

@yakutovicha yakutovicha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested with the nanoribbon work chain and it worked fine. There is still the parsing error reported in #531, but this is another story.

Copy link
Contributor

@ConradJohnston ConradJohnston left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

Is there a case where multiple sub-output files are written but there is a single overall output file?
If so, we may want to offer some user control over whether all sub-outputs are parsed.

@yakutovicha
Copy link
Collaborator

yakutovicha commented Jun 27, 2020

I tested with the fix you made in #534 and everything went smooth. Thanks a lot @sphuber!

@sphuber
Copy link
Contributor Author

sphuber commented Jun 27, 2020

Is there a case where multiple sub-output files are written but there is a single overall output file?
If so, we may want to offer some user control over whether all sub-outputs are parsed.

I wasn't able to figure this out from the docs, but it looks like it should always be a pair of output files. So I guess for now this should work until we find another edge case I guess

The implementation of the `PpCalculation` and `PpParser` assumed that
the code would always only produce on output file with pre-processed
data and one output file with the data formatted in a custom plot
format. However, for certain input parameter combinations, such as
`INPUTPP.plot_num = 7` where more than a single band is requested, a
pair of outputfiles is produced for each band. The filename of the data
file has `INPUTPP.filplot` as a prefix, with some kind of suffix to
distinguish it from the others. The corresponding plot file will use
that filename as a prefix with the `PLOT.fileout` value as a suffix.
For example, for the inputs:

    INPUTPP
        filplot = 'aiida.filplot'
    PLOT
        fileout = 'aiida.fileout'

The data files will be named `aiida.filplot_K1_B1` and the plot files
are formatted as `aiida.filplot_K1_B1aiida.fileout`.

To support this use case, the `PpCalculation` is updated to not just
retrieve a single file, but add a directive to the `retrieve_files` or
`retrieve_temporary_list` that contains the corresponding globbing
pattern. The `PpParser` then simply loops over the content of the
retrieved (temporary) folder and parses each file whose filename matches
the pattern described above. We assume that if there are more than one
file, they all have the exact same format and so can be parsed with the
same logic.

Since now there are potentially more than one parsed output `ArrayData`
node, the `output_data` port, which is not a namespace can not be used.
To keep backwards compatibility, we add the `output_data_multiple`
namespace, which is used if more than one output plot file is parsed.
@sphuber sphuber requested a review from greschd June 27, 2020 13:12
@yakutovicha yakutovicha self-requested a review June 29, 2020 12:49
@yakutovicha
Copy link
Collaborator

yakutovicha commented Jun 29, 2020

I believe this code should definitely be changed:

datalist = []
for line in data_lines:
    for i in range(0, len(line), 13):
        data_point = line[i:i + 13].strip()
        if data_point != '':
            datalist.append(float(data_point))
# Unpack the list and repack as a 3D array
# Note the unusual indexing: cube files run over the z index first, then y and x.
# E.g. The first volumetric data point is x,y,z = (0,0,0) and the second is (0,0,1)
for i in range(0, xdim):
    for j in range(0, ydim):
        for k in range(0, zdim):
            data_array[i, j, k] = (datalist[(i * ydim * zdim) + (j * zdim) + k])

If pp.x developers from day to night decide to change the number formatting, the parser will fail.

A better approach would look something like:

    data_array = np.empty(xdim*ydim*zdim, dtype=float)
    cursor = 0
    for line in lines:
        ls = line.split()
        data_array[cursor:cursor+len(ls)] = ls
        cursor += len(ls)
    data_array = data_array.reshape((xdim, ydim, zdim))

Update.

fixed in #535

@sphuber
Copy link
Contributor Author

sphuber commented Jun 30, 2020

@greschd this is ready for re-review

@yakutovicha
Copy link
Collaborator

@greschd this is ready for re-review

I am not @greschd but I believe this can be merged. I tested the plugin quite extensively with the nanoribbon work chain and it works nicely. Maybe merge #535 before (the changes there do not conflict with the current PR), as without it nanoribbon work chain wouldn't work.

@sphuber sphuber merged commit a52266d into develop Jun 30, 2020
@sphuber sphuber deleted the fix/530/pp-multiple-files branch June 30, 2020 13:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PP plugin: retrieve multiple files.
5 participants