Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix/cello/io #105

Closed
wants to merge 5 commits into from
Closed

Conversation

jameslarrue
Copy link

@jameslarrue jameslarrue commented May 21, 2021

Problem A
I have the following configuration in my Output group:

  data {
    type = "data";
    dir = "output/Hydro/hydro_static_2d";
    name = ["hydro_static_2d-data-%03d-%03d.h5", "cycle", "proc"];
    field_list = [ "velocity_x", "velocity_y" ];
    schedule {
      var = "cycle";
      step = 1;
      stop = 2;
    }
  }

The dir causes an incorrect output file for the .parameters file (etc.):

ERROR Error opening parameter file './output/Hydro/hydro_static_2d/./output/Hydro/hydro_static_2d.parameters' for writing

Cause
The issue appears to be in Cello/io_OutputData.cpp (line 141 by Andrew Emerick, 2020/03/05):

  if (name_dir == "") {
    // output block list and parameters to work directory
    name_dir  = ".";
    // strip extension, use this for name
    name_file = name_out_file.substr(0, name_out_file.rfind("."));
  } else {
    // output block list and parameters to subdirectory
    name_file = name_dir;
  }

  ...

    std::string param_file_name = name_dir+"/"+name_file+".parameters";

In the else, the name_file is being set to name_dir, resulting in the doubling of name_dir as the output file.

Solution
My solution is to remove the else and set the name_file to the same value for all cases:

  if (name_dir == "") {
    // output block list and parameters to work directory
    name_dir  = ".";
  }
  // strip extension, use this for name
  name_file = name_out_file.substr(0, name_out_file.rfind("."));

  ...

    std::string param_file_name = name_dir+"/"+name_file+".parameters";

Result
With the change, the parameters (and similar files) are created successfully, in my case, at:

output/Hydro/hydro_static_2d/hydro_static_2d-data-000-000.parameters

@fearmayo fearmayo self-requested a review May 21, 2021 09:23
@jameslarrue
Copy link
Author

It seems that some tests are testing that dirname/dirname.block_list exists, when it seems they should be checking for dirname/filename.block_list.

I am hunting down those tests and will add them here for review.

@jameslarrue jameslarrue added the bug Something isn't working label May 21, 2021
@fearmayo
Copy link
Contributor

It seems that some tests are testing that dirname/dirname.block_list exists, when it seems they should be checking for dirname/filename.block_list.

I am hunting down those tests and will add them here for review.

Indeed. That makes sense. We should also add @jobordner to the review in case there was a reason he set it up like this.

@fearmayo fearmayo requested a review from jobordner May 21, 2021 10:25
@jameslarrue
Copy link
Author

The file tools/l1_error_norm.py is assuming the block_list is in dirname/dirname.block_list (line 179):

    fname = os.path.join(dir_name,
                         '.'.join((os.path.basename(dir_name), 'block_list')))

That script is by @mabruzzo . I suspect he did that because that is what the code was doing.

@mabruzzo , do you have any insight into why l1_error_norm.py is looking for dirname/dirname.block_list rather than dirname/filename.block_list?

@mabruzzo
Copy link
Contributor

mabruzzo commented May 21, 2021

All of the VL+CT tests (including the ones using l1_error_norm.py) all look for dirname/dirname.block_list because that's the way the code worked.

I have encountered the bug that you are talking about before. I recently found that there is a (undocumented?) parameter for the Output section called dir_global that might address this use case. But I agree it's really annoying, and I'm really happy to see someone addressing it.

In the past I thought about trying to address this problem by modifying the else case such that name_file is assigned the equivalent value to the result of the python command: os.path.basename(name_dir). But that could take some work to implement. And your solution is an interesting idea.

I just wanted to express a few related minor concerns. Imagine we were running a simulation distributed over N processors and we used the output formatting options for our data dumps:

data {
    type = "data";
    dir = ["hydro_static_2d_%03d", "cycle"];
    name = ["hydro_static_2d-data-%03d-%03d.h5", "cycle", "proc"];
    field_list = [ "velocity_x", "velocity_y" ];
    schedule {
      var = "cycle";
      step = 1;
      #stop = 2;
    }
  }

Under the current system, the outputs for cycle 1 are organized as follows:

hydro_static_2d_001/
- hydro_static_2d_001.block_list
- hydro_static_2d_001.file_list
- hydro_static_2d_001.parameters
- hydro_static_2d_001.libconfig
- hydro_static_2d-data-001-000.h5
- hydro_static_2d-data-001-001.h5
- ...
- hydro_static_2d-data-001-{N-2}.h5
- hydro_static_2d-data-001-{N-1}.h5

Under the proposed changes, it's not so obvious to me what the metadata filename would look like; it's not obvious to me what value the "proc" variable will have while determining the filename (is it always 0?). Additionally, is there machinery in place to prevent each process from trying to write their own copies of the metadata files? (I would assume the answer is yes) These are both definitely addressable (@jobordner probably could answer these questions).

My last concern is more subjective: it seems like it would take more work to "discover" the block_list files under the new system (assuming you don't know it ahead of time). However, my opinion is probably informed by the fact that I'm accustomed to the current system. So, if people think this is better, I'd be all for it. Plus, the new scalable IO might change things anyway.

@jameslarrue
Copy link
Author

I modified how the tests are looking for the block_list files, so now the tests are passing. @mabruzzo, please review the test changes, as I believe you wrote that code.

In my scenario, I was given the following block_list files:

output/Hydro/hydro_static_2d/hydro_static_2d-data-000-000.block_list
output/Hydro/hydro_static_2d/hydro_static_2d-data-000-001.block_list
output/Hydro/hydro_static_2d/hydro_static_2d-data-000-002.block_list
output/Hydro/hydro_static_2d/hydro_static_2d-data-000-003.block_list
output/Hydro/hydro_static_2d/hydro_static_2d-data-001-000.block_list
output/Hydro/hydro_static_2d/hydro_static_2d-data-001-001.block_list
output/Hydro/hydro_static_2d/hydro_static_2d-data-001-002.block_list
output/Hydro/hydro_static_2d/hydro_static_2d-data-001-003.block_list
output/Hydro/hydro_static_2d/hydro_static_2d-data-002-000.block_list
output/Hydro/hydro_static_2d/hydro_static_2d-data-002-001.block_list
output/Hydro/hydro_static_2d/hydro_static_2d-data-002-002.block_list
output/Hydro/hydro_static_2d/hydro_static_2d-data-002-003.block_list

Each processor is printing its own block_list. The content of each block_list follows the same pattern, giving the name of the matching .h5 file. For example, cycle 0, processor 1 has:

B1_0 hydro_static_2d-data-000-001.h5

I would expect block_list issues with multiple processors using the dirname/dirname.block_list file. It seems odd that this was never an issue before, though it is possible that the OutputData.cpp changes were only merged this week.

@jameslarrue
Copy link
Author

jameslarrue commented May 21, 2021

Would we not want the .block_list data to be in a single file? Something more like:

B0_0 hydro_static_2d-data-002-000.h5
B0_1 hydro_static_2d-data-002-002.h5
B1_1 hydro_static_2d-data-002-003.h5
B1_0 hydro_static_2d-data-002-001.h5

@mabruzzo
Copy link
Contributor

I suspect that the individual processors might take turns editing the .block_list file to record the information about which blocks they just wrote to disk. If that is indeed the case, then you would need to make sure that all of the individual processors agree about the name of the .block_list file.

@jobordner will know more.

@jameslarrue
Copy link
Author

I feel that what I have is a good fix for parameters and libconfig, but a bad fix for block_list. I am working on an "agreed" name for the block_list file that all the processors can use.

@jameslarrue jameslarrue marked this pull request as draft May 21, 2021 18:35
@jameslarrue
Copy link
Author

Problem B

If the proc value is used in the name of data output, then the block list (and file list) will be written to multiple files. We want a single block_list file. An example of the Output to produce this is:

  data {
    type = "data";
    name = ["hydro_static_2d-data-%03d-%03d.h5", "cycle", "proc"];
    field_list = [ "velocity_x", "velocity_y" ];
    schedule {
      var = "cycle";
      step = 1;
      stop = 2;
    }
  }

Cause
The cause is the same code as the cause for Problem A above. When the file name is expanded, the proc value is replaced with the index of the processing element, and each processing element writes to its own files. This does not impact parameters or libconfig because those files are only written by the root processing element (element 0).

Solution
I implemented a solution where the name used for block_list, file_list, parameters, and libconfig is the expanded name for the root processing element, i.e. using 0 for proc.

An alternative solution could be to hard-code the output file names for the 4 "root" files, e.g. enzo-cycle-000.block_list, but that makes the system less configurable for the users, and increases the chance of a user getting confused about which block_list is for which results.

Result
The block list is written to hydro_static_2d-data-000-000.block_list for all processing elements in cycle 0. The file list is written to a similarly named file.

@jameslarrue
Copy link
Author

jameslarrue commented May 22, 2021

My solution to Problem B touches more of the code than I wished (e.g. io_Output.cpp). If somebody has a different idea, please share.

@jameslarrue jameslarrue marked this pull request as ready for review May 22, 2021 10:45
@mabruzzo
Copy link
Contributor

Since you're asking for a different ideas, I would like to gently re-express my support for a solution similar to the following snippet. But again, my preference comes from the fact that I'm already accustomed to the current system (and that's not necessarily a good reason for adopting this solution).

  // sanitized_name_dir is identical to name_dir, but any trailing `/`s have been removed
  std:string sanitized_name_dir = sanitize_dir_path(name_dir);

  if (sanitized_name_dir == "") {
    // output block list and parameters to work directory
    name_dir  = ".";
    // strip extension, use this for name
    name_file = name_out_file.substr(0, name_out_file.rfind("."));
  } else {
    // output block list and parameters to subdirectory. The basename of the subdirectory 
    // serves as the prefix for the names of the block list and parameters files
    
    std::size_t basename_start = sanitized_name_dir.rfind('/');
    if (basename_start == std::string::npos){ // no '/' was found
      name_file = sanitized_name_dir;
    } else {
      name_file = sanitized_name_dir.substr(basename_start);
    }
  }

This addresses Problem A without introducing Problem B. The sanitized_dir_path function is provided below. It's not strictly necessary to do this sanitizing, but it allows for robust handling of certain edge cases.

std::string sanitize_dir_path(std::string name_dir){ 
  std::string out = "";
  for (std::size_t i_plus_1 = name_dir.size(); i_plus_1 > 0; --i_plus_1){
    std::size_t i = i_plus_1 - 1;
    if (name_dir[i] == '/'){
      continue;
    } else {
      out = name_dir.substr(0,i+1);
      break;
    }
  }
  return out;
}

@jameslarrue
Copy link
Author

@mabruzzo , your solution would overwrite the block_list when you have outputs on more than one cycle in the same directory. For example, with the Output configuration from Problem A, you will have output for cycles 0, 1, and 2, but your solution will use the same block_list file each time, so you will only have the block_list for cycle 2. You will be overwriting your block_list for cycles 0 and 1.

Problem B was not created by the solution for Problem A. It just happens to have a related cause.

The current code would work with a limited number of configurations, such as:

  1. a single processor, or
  2. write the different outputs to a different directory each time and those directories are only 1 level from the current directory.

Those configurations would hide both Problem A and Problem B.

@mabruzzo
Copy link
Contributor

Sorry, I guess I hadn't fully understood (I've only ever used the code in one of the two configurations that you mentioned).

To support writing outputs for multiple cycles to a single non-local directory, your approach makes a lot of sense!

@jobordner
Copy link
Contributor

I'm deferring looking at this in detail since there is a new HDF5 output method in PR #97 which will make the current IO obsolete. The new IO handles the metadata files issue by not generating metadata files. The block_list files are actually much easier to generate outside of Enzo-E using simple scripts that call e.g. h5ls or h5dump. The parameters file is already generated at startup, I don't see the need to re-generate the same file at each data dump.

@peeples peeples requested a review from mabruzzo January 25, 2022 16:13
@peeples
Copy link
Contributor

peeples commented Feb 1, 2022

I am closing this PR as it is likely no longer valid now that PR #97 referenced above has been merged. @jameslarrue if there is still an issue, please resubmit the PR. Thanks!

@peeples peeples closed this Feb 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants