Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removing indent in string stored in hdf5 #1228

Merged
merged 1 commit into from
Nov 3, 2023

Conversation

momchil-flex
Copy link
Collaborator

No description provided.

@momchil-flex momchil-flex force-pushed the momchil/hdf5_string_indent branch 2 times, most recently from b2ba1cb to 413f0c9 Compare October 31, 2023 23:36
@@ -563,7 +563,7 @@ def to_hdf5(self, fname: str, custom_encoders: List[Callable] = None) -> None:

with h5py.File(fname, "w") as f_handle:

f_handle[JSON_TAG] = self._json_string
f_handle[JSON_TAG] = self._json()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this indent=None by default? maybe it's more clear to pass it explicitly?

@momchil-flex
Copy link
Collaborator Author

I've updated this PR to use indent=None internally, and I've also introduced the splitting of the string in the hdf5 file. The one thing I'm not sure is what to do with web/core, where I find separate functions for reading the string from the hdf5:

def _read_simulation_from_hdf5(file_name: str):
"""read simulation str from hdf5"""
with h5py.File(file_name, "r") as f_handle:
json_string = f_handle[JSON_TAG][()]
return json_string
, and there's one more instance.

I think the logic is that I shouldn't be importing anything from tidy3d/components here, so I can't use something like Tidy3dBaseModel._json_string_from_hdf5? Then options seem to be:

  • Repeat the code
  • Introduce a function in something like tidy3d/utils.py that gets imported both in components/base.py and in web/core.py. But does that also go against the future possibility of completely separating web/core.py?

Copy link
Collaborator

@tylerflex tylerflex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

string splitting Looks good, a few suggestions for making it a bit simpler.

f_handle[JSON_TAG] = self._json_string
json_str = self._json_string
for ind in range(ceil(len(json_str) / MAX_STRING_LENGTH)):
key = f"{JSON_TAG}_{ind}" if ind > 0 else JSON_TAG
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can write a static method or module level function for this to avoid duplicating the key as a function of index, incase we change this or need to use it elsewhere.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in Tidy3DBaseModel

@staticmethod
def _json_string_key(index):
    if index:
        return f"{JSON_TAG}_{ind}"
    return JSON_TAG

to_hdf5

json_str = self._json_string
num_strings = ceil(len(json_str) / MAX_STRING_LENGTH)
for ind in range(num_strings):
    ind_start = int(ind * MAX_STRING_LENGTH)
    ind_stop = min(int(ind + 1) * MAX_STRING_LENGTH, len(json_str))
    key = self. _json_string_key(ind)
    f_handle[key] = json_str[ind_start:ind_stop]
@classmethod
def _json_string_from_hdf5(cls, fname: str) -> str:
         """Load the model json string from an hdf5 file."""
         with h5py.File(fname, "r") as f_handle:
               num_string_parts = len([key for key in f_handle.keys() if JSON_TAG in key])
               json_string = ""
               for ind in range(num_string_parts):
                     key = _json_string_key(ind)
                     json_string += f_handle[key][()]
           return json_string

f_handle[JSON_TAG] = self._json_string
json_str = self._json_string
for ind in range(ceil(len(json_str) / MAX_STRING_LENGTH)):
key = f"{JSON_TAG}_{ind}" if ind > 0 else JSON_TAG
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this creates JSON_STRING, JSON_STRING_1, ...?
I'm curious if it will make the logic simpler to just not perform any special handling for ind=0, like JSON_STRING_0, JSON_STRING_1, .... If the string is short enough to fit in JSON_STRING_0, we just leave it named like that. Will it break backwards compatibility with script loading maybe?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

anyway, see comment and suggestion above, I think it makes things a bit simpler to keep the handling how it is but just have a single function to get the key, treating ind=0 differently.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, yeah not appending ind=0 is for backwards compatibility.

tests/test_components/test_IO.py Show resolved Hide resolved
@tylerflex
Copy link
Collaborator

Regarding web/core, that's MC-controlled and I think we're duplicating code in there until we can unify it at a later date (@dbochkov-flexcompute?) so I'd say maybe we can just duplicate but add a #TODO: to unify with this method later.

@dbochkov-flexcompute
Copy link
Contributor

I've updated this PR to use indent=None internally, and I've also introduced the splitting of the string in the hdf5 file. The one thing I'm not sure is what to do with web/core, where I find separate functions for reading the string from the hdf5:

def _read_simulation_from_hdf5(file_name: str):
"""read simulation str from hdf5"""
with h5py.File(file_name, "r") as f_handle:
json_string = f_handle[JSON_TAG][()]
return json_string

, and there's one more instance.
I think the logic is that I shouldn't be importing anything from tidy3d/components here, so I can't use something like Tidy3dBaseModel._json_string_from_hdf5? Then options seem to be:

  • Repeat the code
  • Introduce a function in something like tidy3d/utils.py that gets imported both in components/base.py and in web/core.py. But does that also go against the future possibility of completely separating web/core.py?

Regarding web/core, that's MC-controlled and I think we're duplicating code in there until we can unify it at a later date (@dbochkov-flexcompute?) so I'd say maybe we can just duplicate but add a #TODO: to unify with this method later.

So far, the way we have been dealing with importing tidy3d things into web api is through passing those things during initialization of web api here

core_config.set_config(log, get_logging_console(), __version__)

For example, currently we pass log, logging console (should change that to logging console getter eventually), and tidy3d version. So, perhaps _json_string_from_hdf5() could also be set in web api from tidy3d in the same way? That is:

core_config.set_config(log, get_logging_console(), __version__, Tidy3dBaseModel._json_string_from_hdf5)

@tylerflex
Copy link
Collaborator

tylerflex commented Nov 2, 2023

yea perhaps this is a way to do it, although it seems a bit strange potentially.

Maybe an alternative approach is to handle this in the SimulationTask, grab the class of the component for the task and then call cls._json_string_from_hdf5? Just not sure how to grab the class without downloading the whole object.

@dbochkov-flexcompute
Copy link
Contributor

yea perhaps this is a way to do it, although it seems a bit strange potentially.

Maybe an alternative approach is to handle this in the SimulationTask, grab the class of the component for the task and then call cls._json_string_from_hdf5? Just not sure how to grab the class without downloading the whole object.

Yeah, I think it would work there. However, the other place where json string is extracted from hdf5 is here

extension = _get_valid_extension(file_path)
if extension == ".json":
json_str = read_simulation_from_json(file_path)
elif extension == ".hdf5":
json_str = read_simulation_from_hdf5(file_path)
elif extension == ".hdf5.gz":
json_str = read_simulation_from_hdf5_gz(file_path)
data = json.loads(json_str)
type_ = data["type"]
if "SimulationData" == type_:
sim_data = SimulationData.from_file(file_path)
elif "ModeSolverData" == type_:
sim_data = ModeSolverData.from_file(file_path)
elif "HeatSimulationData" == type_:
sim_data = HeatSimulationData.from_file(file_path)
return sim_data
and the purpose there is to actually figure our what simulation class it is

@tylerflex
Copy link
Collaborator

I see, that's a bit messy, maybe the easiest approach is to just duplicate but make a note and get qingeng to help unify things, but passing in the config also works (even though it seems like not the long term solution to me).

@momchil-flex
Copy link
Collaborator Author

Updated, with code duplicated in web.core.py.

# default indentation (# spaces) in files
INDENT = 4

INDENT_JSON_FILE = 4 # default indentation of json string in json files
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could consider setting this to 2 to reduce the size of the file itself, but no preference either way.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: I'm not sure this will actually change the file size. depends on how they json handles indents I guess.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested. It does change the size. I guess let's do 2 yeah? But this shouldn't matter for upload/download from the server because all this is done through hdf5 files now. So not as important potentially, could be left 4 since it's prettier?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fine by me. I prefer 4 if the file size is not an issue.

Copy link
Collaborator

@tylerflex tylerflex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, thank you! just a couple minor notes

@@ -430,6 +434,23 @@ def get_sub_model(cls, group_path: str, model_dict: dict | list) -> dict:
model_dict = model_dict[key]
return model_dict

@staticmethod
def _json_string_key(index):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add some type annotations just while we're at this

def _json_string_key(index: int) -> str:

@momchil-flex momchil-flex merged commit c5aec95 into pre/2.5 Nov 3, 2023
14 checks passed
@momchil-flex momchil-flex deleted the momchil/hdf5_string_indent branch November 3, 2023 18:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants