Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Component Modeler Batch Path #1259

Merged
merged 1 commit into from
Nov 21, 2023
Merged

Fix Component Modeler Batch Path #1259

merged 1 commit into from
Nov 21, 2023

Conversation

tylerflex
Copy link
Collaborator

Uses deterministic hash for the batch path that doesn't change between sessions

@tylerflex tylerflex changed the base branch from develop to pre/2.5 November 20, 2023 17:12
@tylerflex tylerflex added the 2.5 label Nov 20, 2023
Copy link
Collaborator

@lucas-flexcompute lucas-flexcompute left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, but I'm a little puzzled why doesn't Tidy3DBaseModel.__hash__ work? Do you understand what makes it change in when starting a new session?

@tylerflex
Copy link
Collaborator Author

There's a random seed that gets set each session for python's built in hash(). This is done for security reasons that I don't fully understand but you can read about it here:
https://stackoverflow.com/questions/27522626/hash-function-in-python-3-3-returns-different-results-between-sessions

@momchil-flex
Copy link
Collaborator

momchil-flex commented Nov 20, 2023

So does this mean that more generally our simulations don't have the same hash across sessions? This would be a problem with simulation caching right?

@tylerflex
Copy link
Collaborator Author

So does this mean that more generally our simulations don't have the same hash across sessions? This would be a problem with simulation caching right?

Yea (just verified). We'll just have to make sure to use a different hash function than the builtin python one I guess.

@momchil-flex
Copy link
Collaborator

momchil-flex commented Nov 21, 2023

I think one (easier to implement?) option is Yannick's suggestion in #1262 . In that sense do you want to implement this here already? It feels like maybe such a _hash_self function, which takes into account everything rather than just the json string, and which does not depend on the session, should be added to Tidy3dBaseModel already. Then it can be used specifically for the component modeler file?

Also, how do we do hash for base models currently? If the new function won't be slower, then we may even completely replace the default hash?

@tylerflex
Copy link
Collaborator Author

I think one (easier to implement?) option is Yannick's suggestion in #1262 . In that sense do you want to implement this here already?

There are two suggestions, one of them involves hashing self.json() and the other involves hashing the hdf5 file. Which were you referring to? My fix here basically does the same thing as Yannick's first suggestion and only hashes the json_string. I suppose this might be an issue if any of the simulations contain custom medium, then we could get some file conflicts. Alternatively, I could try hashing self.dict().

It feels like maybe such a _hash_self function, which takes into account everything rather than just the json string, and which does not depend on the session, should be added to Tidy3dBaseModel already. Then it can be used specifically for the component modeler file?

You mean just move this to base model? I guess that works. I am hesitant to change the default hash until we are working more seriously on the simulation cache.

Also, how do we do hash for base models currently? If the new function won't be slower, then we may even completely replace the default hash?

We use pydantic.BaseModel.__hash__. It's not clear to me whether we want to change the default behavior from python. I feel like that's asking for trouble. I'd rather we provide an alternative hash function that uses hashlib and call that in the component modeler or let others use it if they want something that is consistent between sessions.

I can change the PR to reflect this discussion if you agree:

  • move hash_self to Tidy3DBaseModel and call it in the ComponentModeler.batch_path
  • have hash_self use the self.dict() so it has all info (not just json string containing stuff)

@tylerflex
Copy link
Collaborator Author

Just to clarify, this PR is to fix a pretty specific issue where the ComponentModeler batch filename was computed using a hash of the modeler. In this case, the default behavior of hash() changing between sessions is not ideal, but generally I dont think we really want to change the default behavior unless we have a good reason to. Also, we might want to use a different hashing strategy for different things (for the cache, we need to be very careful to hash the entire thing, for component modeler batch path, it's not as critical).

@lucas-flexcompute
Copy link
Collaborator

I agree with @tylerflex in leaving the default __hash__ alone, if only to respect python's default behavior.

For the issue at hand, I think it's important to include the full contents of the object for hashing in component modeler, because we don't verify that the file actually represents the desired model (by verifying equality of the whole contents). I can imagine using component models to store different devices that only differ in their custom medium (maybe steps in an optimization or sweep run) only to find out their files have been overwritten one after the other. It will be less likely to happen (but not impossible) with a full hash. More generally, the hash only exists to assert difference, not identity, so I think that's how we should be using it later for serious caching.

@momchil-flex
Copy link
Collaborator

Ok, yeah, I thought we had already manually overwritten the default hash, but if we haven't, then let's not change that. I think at some point we were manually using the json but that was porbably because for some reason default wasn't working.

Yeah, I was referring to the second approach proposed by Yannick, or at least some way like that for us to introduce a hash that is a) unique as long as the component (including custom data) is unique b) persistent among python sessions. Using the dict should also be fine I think? And yeah this is indeed what I'm suggesting for this PR:

I can change the PR to reflect this discussion if you agree:

move hash_self to Tidy3DBaseModel and call it in the ComponentModeler.batch_path
have hash_self use the self.dict() so it has all info (not just json string containing stuff)

@tylerflex tylerflex force-pushed the tyler/fix/smatrix/hash branch 2 times, most recently from 0228efe to 07e81cc Compare November 21, 2023 19:55
@tylerflex
Copy link
Collaborator Author

Ok made the changes discussed

  1. Tidy3dBaseModel._hash_self() -> str to use Yannick's approach to hashing.
  2. call this in ComponentModeler._batch_path

please take another look and let me know if there are any comments or merge it if it looks good, thanks

Copy link
Collaborator

@lucas-flexcompute lucas-flexcompute left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Better to have it in the base class to be reused as needed.

@tylerflex tylerflex merged commit af54303 into pre/2.5 Nov 21, 2023
14 checks passed
@tylerflex tylerflex deleted the tyler/fix/smatrix/hash branch November 21, 2023 22:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants