Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make validation optional #95

Closed
Tagl opened this issue Nov 1, 2022 · 15 comments
Closed

Make validation optional #95

Tagl opened this issue Nov 1, 2022 · 15 comments

Comments

@Tagl
Copy link
Contributor

Tagl commented Nov 1, 2022

I was trying to optimize some code of mine using the library (billions of DFA operations) so I ran a profiler and saw a lot of time was spent doing unions, specifically in PartitionRefinement.refine, cross product graph creation and then what surprised me was the __init__ function. And half of the time was due to validation.
I think validation should be optional, so the user could pass in validate=False as a keyword argument when the user is a 100% sure that the construction is valid.

Thoughts on this?

@caleb531
Copy link
Owner

caleb531 commented Nov 1, 2022

@Tagl I have definitely thought about the performance impact of the validation, and I do like the idea of adding an option to skip the validation. However, since the validation check is more of an implementation behavior rather than a property that is intrinsic to the automaton (per the theory), I would prefer not to make this an input parameter to __init__.

Rather, I think it would be better to make this a global setting as part of a new base module (see below). We could (and maybe should) also offer an option to disable automaton immutability.

Therefore, perhaps we could introduce an automata.base.config module with a Config class and a singleton global_config instance. Example code of working with this would look like the following:

from automata.base.config import global_config

global_config.should_freeze_automata = False
global_config.should_validate_automata = False

# proceed to instantiate DFAs, etc. ...

The PartitionRefinement.refine method is probably called so often because it is part of the DFA.minify method, which is used for all sorts of DFA operations.

@eliotwrobson What do you think about this?

@Tagl
Copy link
Contributor Author

Tagl commented Nov 1, 2022

A global config seems like a nice approach.

The unnecessary freezing was avoided by just passing in frozensets and frozendicts instead of sets and dicts, something the user already has control over. I think optional immutability might complicate implementations that rely on immutability.

Yes, refine is the main worker of the minify function, I'm not sure it can be optimized by much.

@eliotwrobson
Copy link
Collaborator

@caleb531 I also like the global config approach, that's a nice way to give users control over what's going on under the hood if necessary. And I agree with @Tagl , changing the freezing behavior using the global config could get confusing and it isn't a big deal as long as the user is providing frozensets in their input. I suppose we could do it just for fun to see if anyone uses it, but I think it's very unlikely to be something people are really taking advantage of in any substantial way.

And yes, the refine method is the key function used in the minify method. The current implementation of PartitionRefinement can't really be optimized much, but there may be a way to speed up that function by changing the way that data structure is implemented; I opted to use the current implementation because it was the only one I could find online in Python. There may be some that are written using C extensions or something that could be faster.

Also @Tagl out of curiosity, what is your code doing exactly? Even large artificial workloads I was using didn't have that many operations.

@Tagl
Copy link
Contributor Author

Tagl commented Nov 2, 2022

Also @Tagl out of curiosity, what is your code doing exactly? Even large artificial workloads I was using didn't have that many operations.

Examining permutation classes and determining whether they have an infinite or finite amount of simple permutations.
There are a total of 2^(n!) such sets for permutations of size n, but I do a lot of filtering to avoid unnecessary work. Basically combinatorial explosion. Each DFA for a permutation class is made of unions of DFAs made from permutations, which are each made from unions of DFAs made from pinwords. Combinatorial explosion is basically my problem 😅

@caleb531
Copy link
Owner

caleb531 commented Nov 3, 2022

The unnecessary freezing was avoided by just passing in frozensets and frozendicts instead of sets and dicts, something the user already has control over. I think optional immutability might complicate implementations that rely on immutability.

@Tagl Isn't there still a performance impact, though? Because I am still recursively processing the transitions, etc. regardless of the type of iterable you pass in, even if it's a frozen* type.

@eliotwrobson
Copy link
Collaborator

@caleb531 I think the issue persists even without the freezing, as you make a copy of all of the sets / dicts given as input, instead of just keeping aliases to them (even though the copies are themselves mutable). It would be great if there were a way in Python to transfer ownership of a data structure to the class (something move-constructor-esque, like in C++). If you want to allow for this though, you could add a global flag that turns off copying these data structures with the burden on the user not to modify them anymore.

As an aside, I think that the recursive processing doesn't quite work as intended. For example, I ran into issues trying to pass in an iterable directly as the set of final_states. It would be nice if that behavior were supported (so client code doesn't have to turn iterables into sets, only to have them immediately copied again into frozen sets).

@Tagl
Copy link
Contributor Author

Tagl commented Nov 3, 2022

Tuples, frozensets and frozendicts are just passed through as objects (by reference) since freezeValue just returns the input object in those cases. List, set and dict are copied when constructing the tuple, frozenset and frozendict, respectively. Other iterables may not work with current freezeValue.

@eliotwrobson
Copy link
Collaborator

That makes sense actually, the automaton constructor doesn't need to make a copy because the given structure itself is immutable (just needs a reference to the immutable data). In that case it probably doesn't make sense to have a flag that controls freezing or whether the constructors make copies, just inform the user they should pass in things as frozensets (or iterables) when possible.

@caleb531
Copy link
Owner

caleb531 commented Nov 3, 2022

@eliotwrobson @Tagl The fact that my freezeValue() function doesn't copy frozensets and frozendicts is more of an optimization than anything else. The primary purpose of that function is to convert the input parameters to immutable structures, and if a value is already immutable, I don't need to waste extra time or code to process it.

However, it does raise an interesting question. Before we introduced immutability, every input parameter was copied so that pass-by-reference wouldn't get in the way (e.g. if you passed the same transitions dict to two different NFA constructors, nfa1['transitions']['q3'] = {} wouldn't impact the other NFA instance).

However, we have an inconsistency in this new immutability-based implementation where some input params are copied (e.g. if you pass a set), but other params are not copied / pass-by-reference (if you pass a frozenset). I think the behavior should be one or the other for all parameters.

And what if the user passes a custom object (which doesn't inherit from dict) as the transitions for a new DFA? Because freezeValue doesn't include a case for that instance type, it will just skip over the instance when processing, such that you can pass-by-reference and totally get away with it.

Therefore, because of all this, I'm actually starting to wonder if we should strip out freezeValue and forgo processing input parameters altogether (i.e. every non-primitive input parameter is implicitly pass-by-reference). This would solve the performance problem of recursively copying the input parameters. It also shouldn't affect __hash__ because the hashing algorithm will be based on the theoretical equivalence and not the hash-ability of the instance's members.

Of course, the above would mean we just wasted all that time refactoring the tests to not mutate the automaton's attributes (e.g. self.dfa1.states.add('q4') would work again). And it also means that the only characteristics which make the automaton immutable are the overrides on __setattr__ and __delattr__, so what's the point? I think my original intention with immutability was achieving hash-ability, but if we can still achieve hash-ability with an equivalence-based algo, then there's not as much incentive to immutability.

What do you guys think?

@eliotwrobson
Copy link
Collaborator

I'm kindof in favor of keeping things as they are. Enforcing the immutability (in my mind) is less about the ability to hash the automatons and really about removing a way of accidentally modifying / breaking the language they represent somewhere in your code.

I'm actually ok leaving the freezing behavior as-is (even though there's some inconsistency with the way that references get passed in). I'm not sure what the standard behavior for objects like this is, but it's generally hard (impossible) to prevent users from doing things they aren't supposed to in Python (even with current safeguards, the user can just call the object setattr function themselves and make changes).

Also, I think that even using the copy module, immutable objects simply return references to themselves, since it makes no difference for the end user whether a shallow copy or reference to the original is returned for the final object: https://stackoverflow.com/questions/37100944/i-think-immutable-types-like-frozenset-and-tuple-not-actually-copied-what-is-th

Although it's technically an inconsistency to use the same reference to the given input sets, these are really under-the-hood type performance details that don't make a huge difference to the end user in terms of functionality. The main benefit is giving people fewer ways of accidentally causing problems for themselves.

@caleb531
Copy link
Owner

caleb531 commented Nov 3, 2022

@eliotwrobson You have some very good points. And I agree with keeping immutability from the validation perspective.

Running a quick test in the Python interpreter, I did also confirm that the copy module does indeed preserve references for immutable structures:

>>> f1 = frozenset({'a', 'b', 'c'})
>>> f2 = frozenset({'a', 'b', 'c'})
>>> f1 is f1
True
>>> f1 is f2
False
>>> import copy
>>> copy.copy(f1) is f1
True

As far as this issue goes, I think we've established that we should neither add a global configuration setting for immutability nor for copying (and as you set, just recommend passing it frozensets/frozendicts). So the only configuration setting we would have at this point is to globally disable validation, correct?

@eliotwrobson
Copy link
Collaborator

@caleb531 Yes, based on the discussion it seems like the only global config that makes sense is for the validation.

@Tagl
Copy link
Contributor Author

Tagl commented Nov 3, 2022

I agree, keep immutability. This could be noted in documentation for performance as a recommendation. A flag for validation is necessary in my opinion, as validation is one of the most demanding computation and is mainly a quality of life check when constructing the automata.

@caleb531
Copy link
Owner

caleb531 commented Nov 3, 2022

@Tagl @eliotwrobson Great, thank you for the feedback. I already have a separate branch live for the global configuration, and will submit a PR for that within the next few days.

@caleb531
Copy link
Owner

Closing this ticket since the recent v7 release has added this configuration functionality that enables optional validation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants