Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Copying mutable defaults #3

Closed
ericvsmith opened this issue May 22, 2017 · 47 comments
Closed

Copying mutable defaults #3

ericvsmith opened this issue May 22, 2017 · 47 comments

Comments

@ericvsmith
Copy link
Owner

Guido and I discussed this yesterday, and we decided we'd just copy.copy() the default values when creating a new instance.

The __init__ code I'm currently generating for:

@dataclass
class C:
    x: list = []

Looks something like:

def __init__(self, x=[]):
    self.x = copy.copy(x)

But I don't think this is what we really want. I don't think we want to call copy.copy() if passed in an unrelated list, like C(x=mylist). Maybe __init__ should check and only call copy.copy() if x is C.x?

So:

def __init__(self, x=C.x):
    self.x = copy.copy(x) if x is C.x else x

?

(I haven't checked that this actually works as written, but the idea is to only copy the argument if it's the same object as the default that was assigned in the class creation statement.)

@ericvsmith
Copy link
Owner Author

I should note that attrs (and others) use a wrapper around a factory function to solve this:

...(default=attr.Factory(list),...)

@gvanrossum
Copy link

How do you generate the __init__? Do you generate source code that you them compile, or do you have a generic __init__ that looks for the data in a special place on the class (e.g. C.__data_class_extras__)?

If the former, maybe

def __init__(self, x=None):
    if x is None:
        self.x = []
    else:
        self.x = x

You'd have to special-case at list list, dict and set. I'd like to avoid actually importing copy, it's an annoying extra import.

@ericvsmith
Copy link
Owner Author

I'm generating the code as text and then compiling it.

@dataclass
class C:
    x: int
    y: list = []

currently produces something like:

def __init__(_self, x, y=[]):
    _self.x = x
    _self.y = copy.copy(y)

That is, if there's a default value, copy whatever is passed in. I could change this to use a sentinel value so I can detect if they supplied something (don't copy) or are using the default (do the copy).

def __init__(_self, x, y=_sentinel):
    _self.x = x
    _self.y = copy.copy(C.y) if y is _sentinel else y

My concern with not using copy.copy and instead special casing a few types are two-fold:

  1. What about a field specification ofy: list = [1]? I still need to copy this, although knowing it's a list helps and avoids copy.copy: _self.y = C.y.copy().
  2. What about user-defined mutable classes or list, set, dict subclasses that should also be copied?

Having a Factory marker that contains a callable would get us out of the business of copying, and instead making the class definer deal with it.

I'm thinking I should also add type annotations to the generated __init__, but I'll open another issue for that.

@ericvsmith
Copy link
Owner Author

Would it be too confusing to special case exact dicts, sets, and lists to call default.copy(); and for all other cases provide the Factory() functionality?

I suspect this would mean that 95% of cases would just work, but would it be a trap for the cases that don't work?

@gvanrossum
Copy link

That might work, or it might not. We'll have to ask more opinions.

@ilevkivskyi
Copy link
Contributor

Sometimes people might actually want the mutable default, so that we can support both

@dataclass
class C:
    x: int
    y: list = []

and something like

@dataclass
class C:
    x: int
    y: List[int] = make(list)

or in some other way make the decision explicit. An important question here is what should be the default behavior.

@gvanrossum
Copy link

The default should be safe -- it's a common bug that people make by mistake. (And actually I'm not even getting the use case for wanting a mutable instance variable that's shared between instances except when overridden as an explicit constructor argument.)

@ericvsmith
Copy link
Owner Author

The version of dataclass.py tagged with 'v0' implements the behavior of calling default.copy() if the default is list, set, or dict. That lets you write code like:

from dataclass import dataclass, field

@dataclass
class C:
    x: int
    y: list=[]

c = C(3)
print(c)          # C(x=3,y=[])
c.y.append(10)

d = C(42)
print(d)          # C(x=42,y=[])
d.y.append(20)

print(c)          # C(x=3,y=[10])
print(d)          # C(x=42,y=[20])

Which I think satisfies the "default should be safe", at least for by far the most common cases.

I think if we want to override this, we could add a "non-copy factory" sort of function, like:

y: list = nocopy([])

We could also add a factory function marker to force a new instance for types we're not aware of. That is, something other than str, list, or dict:

y: mytype = factory(mytype)

I'm thinking we could also add a way to specify a custom copy function, but I haven't thought it through yet.

So I guess the question is: if the common cases are safe and we have ways of dealing with non-common cases, does that make describing this too complicated, or too hard to reason about?

@ilevkivskyi
Copy link
Contributor

ilevkivskyi commented May 24, 2017

I agree that the default should be safe. Although on a second thought I have seen people who want a mutable default very rarely. Maybe we need to support only:

@auto
class C:
    x: int
    y: list=[]

that will provide a safe behavior, but also allow manually override __init__ (while keeping other methods like __repr__ etc) for people who wants something fancy?

@gvanrossum
Copy link

I'd launch the PEP without support for mutable defaults. Let people argue for them if they really need them.

@ilevkivskyi
Copy link
Contributor

I'd launch the PEP without support for mutable defaults. Let people argue for them if they really need them.

Agreed.

@ericvsmith
Copy link
Owner Author

I read this to mean: don't treat mutable default specially. If a default is mutable, then it's effectively shared among instances. But that seems to fail the "default should be safe" argument, so maybe that's not what you mean.

There's no way to detect if a default value is mutable, so unless we always copy, we're effectively supporting mutable defaults.

@ilevkivskyi
Copy link
Contributor

I think Guido means exactly opposite - we should not have any special support for creating mutable defaults.

@ericvsmith
Copy link
Owner Author

I'm not sure what that means.

I think no special support means we get this behavior:

    >>> @dataclass
    ... class C:
    ...   x: list = []
    ...
    >>> c = C()
    >>> c.x.append(3)
    >>> c
    C(x=[3])
    >>> c1 = C()
    >>> c1
    C(x=[3])

Did you have something else in mind?

@ilevkivskyi
Copy link
Contributor

Did you have something else in mind?

Yes, sorry, I was not very clear. The idea is to always copy, and by "no special support for mutable default" I meant that it will be not possible to create a "shared" mutable default with always copy strategy (as I mentioned in #3 (comment) some people may want this, and as Guido proposed, if someone will actually ask for this, then this can be allowed by e.g. manually overriding __init__ or in other ways).

@gvanrossum
Copy link

I definitely do not want that behavior.

I think @DataClass ought to special-case at least list, dict and set defaults, and add copying code for these these to the generated __init__. It should also special-case bool, int, float, complex, str, tuple and frozenset and not insert copying code for those. For initializers that are instances of data classes created with 'frozen=True' it could also avoid copying. For initializers it doesn't recognize it should invoke copy.copy().

@warsaw
Copy link

warsaw commented Jun 2, 2017

Are you worried that the special case rules will be hard to remember, either when writing new code or reading existing code? Safety's a concern and I agree that @ericvsmith 's example above is a natural way to write it, but it's also important that someone reading existing dataclass code will be able to immediately and unambiguously be able to reason about the semantics.

Maybe the answer is that the __init__() always calls x.copy() and does not fall back to copy.copy() if that attribute doesn't exist? Then a factory could be used for objects that implement other copying protocols. That's a rule that's easy to describe, remember, and participate in.

(I suppose you could also specify an __copy__() protocol, but then do you also need __deepcopy__()?)

@gvanrossum
Copy link

But only list, dict and set have .copy() methods -- what about int, float, str etc.? I think I meant for the rule to be simple: we always use copy.copy() except when we can prove that that always returns the original object (as a slight optimization and perhaps to avoid having to import the copy module). The special-casing already exists in copy.copy() and copy._copy_dispatch, so we won't have to list it again.

@warsaw
Copy link

warsaw commented Jun 2, 2017

Right, I meant specifically that because int, float, str, etc. didn't have a copy() method, they wouldn't be copied. But I see what you're saying: all the logic for doing that, including the dispatch tables for built-in types, already exists in the copy module, so the special casing to avoid importing the copy method would only duplicate a small bit of that information.

That seems reasonable, if the emphasis in the docs is that copy.copy() is always used except in a small number of optimizations. (Of course, if copy is already globally imported the avoiding the import in dataclass doesn't save much, but it might not be worth worrying about in the implementation.)

@ericvsmith
Copy link
Owner Author

How about something like this for the generated code (but less verbose):

    class C:
        i : int = default_i
        j : list = default_j

    def __init__(self, i=default_i, j=default_j):
        if type(i) in (int, float, complex, str, tuple, frozenset):
            # copy won't do anything
            self.i = i
        elif i is not default_i:
            # no need to copy, we're not sharing a default value
            self.i = i
        elif type(i) in (list, set, dict):
            # i is default_i fast path
            self.i = i.copy()
        else:
            # is is default_i: do a general purpose copy
            import copy
            self.i = copy.copy(i)

        # same logic for j

So the logic is: "if you're using the default value, and it's not a known builtin immutable type, we'll copy it the best way we know how". Which from an external viewpoint is: "if you're using the default value, we'll copy it".

Which means if you really wanted to share state, you could do:

    @dataclass
    class C:
        j: list = []

    lst = []
    a = C(lst)
    b = C(lst)
    # a and b share state because a.j is lst and b.j is lst

Then a and b would share mutable state, because we're not copying the value passed to __init__.

Or, using the default not share state, with the same class definition:

    c = C()
    d = C()
    # c and d don't share state

I'm not sure the check for specific types calling i.copy() instead of copy.copy(i) is worthwhile. This can also be optimized some: for example, we know that if i is default_i then we can pre-compute type(i) in (list, set, dict) and know if we need to do the import or not. But if you want to delay import copy, you'll have to do it in each __init__. We'd have to time that to see the cost versus possibly never importing copy.

@ilevkivskyi
Copy link
Contributor

@ericvsmith Yes, this looks right.

@gvanrossum
Copy link

I would hope that the tests for type(i) could be done at "compile time" (i.e. in the class decorator) since the type is declared. I don't find the "if i is default_i" test very nice, maybe we can use None for the default in __init__ and test for that instead? And if the type is known to be immutable we could just use it as the generated default and stick to that. So, given this definition:

class C:
    i: int = 42
    j: list = [1, 2]

the generated __init__ would look more like this:

def __init__(self, i=42, j=None):
    self.i = i
    if j is None:
        self.j = [1, 2]
    else:
        self.j = j

@ericvsmith
Copy link
Owner Author

Agreed on the "compile time" aspect. That's what I meant when I said we could optimize it.

Agreed on the default value tests, and changing the __init__ signature for the defaults. I think we need to use some sentinel other than None, but I already have one that I use for detecting the absence of a default value in a field(), so I'll use that.

Given:

@dataclass
class C:
    i: int = 42
    j: list = [1, 2]
    k: mytype = mytype()

Then the __init__ would look something like:

def __init__(self, i=_MISSING, j=_MISSING, k=_MISSING):
    self.i = type(self).i if i is _MISSING else i
    self.j = type(self).j.copy() if j is _MISSING else j
    self.k = copy.copy(type(self).k) if k is _MISSING else k

As you show, we could probably do something smarter with the defaults for types we know are immutable like int, str, etc.

@gvanrossum
Copy link

You could check if the dispatch table for the indicated type maps to "immutable" to tell whether a type can be treated as such.

ericvsmith added a commit that referenced this issue Jun 3, 2017
Removed unused Factory class. The way we're handling mutable defaults doesn't require it (see #3).
@hynek
Copy link

hynek commented Jun 4, 2017

This special-casing seems like adding complexity to our code and potentially confusing to the users.

IME we need a way to define factories anyway so there’s that. I agree that a core API of the stdlib should be as unsurprising as possible so my idea special case (in accordance with explicit > implicit) would be to raise a helpful error, if e.g. [] is passed instead of silently copying it.

In the best case that might educate the user on the perils of mutable default arguments.

@gvanrossum
Copy link

This special-casing seems like adding complexity to our code and potentially confusing to the users.

What special-casing? I think the documentation should emphasize that the default specified in the class will be used to initialize the instance variable via copy.copy(), and any special-casing should be a transparent optimization (as long as people don't introspect the generated code) in the code generator.

The only true semantic special case is default values produced by field(), which will be treated differently -- this is reasonable given that it's testing for a single class (or a subclass).

@hynek
Copy link

hynek commented Jun 5, 2017

While I don’t have a conceptual complaint about that, it does sound really slow? Is good performance a goal at all? Not being slower than handwritten code is one of the tenants of attrs so this is a bit of a paradigm shift for me I’ll have to get used to.

@glyph
Copy link

glyph commented Jun 5, 2017

This seems like it could go very badly wrong with examples like this:

@dataclass
class ObjectRelationalMappedRow:
    database_connection: DatabaseConnection = global_database_connection

i.e. there are many cases where a shared, mutable default is what is intended, and copy.copy will produce deeply surprising behaviors, especially given the fact that its behavior is opt out and not opt in, so objects like this hypothetical DatabaseConnection now need to go to some additional trouble to be non-copyable if they don't happen to directly contain a non-copyable object like a socket (since this isn't deepcopy, if the socket is one attribute "hop" away, you won't notice the silent implied state corruption).

@hynek
Copy link

hynek commented Jun 5, 2017

My comment from before was under the impression that there will also be a way to define factories. But the more I think about it, the more I agree with glyph that it’s not a good idea and a landmine. The average Python programmer has not a full grasp of the implications of copy.copy() and this would be just a huge landmine to avoid another one.

[redacted, Tin’s examples are better]

is so nice and explicit, I strongly believe that’s the correct route here.

@Tinche
Copy link

Tinche commented Jun 5, 2017

The shallow copying approach smells to me. That's a lot of contortions to get

@dataclass
class A:
    a: list = [1, 2]

to work verbatim. What about:

@dataclass
class A:
    a: List = [{1: 1}, {2: 2}]

? The dicts will be shared, but I don't think this is what's implied at first glance. And as @glyph mentioned, the situation with arbitrary user defined classes is even worse.

The current attrs way is a little verbose, we have an open issue for a nicer API. My suggestion here would be something like:

@dataclass
class A:
    a: List = call(list)  # Empty list.
    b: List = call(lambda: [1, 2])  # List with members.

@gvanrossum
Copy link

Here's a possible alternative.

We don't ever copy default values, but we do raise an exception when the default value's top-level type is list, dict or set. To create a value with a default that's an empty list, you have to write something like

    x: list = field(default_factory=list)

The error message (as well as the documentation) should guide the user in this direction. The docs should also show that to create a non-empty list you'd have to use a lambda, e.g.

    x: list = field(default_factory=lambda: [0 ,1, 2])

I'm not particular about the exact idiom used for default_factory -- we could also use

    x: list = field(default=Factory(list, [0, 1, 2])  # Assumes Factory is imported

(One thing I don't want is default=list -- a default might well be a function.)

If we adopt this proposal, I think we should not do anything about other types of mutable values (@glyph's example of a global db connection should just work) nor about e.g. a tuple containing a list. (Users are still expected to learn that defaults are shared between instances.)

An alternative solution for Glyph's example would be for the db connection class to define a __copy__ method that just returns self, but that feels unsatisfactory because it would require a change to the package implementing db connections. Alternatively the user would have to know to use a factory that just returns the global db connection, e.g.

    connection: DbConnection = field(default_factory=lambda: global_db_connection)

All in all I'm not sure I like this better than the shallow-copy-everything proposal, but at least it's easy to explain the first approximation (without the exceptions for list/dict/set) and guides users around the most common mistakes (surely an empty list or dict as a default must be a very common beginner's mistake). I'm also not sure that it is more important to support Glyph's global-db-connection example than it is to have a concise notation for defaults with type list/dict/set -- it's hard to come up with a spec that does the right thing for both without using a factory for at least one of them.

I don't worry about the performance, the code generator can omit copy calls for common immutable types and we can hint at that in the docs.

@ilevkivskyi
Copy link
Contributor

I am inclined more towards the factory idiom, it does not actually look too verbose:

class Example:
    shared_names: List[str] = []
    keys: List[int] = field(factory=list)

However I am not sure that we need to raise an exception for the first field above. Potentially, we can still allow plain lists, sets, dicts for class variables:

   shared: ClassVar[List[str]] = [] # This can be OK (if we not include this in __init__)

however this will require importing typing which may induce some speed penalty.

@gvanrossum
Copy link

You need to import typing anyways to write List[str]. ClassVar was made for this purpose, we should use it.

@hynek
Copy link

hynek commented Jun 6, 2017

As someone who thought about this stuff before, I prefer having default=some_marker(list) over having two ways of defining defaults.

Would it be technically possible to allow for x: List[int] = Factory(list)?

@ilevkivskyi
Copy link
Contributor

Would it be technically possible to allow for x: List[int] = Factory(list)?

It is possible but I think it may create a wrong impression of creating something else than a normal field. I like the syntax proposed by Guido more:

class C:
    x: int = field(default=0, repr=False)
    y: List[int] = field(factory=list)

field(default=factory(list)) is also OK, but looks a bit verbose.

@warsaw
Copy link

warsaw commented Jun 6, 2017

@ilevkivskyi That actually doesn't look to bad to me either. As a beginner, I understand that factories are things that create things, so factory=list tells me that it's going to create new list objects. field() will be a concept I'll learn early on with express classes <wink> so using field(factory=list) isn't much of a stretch conceptually.

@gvanrossum
Copy link

I prefer having default=some_marker(list) over having two ways of defining defaults.

Why? default=Factory(list) looks uglier than default_factory=list (it has more punctuation) and requires an extra import. Is your worry that people might provide both default= and default_factory=? That's easily rejected with a friendly error message. Or are there use cases that I'm missing?

@hynek
Copy link

hynek commented Jun 6, 2017

Yes, and I prefer consistency but this is absolutely nothing I would waste anyone’s energy with. I’m fine with either version (although something more inline like x: List[int] = factory(list) still looks cooler – but that’s just syntactic sugar which I believe could be built on top of field if I really wanted to).

@warsaw
Copy link

warsaw commented Jun 6, 2017 via email

@gvanrossum
Copy link

Can we please call that factory instead of default_factory?

I suppose, though then it would lack the connection with this being how the default value is constructed. I don't feel strongly about it, I agree that default_factory is a lot to type.

@ericvsmith
Copy link
Owner Author

It sounds like we're moving away from doing something automatically, and instead using a factory.

Do we want to force you to specify a field if all you want is a factory? Or can we have just a factory as a default value (at the expense of exposing another function/object)?

@dataclass
class C:
   x: list = field(default_factory=list)

@dataclass
class C:
   x: list = factory(list)

?

I do like the idea of raising an exception if using a list, dict, or set as the default value. It think it should be possible to make a factory that works around this, although I haven't thought it through. We could always relax the restriction later, in the unlikely event that people are clamoring for mutable defaults.

@gvanrossum
Copy link

This sounds reasonable. Although I still like my original proposal of using copy.copy() by default (but only when the argument is not specified), and maybe optimizing that out if it's a common immutable type (e.g. bool, int, float, complex, str, bytes, tuple, frozenset).

@ericvsmith
Copy link
Owner Author

Here's an example where copy by default does something surprising. I ran in to this when testing default values:

        sentinel = object()
        @dataclass
        class C:
            x: int = 0
            y: object = sentinel
        c = C()
        self.assertIs(c.y, sentinel)

This fails because c.y is a copy of sentinel, not sentinel itself.

@ericvsmith
Copy link
Owner Author

My goal is to get this resolved during the core sprint so I can make some progress on other dataclass issues. I think the options we've been discussing for non-ClassVar mutable defaults are:

  1. Do nothing special with defaults: if you provide a default such as [], it gets the normal python mutable default parameter behavior. Depending on your knowledge of python, this may or may not be the least surprising behavior.

  2. Prohibit known mutable defaults, and maybe allow them in the future with special behavior. TBD: how to detect mutable values (known whitelist of types? presence of .copy?, etc.).

  3. Detect mutable defaults, and copy them. TBD: how to detect mutable values and how to copy them (deep or shallow).

  4. When using a default, always copy it. Subject to some optimizations, but logically everything is copied. See above Copying mutable defaults #3 (comment) for one problem with this. TBD: how to copy.

Any others?

@gvanrossum
Copy link

gvanrossum commented Sep 4, 2017 via email

@ilevkivskyi
Copy link
Contributor

ilevkivskyi commented Sep 4, 2017

Option (2) seems reasonable as Guido formulated it, as I understand this will be accompanied by a way to make a factory like

@data
class C:
    bad_attr: List[int] = []  # Error
    attr: List[int] = field(factory=list)  # OK

and a clear error message like ValueError: using [] as a default value is unsafe. Use factory=list instead.

@ericvsmith
Copy link
Owner Author

Yes, exactly. I don't currently have the factory option implemented, so the initial checkin will just disallow the first line above. I'll make sure the factory option works at a later time.

ericvsmith added a commit that referenced this issue Sep 4, 2017
disallow default values of this type (since they're mutable and a source
of bugs). In the future, we might do something like allow them, or
automatically copy them.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants