Copying mutable defaults #3

ericvsmith · 2017-05-22T15:15:30Z

Guido and I discussed this yesterday, and we decided we'd just copy.copy() the default values when creating a new instance.

The __init__ code I'm currently generating for:

@dataclass
class C:
    x: list = []

Looks something like:

def __init__(self, x=[]):
    self.x = copy.copy(x)

But I don't think this is what we really want. I don't think we want to call copy.copy() if passed in an unrelated list, like C(x=mylist). Maybe __init__ should check and only call copy.copy() if x is C.x?

So:

def __init__(self, x=C.x):
    self.x = copy.copy(x) if x is C.x else x

?

(I haven't checked that this actually works as written, but the idea is to only copy the argument if it's the same object as the default that was assigned in the class creation statement.)

The text was updated successfully, but these errors were encountered:

ericvsmith · 2017-05-22T15:40:56Z

I should note that attrs (and others) use a wrapper around a factory function to solve this:

...(default=attr.Factory(list),...)

gvanrossum · 2017-05-22T18:23:47Z

How do you generate the __init__? Do you generate source code that you them compile, or do you have a generic __init__ that looks for the data in a special place on the class (e.g. C.__data_class_extras__)?

If the former, maybe

def __init__(self, x=None):
    if x is None:
        self.x = []
    else:
        self.x = x

You'd have to special-case at list list, dict and set. I'd like to avoid actually importing copy, it's an annoying extra import.

ericvsmith · 2017-05-22T19:42:10Z

I'm generating the code as text and then compiling it.

@dataclass
class C:
    x: int
    y: list = []

currently produces something like:

def __init__(_self, x, y=[]):
    _self.x = x
    _self.y = copy.copy(y)

That is, if there's a default value, copy whatever is passed in. I could change this to use a sentinel value so I can detect if they supplied something (don't copy) or are using the default (do the copy).

def __init__(_self, x, y=_sentinel):
    _self.x = x
    _self.y = copy.copy(C.y) if y is _sentinel else y

My concern with not using copy.copy and instead special casing a few types are two-fold:

What about a field specification ofy: list = [1]? I still need to copy this, although knowing it's a list helps and avoids copy.copy: _self.y = C.y.copy().
What about user-defined mutable classes or list, set, dict subclasses that should also be copied?

Having a Factory marker that contains a callable would get us out of the business of copying, and instead making the class definer deal with it.

I'm thinking I should also add type annotations to the generated __init__, but I'll open another issue for that.

ericvsmith · 2017-05-22T22:24:41Z

Would it be too confusing to special case exact dicts, sets, and lists to call default.copy(); and for all other cases provide the Factory() functionality?

I suspect this would mean that 95% of cases would just work, but would it be a trap for the cases that don't work?

gvanrossum · 2017-05-22T23:05:41Z

That might work, or it might not. We'll have to ask more opinions.

ilevkivskyi · 2017-05-23T17:54:45Z

Sometimes people might actually want the mutable default, so that we can support both

@dataclass
class C:
    x: int
    y: list = []

and something like

@dataclass
class C:
    x: int
    y: List[int] = make(list)

or in some other way make the decision explicit. An important question here is what should be the default behavior.

gvanrossum · 2017-05-23T18:12:49Z

The default should be safe -- it's a common bug that people make by mistake. (And actually I'm not even getting the use case for wanting a mutable instance variable that's shared between instances except when overridden as an explicit constructor argument.)

ericvsmith · 2017-05-24T05:04:00Z

The version of dataclass.py tagged with 'v0' implements the behavior of calling default.copy() if the default is list, set, or dict. That lets you write code like:

from dataclass import dataclass, field

@dataclass
class C:
    x: int
    y: list=[]

c = C(3)
print(c)          # C(x=3,y=[])
c.y.append(10)

d = C(42)
print(d)          # C(x=42,y=[])
d.y.append(20)

print(c)          # C(x=3,y=[10])
print(d)          # C(x=42,y=[20])

Which I think satisfies the "default should be safe", at least for by far the most common cases.

I think if we want to override this, we could add a "non-copy factory" sort of function, like:

y: list = nocopy([])

We could also add a factory function marker to force a new instance for types we're not aware of. That is, something other than str, list, or dict:

y: mytype = factory(mytype)

I'm thinking we could also add a way to specify a custom copy function, but I haven't thought it through yet.

So I guess the question is: if the common cases are safe and we have ways of dealing with non-common cases, does that make describing this too complicated, or too hard to reason about?

ilevkivskyi · 2017-05-24T13:02:26Z

I agree that the default should be safe. Although on a second thought I have seen people who want a mutable default very rarely. Maybe we need to support only:

@auto
class C:
    x: int
    y: list=[]

that will provide a safe behavior, but also allow manually override __init__ (while keeping other methods like __repr__ etc) for people who wants something fancy?

gvanrossum · 2017-05-24T22:20:41Z

I'd launch the PEP without support for mutable defaults. Let people argue for them if they really need them.

ilevkivskyi · 2017-05-24T22:21:40Z

I'd launch the PEP without support for mutable defaults. Let people argue for them if they really need them.

Agreed.

ericvsmith · 2017-06-02T15:23:53Z

I read this to mean: don't treat mutable default specially. If a default is mutable, then it's effectively shared among instances. But that seems to fail the "default should be safe" argument, so maybe that's not what you mean.

There's no way to detect if a default value is mutable, so unless we always copy, we're effectively supporting mutable defaults.

ilevkivskyi · 2017-06-02T15:37:15Z

I think Guido means exactly opposite - we should not have any special support for creating mutable defaults.

ericvsmith · 2017-06-02T16:12:17Z

I'm not sure what that means.

I think no special support means we get this behavior:

    >>> @dataclass
    ... class C:
    ...   x: list = []
    ...
    >>> c = C()
    >>> c.x.append(3)
    >>> c
    C(x=[3])
    >>> c1 = C()
    >>> c1
    C(x=[3])

Did you have something else in mind?

ilevkivskyi · 2017-06-02T16:18:24Z

Did you have something else in mind?

Yes, sorry, I was not very clear. The idea is to always copy, and by "no special support for mutable default" I meant that it will be not possible to create a "shared" mutable default with always copy strategy (as I mentioned in #3 (comment) some people may want this, and as Guido proposed, if someone will actually ask for this, then this can be allowed by e.g. manually overriding __init__ or in other ways).

gvanrossum · 2017-06-02T16:19:59Z

I definitely do not want that behavior.

I think @DataClass ought to special-case at least list, dict and set defaults, and add copying code for these these to the generated __init__. It should also special-case bool, int, float, complex, str, tuple and frozenset and not insert copying code for those. For initializers that are instances of data classes created with 'frozen=True' it could also avoid copying. For initializers it doesn't recognize it should invoke copy.copy().

warsaw · 2017-06-02T16:37:32Z

Are you worried that the special case rules will be hard to remember, either when writing new code or reading existing code? Safety's a concern and I agree that @ericvsmith 's example above is a natural way to write it, but it's also important that someone reading existing dataclass code will be able to immediately and unambiguously be able to reason about the semantics.

Maybe the answer is that the __init__() always calls x.copy() and does not fall back to copy.copy() if that attribute doesn't exist? Then a factory could be used for objects that implement other copying protocols. That's a rule that's easy to describe, remember, and participate in.

(I suppose you could also specify an __copy__() protocol, but then do you also need __deepcopy__()?)

gvanrossum · 2017-06-02T18:03:21Z

But only list, dict and set have .copy() methods -- what about int, float, str etc.? I think I meant for the rule to be simple: we always use copy.copy() except when we can prove that that always returns the original object (as a slight optimization and perhaps to avoid having to import the copy module). The special-casing already exists in copy.copy() and copy._copy_dispatch, so we won't have to list it again.

warsaw · 2017-06-02T18:57:50Z

Right, I meant specifically that because int, float, str, etc. didn't have a copy() method, they wouldn't be copied. But I see what you're saying: all the logic for doing that, including the dispatch tables for built-in types, already exists in the copy module, so the special casing to avoid importing the copy method would only duplicate a small bit of that information.

That seems reasonable, if the emphasis in the docs is that copy.copy() is always used except in a small number of optimizations. (Of course, if copy is already globally imported the avoiding the import in dataclass doesn't save much, but it might not be worth worrying about in the implementation.)

ericvsmith · 2017-06-02T20:01:29Z

How about something like this for the generated code (but less verbose):

    class C:
        i : int = default_i
        j : list = default_j

    def __init__(self, i=default_i, j=default_j):
        if type(i) in (int, float, complex, str, tuple, frozenset):
            # copy won't do anything
            self.i = i
        elif i is not default_i:
            # no need to copy, we're not sharing a default value
            self.i = i
        elif type(i) in (list, set, dict):
            # i is default_i fast path
            self.i = i.copy()
        else:
            # is is default_i: do a general purpose copy
            import copy
            self.i = copy.copy(i)

        # same logic for j

So the logic is: "if you're using the default value, and it's not a known builtin immutable type, we'll copy it the best way we know how". Which from an external viewpoint is: "if you're using the default value, we'll copy it".

Which means if you really wanted to share state, you could do:

    @dataclass
    class C:
        j: list = []

    lst = []
    a = C(lst)
    b = C(lst)
    # a and b share state because a.j is lst and b.j is lst

Then a and b would share mutable state, because we're not copying the value passed to __init__.

Or, using the default not share state, with the same class definition:

    c = C()
    d = C()
    # c and d don't share state

I'm not sure the check for specific types calling i.copy() instead of copy.copy(i) is worthwhile. This can also be optimized some: for example, we know that if i is default_i then we can pre-compute type(i) in (list, set, dict) and know if we need to do the import or not. But if you want to delay import copy, you'll have to do it in each __init__. We'd have to time that to see the cost versus possibly never importing copy.

ilevkivskyi · 2017-06-02T21:44:36Z

@ericvsmith Yes, this looks right.

gvanrossum · 2017-06-02T22:32:50Z

I would hope that the tests for type(i) could be done at "compile time" (i.e. in the class decorator) since the type is declared. I don't find the "if i is default_i" test very nice, maybe we can use None for the default in __init__ and test for that instead? And if the type is known to be immutable we could just use it as the generated default and stick to that. So, given this definition:

class C:
    i: int = 42
    j: list = [1, 2]

the generated __init__ would look more like this:

def __init__(self, i=42, j=None):
    self.i = i
    if j is None:
        self.j = [1, 2]
    else:
        self.j = j

ericvsmith · 2017-06-02T23:12:24Z

Agreed on the "compile time" aspect. That's what I meant when I said we could optimize it.

Agreed on the default value tests, and changing the __init__ signature for the defaults. I think we need to use some sentinel other than None, but I already have one that I use for detecting the absence of a default value in a field(), so I'll use that.

Given:

@dataclass
class C:
    i: int = 42
    j: list = [1, 2]
    k: mytype = mytype()

Then the __init__ would look something like:

def __init__(self, i=_MISSING, j=_MISSING, k=_MISSING):
    self.i = type(self).i if i is _MISSING else i
    self.j = type(self).j.copy() if j is _MISSING else j
    self.k = copy.copy(type(self).k) if k is _MISSING else k

As you show, we could probably do something smarter with the defaults for types we know are immutable like int, str, etc.

gvanrossum · 2017-06-02T23:27:27Z

You could check if the dispatch table for the indicated type maps to "immutable" to tell whether a type can be treated as such.

Removed unused Factory class. The way we're handling mutable defaults doesn't require it (see #3).

hynek · 2017-06-04T06:56:59Z

This special-casing seems like adding complexity to our code and potentially confusing to the users.

IME we need a way to define factories anyway so there’s that. I agree that a core API of the stdlib should be as unsurprising as possible so my idea special case (in accordance with explicit > implicit) would be to raise a helpful error, if e.g. [] is passed instead of silently copying it.

In the best case that might educate the user on the perils of mutable default arguments.

gvanrossum · 2017-06-04T22:56:16Z

This special-casing seems like adding complexity to our code and potentially confusing to the users.

What special-casing? I think the documentation should emphasize that the default specified in the class will be used to initialize the instance variable via copy.copy(), and any special-casing should be a transparent optimization (as long as people don't introspect the generated code) in the code generator.

The only true semantic special case is default values produced by field(), which will be treated differently -- this is reasonable given that it's testing for a single class (or a subclass).

hynek · 2017-06-05T06:01:42Z

While I don’t have a conceptual complaint about that, it does sound really slow? Is good performance a goal at all? Not being slower than handwritten code is one of the tenants of attrs so this is a bit of a paradigm shift for me I’ll have to get used to.

glyph · 2017-06-05T07:47:48Z

This seems like it could go very badly wrong with examples like this:

@dataclass
class ObjectRelationalMappedRow:
    database_connection: DatabaseConnection = global_database_connection

i.e. there are many cases where a shared, mutable default is what is intended, and copy.copy will produce deeply surprising behaviors, especially given the fact that its behavior is opt out and not opt in, so objects like this hypothetical DatabaseConnection now need to go to some additional trouble to be non-copyable if they don't happen to directly contain a non-copyable object like a socket (since this isn't deepcopy, if the socket is one attribute "hop" away, you won't notice the silent implied state corruption).

hynek · 2017-06-05T08:54:10Z

My comment from before was under the impression that there will also be a way to define factories. But the more I think about it, the more I agree with glyph that it’s not a good idea and a landmine. The average Python programmer has not a full grasp of the implications of copy.copy() and this would be just a huge landmine to avoid another one.

[redacted, Tin’s examples are better]

is so nice and explicit, I strongly believe that’s the correct route here.

Tinche · 2017-06-05T08:56:47Z

The shallow copying approach smells to me. That's a lot of contortions to get

@dataclass
class A:
    a: list = [1, 2]

to work verbatim. What about:

@dataclass
class A:
    a: List = [{1: 1}, {2: 2}]

? The dicts will be shared, but I don't think this is what's implied at first glance. And as @glyph mentioned, the situation with arbitrary user defined classes is even worse.

The current attrs way is a little verbose, we have an open issue for a nicer API. My suggestion here would be something like:

@dataclass
class A:
    a: List = call(list)  # Empty list.
    b: List = call(lambda: [1, 2])  # List with members.

gvanrossum · 2017-06-05T16:01:59Z

Here's a possible alternative.

We don't ever copy default values, but we do raise an exception when the default value's top-level type is list, dict or set. To create a value with a default that's an empty list, you have to write something like

    x: list = field(default_factory=list)

The error message (as well as the documentation) should guide the user in this direction. The docs should also show that to create a non-empty list you'd have to use a lambda, e.g.

    x: list = field(default_factory=lambda: [0 ,1, 2])

I'm not particular about the exact idiom used for default_factory -- we could also use

    x: list = field(default=Factory(list, [0, 1, 2])  # Assumes Factory is imported

(One thing I don't want is default=list -- a default might well be a function.)

If we adopt this proposal, I think we should not do anything about other types of mutable values (@glyph's example of a global db connection should just work) nor about e.g. a tuple containing a list. (Users are still expected to learn that defaults are shared between instances.)

An alternative solution for Glyph's example would be for the db connection class to define a __copy__ method that just returns self, but that feels unsatisfactory because it would require a change to the package implementing db connections. Alternatively the user would have to know to use a factory that just returns the global db connection, e.g.

    connection: DbConnection = field(default_factory=lambda: global_db_connection)

All in all I'm not sure I like this better than the shallow-copy-everything proposal, but at least it's easy to explain the first approximation (without the exceptions for list/dict/set) and guides users around the most common mistakes (surely an empty list or dict as a default must be a very common beginner's mistake). I'm also not sure that it is more important to support Glyph's global-db-connection example than it is to have a concise notation for defaults with type list/dict/set -- it's hard to come up with a spec that does the right thing for both without using a factory for at least one of them.

I don't worry about the performance, the code generator can omit copy calls for common immutable types and we can hint at that in the docs.

ilevkivskyi · 2017-06-05T16:42:18Z

I am inclined more towards the factory idiom, it does not actually look too verbose:

class Example:
    shared_names: List[str] = []
    keys: List[int] = field(factory=list)

However I am not sure that we need to raise an exception for the first field above. Potentially, we can still allow plain lists, sets, dicts for class variables:

   shared: ClassVar[List[str]] = [] # This can be OK (if we not include this in __init__)

however this will require importing typing which may induce some speed penalty.

gvanrossum · 2017-06-05T21:52:00Z

You need to import typing anyways to write List[str]. ClassVar was made for this purpose, we should use it.

hynek · 2017-06-06T06:01:46Z

As someone who thought about this stuff before, I prefer having default=some_marker(list) over having two ways of defining defaults.

Would it be technically possible to allow for x: List[int] = Factory(list)?

ilevkivskyi · 2017-06-06T08:20:28Z

Would it be technically possible to allow for x: List[int] = Factory(list)?

It is possible but I think it may create a wrong impression of creating something else than a normal field. I like the syntax proposed by Guido more:

class C:
    x: int = field(default=0, repr=False)
    y: List[int] = field(factory=list)

field(default=factory(list)) is also OK, but looks a bit verbose.

warsaw · 2017-06-06T14:21:56Z

@ilevkivskyi That actually doesn't look to bad to me either. As a beginner, I understand that factories are things that create things, so factory=list tells me that it's going to create new list objects. field() will be a concept I'll learn early on with express classes <wink> so using field(factory=list) isn't much of a stretch conceptually.

gvanrossum · 2017-06-06T15:41:27Z

I prefer having default=some_marker(list) over having two ways of defining defaults.

Why? default=Factory(list) looks uglier than default_factory=list (it has more punctuation) and requires an extra import. Is your worry that people might provide both default= and default_factory=? That's easily rejected with a friendly error message. Or are there use cases that I'm missing?

hynek · 2017-06-06T15:50:09Z

Yes, and I prefer consistency but this is absolutely nothing I would waste anyone’s energy with. I’m fine with either version (although something more inline like x: List[int] = factory(list) still looks cooler – but that’s just syntactic sugar which I believe could be built on top of field if I really wanted to).

warsaw · 2017-06-06T17:12:23Z

On Jun 06, 2017, at 08:41 AM, Guido van Rossum wrote: Why? default=Factory(list) looks uglier than default_factory=list

Can we please call that `factory` instead of `default_factory`?

gvanrossum · 2017-06-06T18:11:45Z

Can we please call that factory instead of default_factory?

I suppose, though then it would lack the connection with this being how the default value is constructed. I don't feel strongly about it, I agree that default_factory is a lot to type.

ericvsmith · 2017-06-07T01:19:13Z

It sounds like we're moving away from doing something automatically, and instead using a factory.

Do we want to force you to specify a field if all you want is a factory? Or can we have just a factory as a default value (at the expense of exposing another function/object)?

@dataclass
class C:
   x: list = field(default_factory=list)

@dataclass
class C:
   x: list = factory(list)

?

I do like the idea of raising an exception if using a list, dict, or set as the default value. It think it should be possible to make a factory that works around this, although I haven't thought it through. We could always relax the restriction later, in the unlikely event that people are clamoring for mutable defaults.

gvanrossum · 2017-06-07T02:30:16Z

This sounds reasonable. Although I still like my original proposal of using copy.copy() by default (but only when the argument is not specified), and maybe optimizing that out if it's a common immutable type (e.g. bool, int, float, complex, str, bytes, tuple, frozenset).

ericvsmith · 2017-06-15T14:24:21Z

Here's an example where copy by default does something surprising. I ran in to this when testing default values:

        sentinel = object()
        @dataclass
        class C:
            x: int = 0
            y: object = sentinel
        c = C()
        self.assertIs(c.y, sentinel)

This fails because c.y is a copy of sentinel, not sentinel itself.

ericvsmith · 2017-09-04T17:26:11Z

My goal is to get this resolved during the core sprint so I can make some progress on other dataclass issues. I think the options we've been discussing for non-ClassVar mutable defaults are:

Do nothing special with defaults: if you provide a default such as [], it gets the normal python mutable default parameter behavior. Depending on your knowledge of python, this may or may not be the least surprising behavior.
Prohibit known mutable defaults, and maybe allow them in the future with special behavior. TBD: how to detect mutable values (known whitelist of types? presence of .copy?, etc.).
Detect mutable defaults, and copy them. TBD: how to detect mutable values and how to copy them (deep or shallow).
When using a default, always copy it. Subject to some optimizations, but logically everything is copied. See above Copying mutable defaults #3 (comment) for one problem with this. TBD: how to copy.

Any others?

gvanrossum · 2017-09-04T17:33:22Z

I would go for one that doesn't preclude picking one of the others later. For that reason (2) wins in my book. I'd just prohibit the three common mutable types (list, dict, set) and their subclasses.

ilevkivskyi · 2017-09-04T19:06:10Z

Option (2) seems reasonable as Guido formulated it, as I understand this will be accompanied by a way to make a factory like

@data
class C:
    bad_attr: List[int] = []  # Error
    attr: List[int] = field(factory=list)  # OK

and a clear error message like ValueError: using [] as a default value is unsafe. Use factory=list instead.

ericvsmith · 2017-09-04T20:13:27Z

Yes, exactly. I don't currently have the factory option implemented, so the initial checkin will just disallow the first line above. I'll make sure the factory option works at a later time.

disallow default values of this type (since they're mutable and a source of bugs). In the future, we might do something like allow them, or automatically copy them.

ericvsmith added a commit that referenced this issue Jun 2, 2017

If using a default value in __init__, always copy the default. See #3.

683eaa1

ericvsmith added a commit that referenced this issue Jun 3, 2017

Removed a TODO item that's been resolved (see #9).

daf3696

Removed unused Factory class. The way we're handling mutable defaults doesn't require it (see #3).

ericvsmith closed this as completed Sep 4, 2017

ericvsmith mentioned this issue Sep 4, 2017

How to specify factory functions #24

Closed

Copying mutable defaults #3

Copying mutable defaults #3

Comments

ericvsmith commented May 22, 2017

ericvsmith commented May 22, 2017

gvanrossum commented May 22, 2017

ericvsmith commented May 22, 2017

ericvsmith commented May 22, 2017

gvanrossum commented May 22, 2017

ilevkivskyi commented May 23, 2017

gvanrossum commented May 23, 2017

ericvsmith commented May 24, 2017

ilevkivskyi commented May 24, 2017 • edited

gvanrossum commented May 24, 2017

ilevkivskyi commented May 24, 2017

ericvsmith commented Jun 2, 2017

ilevkivskyi commented Jun 2, 2017

ericvsmith commented Jun 2, 2017

ilevkivskyi commented Jun 2, 2017

gvanrossum commented Jun 2, 2017

warsaw commented Jun 2, 2017

gvanrossum commented Jun 2, 2017

warsaw commented Jun 2, 2017

ericvsmith commented Jun 2, 2017

ilevkivskyi commented Jun 2, 2017

gvanrossum commented Jun 2, 2017

ericvsmith commented Jun 2, 2017

gvanrossum commented Jun 2, 2017

hynek commented Jun 4, 2017

gvanrossum commented Jun 4, 2017

hynek commented Jun 5, 2017

glyph commented Jun 5, 2017

hynek commented Jun 5, 2017 • edited

Tinche commented Jun 5, 2017

gvanrossum commented Jun 5, 2017

ilevkivskyi commented Jun 5, 2017

gvanrossum commented Jun 5, 2017

hynek commented Jun 6, 2017

ilevkivskyi commented Jun 6, 2017

warsaw commented Jun 6, 2017 • edited

gvanrossum commented Jun 6, 2017

hynek commented Jun 6, 2017

warsaw commented Jun 6, 2017 via email

gvanrossum commented Jun 6, 2017

ericvsmith commented Jun 7, 2017

gvanrossum commented Jun 7, 2017

ericvsmith commented Jun 15, 2017

ericvsmith commented Sep 4, 2017

gvanrossum commented Sep 4, 2017 via email

ilevkivskyi commented Sep 4, 2017 • edited

ericvsmith commented Sep 4, 2017

ilevkivskyi commented May 24, 2017 •

edited

hynek commented Jun 5, 2017 •

edited

warsaw commented Jun 6, 2017 •

edited

ilevkivskyi commented Sep 4, 2017 •

edited