Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Path towards better repr behavior #1395

Closed
kwmsmith opened this issue Jan 22, 2016 · 6 comments
Closed

Path towards better repr behavior #1395

kwmsmith opened this issue Jan 22, 2016 · 6 comments

Comments

@kwmsmith
Copy link
Member

There's widespread agreement regarding removing the implicit computation behavior in blaze when repring interactive expressions.

Related to #1304 and #1356.

This thread is to discuss how to get there, given that this is a significant API change to Blaze.

One proposal:

  • For the 0.10 release of Blaze, provide a blaze.evaluate_in_repr flag (open to better names for the flag) that's True by default to preserve current behavior. If set to False, then Blaze's interactive expression repr does not evaluate or compute anything, instead returning an more standard representation.
  • For the 0.11 release of Blaze, the flag is removed entirely and the implicit-computation-in-repr behavior is removed entirely, and the new repr behavior is the default.
  • For 0.11 release, the only way to compute anything in Blaze is to ask for it explicitly via compute().
  • Candidates for the new repr:
    1. boring <blaze.expr.arithmetic.Add object at 0xdeadbeef> style reprs.
    2. Repr with some metadata attached, perhaps including the dshape for the result.
    3. Full repr of the expression with symbols, similar to what str(expr) produces.

I favor a simple repr, for the following reasons:

  • It's easy to implement and is not surprising to end users.
  • As soon as an expression grows to a certain size / level of nesting, any attempt to provide a "nice", "readable" repr of the expression's contents becomes quite difficult.
  • Similarly, an expression's datashape can become quite large when joining many tables with lots of fields, foreign keys, etc. which is an argument against including the datashape of an expression in its repr in all cases.
  • Separate methods / functions can be provided that take an expression and yield more sophisticated presentations of that expression.
    • perhaps a preview() function / method could be provided that does a pandas-like compute(expr.head(10))
    • we could find a method / function name that takes an expression and provides the full nested representation of it, and incorporates ideas from Repr for partially bound expressions #1356.
@llllllllll
Copy link
Member

I will dump some thoughts I have here:

For the repr of expressions, I think the minimum should be the type and the dshape. The dshape is what tells me what methods and functions I can call on this. Just knowing <add object at %p> doesn't inform me how I can use this. Maybe something like:

def __repr__(self):
    name = self._name
    return '<%s %s:: %.200s>' % (
        type(self).__name__,
        name + ' ' if name is not None else '',
        self.dshape,
    )

This generates reprs like:

In [1]: s = bz.symbol('s', 'var * {a: int32, b: float64}')

In [2]: s
Out[2]: <Symbol s :: var * {a: int32, b: float64}>

In [3]: s.a
Out[3]: <Field a :: var * int32>

In [4]: s.a + 1
Out[4]: <Add a :: var * int64>

In [5]: s.a + 1 - 2
Out[5]: <Sub a :: var * int64>

In [6]: (s.a + 1).mean()
Out[6]: <mean a_mean :: float64>

In [7]: (s.a + 1).mean() + 1
Out[7]: <Add a_mean :: float64>

In [8]: bz.Data([1, 2, 3])
Out[8]: <InteractiveSymbol :: 3 * int64>

I like this because it won't grow with the size of the expression, and we already have the str which prints the whole expression in a "human readable" form.

I don't love the idea of having the global setting to turn on reprs. How would you feel about making it default to False and the function that toggles it to True raises a deprecation warning?
If we want it to be True by default maybe the first implicit repr should warn that this behavior is going away. I fear that we will not be sufficiently warning users that this change is happening.

I think we should have a preview method that does all of our current repr logic but stops before coercing to a string. A lot of our coerce_scalar stuff is useful, it would be great to be able to add the result of some previous to something like:

In [1]: ds = bz.Data([1, 2, 3])

In [2]: ds.min.preview()
Out[2]: 1 

In [3]: _2 + 5
Out[3]: 6

Also noting that we will need to remove all of these:

# in blaze.interactive
Expr.__repr__ = expr_repr
Expr._repr_html_ = lambda x: to_html(x)
Expr.__len__ = table_length
Expr.__array__ = intonumpy
Expr.__int__ = lambda x: convert_base(int, x)
Expr.__float__ = lambda x: convert_base(float, x)
Expr.__complex__ = lambda x: convert_base(complex, x)
Expr.__bool__ = lambda x: convert_base(bool, x)
Expr.__nonzero__ = lambda x: convert_base(bool, x)
Expr.__iter__ = into(Iterator)

@kwmsmith
Copy link
Member Author

For the repr of expressions, I think the minimum should be the type and the dshape. The dshape is what tells me what methods and functions I can call on this.

Agreed that the dshape is useful when small, and I agree with you if we can find out something to do when the dshape is large. By "large", I've seen usecases where there are several joins and the resulting dshape has a dozen or more fields. How do you propose to handle things in that case?

I'd be in favor of not including the dshape in the repr after some (arbitrary) size cutoff. Perhaps the cutoff could be based on whether the dshape can fit comfortably on one line, but I'm interested to know your thoughts here.

I'm -1 on multi-line reprs, regardless of what's being repr'd, especially since the repr of an object is used in many contexts outside the REPL.

I like your repr proposal, provided we handle the case when the dshape is large. I'm on the fence on the syntax, but I could go with your version here. Another option would be something like:

In [10]: s.a
Out[10]: <blaze.expr.Field a; dshape=var * int32>

One question I have is with unnamed expressions like s.a + 1; if the repr is <Add a :: var * int64>, it's somewhat misleading since it looks like the repr of the expression is saying that the Add expression is named a. I'd prefer

>>> s.a + 1
<blaze.expr.arithmetic.Add object; dshape=var * int64>

Or similar.

For your other examples, like

In [7]: (s.a + 1).mean() + 1
Out[7]: <Add a_mean :: float64>

What if we're combining multiple symbols, like:

In [7]: (s.a + t.b + u.c).mean() + v
Out[7]: <???>

I also have to think about the <Add a_mean :: float64> output somewhat, since it's really an add of a mean of an add, whereas Add a_mean makes it seem like an add of a mean.

Again, my preference in these cases would be to take the simpler road of something like:

In [7]: (s.a + t.b + u.c).mean() + v
Out[7]: <blaze.expr.arithmetic.Add object; dshape=float64>

I don't love the idea of having the global setting to turn on reprs.

I'm with you -- I don't want global state in Blaze if we can avoid it. The global setting was my attempt to provide user control of the repr behavior that's backwards compatible. This flag would be there for the full 0.10 release and then, in 0.11, we'd remove the setting altogether and switch over to the new repr behavior entirely.

I admit this is heavyweight, but it's provided as a courtesy to end users. If you think that we should just switch over to the new repr behavior entirely, then I'm willing to give that thought.

How would you feel about making it default to False and the function that toggles it to True raises a deprecation warning?

Yes perhaps; either way I'd want the global setting to be around only for one minor release cycle, and after that we'd have the new behavior only.

If we want it to be True by default maybe the first implicit repr should warn that this behavior is going away.

Yes, I'm willing to do that.

I fear that we will not be sufficiently warning users that this change is happening.

With a warning and calling it out clearly in the documentation that "before blaze version 0.10, repr behaved like ...", I think we'll have done due diligence.

I think we should have a preview method that does all of our current repr logic but stops before coercing to a string.

+1.

What about the case when calling preview leads to a lot of computation? In particular, I'm thinking of calling join(a, b, 'col').preview(), for example. Do you think it's tractable for us to detect cases when preview might lead to a lot of computation and not doing computation in those cases, and instead point users to a full compute() call?

@llllllllll
Copy link
Member

I think preview should just be listed as one of three options for explicit computations.

  1. compute
  2. odo
  3. preview

The original format I have for dshape cuts it off after it get's too long with %.200s, we could make that more restrictive like %.60s or something.

if the repr is <Add a :: var * int64>, it's somewhat misleading since it looks like the repr of the expression is saying that the Add expression is named a

The a appears because the add expression is named a. The name is taken from the expression's _name field.

@kwmsmith
Copy link
Member Author

The original format I have for dshape cuts it off after it get's too long with %.200s, we could make that more restrictive like %.60s or something.

Thanks for pointing that out -- I missed that when glancing at your code originally.

I agree that if we include the dshape, we need some sort of summarization for dshapes once they are beyond a certain size. My preference would be for the summarized dshape to include the top-most levels, while the nested levels are collapsed with ellipses or something:

# original full dshape
var * {foobar: var * {col0: int32, col1: float, col3: string}, bar: int32, foo: (string, int, string)
# First level of summaraization
var * {foobar: var * {...}, bar: int32,  foo: (...)}
# Second level of summaraization
var * {foobar: ..., bar: ...,  foo: (...)}

Etc.

That way the user can see the top-level field names and is able to explore to the next level down.

@kwmsmith
Copy link
Member Author

The a appears because the add expression is named a. The name is taken from the expression's _name field.

I'd make the case that taking the a out of a + 1 and using that for the expression name is broken :)

It also breaks down as soon as you combine symbols:

In [11]: x = bz.symbol('x', 'int')

In [12]: y = bz.symbol('y', 'int')

In [13]: (x + 1)._name
Out[13]: 'x'

In [14]: print (x + 1)._name
x

In [15]: print (x + y)._name
None

Automatically assigning meaningful names to expressions is a hard problem, probably intractable. We can always generate names like we do automatically with interactive symbols, but since that's an internal thing, there's more opportunity to confuse rather than help users.

We're starting to talk about how to allow users to give names to expressions explicitly -- in that case, using the user-provided expression name would be great.

Thoughts?

@kwmsmith
Copy link
Member Author

kwmsmith commented Apr 1, 2016

Closed by #1414. The deprecation introduced in that PR needs to be fully removed in version 0.11.

@kwmsmith kwmsmith closed this as completed Apr 1, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants