Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use broadcast_to instead of broadcast_like #288

Closed
wants to merge 10 commits into from
Closed

Conversation

eigenfoo
Copy link
Contributor

@eigenfoo eigenfoo commented Jan 30, 2021

Closes #159. WIP.

Two questions:

  1. Changing broadcast_like to broadcast_to in aesara/tensor/math_opt.py gives me a lot of BadOptimization and AssertionErrors, which I'm not knowledgeable enough to debug - can someone help me understand what's happening?
  2. I've also changed broadcast_like to broadcast_to in aesara/tensor/basic_opt.py - is this something we want to do?

@eigenfoo
Copy link
Contributor Author

eigenfoo commented Jan 30, 2021

Hmm - more tests are failing than I anticipated. Running pytest tests/tensor/test_math_opt.py gives me the 9 errors I was originally referring to: https://gist.github.com/eigenfoo/8c2d5e7867fc23b82c98553c5300b054

The failures don't actually call broadcast_to, but most of them involve toposort or optimize using local_pow_canonicalize, which does call broadcast_to. Does that mean broadcast_to affected the graph optimizations, which is what these failed tests are testing for? Apologies for rambling, I'm trying to understand the codebase better.

tests/tensor/test_basic.test_alloc_constant_folding (the test that failed the current test run) also calls toposort.

@brandonwillard
Copy link
Member

brandonwillard commented Jan 30, 2021

The failures don't actually call broadcast_to, but most of them involve toposort or optimize using local_pow_canonicalize, which does call broadcast_to. Does that mean broadcast_to affected the graph optimizations, which is what these failed tests are testing for?

It looks like they're failing because those unit tests make unnecessarily strong assumptions, aren't isolated/self-contained enough, don't assert the relevant changes directly, etc. In other words, you're seeing the long standing deficiencies of our test suite.

If you could fix those issues as you address this PR's issue, you would be doing significantly more for this project than the change requested in the issue itself.

@brandonwillard
Copy link
Member

Aside from strong test assumptions, there are some optimization errors that are probably worth addressing first; those may actually involve bugs.

For instance, the first optimization error is implying that the the type generated by the newly introduced BroadcastTo Op doesn't match the type of the graph it's replacing in the optimization.

@eigenfoo
Copy link
Contributor Author

eigenfoo commented Jan 30, 2021

@brandonwillard the latest commit pymc-devs/pytensor@a00ee11 gets rid of all BadOptimization errors, as far as I can tell - all that remains are AssertionErrors.

However, I suspect I'm doing something ugly by adding the dtype parameter like this - can you take a look at the new parameter? If it all looks good, I'll move on to fixing the tests: I suspect I'll need a lot of help along the way, though 😬

Copy link
Member

@brandonwillard brandonwillard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we simply cast the result of broadcast_to to the desired dtype (e.g. using TensorVariable.astype)?

Adding a dtype argument to the Op doesn't add much in the way of functionality, and it causes the Op to deviate from the underlying numpy.broadcast_to that it models/wraps.

@eigenfoo
Copy link
Contributor Author

eigenfoo commented Jan 30, 2021

Sure, that actually makes it easier. I assumed we wanted to mimic the broadcast_like signature, which took a dtype parameter. Sorry if that was a bad assumption.

The previous two commits revert the dtype commit, and use TensorVariable.astype to cast dtypes.

Also, do we want to remove the broadcast_like function now? The only reference is itself.

$ rg broadcast_like
aesara/tensor/basic_opt.py
170:def broadcast_like(value, template, fgraph, dtype=None):
182:            "broadcast_like currently requires the "

@eigenfoo
Copy link
Contributor Author

eigenfoo commented Feb 7, 2021

@brandonwillard I think I've fixed all tests but two. I'm confused about the remaining failures - I've attached the output of pytest test_math_opt.py here: https://gist.github.com/eigenfoo/ddb7d2aad30550c0efbf64a1841881f9

  1. test_stacktrace seems to be a real failure that I don't understand - what exactly is the "trace" that we're testing for? https://github.com/pymc-devs/aesara/blob/0344f1a844731eea980ada631fe180e09f397a51/aesara/graph/opt.py#L3188-L3189
  2. What is test_local_reduce_join testing for? I can tell that it's only these two lines that are failing the test, but don't know if its safe to remove them, or if this is a real failure... https://github.com/pymc-devs/aesara/blob/0344f1a844731eea980ada631fe180e09f397a51/tests/tensor/test_math_opt.py#L3472-L3473

@brandonwillard
Copy link
Member

  1. test_stacktrace seems to be a real failure that I don't understand - what exactly is the "trace" that we're testing for?

That's probably referring to the stack traces carried by each variable in their tag. Those traces show where each variable was created, and, during optimizations, they're carried over to newly created replacement variables. Their only real purpose is for user-level debugging.

@brandonwillard
Copy link
Member

2. What is test_local_reduce_join testing for? I can tell that it's only these two lines that are failing the test, but don't know if its safe to remove them, or if this is a real failure...

Those looks like brittle test conditions. Instead of coming up with a direct check for something specific and relevant in the transformed graph, many tests will simply assert the number of nodes in the expected output.

In almost all cases, these kinds of asserts are simply bad, so don't take them too seriously. If you changed an optimization, an Op, and/or one of an Op's helper functions, and any of those things are used in one of these tests, then a change in the number of nodes could be easily justified.

@brandonwillard
Copy link
Member

I'll try running these tests locally within the next few days and see if I can spot any genuine issues (i.e. ones that aren't due to brittle/overly-restrictive tests).

@eigenfoo
Copy link
Contributor Author

Thanks for helping @brandonwillard! Sorry for being so slow-moving on this PR - something has popped up in the real world (and not a lack of interest in finishing this work).

I've loosened the test conditions on test_local_reduce_join so that it passes now, and I've removed test_stacktrace entirely. I'm unsure why test_stacktrace fails whereas a similar test, test_local_sum_prod_mul_by_scalar_stack_trace, succeeds. I'd be happy to put the test back if I could get a pointer on what's gone wrong.

I'll wait for the test suite to finish, but I'd also appreciate a quick triage of whether there are other real test failures - otherwise I'll assume that everything that fails is a flaky test.

@brandonwillard
Copy link
Member

Sorry for being so slow-moving on this PR

I only commented so that you knew I hadn't forgotten about this PR; there's absolutely no rush, though!

@eigenfoo
Copy link
Contributor Author

@brandonwillard test_basic/test_stack_hessian is failing - this seems like a real failure, but I'm unsure what's going wrong. Could you give some pointers?

https://github.com/pymc-devs/aesara/pull/288/checks?check_run_id=1884844321

@brandonwillard
Copy link
Member

brandonwillard commented Feb 17, 2021

test_basic/test_stack_hessian is failing - this seems like a real failure

Yes, that does look like a real issue. My first impression was that it had to do with in-place operations.

The new broadcast_to uses NumPy's underlying view-based broadcasting via the BroadcastTo Op, and its view_map is supposed to indicate the view relationship between the Op's first input, a, and its output. Given that the first value is 2, as expected, and all the remaining values are effectively 0, it seemed like an optimization might've been ignoring this view information and "in-placing" things that it shouldn't.

Unfortunately, the evaluated graph doesn't contain any BroadcastTo Ops, so, if the problem is related to BroadcastTo-generated views, it's manifesting in an indirect way.

Also, with aesara.config.compute_test_value turned on, the test values for the Ha and Hb graphs are incorrect, and setting aesara.config.optimizer_verbose = True shows some validation failures during optimization.

@brandonwillard
Copy link
Member

brandonwillard commented Feb 17, 2021

OK, the issue does appear to involve bad in-placing. Those FunctionGraph validation failures are specifically caused by aesara.tensor.basic_opt.local_inplace_setsubtensor producing the following exception:

InconsistencyError('Attempting to destroy indestructible variables: [TensorConstant{(1,) of 0.0}]')

Actually, those aren't the exact problem. We probably need to find out how the BroadcastTo Op are being replaced.

@brandonwillard
Copy link
Member

All right, I believe that the source of this issue is the change from broadcast_like to broadcast_to in aesara.tensor.basic_opt.local_fill_to_alloc. That's the rewrite that introduces the BroadcastTo Op in the first place.

In this case, a fill really shouldn't be replaced with a broadcasted view of the fill value; however, we might want a rewrite that turns unmodified (in-place) allocs and/or fills into BroadcastTos.

@brandonwillard
Copy link
Member

Now that I think about it, we could probably use a simple optimization that removes useless BroadcastTos.

@brandonwillard
Copy link
Member

brandonwillard commented Feb 18, 2021

Looks like there are a few more brittle tests (e.g. TestCrossEntropyCategorical1Hot.test_get_rid_of_advanced_indexing_version_of_xent and TestMinMax.test_optimization_max do the same len(topo) == x nonsense).

@brandonwillard
Copy link
Member

Also, some of these optimizations might be having trouble because BroadcastTo isn't being lifted.

In the case of TestMinMax.test_optimization_min, it looks like the optimizations being tested aren't applied because they're looking for neg(max(neg(...))), but now the graph is neg(max(broadcast_to(neg(...)))). If we lift the BroadcastTo Ops, it should fix that.

@brandonwillard brandonwillard changed the title Use broadcast_to intead of broadcast_like Use broadcast_to instead of broadcast_like Aug 1, 2021
@ricardoV94 ricardoV94 force-pushed the broadcast_to branch 2 times, most recently from 3307a5d to 4211281 Compare May 11, 2022 07:41
@ricardoV94
Copy link
Contributor

Just rebased this PR

A lot of these tests test for strict equality of the toposorted graph.
We should only test that the toposorted graph _contains_ expected
nodes/ops.
This change allows `get_scalar_constant_value` to "dig through" `BroadcastTo`
`Op`s.
This change removes an assertion on `len(topo)`, and by loosens the strict node
type requirement of the last node in `topo`.
@rlouf rlouf closed this Oct 14, 2022
@rlouf rlouf deleted the broadcast_to branch October 14, 2022 14:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Replace broadcast_like with broadcast_to
4 participants