Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is opt_einsum.paths.optimal algorithm correct? #167

Open
igtrnt opened this issue Aug 24, 2021 · 7 comments
Open

Is opt_einsum.paths.optimal algorithm correct? #167

igtrnt opened this issue Aug 24, 2021 · 7 comments

Comments

@igtrnt
Copy link

igtrnt commented Aug 24, 2021

Is opt_einsum.paths.optimal supposed to support non-standard Einstein summations (when an index occurs in more than two tensors)?

Consider contracting 3 tensors: ij,j,jk->:

First, DFS looks at ij,j contraction. Resulting tensor is j (since jk is still in the remaining list).
So it caches {ij,j} -> [j, cost1]

Now, remaining contains: jk and j . Contracting them results in empty tensor.
So it caches {j,jk} -> [, cost2].

Now DFS goes into another branch, starting with j,jk contraction. It should result in j tensor, because ij is still in the remaining list.
But cache contains {j,jk} -> [, cost2] instead of {j,jk} -> [j, *].

This simple example shows, that caching with key, based on two tensors participating in contraction, may not work correctly.
Cache key should also contain the contraction result. (E.g., {j, jk, j} and {j, jk, } would be two different cache keys). But this defeats the purpose of caching - contraction result needs to be calculated every time...

For larger examples, it's easy to end up with non-optimal "optimal" path and, moreover, completely wrong cost for that path.

@jcmgray
Copy link
Collaborator

jcmgray commented Aug 24, 2021

Hi @igtrnt yes I think you're right - the algorithm was written with 'standard' einsums in mind where two different intermediates sharing the exact same indices is not possible. Out of interest, do you have any more complex examples where this might be important? Inputs with single indices that can/should be immediately summed over (i and k here) are a known suboptimal edge case - #112, #114.

A workaround is just to use the 'dp' path, which is optimal up to outer products (and scales better!). Or if you really want outer products you can instantiate optimize=DynamicProgramming(search_outer=True).


@dgasmith maybe this would be a good time to replace optimal with dynamic programming, either with search_outer=True or (preferable in my opinion) automatically off for larger contractions - sizes which optimal can't handle in any case. The only slight difference is that 'dp' doesn't assume real FLOPs, but practically that doesn't matter... and maybe even makes slightly more sense.

@igtrnt
Copy link
Author

igtrnt commented Aug 24, 2021

@jcmgray Here is a bigger example:

inputs = ["ABC", "D", "EF", "CE", "G", "H", "BH", "D", "BH"]
output = ''
sizes = {'A':9, 'B':11, 'C':7, 'D':24,'E':8, 'F': 2, 'G':5, 'H':30}

optimal ends up with the following contraction path (SSA indexing):

[(1, 7), (0, 9), (3, 10), (2, 11), (4, 12), (6, 8), (5, 13), (14, 15)]

Inside optimal, the cost of each contraction is the following:

[48, 1386, 112, 32, 10, 660, 60, 1]

total: 2309 FLOP.

Correct costs of contracting along the above path are:

[48, 1386, 1232, 352, 110, 330, 330, 660]

total: 4448 FLOP.

So starting from third contraction, optimal gets wrong costs due to wrong stuff in cache.

BTW, I think you also have a bug in computing inner flag that is passed to helpers.flop_count function.
In optimal, it is bool(shared - keep).
In contract_path, idx_removed is used for inner:

        contract_tuple = helpers.find_contraction(contract_inds, input_sets, output_set)
        out_inds, input_sets, idx_removed, idx_contract = contract_tuple

        # Compute cost, scale, and size
        cost = helpers.flop_count(idx_contract, idx_removed, len(contract_inds), size_dict)

helpers.find_contraction essentially computes idx_removed as either - keep (in optimal's terminology).

So there is a discrepancy between inner value computation in optimal and in contract_path. The FLOP counts above are computed without this discrepancy (I fixed optimal to match find_contraction), i.e. apples to apples.

@dgasmith
Copy link
Owner

dgasmith commented Sep 2, 2021

Apologies I'm a bit swamped at work at the moment and slow to respond here. @igtrnt is there a quick proposed fix? Otherwise we can look at deprecating optimal which only has a small performance edge for <8 tensors.

@igtrnt
Copy link
Author

igtrnt commented Sep 2, 2021

I don't see how cache can be fixed. The purpose of cache is to avoid recomputing resulting k12 tensor. But since k12 depends not only on k1 and k2, but on what's left in the remaining, there is no way to cache it.

However, I ported optimal to C++ and discovered that in C++ code recomputing k12 every time is faster than fetching it from cache (at least when number of tensors is up to 12 - I did not try more). So in my C++ code getting rid of cache fixed the error as well as made the code faster. I should say, that I use bitsets to represent sets of tensor indices, so set intersect/union operations are very fast.

I also feel that you might have a similar issue with greedy. But it's more subtle. You put (cost, k1, k2, k12) into priority queue. Again k12 may change depending on when contraction happens. However, in greedy you will put another (cost_new, k1, k2, k12_new) into priority queue. And it will be in front of (cost, k1, k2, k12) because for default cost = size(k12) - size(k1) - size(k2), the cost may only decrease as remaining shrinks and size(k12_new) is smaller than size(k12).

If user-provided cost function is not monotone (i.e. size(k12_new) may be > size(k12)), greedy will fail (k12 may still have indices that have been contracted out and are not in k12_new).

@jcmgray
Copy link
Collaborator

jcmgray commented Sep 2, 2021

I also feel that you might have a similar issue with greedy.

I think with greedy, the optimizer actively performs 'deduplication' as the first step i.e. repeated terms "ab,ab,..." are immediately reduced into a single term with a hadamard product. I suspect that fixes most (maybe all?) issues where k12 as an intermediate is not unique. Possibly adding this to pre-processing optimal would be a fix - but my preference is just to switch to 'dp'.

I think also then that k1, k2 -> k12 cannot change depending on the order of other contractions - since an index is only removed if the contraction contains the last two instances of it. However if you find a counter example let us know!


I'd definitely be interested if you have any numbers on a C++ version vs optimal and the 'dp' algorithm. A compiled version of the 'dp' algorithm would be most useful, but not so simple to re-write.

@igtrnt
Copy link
Author

igtrnt commented Sep 2, 2021

Yes, deduplication is done, but that's when k1==k2 - those are contracted first.
But I am talking about something else. Here is an example:

'ij,jz,jk,z->',
sizes = {'i' : 10, 'j' : 3, 'k' : 1, "z" : 2}

At first iteration, greedy chooses to contract ij and jz.
Here is how queue looks at the start of second iteration (I am printing it in the debugger):

for (cost, i1, i2), k1, k2, k12 in queue: print(cost, k1, k2, k12)

-7 frozenset({'j', 'k'}) frozenset({'j', 'z'}) frozenset({'z'})
-3 frozenset({'j', 'z'}) frozenset({'j', 'k'}) frozenset({'j', 'z'})
-5 frozenset({'j', 'z'}) frozenset({'z'}) frozenset({'j'})

First and second queue entries have same k1, k2 (in different order) and different k12. It could choose wrong contraction (it does not have better cost in this example) if user-supplied cost was computed differently.

@jcmgray
Copy link
Collaborator

jcmgray commented Sep 2, 2021

Ah I see, you are right! I think I was overestimating greedy's preprocessing / combining it with dp's. dp does 'single term' reductions, so in this case 'ij'->j, jk->j and then deduplication would make j,j->j. Though dp doesn't actually need to do the de-duplication as it uses bitsets to represents terms as well.

A preprocessor that does do both so that the inputs are generally unique with shared indices only might be useful, and has been discussed before #114.

Else for greedy specifically, one would need to change the representation so that terms were e.g. frozensets of ints representing initial tensor positions, this is what cotengra does. One would have to keep a close eye on its performance however, as this is the main advantage of greedy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants