Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: RepartitionExec sometimes incorrectly reports "Error" when output is not completely consumed #575

Closed
alamb opened this issue Jun 16, 2021 · 0 comments · Fixed by #576
Assignees
Labels
bug Something isn't working

Comments

@alamb
Copy link
Contributor

alamb commented Jun 16, 2021

Describe the bug
If the output of one of the repartition operator is not completely consumed, the repartition exec may return an error on one of the other streams

So roughly the picture looks like:

                   ┌───────────────┐
                   │   Consumer    │
                   └───────────────┘
                           │
              ┌────────────┴─────────────┐
              │                          │
              ▼                          ▼
   ┌────────────────────┐     ┌────────────────────┐
   │RepartitionStream 0 │     │RepartitionStream 1 │
   └────────────────────┘     └────────────────────┘
              │                          │
              │                          ├───────────────┐
              │                          │               ▼
           ┌──┤                          │    ┌────────────────────┐
           │  └──────────────────────────┼───▶│   InputStream B    │
           │                             │    └────────────────────┘
           ▼                             │
┌────────────────────┐                   │
│   InputStream A    │◀──────────────────┘
└────────────────────┘

If RepartitionStream 0 is dropped prior to both InputStream A and InputStream B completing, the repartition exec may still try to send a batch to RepartitionStream 0, find the channel closed, and report an error which will be seen by RepartitionStream 1

To Reproduce
I am working on a reproducer.

Reproducing this is error is made more challening by the fact that the repartition stream uses unbounded channels so it is very timing dependent

Expected behavior
No errors should be produced

Additional context
We have a test that fails intermittently https://github.com/influxdata/influxdb_iox/issues/1735

Here is the plan (the 'ExecutionPlan(PlaceHolder)' is an extension node that looks like LIMIT -- in that it may decide to stop consuming its input after producing some output.

The plan being run looks like:

ExecutionPlan(PlaceHolder)
  ProjectionExec: expr=[borough, city, state]
    CoalesceBatchesExec: target_batch_size=500
      FilterExec: 1 <= time AND time < 550 AND CAST(state AS Utf8) = NY
        RepartitionExec: partitioning=RoundRobinBatch(4)
          IOxReadFilterNode: table_name=o2, chunks=1 predicate=Predicate exprs: [TimestampNanosecond(1) LtEq #time, #time Lt TimestampNanosecond(550), #state Eq Utf8("NY")]

While I have been recently messing with RepartitionExec as part of #521 it appears the error behavior predates that change. However, now the error is passed up to the caller

@alamb alamb added the bug Something isn't working label Jun 16, 2021
@alamb alamb self-assigned this Jun 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant