-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++] Create a ForEach library function that runs on an iterator of futures #26190
Comments
Antoine Pitrou / @pitrou: |
Weston Pace / @westonpace: The current implementation is:
|
Weston Pace / @westonpace: |
Antoine Pitrou / @pitrou: |
Weston Pace / @westonpace: More significantly though, the AsCompletedIterator still does waits, which is what I'm trying to avoid. I'm attaching a diagram to this sub task that will hopefully provide a bit more explanation. |
Weston Pace / @westonpace: |gzip/cache|6.291222|0.095669|6.467804|0.035468|6229|6.262252|0.056097|4149| |
Weston Pace / @westonpace: | |2.0.0 (Mean)|2.0.0 (StdDev)|Async (Mean)|Async (StdDev)| |
Weston Pace / @westonpace: An AsyncGenerator's Next() function should never have to be called by more than one thread at once in the way you might do with an iterator. Instead, the question comes down to whether you can call a generator's Next() function before the promise returned by the previous call has completed. It is similar and different. So it's not so much a question of "thread safe" as it is a question of "reentrant". AsyncGenerators come in both flavors. A decompressing node (or the CSV block reader and the CSV chunker) are all quite stateful and must finish processing a block before they can begin consuming the next. So you should not call Next() until the future returned is resolved. The parsing and converting on the other hand is free to run in parallel. In addition, any queuing stage (AddReadahead and BackgroundIterator) can be called in parallel, thus allowing for pipeline parallelism. Since everything is pull driven this "parallel pull" is driven from the AddReadahead nodes and could be driven from a "visit" or "collect" sink as well. So just summarizing what we have today... Sources: Intermediate Nodes: Sinks: Today we have... BackgroundGenerator -> AddReadahead -> Transformer -> Transformer -> Visit(1) It would be an error for example to do... BackgroundGenerator -> AddReadahead -> Transformer -> Transformer -> Visit(N) ...or... BackgroundGenerator -> AddReadahead -> Transformer -> Transformer -> AddReadahead -> Visit(1) ...both of those would cause Transformer (which is not reentrant) to be pulled reentrantly. I am wondering if there is some merit in encoding these rules somehow into the types themselves so that something like that would fail to compile. |
Weston Pace / @westonpace: Mapper (Just a flatMap, fits Transformer model, but should be able to be made reentrant, emits 8 times for each input) All of these could be "transformers" but the existing transform model would make them non-reentrant. However, all of them "should" be able to be reentrant. I think this argues somewhat to gaps in the transformer model. I have an improved model in mind, but it is was not compatible with synchronous iterators so I abandoned it. I may have to revisit. |
Ben Kietzman / @bkietz: |
This method should take in an iterator of futures and a callback and pull an item off the iterator, "await" it, run the callback on it, and then fetch the next item from the iterator.
Reporter: Weston Pace / @westonpace
Assignee: Weston Pace / @westonpace
Original Issue Attachments:
PRs and other links:
Note: This issue was originally created as ARROW-10183. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: