Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a task executor when executor is not provided #12606

Merged
merged 9 commits into from
Oct 3, 2023

Conversation

javanna
Copy link
Contributor

@javanna javanna commented Sep 28, 2023

As we introduce more places where we add concurrency (there are currently three) there is a common pattern around checking whether there is an executor provided, and then going sequential on the caller thread or parallel relying on the executor.

That can be improved by internally creating a TaskExecutor that relies on an executor that executes tasks on the caller thread, which ensures that the task executor is never null, hence the common conditional is no longer needed, as the concurrent path that uses the task executor would be the default and only choice for operations that can be parallelized.

As we introduce more places where we add concurrency (there are
currently three) there is a common pattern around checking whether there
is an executor provided, and then going sequential on the caller thread
or parallel relying on the executor.

That can be improved by internally creating a TaskExecutor that relies
on an executor that executes tasks on the caller thread, which ensures
that the task executor is never null, hence the common conditional is no
longer needed, as the concurrent path that uses the task executor would
be the default and only choice for operations that can be parallelized.
@javanna javanna added this to the 9.9.0 milestone Sep 28, 2023
*
* @lucene.experimental
*/
public LeafSlice[] getSlices() {
return (executor == null) ? null : leafSlicesSupplier.get();
return leafSlicesSupplier.get();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether this method is still needed, perhaps it's fine but we could make it final as a follow-up? It could be confusing otherwise for users to figure which of the two methods needs to be overridden between slices and getSlices ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1....I think we should keep this but make final here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, do you have usecases to call this method as a consumer?

Copy link
Contributor

@shubhamvishu shubhamvishu Sep 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the method #slices(List<LeafReaderContext> leaves) is protected in IndexSearcher, I thought #getSlices could be used by other classes(not subclass of IndexSearcher and in different package) to get all the slices without having to pass maxDocsPerSlice and maxSegmentsPerSlice to make use in concurrency?. But giving it a thought again, I feel maybe we should remove/deprecate #getSlices and make slices(List<LeafReaderContext> leaves) public because #getSlices is not doing any extra work other than returning those leaf slices(which is obtained from #slices)? Let me know what do you think?

Copy link
Contributor

@shubhamvishu shubhamvishu Sep 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I see the other #slices method is currently static so I think we should make the other one static too if incase we are pursuing this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sohami I think you could provide the context here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@javanna getSlices() was kept as a convenience method to handle the null executor case such that usage in IndexSearcher doesn't have to explicitly perform check for both executor being null and return value of leafSlices.
This method will be useful if consumer wants to know the slices created by the provider or the count of slices. For example: In OpenSearch, we are using this method to emit a metric around slice count per searcher which can give indication around if a search request is getting too many slices and if we need to adjust that by providing some other slice provider.

Also slices() method was kept protected such that extensions can provide their own implementation of it to control how slices are generated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok thanks are you ok if we make getSlices final? I'd like to clarify confusion around the two methods and how they should be used.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@javanna yes that should be fine. Thanks for checking!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened #12718

@shubhamvishu
Copy link
Contributor

I really like this! This looks much more cleaner.

for (LeafReaderContext context : reader.leaves()) {
tasks.add(taskExecutor.createTask(() -> searchLeaf(context, filterWeight)));
}
TopDocs[] perLeafResults = taskExecutor.invokeAll(tasks).toArray(TopDocs[]::new);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is a trade-off here: we create multiple tasks even if we are not parallelizing. That is good for simplicity, yet it makes the assumption that concurrency is not a corner case, but rather not having an executor is. We could make the task executor API more complex to factor in whether a real executor is provided or not, but I am not sure that's a good trade-off. I'd like to assume that we are evolving Lucene to make use of concurrency more and more, and at some point having an executor to parallelize is the default.

@javanna
Copy link
Contributor Author

javanna commented Sep 29, 2023

Thanks for looking @shubhamvishu !

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to assume that we are evolving Lucene to make use of concurrency more and more, and at some point having an executor to parallelize is the default.

++

@javanna javanna merged commit 2106bf5 into apache:main Oct 3, 2023
4 checks passed
javanna added a commit that referenced this pull request Oct 3, 2023
As we introduce more places where we add concurrency (there are
currently three) there is a common pattern around checking whether there
is an executor provided, and then going sequential on the caller thread
or parallel relying on the executor.

That can be improved by internally creating a TaskExecutor that relies
on an executor that executes tasks on the caller thread, which ensures
that the task executor is never null, hence the common conditional is no
longer needed, as the concurrent path that uses the task executor would
be the default and only choice for operations that can be parallelized.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants