This repository has been archived by the owner on May 12, 2021. It is now read-only.
[PIO-106] Elasticsearch 5.x StorageClient should reuse RestClient #420
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Implements PIO-106
This PR moves to a singleton Elasticsearch RestClient which has built-in HTTP keep-alive and TCP connection pooling. Running on this branch, we've seen a 2x speed-up in predictions from the Universal Recommender with ES5, and the feared "cannot assign requested address" 😱 Elasticsearch connection errors have completely disappeared. Running
pio batchpredict
for 160K queries results in only 7 total TCP connections to Elasticsearch. Previously that would escalate to ~25,000 connections before denying further connections.This fundamentally changes the interface for the new Elasticsearch 5.x REST client introduced with PredictionIO 0.11.0-incubating. With this changeset, the
client
is a single instance oforg.elasticsearch.client.RestClient
.🚨 As a result of this change, any engine templates that directly use the Elasticsearch 5 StorageClient would require an update for compatibility. The change is this:
Original
With this PR
No more balancing
open
&close
as this is handled by using a newCleanupFunctions
hook added to the framework in this PR.Universal Recommender is the only template that I know of which directly uses the ES StorageClient outside of PredictionIO core. See example UR changes for compatibility with this PR.
Elasticsearch StorageClient changes
See StorageClient
Core changes
A new
CleanupFunctions
hook has been added which enables developers of storage modules to register anonymous functions withCleanupFunctions.add { … }
to be executed after Spark-related commands/workflows. The hook is called in afinally { CleanupFunctions.run() }
from within:pio import
pio export
pio train
pio batchpredict
Apologies for the huge indentation shifts from the requisite try-finally blocks: