Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-3857][Streaming Connectors]Add reconnect attempt to Elasticsearch host #1962

Closed
wants to merge 1 commit into from

Conversation

sbcd90
Copy link
Contributor

@sbcd90 sbcd90 commented May 4, 2016

  • General
    • The pull request references the related JIRA issue ("[FLINK-3857] Add reconnect attempt to Elasticsearch host")
  • Documentation
    • Documentation added based on the changes made.
  • Tests & Build
    • Functionality added by the pull request is covered by tests
    • mvn clean verify has been executed successfully locally or a Travis build has passed

@sbcd90
Copy link
Contributor Author

sbcd90 commented May 4, 2016

Hello @fhueske , The tests do not fail because of the changes made in the PR. I tested the Junits for elasticsearch connector & all of them runs fine.
Can you kindly have a look?

@@ -86,6 +88,7 @@
public static final String CONFIG_KEY_BULK_FLUSH_MAX_ACTIONS = "bulk.flush.max.actions";
public static final String CONFIG_KEY_BULK_FLUSH_MAX_SIZE_MB = "bulk.flush.max.size.mb";
public static final String CONFIG_KEY_BULK_FLUSH_INTERVAL_MS = "bulk.flush.interval.ms";
public static final String CONFIG_NO_OF_CONN_RETRIES = "conn.retries";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We usually don't abbreviate configuration keys. Can you rename it to "CONFIG_KEY_CONNECTION_RETRIES" ?

@rmetzger
Copy link
Contributor

rmetzger commented May 5, 2016

Thank you for opening the pull request.

I made some inline comments.
I don't think the proposed changes fix the issue described in the JIRA.
I would check on each invoke() if hasFailure is set. If that's the case, you can reconnect to EL.

@rmetzger
Copy link
Contributor

rmetzger commented May 5, 2016

The change is missing documentation updates & test cases.

@sbcd90 sbcd90 force-pushed the elasticSearchRetryIssue branch 2 times, most recently from d24aa6e to 9f5df39 Compare May 5, 2016 22:24
@sbcd90
Copy link
Contributor Author

sbcd90 commented May 5, 2016

Hello @rmetzger ,

Thanks a lot for reviewing the PR.
I have made all the changes mentioned by you as inline comments as well as added some documentation.
Kindly have a look now.

@sbcd90
Copy link
Contributor Author

sbcd90 commented May 6, 2016

Hello @rmetzger ,

Looking at the test case ElasticsearchSinkItCase.testTransportClient, I think to test the re-connect scenario the hasFailure may need to be made public so that the test-method can set it.
Can you kindly provide some suggestions?

@rmetzger
Copy link
Contributor

rmetzger commented May 6, 2016

How did you test/try out the code you've implemented in this pull request?

@sbcd90
Copy link
Contributor Author

sbcd90 commented May 6, 2016

Hello @rmetzger ,

I added a testcase now to the ElasticsearchSinkITCase.java list of tests. Can you kindly have a look once?

@StephanEwen
Copy link
Contributor

I took a quick look at this. I am wondering if this actually needs an extra timer service for retries.

Can this be solved without a timer? The failures could be detected in the invoke(...) method, and the retry done directly there (with some minimal backoff or so).

Triggering asynchronous timers is very complex and easily creates leaks, races, or leftover work / tasks at shutdown.

@sbcd90
Copy link
Contributor Author

sbcd90 commented Jun 9, 2016

Hello @StephanEwen ,

I have removed the asynchronous timer & doing the retry logic directly now. The backoff is 3s. Please have a look.


if (params.has(CONFIG_KEY_CONNECTION_RETRIES)) {
this.connectionRetries = params.getInt(CONFIG_KEY_CONNECTION_RETRIES);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to have a default value set if not specified by user? Otherwise null exception in invoke().

@tzulitai
Copy link
Contributor

tzulitai commented Jun 9, 2016

Hi @sbcd90,

Gave the changes a quick review and commented. please let me know your opinion on them. Hope they'll be helpful to get you going.

@sbcd90
Copy link
Contributor Author

sbcd90 commented Jun 10, 2016

Hello @tzulitai ,

I think default value for int in Java is 0.
The check if connection is lost or not & then retry for connection is a good suggestion. Made the change.
separated the methods for connection creation & connection status check.

@tzulitai
Copy link
Contributor

Hi @sbcd90,
Sorry for the late reply, as I'm currently busy some other things. I'll be happy to help review again within the next 2~3 days.

LOG.info("Created Elasticsearch TransportClient {}", client);
if (LOG.isInfoEnabled()) {
LOG.info("Created Elasticsearch TransportClient {}", client);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Log message here doesn't match the purpose of this method

@tzulitai
Copy link
Contributor

Hi @sbcd90,
I think to address "add reconnect attempt" alone, checking whether or not the transport client is connected to nodes and retry connect if lost connection in invoke() before processing the element should be fine.

On the other hand, another problem that raises if we are to add reconnect attempt for the ES sink is that failing records due to connection errors also need to be caught in the BulkProcessor afterBulk() callback and re-processed. I wonder if we should be solving this together to resolve this issue. @rmetzger, what's your opinion?

@HungUnicorn
Copy link
Contributor

HungUnicorn commented Aug 25, 2016

Using the sniffing feature of transport client can achieve this with the current ES2 connector in master branch.
The client will connect to all existing nodes and the connected list is updated every 5 seconds. It can fit our case because we will only have to specify one ip, and we will obtain a list of ips which updated periodically. It's done by
Settings settings = Settings.settingsBuilder().put(userConfig).put("client.transport.sniff", true).build();

Explanation:

The Transport client comes with a cluster sniffing feature which allows it to dynamically add new hosts and remove old ones. When sniffing is enabled the the transport client will connect to the nodes in its internal node list, which is built via calls to addTransportAddress. After this, the client will call the internal cluster state API on those nodes to discover available data nodes. The internal node list of the client will be replaced with those data nodes only. This list is refreshed every five seconds by default.

Source:
https://www.elastic.co/guide/en/elasticsearch/client/java-api/2.3//transport-client.html

@tzulitai
Copy link
Contributor

Thanks @HungUnicorn, thats useful info. I wonder though if this config should be set by the user, instead of letting the connector internally set this.

@tzulitai
Copy link
Contributor

tzulitai commented Jan 23, 2017

Hi @sbcd90, will you like to continue working on this PR?

There's going to be a restructuring of the ES connectors (#3112) perhaps soon after the 1.2 release, and this PR will very likely need a rebase. I'd like to include this fix after the restructuring, so please let me know on how you'd like to proceed with this contribution :-) Thank you!

@sbcd90
Copy link
Contributor Author

sbcd90 commented Jan 26, 2017

Hi @tzulitai , thanks for the updates. I'll refactor the code & will rebase the PR.

@tzulitai
Copy link
Contributor

Thank you for picking this up again @sbcd90. I would wait until #3112 is merged before rebasing.

@sbcd90
Copy link
Contributor Author

sbcd90 commented Feb 17, 2017

Hello @tzulitai ,

I have rebased the changes. Can you please review?

Copy link
Contributor

@tzulitai tzulitai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the rebase, and really sorry for the long delay on the review here @sbcd90.
I finally have some time to revisit the Elasticsearch connectors :) I have some comments, could you please let me know what you think?

elasticsearchSinkFunction.process(value, getRuntimeContext(), requestIndexer);
} catch (Exception ex) {
if (client instanceof TransportClient && !callBridge.isConnected(((TransportClient) client))) {
TimeUnit.SECONDS.sleep(3);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be configurable?
Also, could you explain a bit on why you've chosen 3 seconds?


// if there is a connectivity failure, then retry
if (failureThrowable.get() != null &&
client instanceof TransportClient &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why exactly does the client need to be a TransportClient?

@@ -208,6 +222,13 @@ public void invoke(T value) throws Exception {
checkErrorAndRethrow();

elasticsearchSinkFunction.process(value, getRuntimeContext(), requestIndexer);

// if there is a connectivity failure, then retry
if (failureThrowable.get() != null &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering whether or not checking the exception type would be enough for verifying the connectivity failure.

}

try {
open(null);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This open() call seems a bit odd to me. I don't think its a good practice to call that here, since essentially its a life cycle method used by the system.

@aljoscha
Copy link
Contributor

I'm closing this as "Abandoned", since there is no more activity and the code base has moved on quite a bit. Please re-open this if you feel otherwise and work should continue.

@aljoscha aljoscha closed this Oct 14, 2019
@aljoscha aljoscha self-assigned this Oct 20, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants