Skip to content

Conversation

@balashashanka
Copy link
Contributor

  • Updated the Transport Client to rest client.
  • The test cases will only pass when there is a elastic server set up in local.

@lewismc
Copy link
Member

lewismc commented Nov 7, 2019

Thank you @balashashanka for this PR.
I would be in favor of deprecating indexer-elastic-rest in favor of this approach. Reasons being that this would enable us to close (mark as will not fix) both NUTCH-2304 and NUTCH-2677.
We can also then cleanup ivy.xml by removing the unnecessary dependencies associated with JEST.

@lewismc
Copy link
Member

lewismc commented Nov 7, 2019

Is anyone aware of any tradeoff's between indexer-elastic and indexer-elastic-rest?

Copy link
Contributor

@sebastian-nagel sebastian-nagel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @balashashanka, looks promising. I'll hope to be able to test it soon. The plugin.xml needs to be done first, best have a look how this is done in the plugin parse-tika.

<dependencies>
<dependency org="org.elasticsearch" name="elasticsearch" rev="5.3.0" conf="*->default"/>
<dependency org="org.elasticsearch.client" name="transport" rev="5.3.0"/>
<dependency org="org.elasticsearch.client" name="elasticsearch-rest-high-level-client" rev="7.3.0"/>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The updated dependencies must be also "registered" in indexer-elastic/plugin.xml. The description hot to uprade ES is somewhat outdated, better have a look at parse-tika for both the description and build-ivy.xml.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure @sebastian-nagel will add the changes in plugin.xml and remove unnecessary dependencies.

import java.util.UUID;

/*Test works only when there is a elastic client setup is already there.
* Because Rest Client requires a elastic server connection for it to function unlike Transport client*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood. But then we need to deactivate the tests by default. Otherwise the nightly builds will fail.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But maybe there is a way to mock the client, see CustomRestHighLevelClientTests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Sebastian, I went through the CustomRestHighLevelClientTests, and it seems it is to test Rest High Level client changes. This will mock the client itself. But we need to mock the server with requests and response. So can we go ahead and not do the tests at all?

@sebastian-nagel
Copy link
Contributor

This will mock the client itself. But we need to mock the server with requests and response. So can we go ahead and not do the tests at all?

Well, the previous non-REST test implemented a client which did not send anything to the server but just returned a successful response or (if clusterSaturated was set to true) a temporary failure.

But I'm ok to remove the Test class if it's too much work to rewrite it for the REST client.

I've tested the PR but the initial rounds failed for about 50% of the pages/documents:

[2019-11-18T12:56:46,803][DEBUG][o.e.a.b.TransportShardBulkAction] [vagran] [nutch][0] failed to execute bulk item (index) index {[nutch][_doc][http://nutch.apache.org/apidocs/apidocs-2.2.1/index.html], source[{"{date=Mon Jun 09 15:03:28 CEST 2014, type=[text/html, text, html], title=apache-nutch 2.2.1 API, url=http://nutch.apache.org/apidocs/apidocs-2.2.1/index.html, content=apache-nutch 2.2.1 API\n<H2> Frame Alert</H2> <P> This document is designed to be viewed using the frames feature. If you see this message, you are using a non-frame-capable web client. <BR> Link to<A HREF=\"overview-summary.html\">Non-frame version.</A>\n, search=apache-nutch 2.2.1 API, tstamp=Thu Jul 26 16:50:11 CEST 2018, segment=20180726164932, digest=8b8785f9cec87c0376a7fa940e0e3a6c, host=nutch.apache.org, boost=1.0, id=http://nutch.apache.org/apidocs/apidocs-2.2.1/index.html, lastModified=Mon Jun 09 15:03:28 CEST 2014}":"doc"}]}

I got it fixed by using XContentBuilder to pass document as JSON to ES client, you'll find the necessary changes in this branch. Also:

  • updated the description how to upgrade the dependencies in the plugin.xml and added few exclusions of dependencies already provided by Nutch core.
  • changed the default properties in index-writers.xml.template so that the indexer-elastic plugin works out-of-the-box with default settings

So far, I didn't run any tests at scale. Should be to make sure we are able to index millions of documents with the given settings.

Please have a look at my changes. Can you integrate them into your branch?

@sebastian-nagel
Copy link
Contributor

Thanks, @balashashanka! I'll run another round of tests, and will merge soon.

sebastian-nagel added a commit that referenced this pull request Nov 22, 2019
NUTCH-2739 indexer-elastic: Upgrade ES and migrate to REST client
- upgrade to Elasticsearch 7.3.0
- use Java High Level REST Client instead of deprecated TransportClient
@sebastian-nagel
Copy link
Contributor

Rebased and merged into master in c23afa8. Thanks, @balashashanka!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants