[PIO-49] Add support for Elasticsearch 5.x #336

haginot · 2017-01-16T04:01:00Z

We work on meta/event storage support for Elasticsearch 5.x.
Although Elasticsearch 2.x does not allow dots in field names,
Elasticsearch 5.x supports it. So, it's better to upgrade to ES 5.x release.
Since ES 5.x provides Java Rest API client, we replaced
Transport communication with HTTP one. Therefore, our fix
uses HTTP(9200 port) only.

ftopan80 · 2017-01-16T04:12:33Z

+1 pretty please

marevol · 2017-01-16T04:32:34Z

We use the following template.
https://github.com/marevol/incubator-predictionio-template-recommender/tree/0.11.0
(Commented out "data.ratings.take(1)", because it does not seem that elasticsearch-hadoop supports it)

dszeto · 2017-01-17T21:02:54Z

This is really great. Thanks a lot @haginot and @marevol !

Officially ES 1.7 and before are deprecated. I will start a thread in user@ and dev@ to gauge if we need to maintain backward compatibility. If it is not too much effort, it would be nice to have two different sets of configuration and code for ES 1.x and ES 5.x.

marevol · 2017-01-17T21:36:45Z

Thank you for your comment.
We will check if it keeps the code for ES 1.x.

dszeto · 2017-01-17T21:40:12Z

@marevol One way to do so is to rename the existing ES configuration and code to something like ES1Events, PIO_STORAGE_SOURCES_ELASTICSEARCH1_TYPE=elasticsearch1, and keep your new code as is. This may require build profiles (#295) though. Please let us know how it goes.

marevol · 2017-01-18T07:48:29Z

Added Elasticsearch 1.x support!
Elasticsearch 5.x support uses Jar file for REST API and does not conflict with ES 1.x Jar file.
Therefore both 1.x/5.x supports are available in 1 distribution.

pferrel · 2017-01-18T18:18:04Z

@marevol Thanks this looks very promising! How do you configure PIO to use 1.x vs 5.x, Templates may need to look at the config or supply their own.

The UR uses the REST API for 1.x and can easily switch to 5.X so I'd like to coordinate.

What config flags the Elasticsearch version PIO_STORAGE_SOURCES_ELASTICSEARCH1_TYPE?

marevol · 2017-01-18T21:25:24Z

To use the existing code for elasticsearch 1.x, please use PIO_STORAGE_REPOSITORIES_METADATA_SOURCE=ELASTICSEARCH1 and PIO_STORAGE_SOURCES_ELASTICSEARCH1_* settings.

The UR uses the REST API for 1.x and can easily switch to 5.X so I'd like to coordinate.

o.a.p.data.strorage.elasticsearch.* classes contain index mappings and it's different between 1.x and 5.x.
So, to use new code for 5.x in 1.x, the mapping needs to be modified.

dszeto · 2017-01-18T22:05:08Z

Reviewing the code right now. It would be nice if we could include a migration guide in the docs if possible.

dszeto

The rest is great! Thanks a lot @haginot and @marevol !

dszeto · 2017-01-18T22:48:42Z

data/build.sbt

+  "org.clapper"             %% "grizzled-slf4j" % "1.0.2",
+  "org.elasticsearch.client" % "rest"           % elasticsearchVersion.value,
+  "org.elasticsearch"        % "elasticsearch"  % elasticsearch1Version.value,
+  "org.elasticsearch"        % "elasticsearch-spark-13_2.10" % elasticsearchVersion.value % "provided",


We could do

"org.elasticsearch" %% "elasticsearch-spark-13" % elasticsearchVersion.value % "provided",

We do not need to change this now because PIO-30 will need to be extended to take care of cross-building, and the whole artifact name will need to change.

dszeto · 2017-01-18T23:30:18Z

data/src/main/scala/org/apache/predictionio/data/storage/elasticsearch/ESAccessKeys.scala

-  val typeExistResponse = indices.prepareTypesExists(index).setTypes(estype).get
-  if (!typeExistResponse.isExists) {
-    val json =
+  val restClient = client.open()


Should we reuse the same client after it's opened? According to https://www.elastic.co/guide/en/elasticsearch/client/java-rest/current/_initialization.html the official recommendation is to have the REST client have the same lifecycle as the application. I realize this pattern is used extensively in this PR, so please let us know what you think before you go ahead and change your code.

Right now PredictionIO does not provide a cleanup step in the storage layer when the application shuts down. This is probably a good opportunity to add. Would highly appreciate if you would like to take a stab at that.

This is probably a good opportunity to add.

+1
We could not reuse RestClient because it's not closed.
I think it's better to call "close" process for storage at the end of application, and then reuse RestClient.

dszeto · 2017-01-18T23:44:33Z

data/src/main/scala/org/apache/predictionio/data/storage/elasticsearch1/StorageClient.scala

+
+class StorageClient(val config: StorageClientConfig) extends BaseStorageClient
+    with Logging {
+  override val prefix = "ES"


Open discussion: Should we rename this to ES1 (and subsequently all classnames in this package)? Having this thought mainly due to the universal recommender template using the ES client directly and it would be nice to have a way to tell which version the client is. Another approach could be modifying Storage to provide discovery mechanism based on package name. @pferrel you may want to chime in.

dszeto · 2017-01-20T19:05:46Z

If there's no objection, I suggest merging this first and do the other suggested optimizations in a separate JIRA/PR later.

pferrel · 2017-01-21T19:37:07Z

The effect on PIO. IMO ES 1.x should be the default build requiring no config change anywhere. We should somehow mark ES 1.x as "being deprecated" for a release or 2 and support 5.x with new config. Over time ES 5.x becomes the default and es 1.x becomes truly removed to keep old cruft to a minimum.
The effect on Templates. This seems problematic since few of the Templates are actually part of what the PIO team maintains. Personally I'm for making some fairly radical changes but they should be done IMO all at once in a primary version change like 0.x to 1.x. Until then we risk most of the Templates becoming dead-end zombies. Too many haven't even been moved to the new namespace.

I agree that in all likelihood only the UR will need a code change to support 5.x and am willing to commit to that as long as the other Templates remain untouched--ideally.

My $0.02 worth

dszeto · 2017-01-24T18:43:21Z

I'm going to do some more manual testing. I'll report back and let's merge this if things look well.

dszeto · 2017-02-09T19:29:16Z

@haginot @marevol @pferrel After putting more thought to this, we should probably rename elasticsearch1 back to elasticsearch, and have ES5 under elasticsearch5 package name. This is to make sure old configuration doesn't break. After some deprecation period, we can drop ES1 support.

If that make sense, please feel free to modify the PR and fix the conflicts. I can also take care of this over the weekend if you are busy.

marevol · 2017-02-13T08:37:18Z

@dszeto Did you work on this issue? I think I'm available this week. So, I'll do that if you do not work yet.

dszeto · 2017-02-13T08:40:12Z

Yes. I've created and pushed a branch (feature/es5). @pferrel is verifying with universal recommender.

…

On Mon, Feb 13, 2017 at 12:37 AM Shinsuke Sugaya ***@***.***> wrote: @dszeto <https://github.com/dszeto> Did you work on this issue? I think I'm available this week. So, I'll do that if you do not work yet. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#336 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABUvSmXjcGSIx-Hn-AjKQzTuxIxAWENZks5rcBZEgaJpZM4LkKNu> .

marevol · 2017-02-13T09:18:46Z

@dszeto Could you replace ELASTICSEARCH1 with ELASTICSEARCH5?
https://github.com/apache/incubator-predictionio/blob/feature/es5/bin/pio-start-all#L37,L38

feature/es5 branch works on my es5 environment if the above is fixed.

dszeto · 2017-02-14T05:01:30Z

Thanks @marevol ! @pferrel do you want to push your update on the start and stop scripts?

pferrel · 2017-02-21T21:03:50Z

This may be ok to stage to the feature/es5 branch but IMO should not be merged with master for release yet.

The technique of linking and creating an assembly with both es1 and es5 is IMO not ideal. The reasoning is that ES refactored es-hadoop-mr and es-spark at some time between the 2 so if you pull them into a Template (the UR does this) you get duplicate classes. It is not clear yet how to solve this but it is highly unlikely that if we had 2 PIO build profiles, one for ES1 and one for ES5 as well as 2 for any template using the ES Spark integration, this would go away and would be cleaner.

Another reason to do this is that the pio-env.sh would be simplified to only configure ELASTICSEARCH, not different named stores.

Most of this is fine and though it would require a non-trivial change to the UR, I think this is doable and would commit to supporting it if the above issues can be solved. So to summarize I suggest we:

create the equivalent of a Maven build profile for ES5 and leave the default ES1 for now with a warning that ES5 will be the default in some future release.
leave pio-env.sh with only the ES config for whatever version is installed with the name ELASTICSEARCH. If the scheme needs to be there for both that's ok with me.

This will make any Template that uses ES directly just work with ES1 and the default build and it will leave the Template work to move to ES5 using the new build profile for PIO with ES5.

If this pushes the ES5 build profile to 0.12.0, I personally am ok with that. I it delay 0.11.0 I'm also ok with that.

pferrel · 2017-02-21T21:05:22Z

@dszeto I gave you a link to my merged version, to modify this PR I'd need permission to collaborate on the branch from the PR author, or I can create a new PR, which do you want?

@marevol are you ok with the suggestions above? I've spent several days trying to get the UR to work with this even when using ES1 and cannot. This worries me so I made the suggestions above.

marevol · 2017-02-22T10:07:07Z

To solve this issue, how about using plugin directory?
In our branch(https://github.com/jpioug/incubator-predictionio), ES1/ES5 codes in data were extracted to data-elasticsearch and data-elasticsearch1 project. Putting the jar file into plugin directory, it works.
So, default PIO distribution contains pio-data-elasticsearch5.jar in plugins directory,
and then if you want to use ES1, replacing it with pio-data-elasticsearch1.jar, PIO works with ES1.
I think that Plugin mechanism can keep previous behavior for ES1.

pferrel · 2017-02-22T19:08:11Z

The current problem is linking them both in and putting them into the assembly. PIO, including assemblies should be built with one or the other, not both. Not sure how plugins fixes this.

I'm no SBT expert but this is pretty easy and common to solve with maven profiles. You would build pulling in certain classes for es1 and others for es5.

This means templates will need to be changed if they use any of the things that change for es1 vs es5 and I maintain one. This is ok, it's seems like the right way to do this.

marevol · 2017-02-22T21:20:57Z

Please see our snapshot build: http://fess.codelibs.org/snapshot/apache-predictionio-0.11.0-v1-SNAPSHOT.zip
Storage implementations, such as hbase, were moved to plugins/data-* projects,
and jar files are put to plugins directory in the distribution.

IMO, I think that Plugin is better than maven profile.
As for plugin, PIO user can select features in 1 distribution without building PIO,
and also add own storage implementation they want.

In our build, if you want to use ES1, (I'm working on this issue now. For ES1, more testing might be need...)

Remove plugins/pio-data-elasticsearch-assembly-0.11.0-v1-SNAPSHOT.jar
Move extra/pio-data-elasticsearch1-assembly-0.11.0-v1-SNAPSHOT.jar to plugins/
Modify conf/pio-env.sh (you can use previous ES1 settings)
Start PIO (That's all!)

Although I did not check it on UR template, it may work with no modification(or dependencies update) with ES1.

pferrel · 2017-02-22T23:17:31Z

So you are using the old PIO plugins feature that basically puts any files in plugins onto the SparkSubmit --files param? The plugins are, in effect, jars outside the pio assembly to include when running a job?

This might work since we are not linking in classes from more than one version of ES but how do you move a jar that isn't built, how do the jars get built? Seems like some part of the build process has to change even if it's to create these alternative jars?

pferrel · 2017-02-22T23:19:38Z

This would imply too that a single assembly is not enough and you'd have to deploy the entire pio directory structure to any machine that runs it, right? This in turn means that it may have implications for running in Yarn "cluster" mode. I'll defer to @dszeto for all this.

marevol · 2017-02-22T23:36:39Z

Although I'm not sure about old plugin features, in my fix, I put plugin jar files with --jars on spark-submit:
jpioug@70d117c#diff-fe08516b0f04bfd5f37f138874840766
To build plugins, I also modified build.sbt.

pferrel · 2017-02-23T00:51:15Z

@marevol you might want to search for "plugins" in the PIO code I think PIO may already put those jars in the SparkSubmit params

This does seem like a better solution than building both into PIO, right @dszeto ?

Can you create a new PR against the feature/es5 branch so we can discuss there?

marevol · 2017-02-23T23:58:15Z

compute-classpath.sh deals with plugins directory, but existing code seems not to put jars in spark-submit.

If plugin feature looks good, I'll create PR for feature/es5.

dszeto · 2017-02-24T17:42:45Z

@marevol @pferrel Would have never expected about the plugins mechanism being found without documentation. It's a debt that I should pay. :)

The original intention is for implementing event server and prediction server plugins, so that you can put arbitrary filters in front by just dropping in JARs.

Using spark-submit --jars is ideal. I was actually going to propose moving away from submitting a PIO assembly and instead submit a lib folder of JARs or even take advantage of Spark's dependency management (automatic downloads of all other dependencies). If we agree going down this path, let's follow @marevol 's suggestion and rename plugins to lib instead? What do you think?

pferrel · 2017-02-24T18:11:27Z

Well I like passing in only what is used so the general idea seems good. But does the use of lib fit conventions? As I recall lib is used for managed dependencies but in this case it is more like a loose assembly. If this is not an issue then +1

marevol · 2017-02-24T23:42:28Z

+1
As @pferrel 's comment, I think it may be better to use a directory(ex. lib/spark) other than lib if we want to add arbitrary JARs to spark-submit.
If we put them into lib, it's difficult to discriminate if JAR should be submitted to Spark.

dszeto · 2017-02-25T08:20:06Z

What I meant was the lib folder in the binary distribution tarball after doing make-distribution.sh. There isn't really a convention but I have seen many Apache binary distributions putting all the JARs there.

The lib folder in the source code is an SBT convention for unmanaged dependencies.

I agree that we should do something like lib/spark, since some pio commands do not require spark-submit anyway. @marevol can you make a pull request following these ideas, please?

marevol · 2017-02-26T09:45:19Z

Please see #352

Move storage implementation to storage directory
Put storage assembly JARs into lib/spark in distribution ZIP file
To check ES1 support, default ES JAR in ZIP is for ES1 (please replace with ES5 before release)

If you have better location(lib/spark and lib/extra), please change them.

pferrel · 2017-02-27T01:37:50Z

@dszeto I meant lib = unmanaged, mistyped.

This still sounds good.

What will populate lib/spark? Maybe a flag in some sbt file or can a param be passed through pio build? If the later we have the equivalent of a maven build profile, sorta ad hoc but all good. The default for 0.11 is ES1, with a param for ES5.

@marevol this means the params in pio-env.sh do not need to flag ES5, there should be only ELASTICSEARCH and it applies to whichever is in lib/spark and installed right?

marevol · 2017-02-27T02:08:19Z

Right. To use ES5, please replace pio-data-elasticsearch1-assembly-*.jar with pio-data-elasticsearch-assembly-*.jar in lib/spark directory.

To change JARs location(ex. lib/spark), we can modify JAR lookup handing in compute-classpath.sh and Common.scala for #352.

The default for 0.11 is ES1, with a param for ES5.

ES 1.x was EOLed.
https://www.elastic.co/jp/support/eol

Add support for Elasticsearch 5.x

36b79d7

haginot force-pushed the elasticsearch5-support branch from 8ff185c to 36b79d7 Compare January 16, 2017 04:43

marevol added 2 commits January 16, 2017 17:28

add ESClient to close RestClient

d4e75ab

update docker files

48e18b5

add elasticsearch 1.x support

bad2f03

haginot force-pushed the elasticsearch5-support branch from 481b807 to d67b3b1 Compare January 18, 2017 08:23

Fix run_docker.sh for using Elasticsearch1.x as METADATA_SOURCE

89c9d26

haginot force-pushed the elasticsearch5-support branch from d67b3b1 to 89c9d26 Compare January 18, 2017 08:25

dszeto reviewed Jan 18, 2017

View reviewed changes

asfgit closed this in ba46e2c Apr 3, 2017

[PIO-49] Add support for Elasticsearch 5.x #336

[PIO-49] Add support for Elasticsearch 5.x #336

Conversation

haginot commented Jan 16, 2017

ftopan80 commented Jan 16, 2017

marevol commented Jan 16, 2017

dszeto commented Jan 17, 2017

marevol commented Jan 17, 2017

dszeto commented Jan 17, 2017

marevol commented Jan 18, 2017

pferrel commented Jan 18, 2017

marevol commented Jan 18, 2017

dszeto commented Jan 18, 2017

dszeto left a comment

Choose a reason for hiding this comment

dszeto Jan 18, 2017

Choose a reason for hiding this comment

dszeto Jan 18, 2017

Choose a reason for hiding this comment

marevol Jan 19, 2017

Choose a reason for hiding this comment

dszeto Jan 18, 2017

Choose a reason for hiding this comment

dszeto commented Jan 20, 2017

pferrel commented Jan 21, 2017

dszeto commented Jan 24, 2017

dszeto commented Feb 9, 2017

marevol commented Feb 13, 2017

dszeto commented Feb 13, 2017 via email

marevol commented Feb 13, 2017

dszeto commented Feb 14, 2017

pferrel commented Feb 21, 2017

pferrel commented Feb 21, 2017 • edited Loading

marevol commented Feb 22, 2017

pferrel commented Feb 22, 2017

marevol commented Feb 22, 2017

pferrel commented Feb 22, 2017

pferrel commented Feb 22, 2017

marevol commented Feb 22, 2017

pferrel commented Feb 23, 2017 • edited Loading

marevol commented Feb 23, 2017

dszeto commented Feb 24, 2017

pferrel commented Feb 24, 2017

marevol commented Feb 24, 2017

dszeto commented Feb 25, 2017 • edited Loading

marevol commented Feb 26, 2017

pferrel commented Feb 27, 2017

marevol commented Feb 27, 2017

pferrel commented Feb 21, 2017 •

edited

Loading

pferrel commented Feb 23, 2017 •

edited

Loading

dszeto commented Feb 25, 2017 •

edited

Loading