Skip to content
This repository has been archived by the owner on May 12, 2021. It is now read-only.

[PIO-49] Add support for Elasticsearch 5.x #336

Closed
wants to merge 5 commits into from

Conversation

haginot
Copy link
Member

@haginot haginot commented Jan 16, 2017

PredictionIO / PIO-49

We work on meta/event storage support for Elasticsearch 5.x.
Although Elasticsearch 2.x does not allow dots in field names,
Elasticsearch 5.x supports it. So, it's better to upgrade to ES 5.x release.
Since ES 5.x provides Java Rest API client, we replaced
Transport communication with HTTP one. Therefore, our fix
uses HTTP(9200 port) only.

@ftopan80
Copy link

+1 pretty please

@marevol
Copy link
Member

marevol commented Jan 16, 2017

We use the following template.
https://github.com/marevol/incubator-predictionio-template-recommender/tree/0.11.0
(Commented out "data.ratings.take(1)", because it does not seem that elasticsearch-hadoop supports it)

@dszeto
Copy link
Contributor

dszeto commented Jan 17, 2017

This is really great. Thanks a lot @haginot and @marevol !

Officially ES 1.7 and before are deprecated. I will start a thread in user@ and dev@ to gauge if we need to maintain backward compatibility. If it is not too much effort, it would be nice to have two different sets of configuration and code for ES 1.x and ES 5.x.

@marevol
Copy link
Member

marevol commented Jan 17, 2017

Thank you for your comment.
We will check if it keeps the code for ES 1.x.

@dszeto
Copy link
Contributor

dszeto commented Jan 17, 2017

@marevol One way to do so is to rename the existing ES configuration and code to something like ES1Events, PIO_STORAGE_SOURCES_ELASTICSEARCH1_TYPE=elasticsearch1, and keep your new code as is. This may require build profiles (#295) though. Please let us know how it goes.

@marevol
Copy link
Member

marevol commented Jan 18, 2017

Added Elasticsearch 1.x support!
Elasticsearch 5.x support uses Jar file for REST API and does not conflict with ES 1.x Jar file.
Therefore both 1.x/5.x supports are available in 1 distribution.

@pferrel
Copy link
Contributor

pferrel commented Jan 18, 2017

@marevol Thanks this looks very promising! How do you configure PIO to use 1.x vs 5.x, Templates may need to look at the config or supply their own.

The UR uses the REST API for 1.x and can easily switch to 5.X so I'd like to coordinate.

What config flags the Elasticsearch version PIO_STORAGE_SOURCES_ELASTICSEARCH1_TYPE?

@marevol
Copy link
Member

marevol commented Jan 18, 2017

To use the existing code for elasticsearch 1.x, please use PIO_STORAGE_REPOSITORIES_METADATA_SOURCE=ELASTICSEARCH1 and PIO_STORAGE_SOURCES_ELASTICSEARCH1_* settings.

The UR uses the REST API for 1.x and can easily switch to 5.X so I'd like to coordinate.

o.a.p.data.strorage.elasticsearch.* classes contain index mappings and it's different between 1.x and 5.x.
So, to use new code for 5.x in 1.x, the mapping needs to be modified.

@dszeto
Copy link
Contributor

dszeto commented Jan 18, 2017

Reviewing the code right now. It would be nice if we could include a migration guide in the docs if possible.

Copy link
Contributor

@dszeto dszeto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rest is great! Thanks a lot @haginot and @marevol !

"org.clapper" %% "grizzled-slf4j" % "1.0.2",
"org.elasticsearch.client" % "rest" % elasticsearchVersion.value,
"org.elasticsearch" % "elasticsearch" % elasticsearch1Version.value,
"org.elasticsearch" % "elasticsearch-spark-13_2.10" % elasticsearchVersion.value % "provided",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could do

"org.elasticsearch"       %% "elasticsearch-spark-13" % elasticsearchVersion.value % "provided",

We do not need to change this now because PIO-30 will need to be extended to take care of cross-building, and the whole artifact name will need to change.

val typeExistResponse = indices.prepareTypesExists(index).setTypes(estype).get
if (!typeExistResponse.isExists) {
val json =
val restClient = client.open()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we reuse the same client after it's opened? According to https://www.elastic.co/guide/en/elasticsearch/client/java-rest/current/_initialization.html the official recommendation is to have the REST client have the same lifecycle as the application. I realize this pattern is used extensively in this PR, so please let us know what you think before you go ahead and change your code.

Right now PredictionIO does not provide a cleanup step in the storage layer when the application shuts down. This is probably a good opportunity to add. Would highly appreciate if you would like to take a stab at that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably a good opportunity to add.

+1
We could not reuse RestClient because it's not closed.
I think it's better to call "close" process for storage at the end of application, and then reuse RestClient.


class StorageClient(val config: StorageClientConfig) extends BaseStorageClient
with Logging {
override val prefix = "ES"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open discussion: Should we rename this to ES1 (and subsequently all classnames in this package)? Having this thought mainly due to the universal recommender template using the ES client directly and it would be nice to have a way to tell which version the client is. Another approach could be modifying Storage to provide discovery mechanism based on package name. @pferrel you may want to chime in.

@dszeto
Copy link
Contributor

dszeto commented Jan 20, 2017

If there's no objection, I suggest merging this first and do the other suggested optimizations in a separate JIRA/PR later.

@pferrel
Copy link
Contributor

pferrel commented Jan 21, 2017

  1. The effect on PIO. IMO ES 1.x should be the default build requiring no config change anywhere. We should somehow mark ES 1.x as "being deprecated" for a release or 2 and support 5.x with new config. Over time ES 5.x becomes the default and es 1.x becomes truly removed to keep old cruft to a minimum.
  2. The effect on Templates. This seems problematic since few of the Templates are actually part of what the PIO team maintains. Personally I'm for making some fairly radical changes but they should be done IMO all at once in a primary version change like 0.x to 1.x. Until then we risk most of the Templates becoming dead-end zombies. Too many haven't even been moved to the new namespace.

I agree that in all likelihood only the UR will need a code change to support 5.x and am willing to commit to that as long as the other Templates remain untouched--ideally.

My $0.02 worth

@dszeto
Copy link
Contributor

dszeto commented Jan 24, 2017

I'm going to do some more manual testing. I'll report back and let's merge this if things look well.

@dszeto
Copy link
Contributor

dszeto commented Feb 9, 2017

@haginot @marevol @pferrel After putting more thought to this, we should probably rename elasticsearch1 back to elasticsearch, and have ES5 under elasticsearch5 package name. This is to make sure old configuration doesn't break. After some deprecation period, we can drop ES1 support.

If that make sense, please feel free to modify the PR and fix the conflicts. I can also take care of this over the weekend if you are busy.

@marevol
Copy link
Member

marevol commented Feb 13, 2017

@dszeto Did you work on this issue? I think I'm available this week. So, I'll do that if you do not work yet.

@dszeto
Copy link
Contributor

dszeto commented Feb 13, 2017 via email

@marevol
Copy link
Member

marevol commented Feb 13, 2017

@dszeto Could you replace ELASTICSEARCH1 with ELASTICSEARCH5?
https://github.com/apache/incubator-predictionio/blob/feature/es5/bin/pio-start-all#L37,L38

feature/es5 branch works on my es5 environment if the above is fixed.

@dszeto
Copy link
Contributor

dszeto commented Feb 14, 2017

Thanks @marevol ! @pferrel do you want to push your update on the start and stop scripts?

@pferrel
Copy link
Contributor

pferrel commented Feb 21, 2017

This may be ok to stage to the feature/es5 branch but IMO should not be merged with master for release yet.

The technique of linking and creating an assembly with both es1 and es5 is IMO not ideal. The reasoning is that ES refactored es-hadoop-mr and es-spark at some time between the 2 so if you pull them into a Template (the UR does this) you get duplicate classes. It is not clear yet how to solve this but it is highly unlikely that if we had 2 PIO build profiles, one for ES1 and one for ES5 as well as 2 for any template using the ES Spark integration, this would go away and would be cleaner.

Another reason to do this is that the pio-env.sh would be simplified to only configure ELASTICSEARCH, not different named stores.

Most of this is fine and though it would require a non-trivial change to the UR, I think this is doable and would commit to supporting it if the above issues can be solved. So to summarize I suggest we:

  1. create the equivalent of a Maven build profile for ES5 and leave the default ES1 for now with a warning that ES5 will be the default in some future release.
  2. leave pio-env.sh with only the ES config for whatever version is installed with the name ELASTICSEARCH. If the scheme needs to be there for both that's ok with me.

This will make any Template that uses ES directly just work with ES1 and the default build and it will leave the Template work to move to ES5 using the new build profile for PIO with ES5.

If this pushes the ES5 build profile to 0.12.0, I personally am ok with that. I it delay 0.11.0 I'm also ok with that.

@pferrel
Copy link
Contributor

pferrel commented Feb 21, 2017

@dszeto I gave you a link to my merged version, to modify this PR I'd need permission to collaborate on the branch from the PR author, or I can create a new PR, which do you want?

@marevol are you ok with the suggestions above? I've spent several days trying to get the UR to work with this even when using ES1 and cannot. This worries me so I made the suggestions above.

@marevol
Copy link
Member

marevol commented Feb 22, 2017

To solve this issue, how about using plugin directory?
In our branch(https://github.com/jpioug/incubator-predictionio), ES1/ES5 codes in data were extracted to data-elasticsearch and data-elasticsearch1 project. Putting the jar file into plugin directory, it works.
So, default PIO distribution contains pio-data-elasticsearch5.jar in plugins directory,
and then if you want to use ES1, replacing it with pio-data-elasticsearch1.jar, PIO works with ES1.
I think that Plugin mechanism can keep previous behavior for ES1.

@pferrel
Copy link
Contributor

pferrel commented Feb 22, 2017

The current problem is linking them both in and putting them into the assembly. PIO, including assemblies should be built with one or the other, not both. Not sure how plugins fixes this.

I'm no SBT expert but this is pretty easy and common to solve with maven profiles. You would build pulling in certain classes for es1 and others for es5.

This means templates will need to be changed if they use any of the things that change for es1 vs es5 and I maintain one. This is ok, it's seems like the right way to do this.

@marevol
Copy link
Member

marevol commented Feb 22, 2017

Please see our snapshot build: http://fess.codelibs.org/snapshot/apache-predictionio-0.11.0-v1-SNAPSHOT.zip
Storage implementations, such as hbase, were moved to plugins/data-* projects,
and jar files are put to plugins directory in the distribution.

IMO, I think that Plugin is better than maven profile.
As for plugin, PIO user can select features in 1 distribution without building PIO,
and also add own storage implementation they want.

In our build, if you want to use ES1, (I'm working on this issue now. For ES1, more testing might be need...)

  1. Remove plugins/pio-data-elasticsearch-assembly-0.11.0-v1-SNAPSHOT.jar
  2. Move extra/pio-data-elasticsearch1-assembly-0.11.0-v1-SNAPSHOT.jar to plugins/
  3. Modify conf/pio-env.sh (you can use previous ES1 settings)
  4. Start PIO (That's all!)

Although I did not check it on UR template, it may work with no modification(or dependencies update) with ES1.

@pferrel
Copy link
Contributor

pferrel commented Feb 22, 2017

So you are using the old PIO plugins feature that basically puts any files in plugins onto the SparkSubmit --files param? The plugins are, in effect, jars outside the pio assembly to include when running a job?

This might work since we are not linking in classes from more than one version of ES but how do you move a jar that isn't built, how do the jars get built? Seems like some part of the build process has to change even if it's to create these alternative jars?

@pferrel
Copy link
Contributor

pferrel commented Feb 22, 2017

This would imply too that a single assembly is not enough and you'd have to deploy the entire pio directory structure to any machine that runs it, right? This in turn means that it may have implications for running in Yarn "cluster" mode. I'll defer to @dszeto for all this.

@marevol
Copy link
Member

marevol commented Feb 22, 2017

Although I'm not sure about old plugin features, in my fix, I put plugin jar files with --jars on spark-submit:
jpioug@70d117c#diff-fe08516b0f04bfd5f37f138874840766
To build plugins, I also modified build.sbt.

@pferrel
Copy link
Contributor

pferrel commented Feb 23, 2017

@marevol you might want to search for "plugins" in the PIO code I think PIO may already put those jars in the SparkSubmit params

This does seem like a better solution than building both into PIO, right @dszeto ?

Can you create a new PR against the feature/es5 branch so we can discuss there?

@marevol
Copy link
Member

marevol commented Feb 23, 2017

compute-classpath.sh deals with plugins directory, but existing code seems not to put jars in spark-submit.

If plugin feature looks good, I'll create PR for feature/es5.

@dszeto
Copy link
Contributor

dszeto commented Feb 24, 2017

@marevol @pferrel Would have never expected about the plugins mechanism being found without documentation. It's a debt that I should pay. :)

The original intention is for implementing event server and prediction server plugins, so that you can put arbitrary filters in front by just dropping in JARs.

Using spark-submit --jars is ideal. I was actually going to propose moving away from submitting a PIO assembly and instead submit a lib folder of JARs or even take advantage of Spark's dependency management (automatic downloads of all other dependencies). If we agree going down this path, let's follow @marevol 's suggestion and rename plugins to lib instead? What do you think?

@pferrel
Copy link
Contributor

pferrel commented Feb 24, 2017

Well I like passing in only what is used so the general idea seems good. But does the use of lib fit conventions? As I recall lib is used for managed dependencies but in this case it is more like a loose assembly. If this is not an issue then +1

@marevol
Copy link
Member

marevol commented Feb 24, 2017

+1
As @pferrel 's comment, I think it may be better to use a directory(ex. lib/spark) other than lib if we want to add arbitrary JARs to spark-submit.
If we put them into lib, it's difficult to discriminate if JAR should be submitted to Spark.

@dszeto
Copy link
Contributor

dszeto commented Feb 25, 2017

What I meant was the lib folder in the binary distribution tarball after doing make-distribution.sh. There isn't really a convention but I have seen many Apache binary distributions putting all the JARs there.

The lib folder in the source code is an SBT convention for unmanaged dependencies.

I agree that we should do something like lib/spark, since some pio commands do not require spark-submit anyway. @marevol can you make a pull request following these ideas, please?

@marevol
Copy link
Member

marevol commented Feb 26, 2017

Please see #352

  • Move storage implementation to storage directory
  • Put storage assembly JARs into lib/spark in distribution ZIP file
  • To check ES1 support, default ES JAR in ZIP is for ES1 (please replace with ES5 before release)

If you have better location(lib/spark and lib/extra), please change them.

@pferrel
Copy link
Contributor

pferrel commented Feb 27, 2017

@dszeto I meant lib = unmanaged, mistyped.

This still sounds good.

What will populate lib/spark? Maybe a flag in some sbt file or can a param be passed through pio build? If the later we have the equivalent of a maven build profile, sorta ad hoc but all good. The default for 0.11 is ES1, with a param for ES5.

@marevol this means the params in pio-env.sh do not need to flag ES5, there should be only ELASTICSEARCH and it applies to whichever is in lib/spark and installed right?

@marevol
Copy link
Member

marevol commented Feb 27, 2017

Right. To use ES5, please replace pio-data-elasticsearch1-assembly-*.jar with pio-data-elasticsearch-assembly-*.jar in lib/spark directory.

To change JARs location(ex. lib/spark), we can modify JAR lookup handing in compute-classpath.sh and Common.scala for #352.

The default for 0.11 is ES1, with a param for ES5.

ES 1.x was EOLed.
https://www.elastic.co/jp/support/eol

@asfgit asfgit closed this in ba46e2c Apr 3, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants