[HUDI-91][HUDI-12]Migrate to spark 2.4.4, migrate to spark-avro library instead of databricks-avro, add support for Decimal/Date types#1005
Conversation
|
|
@n3nash : Can you and Modi review this PR and see if it is aligned with what you have done internally ? |
|
Hi Udit! Thanks for making this PR! I've been working on upgrading HUDI to Spark 2.4 internally at Uber! So I'll list out a few things that I had to do, so that you're not trying to re-discover these things yourself :)
Btw - I also went to UIUC! Great to meet new Illini! |
|
Hi, any plans to migrate Hudi to Scala 2.12? I don't see any Jira issue regarding this. I would volunteer after this PR is merged. I already looked around, it seems that the biggest problem will be migrating to |
|
@ezhux I have migrated to spark-streaming-kafka-0-10_2.11 internally at my organisation. Let me know if I can help you in any way. :) |
|
@ezhux https://issues.apache.org/jira/browse/HUDI-238 tracks this I think.. yes we want to put up a 2.12 bundle as well. Please engage on the ticket for timelines |
There was a problem hiding this comment.
I understand we can't bundle this since its tied to a spark version now. spark-avro is still a package and the user must explicitly include using --packages ? So, if I upgrade hudi in the next release, then as a user I need to change something? Should/how do we document this?
There was a problem hiding this comment.
Yeah this will be an additional Jar, the user would have to pass while starting the spark-shell. We would have to document it. I don't see any documentation for spark.serializer=org.apache.spark.serializer.KryoSerializer either which is also a pre-requisite right. Shall we update it in the README ?
There was a problem hiding this comment.
KryoSerializer command line is provided on the quickstart page. Adding it there could be a good thing. Do a follow on update?
@modi95 good to hear from a fellow Illini ! About the points you raised:
|
Yeah any guidance on this front would be appreciated. On how we want to go about it for getting the tests working. I will look into the docker setup. |
|
Mistakenly added a comment for another PR here. Deleted it. |
|
@umehrot2 : If you want to give it a shot, its better to open a new PR, You would need to update docker.spark.version in pom files below docker/hadoop/... and also update spark version in Dockerfile in these directories : spark_base, sparkadhoc, sparkworker and sparkmaster. You can then run docker/build_local_docker_images.sh to build new docker images locally and then run integration tests. We would have to push these docker images so that travis integration can pick it up (we will help on this). @umehrot2 : If you need help, let us know. Either me or @bhasudha can get the docker images built and pushed. |
@bvaradar thanks for the suggestions. Will give it a shot tomorrow, and reach out in case of any doubts. |
6f68984 to
d195559
Compare
|
Re-kicked Integration tests |
|
Looks like the integration tests are failing with dependency version mismatches. |
@bvaradar yeah have been looking into them. Like you mentioned there are multiple dependency related issues going on. Working on a solution. |
|
I was able to fix the integration test dependency issues on my local atleast. Hoping that things run fine on Travis too. To give an overview, there were 3 major failures happening:
This is happening because in Hudi even for bits running through Spark we are using What I propose here, is that we should use version of Hive that is compatible with Spark, atleast for the bits running inside Spark so that compatible versions of Hive end up in class paths. Now
By making the above changes, the integration tests work now. Let me know your thoughts about these changes, if there are concerns. |
|
@bvaradar can we re-trigger the tests ? I think this time it failed due to flaky timeouts |
|
@umehrot2 : Comparing this run's logs with that of #1009 , I can see that somehow spark logging level became INFO with this PR (hence, so much logging). You can look at the logs corresponding to org.apache.hudi.integ.ITTestHoodieSanity. Both runs use the same docker image (spark-2.4.4). Can you check how Spark log level became INFO. I didn't dig further than that but let me know if you need help. |
|
@umehrot2 : Regarding your comment regarding moving to spark.hive version for hudi-spark-bundle and hudi-utilities-bundle, I am ok as long as hive sync works with hiveserver/HMS 2.x Regarding realtime queries, if we shade avro in hudi-hadoop-mr-bundle, we would need to follow same shading policy in user's jar when they use custom HoodieRecordPayload. AFAIK, there is no good solution here to resolve this dependency hell. If we go this approach, we would need to document this caveat and make it easy for users to perform this shading (with boiler-plate pom) @vinothchandar : Any suggestions ? @modi95 (cc @n3nash) : Please note that there will be similar issues with realtime queries and spark-2.4.4 when we eventually migrate RT tables to Spark 2.4.4 |
|
On RecordPayload and avro bundling, I expect most people (esp with if you have record level indexes as we plan to) would be happy with OverwriteWithLatest payload, which is internal to the project? I think it may be fair to simplify things to supporting these default payloads well, while providing a guide for authoring Payloads.. (Need a JIRA for documenting such a guide).. Longer term, I am wondering if we should move away from Avro as the standard object (I did not want to invent our own object for obvious reasons of incurring additional conversion cost and code maintenance) or allow Payload to be written using all or some of Row/Avro/ArrayWritable? I know its not a clear answer, but its as concrete as I could get; @umehrot2 is shading needed to avoid conflicts with Hive's avro? hows does this fit into this PR? |
There was a problem hiding this comment.
KryoSerializer command line is provided on the quickstart page. Adding it there could be a good thing. Do a follow on update?
|
@umehrot2 : Sorry for the back-and-forth on this. Issue 1 (as mentioned in #1005 (comment)) is due to fat jar hive-exec. @n3nash proposed a solution in Uber which wont require moving to spark-hive. Instead of the test dependency : hive-exec, can you try depending on the non-fat version of the jar called : hive-exec-core. Hopefuly, we can control parquet/avro versions getting loaded for the tests. |
@bvaradar That's fine, we should take time and solve the right way. In Also I don't see any artifact like |
|
@vinothchandar Yes updating quick start page makes sense. Will you guys be doing that ? |
|
@umehrot2 Thanks for enumerating your thoughts. Let me add some more context here. Firstly, hive-exec has a classifier Secondly, there is no support for Spark's fork of Hive (1.2.1.spark.2). This was forked by the Spark community to solve the exact issue of hive jars not bundling the correct dependencies that I described above, read more here : https://issues.apache.org/jira/browse/HIVE-16391?focusedCommentId=16032497&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16032497 and then some more changes were added to the fork which are NOT necessary according to the comments in the same jira. In fact, there is a strong need in the spark community to move away from this forked version to the regular hive version. See here : http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Upgrade-built-in-Hive-to-2-3-4-td26153.html. But I see your point on having the spark-modules depend on the spark-hive version, this way it's clear and we don't have to solve this issue ourselves. I have a few hesitations in introducing a spark's forked hive version : a) This means we have 2 hive versions across the project b) The spark's forked version of hive doesn't have anything more apart from solving the hive-exec jar mess. |
|
Thanks @n3nash for your thoughts. @umehrot2 : If it is possible to achieve spark 2.4 upgrade cleanly without moving to spark-hive version, it makes sense to me to retain native hive version. I think it is better to not get locked down on spark-version of hive. As we are using custom code (non-spark) to do hive syncing, theoretically speaking - we may run into some hive issues which would need upgrade but as the issue is not seen in spark, they may be unwilling to patch their hive jars. we can use spark-hive as a last resort if we cannot upgrade to Spark 2.4 any other way :) In that spirit, To your concern related to transitive dependencies in hudi-spark module - As maven honors dependency ordering, can we list hive-exec (with classifier as "core") in the dependency section before hive-service and add exclusions in the dependency section for hive-service to exclude hive-exec. I am not sure if this would work but don't have time to try this out myself. Something along the lines of : @umehrot2 : If we cannot make it to work any other way, I am ok with using spark-hive. |
|
What balaji suggested makes sense to me. spark.hive version has its own issues, but we can live with this if there is no other way. |
|
Want to know the progress of this PR now. I think not every user uses Spark2.4, Can we combine the two methods (databricks-avro and spark-avro) and simplify them as our own internal implementation, which can be compatible with most versions of spark2? |
|
|
My suggestion is to freeze code by 15th, test the RC for a week and cut one by jan last/feb first week. @leesf is the release manager though.. So he can share plans.. |
My thought is to get code ready by 15th to get some buffer, and release at the end of this month. |
|
Ack working with that deadline in mind. |
768f03d to
7b3d943
Compare
|
@vinothchandar @bvaradar @n3nash I have updated the PR to now use the |
| .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer").master("local[1]").getOrCreate(); | ||
| JavaSparkContext jssc = new JavaSparkContext(spark.sparkContext()); | ||
| spark.sparkContext().setLogLevel("WARN"); | ||
| jssc.setLogLevel("WARN"); |
There was a problem hiding this comment.
I guess just the first one is fine. Will remove the second line.
| <include>org.apache.hive:hive-metastore</include> | ||
| <include>org.apache.hive:hive-jdbc</include> | ||
|
|
||
| <include>com.databricks:spark-avro_2.11</include> |
There was a problem hiding this comment.
if we bundled org.apache.spark:spark-avro would n't that make life simpler for everyone?
There was a problem hiding this comment.
I can give it a shot, but we need to carefully understand the consequences of shading a spark library, inside a Jar which is being run on Spark. I remember earlier we had some issue on EMR, but don't have the exact details. Nevertheless, let me try and see if tests pass.
There was a problem hiding this comment.
I assume the tests will pass.. but I realize what you are saying.. the user could be running on a higher spark version say and we would be bundling 2.4.4 . Lets just open a JIRA to tackle this usability issue and keep it as -is now.. We can document the need for --packages ... when using spark-submit or spark-shell clearly for now.. and move on.
There was a problem hiding this comment.
Yeah that can be one of the problems. Created a JIRA for this issue: https://issues.apache.org/jira/browse/HUDI-516
About documentation of --packages are you guys going to take care of that ?
|
|
||
| <profiles> | ||
| <profile> | ||
| <id>mr-bundle-shade-avro</id> |
There was a problem hiding this comment.
@umehrot2 (cc @vinothchandar ) I will get back on this by today EOD.
There was a problem hiding this comment.
(cc @n3nash ) Yeah, this would mean that we need to employ the same package relocation in the jar carrying custom record payloads. As discussed in the earlier threads, there is no way around it. @umehrot2 : We would need to document this caveat in Release Notes and add documentation on how to shade it. Can you create a ticket to track this ?
There was a problem hiding this comment.
@vinothchandar @bvaradar Yes, this will affect the custom payload implementation on the reader side. But we are anyways going to make some changes in how the payload packages are loaded so we should be able to absorb this change as part of those considerations.
|
|
||
| <profiles> | ||
| <profile> | ||
| <id>mr-bundle-shade-avro</id> |
There was a problem hiding this comment.
(cc @n3nash ) Yeah, this would mean that we need to employ the same package relocation in the jar carrying custom record payloads. As discussed in the earlier threads, there is no way around it. @umehrot2 : We would need to document this caveat in Release Notes and add documentation on how to shade it. Can you create a ticket to track this ?
…bricks-avro, add support for Decimal/Date types
07d9f3f to
a3fb7d4
Compare
|
@bvaradar Created a JIRA to track documentation of Avro shading caveat https://issues.apache.org/jira/browse/HUDI-519 |
@umehrot2 typically the author of the PR also does the doc changes to keep things in sync.. Are you able to make the changes in docs? Mainly, it should be the quickstart, demo, writing_data pages |
|
@vinothchandar Glad to see this PR can be merged, does it mean we need to use spark2.4 and avro 1.8 in hudi 0.5.1 finally? |
Yes, It is merged and should be available in 0.5.1 |
…port multiple timeline servers with same host (apache#1005)
Sending this PR out early to get feedback. Have not yet looked into what changes are required for tests. But in general these changes have been working for us on AWS EMR without any issues so far. This PR implements the following: