Skip to content

[HUDI-91][HUDI-12]Migrate to spark 2.4.4, migrate to spark-avro library instead of databricks-avro, add support for Decimal/Date types#1005

Merged
bvaradar merged 1 commit intoapache:masterfrom
umehrot2:spark-2.4.4-migration
Jan 12, 2020
Merged

[HUDI-91][HUDI-12]Migrate to spark 2.4.4, migrate to spark-avro library instead of databricks-avro, add support for Decimal/Date types#1005
bvaradar merged 1 commit intoapache:masterfrom
umehrot2:spark-2.4.4-migration

Conversation

@umehrot2
Copy link
Copy Markdown
Contributor

@umehrot2 umehrot2 commented Nov 9, 2019

Sending this PR out early to get feedback. Have not yet looked into what changes are required for tests. But in general these changes have been working for us on AWS EMR without any issues so far. This PR implements the following:

  • Migrates Hudi to Spark 2.4.4
  • Migrates Hudi to use spark-avro module instead of the deprecated databricks-avro
  • Adds support for Decimal/Date types correctly through Hive
  • Timestamp still needs to be supported as it is blocked on https://issues.apache.org/jira/browse/HUDI-83

@umehrot2
Copy link
Copy Markdown
Contributor Author

umehrot2 commented Nov 9, 2019

Sending this PR out early to get feedback. Have not yet looked into what changes are required for tests. But in general these changes have been working for us on AWS EMR without any issues so far. This PR implements the following:

* Migrates Hudi to Spark 2.4.4

* Migrates Hudi to use spark-avro module instead of the deprecated databricks-avro

* Adds support for Decimal/Date types correctly through Hive

* Timestamp still needs to be supported as it is blocked on https://issues.apache.org/jira/browse/HUDI-83

@umehrot2 umehrot2 closed this Nov 9, 2019
@umehrot2 umehrot2 reopened this Nov 9, 2019
@bvaradar
Copy link
Copy Markdown
Contributor

bvaradar commented Nov 9, 2019

@n3nash : Can you and Modi review this PR and see if it is aligned with what you have done internally ?

@modi95
Copy link
Copy Markdown
Contributor

modi95 commented Nov 10, 2019

Hi Udit! Thanks for making this PR!

I've been working on upgrading HUDI to Spark 2.4 internally at Uber! So I'll list out a few things that I had to do, so that you're not trying to re-discover these things yourself :)

  1. Some of the create functions in HoodieWrapperFileSystem don't fully work with Parquet 1.10+. See here. We'll need to make sure that all the create functions correctly call wrapOutputStream.
  2. hive-exec is a fat JAR. It might be causing unit tests failure as it may introduce an older version of avro into the calsspath. We're currently trying to figure out how to address this. Let us know if you have any suggestions!

Btw - I also went to UIUC! Great to meet new Illini!

@ezzz01
Copy link
Copy Markdown

ezzz01 commented Nov 10, 2019

Hi,

any plans to migrate Hudi to Scala 2.12? I don't see any Jira issue regarding this.

I would volunteer after this PR is merged. I already looked around, it seems that the biggest problem will be migrating to spark-streaming-kafka-0-10_2.12, since there is no spark-streaming-kafka-0-8 version for Scala 2.12. And the tests in hudi-utilities use the low level Kafka API that is gone in version spark-streaming-kafka-0-10.

@pratyakshsharma
Copy link
Copy Markdown
Contributor

@ezhux I have migrated to spark-streaming-kafka-0-10_2.11 internally at my organisation. Let me know if I can help you in any way. :)

@vinothchandar
Copy link
Copy Markdown
Member

@ezhux https://issues.apache.org/jira/browse/HUDI-238 tracks this I think.. yes we want to put up a 2.12 bundle as well. Please engage on the ticket for timelines

Copy link
Copy Markdown
Member

@vinothchandar vinothchandar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change looks good. But we need to get the tests working reliably and think about changes to bundling.. Our docker images are still 2.3x . Should we upgrade and push them as well ? @bvaradar do you want to shepherd this since you are working closely with @modi95 as well?

Comment thread hudi-spark/pom.xml Outdated
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand we can't bundle this since its tied to a spark version now. spark-avro is still a package and the user must explicitly include using --packages ? So, if I upgrade hudi in the next release, then as a user I need to change something? Should/how do we document this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this will be an additional Jar, the user would have to pass while starting the spark-shell. We would have to document it. I don't see any documentation for spark.serializer=org.apache.spark.serializer.KryoSerializer either which is also a pre-requisite right. Shall we update it in the README ?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KryoSerializer command line is provided on the quickstart page. Adding it there could be a good thing. Do a follow on update?

@umehrot2
Copy link
Copy Markdown
Contributor Author

Hi Udit! Thanks for making this PR!

I've been working on upgrading HUDI to Spark 2.4 internally at Uber! So I'll list out a few things that I had to do, so that you're not trying to re-discover these things yourself :)

1. Some of the `create` functions in `HoodieWrapperFileSystem` don't fully work with Parquet 1.10+. See [here](https://github.com/apache/incubator-hudi/blob/b19bed442d84c1cb1e48d184c9554920735bcb6c/hudi-common/src/main/java/org/apache/hudi/common/io/storage/HoodieWrapperFileSystem.java#L146). We'll need to make sure that all the `create` functions correctly call `wrapOutputStream`.

2. `hive-exec` is a fat JAR. It might be causing unit tests failure as it may introduce an older version of avro into the calsspath. We're currently trying to figure out how to address this. Let us know if you have any suggestions!

Btw - I also went to UIUC! Great to meet new Illini!

@modi95 good to hear from a fellow Illini !

About the points you raised:

  1. I guess I am not quite sure yet, where this comes into the picture. Did you run into an actual issue ? Is there an easy way to reproduce this ?

  2. Yeah I need to look into the test failures. Not sure of the reasons yet. But first thing I guess I need to do is upgrade the docker images right, to use spark 2.4.4 and probably hive 2.3.6 ?

@umehrot2
Copy link
Copy Markdown
Contributor Author

Change looks good. But we need to get the tests working reliably and think about changes to bundling.. Our docker images are still 2.3x . Should we upgrade and push them as well ? @bvaradar do you want to shepherd this since you are working closely with @modi95 as well?

Yeah any guidance on this front would be appreciated. On how we want to go about it for getting the tests working. I will look into the docker setup.

@vinothchandar
Copy link
Copy Markdown
Member

@umehrot2 I believe we can first push spark 2.4 docker images and get current master working (since com.databricks/spark-avro is shaded, it wil work).. @bhasudha or @bvaradar have a handle on this.

@bvaradar
Copy link
Copy Markdown
Contributor

Mistakenly added a comment for another PR here. Deleted it.

@bvaradar
Copy link
Copy Markdown
Contributor

@umehrot2 : If you want to give it a shot, its better to open a new PR, You would need to update docker.spark.version in pom files below docker/hadoop/... and also update spark version in Dockerfile in these directories : spark_base, sparkadhoc, sparkworker and sparkmaster.

You can then run docker/build_local_docker_images.sh to build new docker images locally and then run integration tests. We would have to push these docker images so that travis integration can pick it up (we will help on this).

@umehrot2 : If you need help, let us know. Either me or @bhasudha can get the docker images built and pushed.

@umehrot2
Copy link
Copy Markdown
Contributor Author

@umehrot2 : If you want to give it a shot, its better to open a new PR, You would need to update docker.spark.version in pom files below docker/hadoop/... and also update spark version in Dockerfile in these directories : spark_base, sparkadhoc, sparkworker and sparkmaster.

You can then run docker/build_local_docker_images.sh to build new docker images locally and then run integration tests. We would have to push these docker images so that travis integration can pick it up (we will help on this).

@umehrot2 : If you need help, let us know. Either me or @bhasudha can get the docker images built and pushed.

@bvaradar thanks for the suggestions. Will give it a shot tomorrow, and reach out in case of any doubts.

@bvaradar
Copy link
Copy Markdown
Contributor

Re-kicked Integration tests

@bvaradar
Copy link
Copy Markdown
Contributor

Looks like the integration tests are failing with dependency version mismatches.

@umehrot2
Copy link
Copy Markdown
Contributor Author

Looks like the integration tests are failing with dependency version mismatches.

@bvaradar yeah have been looking into them. Like you mentioned there are multiple dependency related issues going on. Working on a solution.

@umehrot2
Copy link
Copy Markdown
Contributor Author

umehrot2 commented Nov 14, 2019

@modi95 @bvaradar

I was able to fix the integration test dependency issues on my local atleast. Hoping that things run fine on Travis too. To give an overview, there were 3 major failures happening:

  1. The ITTestHoodieSanity tests were failing firstly becuase of this error:
17:15:31.995 [pool-21-thread-2] ERROR org.apache.hudi.io.HoodieCreateHandle - Error writing record HoodieRecord{key=HoodieKey { recordKey=98ea14b7-b318-4b0b-9f14-0115900a10e0 partitionPath=2016/03/15}, currentLocation='null', newLocation='null'}

java.lang.NoSuchMethodError: org.apache.parquet.io.api.Binary.fromCharSequence(Ljava/lang/CharSequence;)Lorg/apache/parquet/io/api/Binary;

	at org.apache.parquet.avro.AvroWriteSupport.fromAvroString(AvroWriteSupport.java:371) ~[hudi-spark-bundle-0.5.1-SNAPSHOT.jar:0.5.1-SNAPSHOT]

	at org.apache.parquet.avro.AvroWriteSupport.writeValueWithoutConversion(AvroWriteSupport.java:346) ~[hudi-spark-bundle-0.5.1-SNAPSHOT.jar:0.5.1-SNAPSHOT]

	at org.apache.parquet.avro.AvroWriteSupport.writeValue(AvroWriteSupport.java:278) ~[hudi-spark-bundle-0.5.1-SNAPSHOT.jar:0.5.1-SNAPSHOT]

	at org.apache.parquet.avro.AvroWriteSupport.writeRecordFields(AvroWriteSupport.java:191) ~[hudi-spark-bundle-0.5.1-SNAPSHOT.jar:0.5.1-SNAPSHOT]

	at org.apache.parquet.avro.AvroWriteSupport.write(AvroWriteSupport.java:165) ~[hudi-spark-bundle-0.5.1-SNAPSHOT.jar:0.5.1-SNAPSHOT]

	at org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:121) ~[hive-exec-2.3.1.jar:1.10.1]

	at org.apache.parquet.hadoop.ParquetWriter.write(ParquetWriter.java:288) ~[hive-exec-2.3.1.jar:1.10.1]

	at org.apache.hudi.io.storage.HoodieParquetWriter.writeAvroWithMetadata(HoodieParquetWriter.java:91) ~[hudi-spark-bundle-0.5.1-SNAPSHOT.jar:0.5.1-SNAPSHOT]

	at org.apache.hudi.io.HoodieCreateHandle.write(HoodieCreateHandle.java:101) ~[hudi-spark-bundle-0.5.1-SNAPSHOT.jar:0.5.1-SNAPSHOT]

	at org.apache.hudi.io.HoodieWriteHandle.write(HoodieWriteHandle.java:150) ~[hudi-spark-bundle-0.5.1-SNAPSHOT.jar:0.5.1-SNAPSHOT]

	at org.apache.hudi.func.CopyOnWriteLazyInsertIterable$CopyOnWriteInsertHandler.consumeOneRecord(CopyOnWriteLazyInsertIterable.java:142) ~[hudi-spark-bundle-0.5.1-SNAPSHOT.jar:0.5.1-SNAPSHOT]

	at org.apache.hudi.func.CopyOnWriteLazyInsertIterable$CopyOnWriteInsertHandler.consumeOneRecord(CopyOnWriteLazyInsertIterable.java:125) ~[hudi-spark-bundle-0.5.1-SNAPSHOT.jar:0.5.1-SNAPSHOT]

	at org.apache.hudi.common.util.queue.BoundedInMemoryQueueConsumer.consume(BoundedInMemoryQueueConsumer.java:38) ~[hudi-spark-bundle-0.5.1-SNAPSHOT.jar:0.5.1-SNAPSHOT]

This is happening because in Hudi even for bits running through Spark we are using Hive 2.3.1 which is not really compatible with Spark. So, hive-exec 2.3.1 ends up in HoodieJavaApp classpath while running the example, and that has its own shaded parquet version which is old and conflicts with parquet 1.10.1.

What I propose here, is that we should use version of Hive that is compatible with Spark, atleast for the bits running inside Spark so that compatible versions of Hive end up in class paths. Now hive-exec 1.2.1.spark2 does not cause this issue as it does not shade parquet. Also, we have removed Hive shading in master now, so anyways we are dependent on runtime Hive version which is Spark's Hive version. So, from code's perspective also I think it makes sense to depend on Spark's Hive version for the code which is running inside of Spark to avoid such issues.

  1. Post that ITTestHoodieSanity all the _rt tests were failing because now that our code is using Avro 1.8.2 and Hive is still on older versions, we need to shade avro in hudi-hadoop-mr-bundle which we had done internally for EMR through an optional profile. Now that we are migrating Hudi itself to Avro 1.8.2 we need to always shade Hive to get around this issue. More details on https://issues.apache.org/jira/browse/HUDI-268

  2. Finally some tests were failing because spark-avro was not being passed while starting the spark-shell, and it was not finding the classes. So, I switched over to downloading spark-avro instead of databricks-avro

By making the above changes, the integration tests work now. Let me know your thoughts about these changes, if there are concerns.

@umehrot2
Copy link
Copy Markdown
Contributor Author

@bvaradar can we re-trigger the tests ? I think this time it failed due to flaky timeouts

@bvaradar
Copy link
Copy Markdown
Contributor

@bvaradar can we re-trigger the tests ? I think this time it failed due to flaky timeouts

@umehrot2 : I re-triggered anyways but looks like the log length reached max limits

@umehrot2
Copy link
Copy Markdown
Contributor Author

@bvaradar can we re-trigger the tests ? I think this time it failed due to flaky timeouts

@umehrot2 : I re-triggered anyways but looks like the log length reached max limits

@bvaradar you are right, it failed again because of log length. Any suggestions ?

@bvaradar
Copy link
Copy Markdown
Contributor

@umehrot2 : Comparing this run's logs with that of #1009 , I can see that somehow spark logging level became INFO with this PR (hence, so much logging). You can look at the logs corresponding to org.apache.hudi.integ.ITTestHoodieSanity. Both runs use the same docker image (spark-2.4.4). Can you check how Spark log level became INFO. I didn't dig further than that but let me know if you need help.

@bvaradar
Copy link
Copy Markdown
Contributor

@umehrot2 : Regarding your comment regarding moving to spark.hive version for hudi-spark-bundle and hudi-utilities-bundle, I am ok as long as hive sync works with hiveserver/HMS 2.x

Regarding realtime queries, if we shade avro in hudi-hadoop-mr-bundle, we would need to follow same shading policy in user's jar when they use custom HoodieRecordPayload. AFAIK, there is no good solution here to resolve this dependency hell. If we go this approach, we would need to document this caveat and make it easy for users to perform this shading (with boiler-plate pom)

@vinothchandar : Any suggestions ?

@modi95 (cc @n3nash) : Please note that there will be similar issues with realtime queries and spark-2.4.4 when we eventually migrate RT tables to Spark 2.4.4

@vinothchandar
Copy link
Copy Markdown
Member

On RecordPayload and avro bundling, I expect most people (esp with if you have record level indexes as we plan to) would be happy with OverwriteWithLatest payload, which is internal to the project? I think it may be fair to simplify things to supporting these default payloads well, while providing a guide for authoring Payloads.. (Need a JIRA for documenting such a guide).. Longer term, I am wondering if we should move away from Avro as the standard object (I did not want to invent our own object for obvious reasons of incurring additional conversion cost and code maintenance) or allow Payload to be written using all or some of Row/Avro/ArrayWritable? I know its not a clear answer, but its as concrete as I could get;

@umehrot2 is shading needed to avoid conflicts with Hive's avro? hows does this fit into this PR?

Copy link
Copy Markdown
Member

@vinothchandar vinothchandar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment thread hudi-spark/pom.xml Outdated
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KryoSerializer command line is provided on the quickstart page. Adding it there could be a good thing. Do a follow on update?

@bvaradar
Copy link
Copy Markdown
Contributor

@umehrot2 : Sorry for the back-and-forth on this. Issue 1 (as mentioned in #1005 (comment)) is due to fat jar hive-exec. @n3nash proposed a solution in Uber which wont require moving to spark-hive. Instead of the test dependency : hive-exec, can you try depending on the non-fat version of the jar called : hive-exec-core. Hopefuly, we can control parquet/avro versions getting loaded for the tests.

@umehrot2
Copy link
Copy Markdown
Contributor Author

@umehrot2 : Sorry for the back-and-forth on this. Issue 1 (as mentioned in #1005 (comment)) is due to fat jar hive-exec. @n3nash proposed a solution in Uber which wont require moving to spark-hive. Instead of the test dependency : hive-exec, can you try depending on the non-fat version of the jar called : hive-exec-core. Hopefully, we can control parquet/avro versions getting loaded for the tests.

@bvaradar That's fine, we should take time and solve the right way.

In hudi-utilities it is a test dependency, but not in hudi-spark. hudi-spark depends on hive-service which is bringing in hive-exec as a transitive dependency. And that is the reason its ending up in tests classpath. We would have to get rid of hive-service if we were to do that.

Also I don't see any artifact like hive-exec-core. Maven does not recognize it. But besides that, when we are depending on runtime Spark's version of Hive, is there any reason why we are wanting to build it with hive 2.x instead ? May be hudi-hive in itself since that is an independent module makes sense to build with hive 2.x. But anything running within spark, why are we inclined to building with version of hive not supported by spark ?

@umehrot2
Copy link
Copy Markdown
Contributor Author

@vinothchandar Yes updating quick start page makes sense. Will you guys be doing that ?

@n3nash
Copy link
Copy Markdown
Contributor

n3nash commented Nov 19, 2019

@umehrot2 Thanks for enumerating your thoughts. Let me add some more context here.

Firstly, hive-exec has a classifier core that allows you to get a dependency reduced version of the jar. Although this allows us to workaround the fat jar problem, there is another problem with this dependency reduced version of the jar which doesn't package some of the required transitive dependencies needed by classes in this jar. There are ways to fix this as well by including those relocated dependencies directly in Hudi (@modi95 was trying it at Uber)

Secondly, there is no support for Spark's fork of Hive (1.2.1.spark.2). This was forked by the Spark community to solve the exact issue of hive jars not bundling the correct dependencies that I described above, read more here : https://issues.apache.org/jira/browse/HIVE-16391?focusedCommentId=16032497&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16032497 and then some more changes were added to the fork which are NOT necessary according to the comments in the same jira.

In fact, there is a strong need in the spark community to move away from this forked version to the regular hive version. See here : http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Upgrade-built-in-Hive-to-2-3-4-td26153.html.

But I see your point on having the spark-modules depend on the spark-hive version, this way it's clear and we don't have to solve this issue ourselves.

I have a few hesitations in introducing a spark's forked hive version : a) This means we have 2 hive versions across the project b) The spark's forked version of hive doesn't have anything more apart from solving the hive-exec jar mess.
I'm actually okay with (b). @vinothchandar @bvaradar If you're okay with (a) and don't see any issues, I'm fine with taking this approach. Personally, don't have much foresight on the side-effects of having different versions across the project in the future.

@bvaradar
Copy link
Copy Markdown
Contributor

bvaradar commented Nov 19, 2019

Thanks @n3nash for your thoughts.

@umehrot2 : If it is possible to achieve spark 2.4 upgrade cleanly without moving to spark-hive version, it makes sense to me to retain native hive version. I think it is better to not get locked down on spark-version of hive.

As we are using custom code (non-spark) to do hive syncing, theoretically speaking - we may run into some hive issues which would need upgrade but as the issue is not seen in spark, they may be unwilling to patch their hive jars. we can use spark-hive as a last resort if we cannot upgrade to Spark 2.4 any other way :)

In that spirit, To your concern related to transitive dependencies in hudi-spark module - As maven honors dependency ordering, can we list hive-exec (with classifier as "core") in the dependency section before hive-service and add exclusions in the dependency section for hive-service to exclude hive-exec. I am not sure if this would work but don't have time to try this out myself.

Something along the lines of :

+    <dependency>
+      <groupId>${hive.groupid}</groupId>
+      <artifactId>hive-exec</artifactId>
+      <version>${hive.version}</version>
+      <classifier>core</classifier>
+    </dependency>
+
     <dependency>
       <groupId>${hive.groupid}</groupId>
       <artifactId>hive-service</artifactId>
       <version>${hive.version}</version>
+      <exclusions>
+        <exclusion>
+          <groupId>${hive.groupid}</groupId>
+          <artifactId>hive-exec</artifactId>
+        </exclusion>
+      </exclusions>
     </dependency>
     <dependency>

@umehrot2 : If we cannot make it to work any other way, I am ok with using spark-hive.

@vinothchandar
Copy link
Copy Markdown
Member

What balaji suggested makes sense to me. spark.hive version has its own issues, but we can live with this if there is no other way.

@ezzz01 ezzz01 mentioned this pull request Dec 17, 2019
5 tasks
@cdmikechen
Copy link
Copy Markdown
Contributor

Want to know the progress of this PR now. I think not every user uses Spark2.4, Can we combine the two methods (databricks-avro and spark-avro) and simplify them as our own internal implementation, which can be compatible with most versions of spark2?

@vinothchandar
Copy link
Copy Markdown
Member

@umehrot2 are you still driving this? We would like to merge this asap, giving us enough time for the next release to be cut..

cc @leesf

@umehrot2
Copy link
Copy Markdown
Contributor Author

umehrot2 commented Jan 7, 2020

@umehrot2 are you still driving this? We would like to merge this asap, giving us enough time for the next release to be cut..

cc @leesf
@vinothchandar just got back from my time off a couple of days back. Let me catch up on this PR and try to get it merged soon. When are we targeting for next release to be cut ?

@vinothchandar
Copy link
Copy Markdown
Member

My suggestion is to freeze code by 15th, test the RC for a week and cut one by jan last/feb first week. @leesf is the release manager though.. So he can share plans..

@leesf
Copy link
Copy Markdown
Contributor

leesf commented Jan 7, 2020

@umehrot2 are you still driving this? We would like to merge this asap, giving us enough time for the next release to be cut..
cc @leesf
@vinothchandar just got back from my time off a couple of days back. Let me catch up on this PR and try to get it merged soon. When are we targeting for next release to be cut ?

My thought is to get code ready by 15th to get some buffer, and release at the end of this month.

@umehrot2
Copy link
Copy Markdown
Contributor Author

umehrot2 commented Jan 8, 2020

Ack working with that deadline in mind.

@umehrot2 umehrot2 force-pushed the spark-2.4.4-migration branch from 768f03d to 7b3d943 Compare January 8, 2020 19:25
@umehrot2
Copy link
Copy Markdown
Contributor Author

umehrot2 commented Jan 8, 2020

@vinothchandar @bvaradar @n3nash I have updated the PR to now use the hive-exec with core classifier to solve the unit test issues that were occuring becuase of Hive. Removed the usage of spark.hive.version as desired. Let me know if this looks good.

Copy link
Copy Markdown
Member

@vinothchandar vinothchandar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall. Left two comments around bundling and shading.. we can wait for one of @bvaradar and @n3nash to chime in and merge if they agree as well

.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer").master("local[1]").getOrCreate();
JavaSparkContext jssc = new JavaSparkContext(spark.sparkContext());
spark.sparkContext().setLogLevel("WARN");
jssc.setLogLevel("WARN");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need both?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess just the first one is fine. Will remove the second line.

<include>org.apache.hive:hive-metastore</include>
<include>org.apache.hive:hive-jdbc</include>

<include>com.databricks:spark-avro_2.11</include>
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we bundled org.apache.spark:spark-avro would n't that make life simpler for everyone?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can give it a shot, but we need to carefully understand the consequences of shading a spark library, inside a Jar which is being run on Spark. I remember earlier we had some issue on EMR, but don't have the exact details. Nevertheless, let me try and see if tests pass.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume the tests will pass.. but I realize what you are saying.. the user could be running on a higher spark version say and we would be bundling 2.4.4 . Lets just open a JIRA to tackle this usability issue and keep it as -is now.. We can document the need for --packages ... when using spark-submit or spark-shell clearly for now.. and move on.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that can be one of the problems. Created a JIRA for this issue: https://issues.apache.org/jira/browse/HUDI-516

About documentation of --packages are you guys going to take care of that ?


<profiles>
<profile>
<id>mr-bundle-shade-avro</id>
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bvaradar @n3nash : call this to your attention. is removing the profile going to affect the custom payload implementations at Uber?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@umehrot2 (cc @vinothchandar ) I will get back on this by today EOD.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(cc @n3nash ) Yeah, this would mean that we need to employ the same package relocation in the jar carrying custom record payloads. As discussed in the earlier threads, there is no way around it. @umehrot2 : We would need to document this caveat in Release Notes and add documentation on how to shade it. Can you create a ticket to track this ?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vinothchandar @bvaradar Yes, this will affect the custom payload implementation on the reader side. But we are anyways going to make some changes in how the payload packages are loaded so we should be able to absorb this change as part of those considerations.

Copy link
Copy Markdown
Contributor

@bvaradar bvaradar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@umehrot2 : Looks good otherwise. Once you update LICENSE file, we can merge. While you are at it, can you also squash your commits. If not, we will do it when merging.

Comment thread LICENSE

<profiles>
<profile>
<id>mr-bundle-shade-avro</id>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(cc @n3nash ) Yeah, this would mean that we need to employ the same package relocation in the jar carrying custom record payloads. As discussed in the earlier threads, there is no way around it. @umehrot2 : We would need to document this caveat in Release Notes and add documentation on how to shade it. Can you create a ticket to track this ?

…bricks-avro, add support for Decimal/Date types
@umehrot2 umehrot2 force-pushed the spark-2.4.4-migration branch from 07d9f3f to a3fb7d4 Compare January 10, 2020 22:33
@umehrot2
Copy link
Copy Markdown
Contributor Author

@bvaradar Created a JIRA to track documentation of Avro shading caveat https://issues.apache.org/jira/browse/HUDI-519

@n3nash n3nash self-requested a review January 11, 2020 00:27
Copy link
Copy Markdown
Contributor

@n3nash n3nash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@vinothchandar
Copy link
Copy Markdown
Member

About documentation of --packages are you guys going to take care of that ?

@umehrot2 typically the author of the PR also does the doc changes to keep things in sync.. Are you able to make the changes in docs? Mainly, it should be the quickstart, demo, writing_data pages

@bvaradar bvaradar merged commit ad50008 into apache:master Jan 12, 2020
@cdmikechen
Copy link
Copy Markdown
Contributor

cdmikechen commented Jan 13, 2020

@vinothchandar Glad to see this PR can be merged, does it mean we need to use spark2.4 and avro 1.8 in hudi 0.5.1 finally?
I think I should close my PR about spark avro before later.

@bvaradar
Copy link
Copy Markdown
Contributor

@vinothchandar Glad to see this PR can be merged, does it mean we need to use spark2.4 and avro 1.8 in hudi 0.5.1 finally?
I think I should close my PR about spark avro before later.

Yes, It is merged and should be available in 0.5.1

@cdmikechen
Copy link
Copy Markdown
Contributor

@bvaradar
OK~ I will close my PR #770 right now.
The problem about timestamp type maybe can be discussed and solved in other JIRA in later version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants