[SPARK-17920][SPARK-19580][SPARK-19878][SQL] Support writing to Hive table which uses Avro schema url 'avro.schema.url' #19779

vinodkc · 2017-11-18T09:01:46Z

What changes were proposed in this pull request?

SPARK-19580 Support for avro.schema.url while writing to hive table
SPARK-19878 Add hive configuration when initialize hive serde in InsertIntoHiveTable.scala
SPARK-17920 HiveWriterContainer passes null configuration to serde.initialize, causing NullPointerException in AvroSerde when using avro.schema.url

Support writing to Hive table which uses Avro schema url 'avro.schema.url'
For ex:
create external table avro_in (a string) stored as avro location '/avro-in/' tblproperties ('avro.schema.url'='/avro-schema/avro.avsc');

create external table avro_out (a string) stored as avro location '/avro-out/' tblproperties ('avro.schema.url'='/avro-schema/avro.avsc');

insert overwrite table avro_out select * from avro_in; // fails with java.lang.NullPointerException

WARN AvroSerDe: Encountered exception determining schema. Returning signal schema to indicate problem
java.lang.NullPointerException
at org.apache.hadoop.fs.FileSystem.getDefaultUri(FileSystem.java:182)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:174)

Changes proposed in this fix

Currently 'null' value is passed to serializer, which causes NPE during insert operation, instead pass Hadoop configuration object

How was this patch tested?

Added new test case in VersionsSuite

SparkQA · 2017-11-18T09:04:24Z

Test build #83984 has finished for PR 19779 at commit 034b246.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-11-18T11:07:00Z

Test build #83985 has finished for PR 19779 at commit a59bd09.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

kavnsir · 2017-11-20T02:11:54Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveFileFormat.scala

@@ -89,6 +90,8 @@ class HiveFileFormat(fileSinkConf: FileSinkDesc)
    val fileSinkConfSer = fileSinkConf
    new OutputWriterFactory {
      private val jobConf = new SerializableJobConf(new JobConf(conf))
+      private val broadcastHadoopConf = sparkSession.sparkContext.broadcast(


Is it possible to use jobConf as hive serde initialize param directly?

Thanks for the comment, I'll change code to use jobConf

SparkQA · 2017-11-20T06:42:47Z

Test build #84011 has finished for PR 19779 at commit 9beb53f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2017-11-20T17:44:17Z

Hi, @vinodkc .
Thank you for taking a look at these issue. Since this PR contains multiple issues, could you add the followings in the PR description for reviewer and commit log?

SPARK-19580 Support for avro.schema.url while writing to hive table
SPARK-19878 Add hive configuration when initialize hive serde in InsertIntoHiveTable.scala
SPARK-17920 HiveWriterContainer passes null configuration to serde.initialize, causing NullPointerException in AvroSerde when using avro.schema.url

dongjoon-hyun · 2017-11-20T17:46:51Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/client/VersionsSuite.scala

@@ -841,6 +841,76 @@ class VersionsSuite extends SparkFunSuite with Logging {
      }
    }

+    test(s"$version: Insert into/overwrite external avro table") {


Is there a reason to have this in VersionSuite?

Could you add SPARK-19878 in a test case name?

I am fine to keep it in VersionSuite, since it is related to Hive.

dongjoon-hyun · 2017-11-20T17:56:03Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/client/VersionsSuite.scala

+           """.stripMargin
+          )
+          versionSpark.sql(
+            s"""insert overwrite table $destTableName select * from $srcTableName""".stripMargin)


nit.

INSERT OVERWRITE TABLE $destTableName SELECT * FROM $srcTableName

Thank you for your review comments, I'll fix all your comments

dongjoon-hyun · 2017-11-20T17:56:11Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/client/VersionsSuite.scala

+          assert(versionSpark.table(destTableName).count() ===
+            versionSpark.table(srcTableName).count())
+          versionSpark.sql(
+            s"""insert into table $destTableName select * from $srcTableName""".stripMargin)


dongjoon-hyun · 2017-11-20T17:58:13Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/client/VersionsSuite.scala

+          versionSpark.sql(
+            s"""insert into table $destTableName select * from $srcTableName""".stripMargin)
+          assert(versionSpark.table(destTableName).count()/2 ===
+            versionSpark.table(srcTableName).count())


If possible, can we check values instead of count?

dongjoon-hyun · 2017-11-20T18:00:37Z

Ping @cloud-fan and @gatorsmile .

gatorsmile · 2017-11-21T07:57:49Z

The fix looks good to me. You can address the comments left by @dongjoon-hyun

cloud-fan · 2017-11-21T12:28:06Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/client/VersionsSuite.scala

@@ -800,7 +800,7 @@ class VersionsSuite extends SparkFunSuite with Logging {
      }
    }

-    test(s"$version: read avro file containing decimal") {
+    test(s"$version: SPARK-17920: read avro file containing decimal") {


do you mean SPARK-17920 is already fixed because this test passes?

@cloud-fan , Sorry, wrong test case updated,I'll change it

SparkQA · 2017-11-21T12:48:11Z

Test build #84065 has finished for PR 19779 at commit f92f44c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-11-21T15:47:12Z

Test build #84069 has finished for PR 19779 at commit d09211f.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-11-21T16:27:14Z

Test build #84071 has finished for PR 19779 at commit b7d7a3c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-11-21T16:48:23Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/client/VersionsSuite.scala

+    test(s"$version: SPARK-17920: Insert into/overwrite external avro table") {
+      withTempDir { dir =>
+        val path = dir.getAbsolutePath
+        val schemaPath = s"""$path${File.separator}avroschemadir"""


nit:

val schemaFile = new File(dir, "avroDecimal.avsc") val writer = new PrintWriter(schemaFile) writer.write(avroSchema) writer.close() ...

cloud-fan · 2017-11-21T16:51:04Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/client/VersionsSuite.scala

+               |STORED AS
+               |  INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
+               |  OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
+               |LOCATION '$destLocation'


do we have to provide a location for an empty table?

Will change to 'CREATE EXTERNAL TABLE'

This bug is for external table only? how about managed table?

@cloud-fan , This bug is for both external and managed tables.
I've added a new test case for managed table too. However, to avoid code duplication, should I include both tests inside same test method?. Please suggest.

we can just test the managed table, to avoid creating a temp directory for external table.

Thanks, I've updated the test case to test only managed tables and avoided creating a temp directory.

SparkQA · 2017-11-21T18:16:25Z

Test build #84076 has finished for PR 19779 at commit 68ee79d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-11-21T20:55:33Z

Test build #84081 has finished for PR 19779 at commit a2560b3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-11-22T03:14:00Z

Test build #84090 has finished for PR 19779 at commit 083e1b3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-11-22T04:07:54Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/client/VersionsSuite.scala

+           """.stripMargin
+          )
+          versionSpark.sql(
+            s"""INSERT OVERWRITE TABLE $destTableName SELECT * FROM $srcTableName""".stripMargin)


stripMargin is useless

Sure, I'll update it

gatorsmile · 2017-11-22T04:08:01Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/client/VersionsSuite.scala

+          val result = versionSpark.table(srcTableName).collect()
+          assert(versionSpark.table(destTableName).collect() === result)
+          versionSpark.sql(
+            s"""INSERT INTO TABLE $destTableName SELECT * FROM $srcTableName""".stripMargin)


The same here.

gatorsmile · 2017-11-22T04:08:25Z

LGTM

gatorsmile · 2017-11-22T04:09:31Z

cc @felixcheung This sounds critical for Spark 2.2 too.

SparkQA · 2017-11-22T05:52:40Z

Test build #84092 has finished for PR 19779 at commit 51999d0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-11-22T06:32:55Z

Thanks! Merged to master.

@vinodkc Could you submit a separate PR for backporting it to 2.2?

vinodkc · 2017-11-22T06:57:33Z

@gatorsmile , @cloud-fan and @dongjoon-hyun
Thanks for the review comments and guidance
Sure, I'll submit a separate PR for backporting it to 2.2

SparkQA · 2017-11-22T07:51:27Z

Test build #84094 has finished for PR 19779 at commit e3651ef.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…nch-2.2 - Support writing to Hive table which uses Avro schema url 'avro.schema.url' ## What changes were proposed in this pull request? > Backport #19779 to branch-2.2 SPARK-19580 Support for avro.schema.url while writing to hive table SPARK-19878 Add hive configuration when initialize hive serde in InsertIntoHiveTable.scala SPARK-17920 HiveWriterContainer passes null configuration to serde.initialize, causing NullPointerException in AvroSerde when using avro.schema.url Support writing to Hive table which uses Avro schema url 'avro.schema.url' For ex: create external table avro_in (a string) stored as avro location '/avro-in/' tblproperties ('avro.schema.url'='/avro-schema/avro.avsc'); create external table avro_out (a string) stored as avro location '/avro-out/' tblproperties ('avro.schema.url'='/avro-schema/avro.avsc'); insert overwrite table avro_out select * from avro_in; // fails with java.lang.NullPointerException WARN AvroSerDe: Encountered exception determining schema. Returning signal schema to indicate problem java.lang.NullPointerException at org.apache.hadoop.fs.FileSystem.getDefaultUri(FileSystem.java:182) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:174) ## Changes proposed in this fix Currently 'null' value is passed to serializer, which causes NPE during insert operation, instead pass Hadoop configuration object ## How was this patch tested? Added new test case in VersionsSuite Author: vinodkc <vinod.kc.in@gmail.com> Closes #19795 from vinodkc/br_Fix_SPARK-17920_branch-2.2.

## What changes were proposed in this pull request? a followup of #19779 , to simplify the file creation. ## How was this patch tested? test only change Author: Wenchen Fan <wenchen@databricks.com> Closes #19799 from cloud-fan/minor.

…nch-2.2 - Support writing to Hive table which uses Avro schema url 'avro.schema.url' ## What changes were proposed in this pull request? > Backport apache#19779 to branch-2.2 SPARK-19580 Support for avro.schema.url while writing to hive table SPARK-19878 Add hive configuration when initialize hive serde in InsertIntoHiveTable.scala SPARK-17920 HiveWriterContainer passes null configuration to serde.initialize, causing NullPointerException in AvroSerde when using avro.schema.url Support writing to Hive table which uses Avro schema url 'avro.schema.url' For ex: create external table avro_in (a string) stored as avro location '/avro-in/' tblproperties ('avro.schema.url'='/avro-schema/avro.avsc'); create external table avro_out (a string) stored as avro location '/avro-out/' tblproperties ('avro.schema.url'='/avro-schema/avro.avsc'); insert overwrite table avro_out select * from avro_in; // fails with java.lang.NullPointerException WARN AvroSerDe: Encountered exception determining schema. Returning signal schema to indicate problem java.lang.NullPointerException at org.apache.hadoop.fs.FileSystem.getDefaultUri(FileSystem.java:182) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:174) ## Changes proposed in this fix Currently 'null' value is passed to serializer, which causes NPE during insert operation, instead pass Hadoop configuration object ## How was this patch tested? Added new test case in VersionsSuite Author: vinodkc <vinod.kc.in@gmail.com> Closes apache#19795 from vinodkc/br_Fix_SPARK-17920_branch-2.2.

vinodkc force-pushed the br_Fix_SPARK-17920 branch from 034b246 to a59bd09 Compare November 18, 2017 09:06

kavnsir reviewed Nov 20, 2017

View reviewed changes

vinodkc force-pushed the br_Fix_SPARK-17920 branch from a59bd09 to 9beb53f Compare November 20, 2017 04:49

dongjoon-hyun reviewed Nov 20, 2017

View reviewed changes

vinodkc force-pushed the br_Fix_SPARK-17920 branch from 9beb53f to f92f44c Compare November 21, 2017 10:53

cloud-fan reviewed Nov 21, 2017

View reviewed changes

vinodkc force-pushed the br_Fix_SPARK-17920 branch 2 times, most recently from d09211f to b7d7a3c Compare November 21, 2017 14:22

vinodkc force-pushed the br_Fix_SPARK-17920 branch from b7d7a3c to 68ee79d Compare November 21, 2017 16:13

cloud-fan reviewed Nov 21, 2017

View reviewed changes

vinodkc force-pushed the br_Fix_SPARK-17920 branch from 68ee79d to a2560b3 Compare November 21, 2017 17:50

vinodkc force-pushed the br_Fix_SPARK-17920 branch from a2560b3 to 083e1b3 Compare November 22, 2017 01:09

vinodkc force-pushed the br_Fix_SPARK-17920 branch from 083e1b3 to 51999d0 Compare November 22, 2017 03:47

gatorsmile reviewed Nov 22, 2017

View reviewed changes

pass jobConf to Serializer

e3651ef

vinodkc force-pushed the br_Fix_SPARK-17920 branch from 51999d0 to e3651ef Compare November 22, 2017 05:50

asfgit closed this in e0d7665 Nov 22, 2017

vinodkc mentioned this pull request Nov 22, 2017

[SPARK-17920][SPARK-19580][SPARK-19878][SQL] Backport PR 19779 to branch-2.2 - Support writing to Hive table which uses Avro schema url 'avro.schema.url' #19795

Closed

cloud-fan mentioned this pull request Nov 23, 2017

[SPARK-17920][followup] simplify the schema file creation in test #19799

Closed

vinodkc deleted the br_Fix_SPARK-17920 branch May 25, 2021 07:57

[SPARK-17920][SPARK-19580][SPARK-19878][SQL] Support writing to Hive table which uses Avro schema url 'avro.schema.url' #19779

[SPARK-17920][SPARK-19580][SPARK-19878][SQL] Support writing to Hive table which uses Avro schema url 'avro.schema.url' #19779

Conversation

vinodkc commented Nov 18, 2017 • edited Loading

What changes were proposed in this pull request?

Changes proposed in this fix

How was this patch tested?

SparkQA commented Nov 18, 2017

SparkQA commented Nov 18, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Nov 20, 2017

dongjoon-hyun commented Nov 20, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun commented Nov 20, 2017

gatorsmile commented Nov 21, 2017

Choose a reason for hiding this comment

vinodkc Nov 21, 2017 • edited Loading

Choose a reason for hiding this comment

SparkQA commented Nov 21, 2017

SparkQA commented Nov 21, 2017

SparkQA commented Nov 21, 2017

cloud-fan Nov 21, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vinodkc Nov 22, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Nov 21, 2017

SparkQA commented Nov 21, 2017

SparkQA commented Nov 22, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gatorsmile commented Nov 22, 2017

gatorsmile commented Nov 22, 2017

SparkQA commented Nov 22, 2017

gatorsmile commented Nov 22, 2017

vinodkc commented Nov 22, 2017 • edited Loading

SparkQA commented Nov 22, 2017

vinodkc commented Nov 18, 2017 •

edited

Loading

vinodkc Nov 21, 2017 •

edited

Loading

cloud-fan Nov 21, 2017 •

edited

Loading

vinodkc Nov 22, 2017 •

edited

Loading

vinodkc commented Nov 22, 2017 •

edited

Loading