[SPARK-24768][SQL] Have a built-in AVRO data source implementation #21742

gengliangwang · 2018-07-10T17:42:29Z

What changes were proposed in this pull request?

Apache Avro (https://avro.apache.org) is a popular data serialization format. It is widely used in the Spark and Hadoop ecosystem, especially for Kafka-based data pipelines. Using the external package https://github.com/databricks/spark-avro, Spark SQL can read and write the avro data. Making spark-Avro built-in can provide a better experience for first-time users of Spark SQL and structured streaming. We expect the built-in Avro data source can further improve the adoption of structured streaming.
The proposal is to inline code from spark-avro package (https://github.com/databricks/spark-avro). The target release is Spark 2.4.

Built-in AVRO Data Source In Spark 2.4.pdf

How was this patch tested?

Unit test

MaxGekk · 2018-07-10T20:52:55Z

external/avro/src/test/scala/org/apache/spark/sql/avro/AvroReadBenchmark.scala

+      .avro(AvroFileGenerator.outputDir)
+      .select("string")
+      .count()
+    val endTime = System.nanoTime


Could you use the Benchmark API for the benchmarks:

spark/core/src/main/scala/org/apache/spark/util/Benchmark.scala

Line 48 in 39e2bad

private[spark] class Benchmark(

?

This PR is for initial import. I have created sub task:
https://issues.apache.org/jira/browse/SPARK-24777

In that case, shall we add the whole benchmark separately?

makes sense to me.

MaxGekk · 2018-07-10T20:55:42Z

external/avro/src/test/scala/org/apache/spark/sql/avro/AvroReadBenchmark.scala

+    val endTime = System.nanoTime
+    val executionTime = TimeUnit.SECONDS.convert(endTime - startTime, TimeUnit.NANOSECONDS)
+
+    println(s"\n\n\nFinished benchmark test - result was $executionTime seconds\n\n\n")


Could you leave benchmark results here like in JsonBenchmarks for example:

spark/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonBenchmarks.scala

Lines 76 to 79 in bd14da6

JSON schema inferring: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative

--------------------------------------------------------------------------------------------

No encoding 38902 / 39282 2.6 389.0 1.0X

UTF-8 is set 56959 / 57261 1.8 569.6 0.7X

This PR is for initial import. I have created sub task:
https://issues.apache.org/jira/browse/SPARK-24777

MaxGekk · 2018-07-10T20:59:18Z

external/avro/src/test/scala/org/apache/spark/sql/avro/AvroReadBenchmark.scala

+        "is empty. First you should generate some files to run a benchmark with (see README)")
+    }
+
+    val spark = SparkSession.builder().master("local[2]").appName("AvroReadBenchmark")


I don't think it make sense to read from 2 task in parallel. You just introduce unnecessary fluctuations in the results. I would do measurements in one task.

MaxGekk · 2018-07-10T21:01:50Z

external/avro/src/main/scala/org/apache/spark/sql/avro/package.scala

+   * Adds a method, `avro`, to DataFrameReader that allows you to read avro files using
+   * the DataFileReade
+   */
+  implicit class AvroDataFrameReader(reader: DataFrameReader) {


I think it is better to extend DataFrameReader by new method avro as for other supported datasources.

SparkQA · 2018-07-10T22:11:17Z

Test build #92824 has finished for PR 21742 at commit 16d08af.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class SerializableConfiguration(@transient var value: Configuration)
class IncompatibleSchemaException(msg: String, ex: Throwable = null) extends Exception(msg, ex)
case class SchemaType(dataType: DataType, nullable: Boolean)
implicit class AvroDataFrameWriter[T](writer: DataFrameWriter[T])
implicit class AvroDataFrameReader(reader: DataFrameReader)

viirya · 2018-07-11T02:30:24Z

external/avro/src/main/scala/org/apache/spark/sql/avro/package.scala

+
+  /**
+   * Adds a method, `avro`, to DataFrameReader that allows you to read avro files using
+   * the DataFileReade


typo: DataFileReade

viirya · 2018-07-11T02:31:48Z

external/avro/src/main/scala/org/apache/spark/sql/avro/package.scala

+   * the DataFileWriter
+   */
+  implicit class AvroDataFrameWriter[T](writer: DataFrameWriter[T]) {
+    def avro: String => Unit = writer.format("avro").save


Does this mean avro must be at the end of call chain of DataFileWriter?

Yes, this is the same as orc, csv, etc.

But it doesn't support Java and Python though.

I think in most cases, users will directly use df.write.format("avro"), which should be good enough.

Yup, in that case, I think we wouldn't even need this in Scala side.

I am not sure about this. This exists in spark-avro for a long time, and it should be used by some users. I can't see any downside of keeping it.

In that case, can we move this into DataFrameReader and DataFrameWriter for Python and Java usages too?

I think this was just a workaround to resemble Spark 2.0.0's API shape. spark-avro as a thridparty would probably keep source and binary compatibility but here I think we don't keep them here although we will probably keep the behaviours. So, I think it's better to minimise the exposed APIs when we are in doubt.

To me, I can't see any particular advantage of keeping it on the other hand as implicit here.

For instance, CSV didn't bring all other APIs at the initial implementation:

https://github.com/databricks/spark-csv/blob/master/src/main/scala/com/databricks/spark/csv/package.scala#L36-L213

and its parser API https://github.com/databricks/spark-csv/blob/master/src/main/scala/com/databricks/spark/csv/CsvParser.scala

I'd remove the Scala API.

viirya · 2018-07-11T02:34:20Z

external/avro/src/test/scala/org/apache/spark/sql/avro/AvroReadBenchmark.scala

+/**
+ * This object runs a simple benchmark test on the avro files in benchmarkFilesDir. It measures
+ * how long does it take to convert them into DataFrame and run count() method on them. See
+ * README on how to invoke it.


viirya · 2018-07-11T02:35:14Z

external/avro/src/test/scala/org/apache/spark/sql/avro/AvroFileGenerator.scala

+
+/**
+ * This object allows you to generate large avro files that can be used for speed benchmarking.
+ * See README on how to use it.


Is this README located in original spark-avro project? Shall we copy it here?

HyukjinKwon · 2018-07-11T02:50:33Z

external/avro/src/main/scala/org/apache/spark/sql/avro/package.scala

+   * Adds a method, `avro`, to DataFrameWriter that allows you to write avro files using
+   * the DataFileWriter
+   */
+  implicit class AvroDataFrameWriter[T](writer: DataFrameWriter[T]) {


I think this was just ported as is from the avro. Shall we expose these into Spark's DataFrameReader and Writer?

I am not sure about this. If we put the package under /external, I don't think we should expose it in DataFrameReader and Writer.

I am actually ok with either way.
@marmbrus @tdas What do you think? If so, should we should expose .kafka in DataFrameReader/DataFrameWriter as well?

I think we could do that separately though like we did in CSV.

HyukjinKwon · 2018-07-11T02:53:35Z

dev/sparktestsupport/modules.py

@@ -170,6 +170,16 @@ def __hash__(self):
    ]
 )

+avro = Module(


Why does this need a separate module unlike other datasources?

This is much cleaner, like what we did for kafka, which is also a built-in data source. Ideally, we should separate parquet, orc and other built-in data sources from sql module. We can do the refactoring in the future, if needed

HyukjinKwon · 2018-07-11T02:54:55Z

external/avro/src/test/scala/org/apache/spark/sql/avro/AvroFileGenerator.scala

+      dataFileWriter.append(avroRec)
+    }
+
+    dataFileWriter.close()


nit: can we put this in finally.

HyukjinKwon · 2018-07-11T02:57:06Z

external/avro/src/main/scala/org/apache/spark/sql/avro/AvroOutputWriter.scala

+    write(internalRowConverter(row))
+  }
+
+  // api in spark 2.0 - 2.1


This will be probably new to Spark 2.4.0. Do we need this?

HyukjinKwon · 2018-07-11T09:41:18Z

external/avro/src/test/resources/benchmarkSchema.avsc

@@ -0,0 +1,35 @@
+{


Seems missed to delete it.

Nice catch!

SparkQA · 2018-07-11T12:23:41Z

Test build #92848 has finished for PR 21742 at commit 7a45457.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-07-11T14:55:07Z

Test build #92854 has finished for PR 21742 at commit 057c3a4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2018-07-12T14:38:16Z

ping @marmbrus @tdas
Should we create API .avro in Spark's DataFrameReader and DataFrameWriter? If so, should we should expose .kafka in DataFrameReader/DataFrameWriter as well?

cloud-fan · 2018-07-12T15:01:01Z

since we decided to follow sql-kafka here, I think we should not add .avro at the beginning, and decide it later with .kafka together.

cloud-fan · 2018-07-12T15:02:37Z

external/avro/src/main/scala/org/apache/spark/sql/avro/AvroOutputWriter.scala

+
+    }.getRecordWriter(context)
+
+  def write(internalRow: InternalRow): Unit = {


nit: add overwrite

SparkQA · 2018-07-12T19:40:28Z

Test build #92937 has finished for PR 21742 at commit 3979bad.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-07-12T20:54:50Z

LGTM

Thanks! Merged to master.

@gengliangwang Please submit the follow-up PRs to resolve the sub tasks and the good comment in the PR review.

Apache Avro (https://avro.apache.org) is a popular data serialization format. It is widely used in the Spark and Hadoop ecosystem, especially for Kafka-based data pipelines. Using the external package https://github.com/databricks/spark-avro, Spark SQL can read and write the avro data. Making spark-Avro built-in can provide a better experience for first-time users of Spark SQL and structured streaming. We expect the built-in Avro data source can further improve the adoption of structured streaming. The proposal is to inline code from spark-avro package (https://github.com/databricks/spark-avro). The target release is Spark 2.4. [Built-in AVRO Data Source In Spark 2.4.pdf](https://github.com/apache/spark/files/2181511/Built-in.AVRO.Data.Source.In.Spark.2.4.pdf) Unit test Author: Gengliang Wang <gengliang.wang@databricks.com> Closes apache#21742 from gengliangwang/export_avro. (cherry picked from commit 395860a)

…roDataFrameReader As per Reynold's comment: apache#21742 (comment) It makes sense to remove the implicit class AvroDataFrameWriter/AvroDataFrameReader, since the Avro package is external module. Unit test Author: Gengliang Wang <gengliang.wang@databricks.com> Closes apache#21841 from gengliangwang/removeImplicit. (cherry picked from commit f59de52)

initial import

16d08af

MaxGekk reviewed Jul 10, 2018

View reviewed changes

viirya reviewed Jul 11, 2018

View reviewed changes

HyukjinKwon reviewed Jul 11, 2018

View reviewed changes

address comments

7a45457

HyukjinKwon reviewed Jul 11, 2018

View reviewed changes

remove benchmarkSchema.avsc

057c3a4

cloud-fan reviewed Jul 12, 2018

View reviewed changes

address one comment

3979bad

asfgit closed this in 395860a Jul 12, 2018

gengliangwang mentioned this pull request Jul 22, 2018

[SPARK-24883][SQL]Avro: remove implicit class AvroDataFrameWriter/AvroDataFrameReader #21841

Closed

dongjoon-hyun mentioned this pull request Sep 10, 2018

[SPARK-25393][SQL] Adding new function from_csv() #22379

Closed

HyukjinKwon mentioned this pull request Oct 9, 2018

[SPARK-25347][ML][DOC] Spark datasource for image/libsvm user guide #22675

Closed

	JSON schema inferring: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
	--------------------------------------------------------------------------------------------
	No encoding 38902 / 39282 2.6 389.0 1.0X
	UTF-8 is set 56959 / 57261 1.8 569.6 0.7X


		}.getRecordWriter(context)

		def write(internalRow: InternalRow): Unit = {

[SPARK-24768][SQL] Have a built-in AVRO data source implementation #21742

[SPARK-24768][SQL] Have a built-in AVRO data source implementation #21742

Conversation

gengliangwang commented Jul 10, 2018

What changes were proposed in this pull request?

How was this patch tested?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon Jul 11, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jul 10, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon Jul 11, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon Jul 11, 2018 • edited Loading

Choose a reason for hiding this comment

HyukjinKwon Jul 11, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jul 11, 2018

SparkQA commented Jul 11, 2018

gengliangwang commented Jul 12, 2018

cloud-fan commented Jul 12, 2018

Choose a reason for hiding this comment

SparkQA commented Jul 12, 2018

gatorsmile commented Jul 12, 2018

HyukjinKwon Jul 11, 2018 •

edited

Loading

HyukjinKwon Jul 11, 2018 •

edited

Loading

HyukjinKwon Jul 11, 2018 •

edited

Loading

HyukjinKwon Jul 11, 2018 •

edited

Loading