[SPARK-15615][SQL] Add an API to load DataFrame from Dataset[String] storing JSON #16895

pjfanning · 2017-02-11T19:05:38Z

What changes were proposed in this pull request?

SPARK-15615 proposes replacing the sqlContext.read.json(rdd) with a dataset equivalent.
SPARK-15463 adds a CSV API for reading from Dataset[String] so this keeps the API consistent.
I am deprecating the existing RDD based APIs.

How was this patch tested?

There are existing tests. I left most tests to use the existing APIs as they delegate to the new json API.

Please review http://spark.apache.org/contributing.html before opening a pull request.

…storing JSON, deprecating existing RDD APIs

cloud-fan · 2017-02-12T02:07:25Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala

  def json(jsonRDD: RDD[String]): DataFrame = {
+    import sparkSession.sqlContext.implicits._
+    json(sparkSession.createDataset(jsonRDD))


nit sparkSession.createDataset(jsonRDD)(Encoders.STRING)

cloud-fan · 2017-02-12T02:07:42Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonFileFormat.scala

      conf.asInstanceOf[JobConf],
      classOf[TextInputFormat],
      classOf[LongWritable],
      classOf[Text]).map(_._2.toString) // get the text line
+    import sparkSession.sqlContext.implicits._
+    sparkSession.createDataset(rdd)


cloud-fan · 2017-02-12T02:08:17Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/TestJsonData.scala

+  def dataset(rdd: RDD[String]): Dataset[String] = {
+    val sqlContext = spark.sqlContext
+    import sqlContext.implicits._
+    spark.createDataset(rdd)


cloud-fan · 2017-02-12T02:09:05Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala

    // This is really a test that it doesn't throw an exception
-    val emptySchema = JsonInferSchema.infer(empty, "", new JSONOptions(Map.empty[String, String]))
+    val emptySchema = JsonInferSchema.infer(dataset(empty), "", new JSONOptions(Map.empty[String, String]))


I think we can just write empty.toDS

cloud-fan · 2017-02-12T02:09:16Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/TestJsonData.scala

@@ -231,4 +231,10 @@ private[json] trait TestJsonData {
  lazy val singleRow: RDD[String] = spark.sparkContext.parallelize("""{"a":123}""" :: Nil)

  def empty: RDD[String] = spark.sparkContext.parallelize(Seq[String]())
+
+  def dataset(rdd: RDD[String]): Dataset[String] = {


actually we don't need this , see https://github.com/apache/spark/pull/16895/files#r100681578

cloud-fan · 2017-02-12T02:09:32Z

ok to test

SparkQA · 2017-02-12T02:14:29Z

Test build #72747 has finished for PR 16895 at commit 3c477c1.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-12T04:59:56Z

Test build #72749 has finished for PR 16895 at commit bb304de.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-16T00:46:35Z

Test build #72960 has finished for PR 16895 at commit 580b4e4.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
public class TransportChannelHandler extends ChannelInboundHandlerAdapter
class LinearSVCWrapperWriter(instance: LinearSVCWrapper) extends MLWriter
class LinearSVCWrapperReader extends MLReader[LinearSVCWrapper]
class NoSuchDatabaseException(val db: String) extends AnalysisException(s\"Database '$db' not found\")
class ResolveBroadcastHints(conf: CatalystConf) extends Rule[LogicalPlan]
case class JsonToStruct(
case class StructToJson(
case class Hint(name: String, parameters: Seq[String], child: LogicalPlan) extends UnaryNode
case class InnerOuterEstimation(conf: CatalystConf, join: Join) extends Logging
case class LeftSemiAntiEstimation(conf: CatalystConf, join: Join)
case class NumericRange(min: JDecimal, max: JDecimal) extends Range
class FileStreamOptions(parameters: CaseInsensitiveMap[String]) extends Logging

cloud-fan · 2017-02-16T06:50:47Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala

    // This is really a test that it doesn't throw an exception
+    val emptyDataset = spark.createDataset(empty)(Encoders.STRING)


doesn't empty.toDS work?

I can double check but the toDS call appears to require the spark implicits import

RDD only has toDS() function added when SQLImplicits applies an implicit conversion to wrap the RDD as a DatasetHolder.
import sparkSession.sqlContext.implicits._

I switched to this spark.createDataset approach based on your first coment on 3c477c1

I think the implicits are already imported at the beginning of this test suite

SparkQA · 2017-02-16T23:18:44Z

Test build #73015 has finished for PR 16895 at commit 731951a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-02-17T00:02:59Z

LGTM, I'll merge it in 1 or 2 days, if no one agains this API change.

# Conflicts: # sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala # sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonFileFormat.scala # sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonInferSchema.scala # sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala

SparkQA · 2017-02-18T03:43:22Z

Test build #73086 has finished for PR 16895 at commit 82561c0.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class LSHParams(Params):
class LSHModel(JavaModel):
class BucketedRandomProjectionLSH(JavaEstimator, LSHParams, HasInputCol, HasOutputCol, HasSeed,
class BucketedRandomProjectionLSHModel(LSHModel, JavaMLReadable, JavaMLWritable):
class MinHashLSH(JavaEstimator, LSHParams, HasInputCol, HasOutputCol, HasSeed,
class MinHashLSHModel(LSHModel, JavaMLReadable, JavaMLWritable):
case class StreamingExplainCommand(
case class SaveIntoDataSourceCommand(
abstract class JsonDataSource[T] extends Serializable

SparkQA · 2017-02-18T03:52:55Z

Test build #73087 has finished for PR 16895 at commit cdf53bf.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-02-23T02:03:42Z

thanks, merging to master!

…storing JSON ## What changes were proposed in this pull request? SPARK-15615 proposes replacing the sqlContext.read.json(rdd) with a dataset equivalent. SPARK-15463 adds a CSV API for reading from Dataset[String] so this keeps the API consistent. I am deprecating the existing RDD based APIs. ## How was this patch tested? There are existing tests. I left most tests to use the existing APIs as they delegate to the new json API. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: pj.fanning <pj.fanning@workday.com> Author: PJ Fanning <pjfanning@users.noreply.github.com> Closes apache#16895 from pjfanning/SPARK-15615.

[SPARK-15615][SQL] Add an API to load DataFrame from Dataset[String] …

3c477c1

…storing JSON, deprecating existing RDD APIs

pjfanning mentioned this pull request Feb 11, 2017

[SPARK-15463][SQL] Add an API to load DataFrame from Dataset[String] storing CSV #16854

Closed

cloud-fan reviewed Feb 12, 2017

View reviewed changes

address review comments and fix scalastyle issues

bb304de

Merge branch 'master' into SPARK-15615

580b4e4

cloud-fan reviewed Feb 16, 2017

View reviewed changes

use toDS() in test code

731951a

pj.fanning added 2 commits February 18, 2017 01:48

Fix grammar in documentation

cdf53bf

asfgit closed this in d314750 Feb 23, 2017

pjfanning deleted the SPARK-15615 branch February 25, 2017 20:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-15615][SQL] Add an API to load DataFrame from Dataset[String] storing JSON #16895

[SPARK-15615][SQL] Add an API to load DataFrame from Dataset[String] storing JSON #16895

pjfanning commented Feb 11, 2017

cloud-fan Feb 12, 2017

cloud-fan Feb 12, 2017

cloud-fan Feb 12, 2017

cloud-fan Feb 12, 2017

cloud-fan Feb 12, 2017

cloud-fan commented Feb 12, 2017

SparkQA commented Feb 12, 2017

SparkQA commented Feb 12, 2017

SparkQA commented Feb 16, 2017

cloud-fan Feb 16, 2017

pjfanning Feb 16, 2017

pjfanning Feb 16, 2017 •

edited

Loading

cloud-fan Feb 16, 2017

SparkQA commented Feb 16, 2017

cloud-fan commented Feb 17, 2017

SparkQA commented Feb 18, 2017

SparkQA commented Feb 18, 2017

cloud-fan commented Feb 23, 2017

		// This is really a test that it doesn't throw an exception
		val emptyDataset = spark.createDataset(empty)(Encoders.STRING)

[SPARK-15615][SQL] Add an API to load DataFrame from Dataset[String] storing JSON #16895

[SPARK-15615][SQL] Add an API to load DataFrame from Dataset[String] storing JSON #16895

Conversation

pjfanning commented Feb 11, 2017

What changes were proposed in this pull request?

How was this patch tested?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Feb 12, 2017

SparkQA commented Feb 12, 2017

SparkQA commented Feb 12, 2017

SparkQA commented Feb 16, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pjfanning Feb 16, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Feb 16, 2017

cloud-fan commented Feb 17, 2017

SparkQA commented Feb 18, 2017

SparkQA commented Feb 18, 2017

cloud-fan commented Feb 23, 2017

pjfanning Feb 16, 2017 •

edited

Loading