Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-15615][SQL] Add an API to load DataFrame from Dataset[String] storing JSON #16895

Closed
wants to merge 6 commits into from
Closed

Conversation

pjfanning
Copy link
Contributor

What changes were proposed in this pull request?

SPARK-15615 proposes replacing the sqlContext.read.json(rdd) with a dataset equivalent.
SPARK-15463 adds a CSV API for reading from Dataset[String] so this keeps the API consistent.
I am deprecating the existing RDD based APIs.

How was this patch tested?

There are existing tests. I left most tests to use the existing APIs as they delegate to the new json API.

Please review http://spark.apache.org/contributing.html before opening a pull request.

…storing JSON, deprecating existing RDD APIs
def json(jsonRDD: RDD[String]): DataFrame = {
import sparkSession.sqlContext.implicits._
json(sparkSession.createDataset(jsonRDD))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit sparkSession.createDataset(jsonRDD)(Encoders.STRING)

conf.asInstanceOf[JobConf],
classOf[TextInputFormat],
classOf[LongWritable],
classOf[Text]).map(_._2.toString) // get the text line
import sparkSession.sqlContext.implicits._
sparkSession.createDataset(rdd)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

def dataset(rdd: RDD[String]): Dataset[String] = {
val sqlContext = spark.sqlContext
import sqlContext.implicits._
spark.createDataset(rdd)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

// This is really a test that it doesn't throw an exception
val emptySchema = JsonInferSchema.infer(empty, "", new JSONOptions(Map.empty[String, String]))
val emptySchema = JsonInferSchema.infer(dataset(empty), "", new JSONOptions(Map.empty[String, String]))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can just write empty.toDS

@@ -231,4 +231,10 @@ private[json] trait TestJsonData {
lazy val singleRow: RDD[String] = spark.sparkContext.parallelize("""{"a":123}""" :: Nil)

def empty: RDD[String] = spark.sparkContext.parallelize(Seq[String]())

def dataset(rdd: RDD[String]): Dataset[String] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan
Copy link
Contributor

ok to test

@SparkQA
Copy link

SparkQA commented Feb 12, 2017

Test build #72747 has finished for PR 16895 at commit 3c477c1.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 12, 2017

Test build #72749 has finished for PR 16895 at commit bb304de.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 16, 2017

Test build #72960 has finished for PR 16895 at commit 580b4e4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • public class TransportChannelHandler extends ChannelInboundHandlerAdapter
  • class LinearSVCWrapperWriter(instance: LinearSVCWrapper) extends MLWriter
  • class LinearSVCWrapperReader extends MLReader[LinearSVCWrapper]
  • class NoSuchDatabaseException(val db: String) extends AnalysisException(s\"Database '$db' not found\")
  • class ResolveBroadcastHints(conf: CatalystConf) extends Rule[LogicalPlan]
  • case class JsonToStruct(
  • case class StructToJson(
  • case class Hint(name: String, parameters: Seq[String], child: LogicalPlan) extends UnaryNode
  • case class InnerOuterEstimation(conf: CatalystConf, join: Join) extends Logging
  • case class LeftSemiAntiEstimation(conf: CatalystConf, join: Join)
  • case class NumericRange(min: JDecimal, max: JDecimal) extends Range
  • class FileStreamOptions(parameters: CaseInsensitiveMap[String]) extends Logging

// This is really a test that it doesn't throw an exception
val emptyDataset = spark.createDataset(empty)(Encoders.STRING)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't empty.toDS work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can double check but the toDS call appears to require the spark implicits import

Copy link
Contributor Author

@pjfanning pjfanning Feb 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RDD only has toDS() function added when SQLImplicits applies an implicit conversion to wrap the RDD as a DatasetHolder.
import sparkSession.sqlContext.implicits._

I switched to this spark.createDataset approach based on your first coment on 3c477c1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the implicits are already imported at the beginning of this test suite

@SparkQA
Copy link

SparkQA commented Feb 16, 2017

Test build #73015 has finished for PR 16895 at commit 731951a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

LGTM, I'll merge it in 1 or 2 days, if no one agains this API change.

pj.fanning added 2 commits February 18, 2017 01:48
# Conflicts:
#	sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
#	sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonFileFormat.scala
#	sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonInferSchema.scala
#	sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala
@SparkQA
Copy link

SparkQA commented Feb 18, 2017

Test build #73086 has finished for PR 16895 at commit 82561c0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class LSHParams(Params):
  • class LSHModel(JavaModel):
  • class BucketedRandomProjectionLSH(JavaEstimator, LSHParams, HasInputCol, HasOutputCol, HasSeed,
  • class BucketedRandomProjectionLSHModel(LSHModel, JavaMLReadable, JavaMLWritable):
  • class MinHashLSH(JavaEstimator, LSHParams, HasInputCol, HasOutputCol, HasSeed,
  • class MinHashLSHModel(LSHModel, JavaMLReadable, JavaMLWritable):
  • case class StreamingExplainCommand(
  • case class SaveIntoDataSourceCommand(
  • abstract class JsonDataSource[T] extends Serializable

@SparkQA
Copy link

SparkQA commented Feb 18, 2017

Test build #73087 has finished for PR 16895 at commit cdf53bf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@asfgit asfgit closed this in d314750 Feb 23, 2017
@pjfanning pjfanning deleted the SPARK-15615 branch February 25, 2017 20:50
Yunni pushed a commit to Yunni/spark that referenced this pull request Feb 27, 2017
…storing JSON

## What changes were proposed in this pull request?

SPARK-15615 proposes replacing the sqlContext.read.json(rdd) with a dataset equivalent.
SPARK-15463 adds a CSV API for reading from Dataset[String] so this keeps the API consistent.
I am deprecating the existing RDD based APIs.

## How was this patch tested?

There are existing tests. I left most tests to use the existing APIs as they delegate to the new json API.

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: pj.fanning <pj.fanning@workday.com>
Author: PJ Fanning <pjfanning@users.noreply.github.com>

Closes apache#16895 from pjfanning/SPARK-15615.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants