-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-15615][SQL] Add an API to load DataFrame from Dataset[String] storing JSON #16895
Conversation
…storing JSON, deprecating existing RDD APIs
def json(jsonRDD: RDD[String]): DataFrame = { | ||
import sparkSession.sqlContext.implicits._ | ||
json(sparkSession.createDataset(jsonRDD)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit sparkSession.createDataset(jsonRDD)(Encoders.STRING)
conf.asInstanceOf[JobConf], | ||
classOf[TextInputFormat], | ||
classOf[LongWritable], | ||
classOf[Text]).map(_._2.toString) // get the text line | ||
import sparkSession.sqlContext.implicits._ | ||
sparkSession.createDataset(rdd) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
def dataset(rdd: RDD[String]): Dataset[String] = { | ||
val sqlContext = spark.sqlContext | ||
import sqlContext.implicits._ | ||
spark.createDataset(rdd) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
// This is really a test that it doesn't throw an exception | ||
val emptySchema = JsonInferSchema.infer(empty, "", new JSONOptions(Map.empty[String, String])) | ||
val emptySchema = JsonInferSchema.infer(dataset(empty), "", new JSONOptions(Map.empty[String, String])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can just write empty.toDS
@@ -231,4 +231,10 @@ private[json] trait TestJsonData { | |||
lazy val singleRow: RDD[String] = spark.sparkContext.parallelize("""{"a":123}""" :: Nil) | |||
|
|||
def empty: RDD[String] = spark.sparkContext.parallelize(Seq[String]()) | |||
|
|||
def dataset(rdd: RDD[String]): Dataset[String] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually we don't need this , see https://github.com/apache/spark/pull/16895/files#r100681578
ok to test |
Test build #72747 has finished for PR 16895 at commit
|
Test build #72749 has finished for PR 16895 at commit
|
Test build #72960 has finished for PR 16895 at commit
|
// This is really a test that it doesn't throw an exception | ||
val emptyDataset = spark.createDataset(empty)(Encoders.STRING) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesn't empty.toDS
work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can double check but the toDS call appears to require the spark implicits import
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RDD only has toDS() function added when SQLImplicits applies an implicit conversion to wrap the RDD as a DatasetHolder.
import sparkSession.sqlContext.implicits._
I switched to this spark.createDataset approach based on your first coment on 3c477c1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the implicits are already imported at the beginning of this test suite
Test build #73015 has finished for PR 16895 at commit
|
LGTM, I'll merge it in 1 or 2 days, if no one agains this API change. |
# Conflicts: # sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala # sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonFileFormat.scala # sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonInferSchema.scala # sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala
Test build #73086 has finished for PR 16895 at commit
|
Test build #73087 has finished for PR 16895 at commit
|
thanks, merging to master! |
…storing JSON ## What changes were proposed in this pull request? SPARK-15615 proposes replacing the sqlContext.read.json(rdd) with a dataset equivalent. SPARK-15463 adds a CSV API for reading from Dataset[String] so this keeps the API consistent. I am deprecating the existing RDD based APIs. ## How was this patch tested? There are existing tests. I left most tests to use the existing APIs as they delegate to the new json API. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: pj.fanning <pj.fanning@workday.com> Author: PJ Fanning <pjfanning@users.noreply.github.com> Closes apache#16895 from pjfanning/SPARK-15615.
What changes were proposed in this pull request?
SPARK-15615 proposes replacing the sqlContext.read.json(rdd) with a dataset equivalent.
SPARK-15463 adds a CSV API for reading from Dataset[String] so this keeps the API consistent.
I am deprecating the existing RDD based APIs.
How was this patch tested?
There are existing tests. I left most tests to use the existing APIs as they delegate to the new json API.
Please review http://spark.apache.org/contributing.html before opening a pull request.