Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-15615] [SQL] Support Json input from Dataset[String] #13460

Closed
wants to merge 8 commits into from
Closed

[SPARK-15615] [SQL] Support Json input from Dataset[String] #13460

wants to merge 8 commits into from

Conversation

pjfanning
Copy link
Contributor

What changes were proposed in this pull request?

[SPARK-15615] add new json function that takes Dataset[String] as input and deprecate the existing RDD based functions

How was this patch tested?

Changed the existing unit tests

PJ Fanning added 3 commits May 28, 2016 08:07
…o json-dataset

Conflicts:
	sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
	sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/TestJsonData.scala
@HyukjinKwon
Copy link
Member

I guess it would be nicer if the title has the form such as [SPARK-15615][SQL] ... according to https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark.

@@ -335,16 +336,32 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging {
* input once to determine the input schema.
*
* @param jsonRDD input RDD with one JSON object per record
* @since 1.4.0
* @since 1.4.0*
Copy link
Member

@HyukjinKwon HyukjinKwon Jun 2, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems there is a typo here, *.

@pjfanning pjfanning changed the title Json dataset [SPARK-15615] [SQL] Json dataset Jun 2, 2016
@pjfanning pjfanning changed the title [SPARK-15615] [SQL] Json dataset [SPARK-15615] [SQL] Support Json input from Dataset[String] Jun 2, 2016
def complexFieldAndType1: RDD[String] =
spark.sparkContext.parallelize(
def complexFieldAndType1: Dataset[String] =
sqlContext.createDataset(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it is preferred to use spark rather than sqlContext..

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Jun 2, 2016

I just fetched this PR and run tests. It seems "SPARK-7565 MapType in JsonRDD" test is being failed. Maybe this should be solved. Also, if json(jsonRDD: JavaRDD[String]) is being deprecated, then I think its usages should be changed to the new ones in this PR or in a follow-up.

BTW, I guess it is arguable to add a new API. For me, I feel it is a bit questionable to add this API because there is already rdd one, json(jsonRDD: RDD[String]). Dataset one might be able to be easily done with this API like the blow:

json(jsonDataset.rdd)

I guess APIs would not be added only for consistency.

Maybe I think we should wait for a committer's call.

@pjfanning
Copy link
Contributor Author

@HyukjinKwon this change in input parameter relates to #13300. There was a request there to treat Dataset[String] as a preferred input to RDD[String].

@pjfanning
Copy link
Contributor Author

@HyukjinKwon all the JsonSuite tests pass for me on my laptop - would it be feasible to get this reviewed again?

# Conflicts:
#	sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants