User friendly dataset, dataframe generation for csv datasources without explicit StructType definitions.#24724
User friendly dataset, dataframe generation for csv datasources without explicit StructType definitions.#24724swapnilushinde wants to merge 2 commits intoapache:masterfrom
Conversation
Updating the fork with latest
…t specifying explicit schema using structType.
|
Can one of the admins verify this patch? |
|
Hi, @swapnilushinde . Thank you for making a PR, but do you the following? It's one-liner. scala> spark.version
res0: String = 2.4.3
scala> spark.read.schema("id int, name string, subject string, marks int, result boolean").load("/tmp/csv").printSchema
root
|-- id: integer (nullable = true)
|-- name: string (nullable = true)
|-- subject: string (nullable = true)
|-- marks: integer (nullable = true)
|-- result: boolean (nullable = true) |
|
Hi, @dongjoon-hyun Thanks for reply. Yes, I use this API sometimes as well. Passing schema as DDL string is one-liner but would require to define case class for Dataset creation anyways. So, creating dataset would require to define schema as both DDL string and case class. for instance, Above change would need to define schema just once with Product class and dataset/dataframes can be created easily. |
|
First of all, the followings are the most frequent use cases. (And, the recommended way.)
scala> spark.read.option("header", true).option("inferSchema", true).csv("/tmp/csv").as[Person]
res0: org.apache.spark.sql.Dataset[Person] = [name: string, age: int]
scala> case class Person(name: String, age: Long)
scala> spark.read.schema("name string, age long").csv("/tmp/csv").as[Person]
res0: org.apache.spark.sql.Dataset[Person] = [name: string, age: bigint]I believe the above two are more natural. Anyway, cc @HyukjinKwon and @MaxGekk |
|
API itself is two lines. It's one liner or two liner - workaround is easy. I don't think we need this and I would like to avoid to introduce some other variants like this. |
|
There's virtually no diff: case class Person(name: String, age: Long)
val df = spark.createDataFrame[A]("/tmp/csv")vs case class Person(name: String, age: Long)
spark.read.schema("name string, age long").csv("/tmp/csv").as[Person]and it's super confusing that |
|
Hello @HyukjinKwon @MaxGekk - Proposed API gives single way to define schema using case class and load csv without StructType or DDL definitions. Parquet and json formats: It is already easy to load these formats with schema so no need or confusion to create equivalent API like proposed. |
|
The idea looks interesting, especially, getting schema from a case class. How about new case class Person(name: String, age: Long)
spark.read.schema[Person].csv("/tmp/csv").as[Person] |
|
Why don't we just call import org.apache.spark.sql.Encoders
val schema = Encoders.product[Person].schema
spark.read.schema(schema).csv("/tmp/csv").as[Person]? Once we allow, we have to consider allowing this all the ways. |
I just think users are not aware of it. It would be nice to add an example to http://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html |
|
Yes, then, I would leave a note for |
|
+1 for @HyukjinKwon 's comment about adding a note instead of this code. |
|
@swapnilushinde Will you open a PR for comments? Please, let us know if you don't have time for that, I will do that. |
Hello @MaxGekk, I am not able to see options to reopen PR for comments. Could you please do it from your end and let me know how to proceed. |
@MaxGekk - Following up. please let me know how to proceed on this. |
@swapnilushinde Open new PR with updated comments, add an example of using |
What changes were proposed in this pull request?
Many users frequently load structured data from csv datasources. It's is very common with current APIs to load csv as Dataframe where schema needs to be defined as StructType object. Many users then convert Dataframe to Dataset with objects of Product (case classes).
Loading CSV files becomes relatively complex which can be easily simplified. This change would help to work with csv files more user friendly.
Input -
Current approach -
Proposed change -
How was this patch tested?
This change is manually tested. I didnt see similar createDataset/createDataframe unit test cases. Please let me know best place to add unit test cases for this and existing similar APIs.