[SPARK-20055] [Docs] Added documentation for loading csv files into DataFrames #19429

jomach · 2017-10-04T13:14:15Z

What changes were proposed in this pull request?

Added documentation for loading csv files into Dataframes

How was this patch tested?

/dev/run-tests

-Added documentation for loading csv files into Dataframes

felixcheung

there is no people.csv though. these examples need to be runnable

- Some examples on how to create a dataframe with a csv file (cherry picked from commit e8ca1dc)

-Added documentation for loading csv files into Dataframes

- Some examples on how to create a dataframe with a csv file (cherry picked from commit a546421)

# Conflicts: # docs/sql-programming-guide.md # examples/src/main/java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java # examples/src/main/r/RSparkSQLExample.R

jomach · 2017-10-05T08:34:46Z

@felixcheung Sorry for that. Should be there now. Can you test ? thanks

gatorsmile · 2017-10-06T20:27:39Z

examples/src/main/java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java

@@ -115,7 +115,20 @@ private static void runBasicDataSourceExample(SparkSession spark) {
    Dataset<Row> peopleDF =
      spark.read().format("json").load("examples/src/main/resources/people.json");
    peopleDF.select("name", "age").write().format("parquet").save("namesAndAges.parquet");
-    // $example off:manual_load_options$
+    // $example on:manual_load_options_csv$


You still need to keep

// $example off:manual_load_options$

gatorsmile · 2017-10-06T20:29:29Z

examples/src/main/python/sql/datasource.py

+    # $example on:manual_load_options_csv$
+    df = spark.read.load("examples/src/main/resources/people.csv",
+                         format="csv", sep=":", inferSchema="true", header="true")
+    # $example off:manual_load_options_csv


This need to be corrected to

$example off:manual_load_options_csv$

gatorsmile · 2017-10-06T20:30:12Z

examples/src/main/resources/people.csv

@@ -0,0 +1,3 @@
+name;age;job
+Jorge;30;Developer
+Bob;32;Developer


Add an empty line.

gatorsmile · 2017-10-06T20:31:37Z

examples/src/main/java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java

+              .option("inferSchema", "true")
+              .option("header", "true")
+              .load("examples/src/main/resources/people.csv");
+    // $example off:manual_load_options_csv$


Line 125-131 is a duplicate.

gatorsmile · 2017-10-06T20:32:15Z

docs/sql-programming-guide.md

+</div>
+</div>
+
+To load a csv file you can use:


This is also a duplicate.

- Some examples on how to create a dataframe with a csv file

jomach · 2017-10-10T05:54:39Z

@gatorsmile I dressed your comments. Still I cannot use the jekyll build...
SKIP_API=1 jekyll build --incremental Configuration file: /Users/jorge/Downloads/spark/docs/_config.yml Deprecation: The 'gems' configuration option has been renamed to 'plugins'. Please update your config file accordingly. Source: /Users/jorge/Downloads/spark/docs Destination: /Users/jorge/Downloads/spark/docs/_site Incremental build: enabled Generating... Liquid Exception: invalid byte sequence in US-ASCII in _layouts/redirect.html

gatorsmile · 2017-10-11T07:23:00Z

examples/src/main/java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java

+    Dataset<Row> peopleDFCsv = spark.read().format("csv")
+              .option("sep", ";")
+              .option("inferSchema", "true")
+              .option("header", "true")


Could you change the indents of line 121-123 to 2 spaces?

gatorsmile · 2017-10-11T07:23:20Z

examples/src/main/scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala

+         .option("sep", ";")
+         .option("inferSchema", "true")
+         .option("header", "true")
+         .load("examples/src/main/resources/people.csv")


Could you change the indents of line 54-57 to 2 spaces?

gatorsmile · 2017-10-11T07:24:36Z

ok to test

- PR comments

jomach · 2017-10-11T07:32:41Z

@gatorsmile pr comments fixed. Sorry but is my first time.

SparkQA · 2017-10-11T07:35:00Z

Test build #82629 has finished for PR 19429 at commit 68799ed.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-10-11T07:38:49Z

docs/sql-programming-guide.md

+<div data-lang="r"  markdown="1">
+{% include_example manual_load_options_csv r/RSparkSQLExample.R %}
+</div>
+</div>


Let's add another newline here. It breaks rendering.

HyukjinKwon · 2017-10-11T07:40:17Z

docs/sql-programming-guide.md

@@ -461,6 +461,8 @@ name (i.e., `org.apache.spark.sql.parquet`), but for built-in sources you can al
 names (`json`, `parquet`, `jdbc`, `orc`, `libsvm`, `csv`, `text`). DataFrames loaded from any data
 source type can be converted into other types using this syntax.

+To load a json file you can use:


I'd say JSON instead of json.

HyukjinKwon · 2017-10-11T07:45:06Z

examples/src/main/r/RSparkSQLExample.R

@@ -112,6 +112,11 @@ namesAndAges <- select(df, "name", "age")
 write.df(namesAndAges, "namesAndAges.parquet", "parquet")
 # $example off:manual_load_options$

+# $example on:manual_load_options_csv$


I'd add a newline here above to keep consistent in this file

HyukjinKwon · 2017-10-11T07:45:58Z

examples/src/main/scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala

@@ -49,6 +49,14 @@ object SQLDataSourceExample {
    val peopleDF = spark.read.format("json").load("examples/src/main/resources/people.json")
    peopleDF.select("name", "age").write.format("parquet").save("namesAndAges.parquet")
    // $example off:manual_load_options$
+    // $example on:manual_load_options_csv$
+    val peopleDFCsv = spark.read.format("csv")
+	  .option("sep", ";")


double-spaced (no tab of course ..)

HyukjinKwon · 2017-10-11T07:46:21Z

examples/src/main/java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java

@@ -116,6 +116,13 @@ private static void runBasicDataSourceExample(SparkSession spark) {
      spark.read().format("json").load("examples/src/main/resources/people.json");
    peopleDF.select("name", "age").write().format("parquet").save("namesAndAges.parquet");
    // $example off:manual_load_options$
+    // $example on:manual_load_options_csv$
+    Dataset<Row> peopleDFCsv = spark.read().format("csv")
+	  .option("sep", ";")


HyukjinKwon · 2017-10-11T07:50:09Z

When I opened a JIRA, I thought a chapter such as https://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets. This chapter, Manually Specifying Options, looks describing how to specify options BTW.

SparkQA · 2017-10-11T07:52:47Z

Test build #82628 has finished for PR 19429 at commit cd69fa2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-10-11T07:53:06Z

docs/sql-programming-guide.md

@@ -479,6 +481,25 @@ source type can be converted into other types using this syntax.
 </div>
 </div>

+To load a csv file you can use:


- PR comments

jomach · 2017-10-11T08:12:52Z

@gatorsmile pr comments fixed. The problem with the actual docs is that people wen start with spark usually don't start with JSON files but with CSV files to "see" something....

HyukjinKwon · 2017-10-11T08:27:58Z

I intended to explain multiLine, inferSchema and header which are quite arguably commonly used rather than just show up the examples. JSON one explains multiLine and each line of the examples with detailed comments.

Another point is, I'd like to let users see the options (without duplications) rather than checking API documentation as I am quite sure newbies often misunderstand this. For example, I happen to see newbies setting inferSchema to true to non-CSV datasources time to time, or setting com.databricks.spark.csv instead of csv.

HyukjinKwon · 2017-10-11T08:29:45Z

I wasn't even quite sure when I opened the JIRA. That's why I asked it to one of PMCs who might have a better insight. I am okay with going ahead as a small improvement in the docs if any committer likes it (though I don't support) but please leave the JIRA open. I think this PR does not fully solve the issue.

SparkQA · 2017-10-11T08:33:54Z

Test build #82630 has finished for PR 19429 at commit 7ff1d84.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2017-10-11T17:32:54Z

+1 for more detailed documentation (we should steer away from inferSchema)

gatorsmile · 2017-10-12T05:12:17Z

@jomach is a new contributor to Apache Spark. It might be hard for him to address the above comments. Please submit a separate PR for addressing it. Will review it. Thanks!

gatorsmile · 2017-10-12T05:12:25Z

LGTM

Thanks! Merged to master.

HyukjinKwon · 2017-10-12T06:09:12Z

docs/sql-programming-guide.md

+{% include_example manual_load_options_csv r/RSparkSQLExample.R %}
+
+</div>
+</div>
 ### Run SQL on files directly


Yup, that's okay. BTW, I initially what I meant in #19429 (comment) was a newline between </div> and ### Run .. (not ...ample.R %} and </div>. This breaks rendering:

Let's don't forget to fix this up before the release if the followup couldn't be made ahead.

@HyukjinKwon should I add a new line between line 503 and 504 ?
For example :

{% include_example generic_load_save_functions r/RSparkSQLExample.R %} </div> </div> ### Manually Specifying Options

Yup, a newline between 503 and 504.

SPARK-20055 Documentation

f5941bf

-Added documentation for loading csv files into Dataframes

felixcheung reviewed Oct 4, 2017

View reviewed changes

jomach added 4 commits October 5, 2017 10:28

SPARK-20055 Documentation

812bdf7

- Some examples on how to create a dataframe with a csv file (cherry picked from commit e8ca1dc)

SPARK-20055 Documentation

4e4a02b

-Added documentation for loading csv files into Dataframes

SPARK-20055 Documentation

a2ec38a

- Some examples on how to create a dataframe with a csv file (cherry picked from commit a546421)

Merge remote-tracking branch 'origin/master'

793628b

# Conflicts: # docs/sql-programming-guide.md # examples/src/main/java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java # examples/src/main/r/RSparkSQLExample.R

gatorsmile reviewed Oct 6, 2017

View reviewed changes

examples/src/main/resources/people.csv

@@ -0,0 +1,3 @@

name;age;job

Jorge;30;Developer

Bob;32;Developer

Copy link

Member

gatorsmile Oct 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add an empty line.

gatorsmile reviewed Oct 6, 2017

View reviewed changes

docs/sql-programming-guide.md Outdated

</div>

</div>

To load a csv file you can use:

Copy link

Member

gatorsmile Oct 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also a duplicate.

SPARK-20055 Documentation

cd69fa2

- Some examples on how to create a dataframe with a csv file

gatorsmile reviewed Oct 11, 2017

View reviewed changes

SPARK-20055 Documentation

68799ed

- PR comments

HyukjinKwon reviewed Oct 11, 2017

View reviewed changes

SPARK-20055 Documentation

7ff1d84

- PR comments

asfgit closed this in ccdf21f Oct 12, 2017

HyukjinKwon reviewed Oct 12, 2017

View reviewed changes

HyukjinKwon mentioned this pull request Oct 12, 2017

[SPARK-20055] [Docs] Added documentation for loading csv files into DataFrames Fix #19485

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-20055] [Docs] Added documentation for loading csv files into DataFrames #19429

[SPARK-20055] [Docs] Added documentation for loading csv files into DataFrames #19429

jomach commented Oct 4, 2017 •

edited

felixcheung left a comment

jomach commented Oct 5, 2017

gatorsmile Oct 6, 2017

gatorsmile Oct 6, 2017

gatorsmile Oct 6, 2017

gatorsmile Oct 6, 2017

gatorsmile Oct 6, 2017

jomach commented Oct 10, 2017

gatorsmile Oct 11, 2017

gatorsmile Oct 11, 2017

gatorsmile commented Oct 11, 2017

jomach commented Oct 11, 2017

SparkQA commented Oct 11, 2017

HyukjinKwon Oct 11, 2017

HyukjinKwon Oct 11, 2017

HyukjinKwon Oct 11, 2017

HyukjinKwon Oct 11, 2017

HyukjinKwon Oct 11, 2017

HyukjinKwon commented Oct 11, 2017

SparkQA commented Oct 11, 2017

HyukjinKwon Oct 11, 2017

jomach commented Oct 11, 2017

HyukjinKwon commented Oct 11, 2017

HyukjinKwon commented Oct 11, 2017

SparkQA commented Oct 11, 2017

felixcheung commented Oct 11, 2017

gatorsmile commented Oct 12, 2017

gatorsmile commented Oct 12, 2017 •

edited

HyukjinKwon Oct 12, 2017

jomach Oct 12, 2017

HyukjinKwon Oct 12, 2017

[SPARK-20055] [Docs] Added documentation for loading csv files into DataFrames #19429

[SPARK-20055] [Docs] Added documentation for loading csv files into DataFrames #19429

Conversation

jomach commented Oct 4, 2017 • edited

What changes were proposed in this pull request?

How was this patch tested?

felixcheung left a comment

Choose a reason for hiding this comment

jomach commented Oct 5, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

$example off:manual_load_options_csv$

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jomach commented Oct 10, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gatorsmile commented Oct 11, 2017

jomach commented Oct 11, 2017

SparkQA commented Oct 11, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon commented Oct 11, 2017

SparkQA commented Oct 11, 2017

Choose a reason for hiding this comment

jomach commented Oct 11, 2017

HyukjinKwon commented Oct 11, 2017

HyukjinKwon commented Oct 11, 2017

SparkQA commented Oct 11, 2017

felixcheung commented Oct 11, 2017

gatorsmile commented Oct 12, 2017

gatorsmile commented Oct 12, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jomach commented Oct 4, 2017 •

edited

gatorsmile commented Oct 12, 2017 •

edited