Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-20055] [Docs] Added documentation for loading csv files into DataFrames #19429

Closed
wants to merge 8 commits into from

Conversation

jomach
Copy link
Contributor

@jomach jomach commented Oct 4, 2017

What changes were proposed in this pull request?

Added documentation for loading csv files into Dataframes

How was this patch tested?

/dev/run-tests

 -Added documentation for loading csv files into Dataframes
Copy link
Member

@felixcheung felixcheung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is no people.csv though. these examples need to be runnable

 - Some examples on how to create a dataframe with a csv file

(cherry picked from commit e8ca1dc)
 -Added documentation for loading csv files into Dataframes
 - Some examples on how to create a dataframe with a csv file

(cherry picked from commit a546421)
# Conflicts:
#	docs/sql-programming-guide.md
#	examples/src/main/java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java
#	examples/src/main/r/RSparkSQLExample.R
@jomach
Copy link
Contributor Author

jomach commented Oct 5, 2017

@felixcheung Sorry for that. Should be there now. Can you test ? thanks

@@ -115,7 +115,20 @@ private static void runBasicDataSourceExample(SparkSession spark) {
Dataset<Row> peopleDF =
spark.read().format("json").load("examples/src/main/resources/people.json");
peopleDF.select("name", "age").write().format("parquet").save("namesAndAges.parquet");
// $example off:manual_load_options$
// $example on:manual_load_options_csv$
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You still need to keep

// $example off:manual_load_options$

# $example on:manual_load_options_csv$
df = spark.read.load("examples/src/main/resources/people.csv",
format="csv", sep=":", inferSchema="true", header="true")
# $example off:manual_load_options_csv
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This need to be corrected to

$example off:manual_load_options_csv$

@@ -0,0 +1,3 @@
name;age;job
Jorge;30;Developer
Bob;32;Developer
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add an empty line.

.option("inferSchema", "true")
.option("header", "true")
.load("examples/src/main/resources/people.csv");
// $example off:manual_load_options_csv$
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 125-131 is a duplicate.

</div>
</div>

To load a csv file you can use:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also a duplicate.

 - Some examples on how to create a dataframe with a csv file
@jomach
Copy link
Contributor Author

jomach commented Oct 10, 2017

@gatorsmile I dressed your comments. Still I cannot use the jekyll build...
SKIP_API=1 jekyll build --incremental Configuration file: /Users/jorge/Downloads/spark/docs/_config.yml Deprecation: The 'gems' configuration option has been renamed to 'plugins'. Please update your config file accordingly. Source: /Users/jorge/Downloads/spark/docs Destination: /Users/jorge/Downloads/spark/docs/_site Incremental build: enabled Generating... Liquid Exception: invalid byte sequence in US-ASCII in _layouts/redirect.html

Dataset<Row> peopleDFCsv = spark.read().format("csv")
.option("sep", ";")
.option("inferSchema", "true")
.option("header", "true")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you change the indents of line 121-123 to 2 spaces?

.option("sep", ";")
.option("inferSchema", "true")
.option("header", "true")
.load("examples/src/main/resources/people.csv")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you change the indents of line 54-57 to 2 spaces?

@gatorsmile
Copy link
Member

ok to test

@jomach
Copy link
Contributor Author

jomach commented Oct 11, 2017

@gatorsmile pr comments fixed. Sorry but is my first time.

@SparkQA
Copy link

SparkQA commented Oct 11, 2017

Test build #82629 has finished for PR 19429 at commit 68799ed.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

<div data-lang="r" markdown="1">
{% include_example manual_load_options_csv r/RSparkSQLExample.R %}
</div>
</div>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add another newline here. It breaks rendering.

@@ -461,6 +461,8 @@ name (i.e., `org.apache.spark.sql.parquet`), but for built-in sources you can al
names (`json`, `parquet`, `jdbc`, `orc`, `libsvm`, `csv`, `text`). DataFrames loaded from any data
source type can be converted into other types using this syntax.

To load a json file you can use:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say JSON instead of json.

@@ -112,6 +112,11 @@ namesAndAges <- select(df, "name", "age")
write.df(namesAndAges, "namesAndAges.parquet", "parquet")
# $example off:manual_load_options$

# $example on:manual_load_options_csv$
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add a newline here above to keep consistent in this file

@@ -49,6 +49,14 @@ object SQLDataSourceExample {
val peopleDF = spark.read.format("json").load("examples/src/main/resources/people.json")
peopleDF.select("name", "age").write.format("parquet").save("namesAndAges.parquet")
// $example off:manual_load_options$
// $example on:manual_load_options_csv$
val peopleDFCsv = spark.read.format("csv")
.option("sep", ";")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double-spaced (no tab of course ..)

@@ -116,6 +116,13 @@ private static void runBasicDataSourceExample(SparkSession spark) {
spark.read().format("json").load("examples/src/main/resources/people.json");
peopleDF.select("name", "age").write().format("parquet").save("namesAndAges.parquet");
// $example off:manual_load_options$
// $example on:manual_load_options_csv$
Dataset<Row> peopleDFCsv = spark.read().format("csv")
.option("sep", ";")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@HyukjinKwon
Copy link
Member

When I opened a JIRA, I thought a chapter such as https://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets. This chapter, Manually Specifying Options, looks describing how to specify options BTW.

@SparkQA
Copy link

SparkQA commented Oct 11, 2017

Test build #82628 has finished for PR 19429 at commit cd69fa2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -479,6 +481,25 @@ source type can be converted into other types using this syntax.
</div>
</div>

To load a csv file you can use:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@jomach
Copy link
Contributor Author

jomach commented Oct 11, 2017

@gatorsmile pr comments fixed. The problem with the actual docs is that people wen start with spark usually don't start with JSON files but with CSV files to "see" something....

@HyukjinKwon
Copy link
Member

I intended to explain multiLine, inferSchema and header which are quite arguably commonly used rather than just show up the examples. JSON one explains multiLine and each line of the examples with detailed comments.

Another point is, I'd like to let users see the options (without duplications) rather than checking API documentation as I am quite sure newbies often misunderstand this. For example, I happen to see newbies setting inferSchema to true to non-CSV datasources time to time, or setting com.databricks.spark.csv instead of csv.

@HyukjinKwon
Copy link
Member

I wasn't even quite sure when I opened the JIRA. That's why I asked it to one of PMCs who might have a better insight. I am okay with going ahead as a small improvement in the docs if any committer likes it (though I don't support) but please leave the JIRA open. I think this PR does not fully solve the issue.

@SparkQA
Copy link

SparkQA commented Oct 11, 2017

Test build #82630 has finished for PR 19429 at commit 7ff1d84.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixcheung
Copy link
Member

+1 for more detailed documentation (we should steer away from inferSchema)

@gatorsmile
Copy link
Member

@jomach is a new contributor to Apache Spark. It might be hard for him to address the above comments. Please submit a separate PR for addressing it. Will review it. Thanks!

@gatorsmile
Copy link
Member

gatorsmile commented Oct 12, 2017

LGTM

Thanks! Merged to master.

@asfgit asfgit closed this in ccdf21f Oct 12, 2017
{% include_example manual_load_options_csv r/RSparkSQLExample.R %}

</div>
</div>
### Run SQL on files directly
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, that's okay. BTW, I initially what I meant in #19429 (comment) was a newline between </div> and ### Run .. (not ...ample.R %} and </div>. This breaks rendering:

Let's don't forget to fix this up before the release if the followup couldn't be made ahead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon should I add a new line between line 503 and 504 ?
For example :

{% include_example generic_load_save_functions r/RSparkSQLExample.R %}

</div>
</div>

### Manually Specifying Options

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, a newline between 503 and 504.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants