New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-20055] [Docs] Added documentation for loading csv files into DataFrames #19429
Conversation
-Added documentation for loading csv files into Dataframes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is no people.csv
though. these examples need to be runnable
- Some examples on how to create a dataframe with a csv file (cherry picked from commit e8ca1dc)
-Added documentation for loading csv files into Dataframes
- Some examples on how to create a dataframe with a csv file (cherry picked from commit a546421)
# Conflicts: # docs/sql-programming-guide.md # examples/src/main/java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java # examples/src/main/r/RSparkSQLExample.R
@felixcheung Sorry for that. Should be there now. Can you test ? thanks |
@@ -115,7 +115,20 @@ private static void runBasicDataSourceExample(SparkSession spark) { | |||
Dataset<Row> peopleDF = | |||
spark.read().format("json").load("examples/src/main/resources/people.json"); | |||
peopleDF.select("name", "age").write().format("parquet").save("namesAndAges.parquet"); | |||
// $example off:manual_load_options$ | |||
// $example on:manual_load_options_csv$ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You still need to keep
//
$example off:manual_load_options$
# $example on:manual_load_options_csv$ | ||
df = spark.read.load("examples/src/main/resources/people.csv", | ||
format="csv", sep=":", inferSchema="true", header="true") | ||
# $example off:manual_load_options_csv |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This need to be corrected to
$example off:manual_load_options_csv$
@@ -0,0 +1,3 @@ | |||
name;age;job | |||
Jorge;30;Developer | |||
Bob;32;Developer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add an empty line.
.option("inferSchema", "true") | ||
.option("header", "true") | ||
.load("examples/src/main/resources/people.csv"); | ||
// $example off:manual_load_options_csv$ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line 125-131 is a duplicate.
docs/sql-programming-guide.md
Outdated
</div> | ||
</div> | ||
|
||
To load a csv file you can use: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also a duplicate.
- Some examples on how to create a dataframe with a csv file
@gatorsmile I dressed your comments. Still I cannot use the jekyll build... |
Dataset<Row> peopleDFCsv = spark.read().format("csv") | ||
.option("sep", ";") | ||
.option("inferSchema", "true") | ||
.option("header", "true") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you change the indents of line 121-123 to 2 spaces?
.option("sep", ";") | ||
.option("inferSchema", "true") | ||
.option("header", "true") | ||
.load("examples/src/main/resources/people.csv") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you change the indents of line 54-57 to 2 spaces?
ok to test |
- PR comments
@gatorsmile pr comments fixed. Sorry but is my first time. |
Test build #82629 has finished for PR 19429 at commit
|
docs/sql-programming-guide.md
Outdated
<div data-lang="r" markdown="1"> | ||
{% include_example manual_load_options_csv r/RSparkSQLExample.R %} | ||
</div> | ||
</div> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add another newline here. It breaks rendering.
docs/sql-programming-guide.md
Outdated
@@ -461,6 +461,8 @@ name (i.e., `org.apache.spark.sql.parquet`), but for built-in sources you can al | |||
names (`json`, `parquet`, `jdbc`, `orc`, `libsvm`, `csv`, `text`). DataFrames loaded from any data | |||
source type can be converted into other types using this syntax. | |||
|
|||
To load a json file you can use: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say JSON
instead of json
.
@@ -112,6 +112,11 @@ namesAndAges <- select(df, "name", "age") | |||
write.df(namesAndAges, "namesAndAges.parquet", "parquet") | |||
# $example off:manual_load_options$ | |||
|
|||
# $example on:manual_load_options_csv$ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd add a newline here above to keep consistent in this file
@@ -49,6 +49,14 @@ object SQLDataSourceExample { | |||
val peopleDF = spark.read.format("json").load("examples/src/main/resources/people.json") | |||
peopleDF.select("name", "age").write.format("parquet").save("namesAndAges.parquet") | |||
// $example off:manual_load_options$ | |||
// $example on:manual_load_options_csv$ | |||
val peopleDFCsv = spark.read.format("csv") | |||
.option("sep", ";") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
double-spaced (no tab of course ..)
@@ -116,6 +116,13 @@ private static void runBasicDataSourceExample(SparkSession spark) { | |||
spark.read().format("json").load("examples/src/main/resources/people.json"); | |||
peopleDF.select("name", "age").write().format("parquet").save("namesAndAges.parquet"); | |||
// $example off:manual_load_options$ | |||
// $example on:manual_load_options_csv$ | |||
Dataset<Row> peopleDFCsv = spark.read().format("csv") | |||
.option("sep", ";") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
When I opened a JIRA, I thought a chapter such as https://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets. This chapter, |
Test build #82628 has finished for PR 19429 at commit
|
docs/sql-programming-guide.md
Outdated
@@ -479,6 +481,25 @@ source type can be converted into other types using this syntax. | |||
</div> | |||
</div> | |||
|
|||
To load a csv file you can use: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
- PR comments
@gatorsmile pr comments fixed. The problem with the actual docs is that people wen start with spark usually don't start with JSON files but with CSV files to "see" something.... |
I intended to explain Another point is, I'd like to let users see the options (without duplications) rather than checking API documentation as I am quite sure newbies often misunderstand this. For example, I happen to see newbies setting |
I wasn't even quite sure when I opened the JIRA. That's why I asked it to one of PMCs who might have a better insight. I am okay with going ahead as a small improvement in the docs if any committer likes it (though I don't support) but please leave the JIRA open. I think this PR does not fully solve the issue. |
Test build #82630 has finished for PR 19429 at commit
|
+1 for more detailed documentation (we should steer away from |
@jomach is a new contributor to Apache Spark. It might be hard for him to address the above comments. Please submit a separate PR for addressing it. Will review it. Thanks! |
LGTM Thanks! Merged to master. |
{% include_example manual_load_options_csv r/RSparkSQLExample.R %} | ||
|
||
</div> | ||
</div> | ||
### Run SQL on files directly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, that's okay. BTW, I initially what I meant in #19429 (comment) was a newline between </div>
and ### Run ..
(not ...ample.R %}
and </div>
. This breaks rendering:
Let's don't forget to fix this up before the release if the followup couldn't be made ahead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HyukjinKwon should I add a new line between line 503 and 504 ?
For example :
{% include_example generic_load_save_functions r/RSparkSQLExample.R %}
</div>
</div>
### Manually Specifying Options
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, a newline between 503 and 504.
What changes were proposed in this pull request?
Added documentation for loading csv files into Dataframes
How was this patch tested?
/dev/run-tests