Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-28801][DOC] Document SELECT statement in SQL Reference (Main page) #27216

Closed
wants to merge 3 commits into from

Conversation

dilipbiswal
Copy link
Contributor

@dilipbiswal dilipbiswal commented Jan 15, 2020

What changes were proposed in this pull request?

Document SELECT statement in SQL Reference Guide. In this PR includes the main
entry page for SELECT. I will open follow-up PRs for different clauses.

Why are the changes needed?

Currently Spark lacks documentation on the supported SQL constructs causing
confusion among users who sometimes have to look at the code to understand the
usage. This is aimed at addressing this issue.

Does this PR introduce any user-facing change?

Yes.

Before:
There was no documentation for this.

After.
Screen Shot 2020-01-19 at 11 20 41 PM
Screen Shot 2020-01-19 at 11 21 55 PM
Screen Shot 2020-01-19 at 11 22 16 PM

How was this patch tested?

Tested using jykyll build --serve

@SparkQA
Copy link

SparkQA commented Jan 15, 2020

Test build #116770 has finished for PR 27216 at commit de627aa.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

limitations under the License.
---

**This page is under construction**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there value in adding these placeholders, vs just adding/linking them when available?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srowen Thanks. Actually, thats the approach we followed when we started this work a few months back. I guess we didn't want to have a link broken at any point. Also, makes it easier for others to contribute without having to rebase ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I know, I'm not sure that was great as not all are filled out. Unless it would be hard to link them everywhere later, I wonder why they need to be here now? I don't see a difference w.r.t merging. I don't feel strongly but think we are just going to end up with more dummy pages that nobody fills out

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srowen OK. Let me remove the links and push.


### Related clauses
- [FROM clause](sql-ref-syntax-qry-select-from.html)
- [WHERE clause](sql-ref-syntax-qry-select-where.html)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srowen Are you okay to keep these or we want to remove the related clauses section add add as we go ? Please let me know.

<dd>
Hints can be specified to help spark optimizer make better planning decisions. Currently spark supports hints
that influence selection of join strategies and repartitioning of the data. For a detailed explanation, please
refer to.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refer to where?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu Thanks a lot for reviewing and catching this. I am going to remove that sentence for now. I wanted to refer to the Hints page from here. I will add it when i have it ready.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

</dd>
<dt><code><em>ORDER BY</em></code></dt>
<dd>
Specifies an ordering of the rows of the complete result set of the query. The output rows are ordered
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about writing the default behaviour here (e.g., direction and null order)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu I have a page written up for ORDER BY where i have explained the sort direction and null order in detail with examples. In this page, i wanted to just briefly introduce the params. what do you think ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@@ -18,8 +18,132 @@ license: |
See the License for the specific language governing permissions and
limitations under the License.
---
Spark supports `SELECT` statement and conforms to ANSI SQL standard. Queries are
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SELECT statement => a SELECT statement?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ANSI SQL standard -> the ANSI SQL standard?

@@ -18,8 +18,132 @@ license: |
See the License for the specific language governing permissions and
limitations under the License.
---
Spark supports `SELECT` statement and conforms to ANSI SQL standard. Queries are
used to retrieve result sets from one or more table. The following section
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more table -> more tables?

</dd>
<dt><code><em>boolean_expression</em></code></dt>
<dd>
Specifies a expression with a return type of boolean.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a expression -> an expression

<dt><code><em>LIMIT</em></code></dt>
<dd>
Specifies the maximum number of rows that can be returned by a statement or subquery. This clause
is mostly used in the conjunction with <code>ORDER BY</code> to produce deterministic result.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deterministic result -> a deterministic result?

<dt><code><em>SORT BY</em></code></dt>
<dd>
Specifies an ordering by which the rows are ordered within each partition. This parameter is mutually
exclusive with <code>ORDER BY</code>, <code>CLUSTER BY</code> and can not be specified together.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exclusive with <code>ORDER BY</code>, <code>CLUSTER BY</code> and => exclusive with <code>ORDER BY</code> or <code>CLUSTER BY</code>, and?

Copy link
Contributor Author

@dilipbiswal dilipbiswal Jan 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu yeah.. sounds good. perhaps say "and" as opposed to "or" ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, it looks fine to me. (but, I'm not a good English writer, so better to follow the others, hahaha

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu Same here. Hopefully @srowen will keep us honest :-)

@dilipbiswal
Copy link
Contributor Author

@maropu @srowen I have incorporated the comments. Also i have removed the "Related Sections" and the links.

@SparkQA
Copy link

SparkQA commented Jan 17, 2020

Test build #116903 has finished for PR 27216 at commit bd4c0ae.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile gatorsmile changed the title [SPARK-28588][DOC] Document SELECT statement in SQL Reference (Main page) [SPARK-28801][DOC] Document SELECT statement in SQL Reference (Main page) Jan 19, 2020
Specifies a set of expressions that is used to repartition and sort the rows. Using this clause has
the same effect of using <code>DISTRIBUTE BY</code> and <code>SORT BY</code> together.
</dd>
<dt><code><em>DISTRIBUTE BY</em></code></dt>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SortBy I think we also need a dedicated file for these clauses [SORT BY, CLUSTER BY and DISTRIBUTE BY].

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These three clauses are very special. It is from Hive. Could we have a simple SELECT and then a full SELECT?

Copy link
Contributor Author

@dilipbiswal dilipbiswal Jan 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gatorsmile the way i have at the moment is to have separate links for each of the clauses each having its syntax, parameters and examples. So yes, i will have separate links for the clauses you have mentioned. What do you think ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you file jiras (sub-tickets) for planned dedicated files having these syntaxes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu Sure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@maropu
Copy link
Member

maropu commented Jan 20, 2020

Also, can you update the screenshot in the PR description? That looks stale.

<dd>
Specifies the common table expressions (CTEs) before the main <code>SELECT</code> query block.
These table expressions are allowed to be referenced later in the main query. This is useful to abstract
out repeated sub query blocks in the main query and improves readability of the query.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it is OK to use either sub query or subquery, but it might be better to pick one and keep consistent.

</dd>
<dt><code><em>from_item</em></code></dt>
<dd>
Specifies a source of input for the query. It can be one of the following.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

: instead of . in the end of the sentence.

<dt><code><em>GROUP BY</em></code></dt>
<dd>
Specifies the expressions that are used to group the rows. This is used in conjunction with aggregate functions
(MIN, MAX, COUNT, SUM, AVG) to group rows bsed on the grouping expressions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bsed typo

<dt><code><em>HAVING</em></code></dt>
<dd>
Specifies the predicates by which the rows produced by GROUP BY are filtered. The HAVING clause is used to
filter rows after the grouping is performed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

. in the end of the sentence.

along with usage examples when applicable.
### Syntax
{% highlight sql %}
[WITH with_query [, ...]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess always have a space between symbol and text, and between symbol and symbol?

[ WITH with_query [ , ... ] ] 

@SparkQA
Copy link

SparkQA commented Jan 20, 2020

Test build #117091 has finished for PR 27216 at commit 7a643d7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

</dd>
<dt><code><em>named_expression</em></code></dt>
<dd>
A expression with an assigned name. In general, it denotes a column expression.<br><br>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: An expression

@SparkQA
Copy link

SparkQA commented Jan 20, 2020

Test build #117131 has finished for PR 27216 at commit 54dd2da.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Jan 23, 2020

Merged to master

@srowen srowen closed this in 38f4e59 Jan 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
8 participants