Skip to content

Commit

Permalink
re-format
Browse files Browse the repository at this point in the history
  • Loading branch information
huaxingao committed Apr 10, 2020
1 parent 19c87ee commit 36cbcb7
Showing 1 changed file with 120 additions and 102 deletions.
222 changes: 120 additions & 102 deletions docs/sql-ref-functions-builtin-window.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,36 +20,41 @@ license: |
---

Similarly to aggregate functions, window functions operate on a group of rows. However, unlike aggregate functions, window functions perform aggregation without reducing, calculating a return value for each row in the group. Window functions are useful for processing tasks such as calculating a moving average, computing a cumulative, or accessing the value of rows given the relative position of the current row. Spark SQL supports three types of window functions:
* Ranking Functions
* Analytic Functions
* Aggregate Functions
* Ranking Functions
* Analytic Functions
* Aggregate Functions

### Types of Window Functions

#### Ranking Functions

<table class="table">
<thead>
<tr><th style="width:25%">Function</th><th>Description</th></tr>
<tr><th style="width:25%">Function</th><th>Argument Type(s)</th><th>Description</th></tr>
</thead>
<tr>
<td><b> rank </b></td>
<td>Returns the rank of rows within a window partition. This is equivalent to the RANK function in SQL. </td>
<td width="20%"><b> rank </b></td>
<td width="20%"> none </td>
<td width="60%">Returns the rank of rows within a window partition. This is equivalent to the RANK function in SQL. </td>
</tr>
<tr>
<td><b> dense_rank </b></td>
<td> none </td>
<td>Returns the rank of rows within a window partition, without any gaps. This is equivalent to the DENSE_RANK function in SQL.</td>
</tr>
<tr>
<td><b> percent_rank </b></td>
<td> none </td>
<td>Returns the relative rank (i.e. percentile) of rows within a window partition. This is equivalent to the PERCENT_RANK function in SQL.</td>
</tr>
<tr>
<td><b> ntile </b></td>
<td><b> ntile </b>(<i>n</i>)</td>
<td> Int </td>
<td>Returns the ntile group id (from 1 to `n` inclusive) in an ordered window partition. This is equivalent to the NTILE function in SQL.</td>
</tr>
<tr>
<td><b> row_number </b></td>
<td> none </td>
<td>Returns a sequential number starting from 1 within a window partition. This is equivalent to the ROWNUMBER function in SQL.</td>
</tr>
</table>
Expand All @@ -58,19 +63,32 @@ Similarly to aggregate functions, window functions operate on a group of rows. H

<table class="table">
<thead>
<tr><th style="width:25%">Function</th><th>Description</th></tr>
<tr><th style="width:25%">Function</th><th>Argument Type(s)</th><th>Description</th></tr>
</thead>
<tr>
<td><b> cume_dist </b></td>
<td>Returns the cumulative distribution of values within a window partition, i.e. the fraction of rows that are below the current row. This is equivalent to the CUMEDIST function in SQL. </td>
<td width="20%"><b> cume_dist </b></td>
<td width="20%"> none </td>
<td width="60%">Returns the cumulative distribution of values within a window partition, i.e. the fraction of rows that are below the current row. This is equivalent to the CUMEDIST function in SQL. </td>
</tr>
<tr>
<td><b> lag </b></td>
<td>Returns the value that is <code>offset</code> rows before the current row, and <code>null</code> if there are less than <code>offset</code> rows before the current row.</td>
<td><b> lag </b>( <i>expression, offset [ , defaultValue ]</i> )</td>
<td> Column, Int [ , Any ]</td>
<td>Returns the value that is <code>offset</code> rows before the current row, and <code>null</code> or <code> defaultValue</code> if there are less than <code>offset</code> rows before the current row.</td>
</tr>
<tr>
<td><b> lag </b>( <i>columnName, offset [ , defaultValue ]</i> )</td>
<td> String, Int [ , Any ]</td>
<td>Returns the value that is <code>offset</code> rows before the current row, and <code>null</code> or <code> defaultValue</code> if there are less than <code>offset</code> rows before the current row.</td>
</tr>
<tr>
<td><b> lead </b>( <i>expression, offset [ , defaultValue ]</i> )</td>
<td> Column, Int [ , Any ] </td>
<td>Returns the value that is <code>offset</code> rows after the current row, and <code>null</code> or <code> defaultValue</code> if there are less than <code>offset</code> rows after the current row. This is equivalent to the LEAD function in SQL.</td>
</tr>
<tr>
<td><b> lead </b></td>
<td>Returns the value that is <code>offset</code> rows after the current row, and <code>null</code> if there are less than <code>offset</code> rows after the current row. This is equivalent to the LEAD function in SQL.</td>
<td><b> lead </b>( <i>columnName, offset [ , defaultValue ]</i> )</td>
<td> String, Int [ , Any ] </td>
<td>Returns the value that is <code>offset</code> rows after the current row, and <code>null</code> or <code> defaultValue</code> if there are less than <code>offset</code> rows after the current row. This is equivalent to the LEAD function in SQL.</td>
</tr>
</table>

Expand All @@ -81,18 +99,18 @@ Any of the aggregate functions can be used as a window function. Please refer to
### How to Use Window Functions

* Mark a function as window function by using `over`.
- SQL: Add an OVER clause after the window function, e.g. avg (...) OVER (...);
- DataFrame API: Call the window function's `over` method, e.g. rank().over(...)
- SQL: Add an OVER clause after the window function, e.g. avg ( ... ) OVER ( ... );
- DataFrame API: Call the window function's `over` method, e.g. rank ( ).over ( ... )
* Define the window specification associated with this function. A window specification includes partitioning specification, ordering specification, and frame specification.
- Partitioning Specification:
- SQL: PARTITION BY
- DataFrame API: Window.partitionBy (...)
- DataFrame API: Window.partitionBy ( ... )
- Ordering Specification:
- SQL: Order BY
- DataFrame API: Window.orderBy (...)
- DataFrame API: Window.orderBy ( ... )
- Frame Specification:
- SQL: ROWS (for ROW frame), RANGE (for RANGE frame)
- DataFrame API: WindowSpec.rowsBetween (for ROW frame), WindowSpec.rangeBetween (for RANGE frame)
- SQL: ROWS ( for ROW frame ), RANGE ( for RANGE frame )
- DataFrame API: WindowSpec.rowsBetween ( for ROW frame ), WindowSpec.rangeBetween ( for RANGE frame )

### Examples

Expand All @@ -113,103 +131,103 @@ Any of the aggregate functions can be used as a window function. Please refer to
)
val df = data.toDF("name", "dept", "salary")
df.show()
// +-----+-----------+------+
// | name| dept|salary|
// +-----+-----------+------+
// | Lisa| Sales| 10000|
// | Evan| Sales| 32000|
// | Fred|Engineering| 21000|
// |Helen| Marketing| 29000|
// | Alex| Sales| 30000|
// | Tom|Engineering| 23000|
// | Jane| Marketing| 29000|
// | Jeff| Marketing| 35000|
// | Paul|Engineering| 29000|
// |Chloe|Engineering| 23000|
// +-----+-----------+------+
+-----+-----------+------+
| name| dept|salary|
+-----+-----------+------+
| Lisa| Sales| 10000|
| Evan| Sales| 32000|
| Fred|Engineering| 21000|
|Helen| Marketing| 29000|
| Alex| Sales| 30000|
| Tom|Engineering| 23000|
| Jane| Marketing| 29000|
| Jeff| Marketing| 35000|
| Paul|Engineering| 29000|
|Chloe|Engineering| 23000|
+-----+-----------+------+

val windowSpec = Window.partitionBy("dept").orderBy("salary")
windowSpec.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)

// Using Ranking Functions
df.withColumn("rank", rank().over(windowSpec)).show()
// +-----+-----------+------+----+
// | name| dept|salary|rank|
// +-----+-----------+------+----+
// |Helen| Marketing| 29000| 1|
// | Jane| Marketing| 29000| 1|
// | Jeff| Marketing| 35000| 3|
// | Fred|Engineering| 21000| 1|
// | Tom|Engineering| 23000| 2|
// |Chloe|Engineering| 23000| 2|
// | Paul|Engineering| 29000| 4|
// | Lisa| Sales| 10000| 1|
// | Alex| Sales| 30000| 2|
// | Evan| Sales| 32000| 3|
// +-----+-----------+------+----+
+-----+-----------+------+----+
| name| dept|salary|rank|
+-----+-----------+------+----+
|Helen| Marketing| 29000| 1|
| Jane| Marketing| 29000| 1|
| Jeff| Marketing| 35000| 3|
| Fred|Engineering| 21000| 1|
| Tom|Engineering| 23000| 2|
|Chloe|Engineering| 23000| 2|
| Paul|Engineering| 29000| 4|
| Lisa| Sales| 10000| 1|
| Alex| Sales| 30000| 2|
| Evan| Sales| 32000| 3|
+-----+-----------+------+----+

df.withColumn("dense_rank", dense_rank().over(windowSpec)).show()
// +-----+-----------+------+----------+
// | name| dept|salary|dense_rank|
// +-----+-----------+------+----------+
// |Helen| Marketing| 29000| 1|
// | Jane| Marketing| 29000| 1|
// | Jeff| Marketing| 35000| 2|
// | Fred|Engineering| 21000| 1|
// | Tom|Engineering| 23000| 2|
// |Chloe|Engineering| 23000| 2|
// | Paul|Engineering| 29000| 3|
// | Lisa| Sales| 10000| 1|
// | Alex| Sales| 30000| 2|
// | Evan| Sales| 32000| 3|
// +-----+-----------+------+----------+
+-----+-----------+------+----------+
| name| dept|salary|dense_rank|
+-----+-----------+------+----------+
|Helen| Marketing| 29000| 1|
| Jane| Marketing| 29000| 1|
| Jeff| Marketing| 35000| 2|
| Fred|Engineering| 21000| 1|
| Tom|Engineering| 23000| 2|
|Chloe|Engineering| 23000| 2|
| Paul|Engineering| 29000| 3|
| Lisa| Sales| 10000| 1|
| Alex| Sales| 30000| 2|
| Evan| Sales| 32000| 3|
+-----+-----------+------+----------+

// Using Analytic Functions
df.withColumn("cume_dist", cume_dist().over(windowSpec)).show()
// +-----+-----------+------+------------------+
// | name| dept|salary| cume_dist|
// +-----+-----------+------+------------------+
// |Helen| Marketing| 29000|0.6666666666666666|
// | Jane| Marketing| 29000|0.6666666666666666|
// | Jeff| Marketing| 35000| 1.0|
// | Fred|Engineering| 21000| 0.25|
// | Tom|Engineering| 23000| 0.75|
// |Chloe|Engineering| 23000| 0.75|
// | Paul|Engineering| 29000| 1.0|
// | Lisa| Sales| 10000|0.3333333333333333|
// | Alex| Sales| 30000|0.6666666666666666|
// | Evan| Sales| 32000| 1.0|
// +-----+-----------+------+------------------+
+-----+-----------+------+------------------+
| name| dept|salary| cume_dist|
+-----+-----------+------+------------------+
|Helen| Marketing| 29000|0.6666666666666666|
| Jane| Marketing| 29000|0.6666666666666666|
| Jeff| Marketing| 35000| 1.0|
| Fred|Engineering| 21000| 0.25|
| Tom|Engineering| 23000| 0.75|
|Chloe|Engineering| 23000| 0.75|
| Paul|Engineering| 29000| 1.0|
| Lisa| Sales| 10000|0.3333333333333333|
| Alex| Sales| 30000|0.6666666666666666|
| Evan| Sales| 32000| 1.0|
+-----+-----------+------+------------------+

df.withColumn("lag", lag("salary", 2).over(windowSpec)).show()
// +-----+-----------+------+-----+
// |Helen| Marketing| 29000| null|
// | Jane| Marketing| 29000| null|
// | Jeff| Marketing| 35000|29000|
// | Fred|Engineering| 21000| null|
// | Tom|Engineering| 23000| null|
// |Chloe|Engineering| 23000|21000|
// | Paul|Engineering| 29000|23000|
// | Lisa| Sales| 10000| null|
// | Alex| Sales| 30000| null|
// | Evan| Sales| 32000|10000|
// +-----+-----------+------+-----+
+-----+-----------+------+-----+
|Helen| Marketing| 29000| null|
| Jane| Marketing| 29000| null|
| Jeff| Marketing| 35000|29000|
| Fred|Engineering| 21000| null|
| Tom|Engineering| 23000| null|
|Chloe|Engineering| 23000|21000|
| Paul|Engineering| 29000|23000|
| Lisa| Sales| 10000| null|
| Alex| Sales| 30000| null|
| Evan| Sales| 32000|10000|
+-----+-----------+------+-----+

// Using Aggregate Functions
df.withColumn("min", min(col("salary")).over(windowSpec)).show()
// +-----+-----------+------+-----+
// | name| dept|salary| min|
// +-----+-----------+------+-----+
// |Helen| Marketing| 29000|29000|
// | Jane| Marketing| 29000|29000|
// | Jeff| Marketing| 35000|29000|
// | Fred|Engineering| 21000|21000|
// | Tom|Engineering| 23000|21000|
// |Chloe|Engineering| 23000|21000|
// | Paul|Engineering| 29000|21000|
// | Lisa| Sales| 10000|10000|
// | Alex| Sales| 30000|10000|
// | Evan| Sales| 32000|10000|
// +-----+-----------+------+-----+
+-----+-----------+------+-----+
| name| dept|salary| min|
+-----+-----------+------+-----+
|Helen| Marketing| 29000|29000|
| Jane| Marketing| 29000|29000|
| Jeff| Marketing| 35000|29000|
| Fred|Engineering| 21000|21000|
| Tom|Engineering| 23000|21000|
|Chloe|Engineering| 23000|21000|
| Paul|Engineering| 29000|21000|
| Lisa| Sales| 10000|10000|
| Alex| Sales| 30000|10000|
| Evan| Sales| 32000|10000|
+-----+-----------+------+-----+

{% endhighlight %}

0 comments on commit 36cbcb7

Please sign in to comment.