## Additional Imports

In [1]:
from pyspark.sql.window import Window
from pyspark.sql import functions as F
from starscream.utils.time_utils import parse

## Initial Dataset

In [2]:
df = sc.sql.createDataFrame(
    [
        (datetime(2017, 1, 11, 1, 21), 'http://www.anotherstore.com/page1', '123abc', 0), 
        (datetime(2017, 1, 11, 1, 25), 'http://www.anotherstore.com/page2', '123abc', 1), 
        (datetime(2017, 1, 11, 1, 36), 'http://www.anotherstore.com/checkout', '123abc', 2), 
    ], ['viewed_at', 'url', 'session_token', 'is_foo']
)

df.toPandas()

[01m2018-11-16 18:24:02.926 [32mINFO [0m[01mstarscream.PySparkHelpers Setting spark.driver.host = 10.128.0.10[0m
[01m2018-11-16 18:24:02.928 [32mINFO [0m[01mstarscream.PySparkHelpers Exporting spark.driver.memory as SPARK_DRIVER_MEMORY = 4g[0m
[01m2018-11-16 18:24:33.487 [32mINFO [0m[01mstarscream.PySparkHelpers Spark Options: spark.default.parallelism=1024,spark.driver.extraJavaOptions=-XX:+UseCompressedOops -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Duser.timezone=GMT -Dlog4j.configuration=file:///app/config/spark.log4j.properties,spark.driver.host=10.128.0.10,spark.driver.maxResultSize=3686m,spark.driver.memory=4g,spark.dynamicAllocation.enabled=True,spark.dynamicAllocation.executorIdleTimeout=10m,spark.dynamicAllocation.maxExecutors=8,spark.eventLog.dir=gs://starscream-adhoc/var/spark/event-logs,spark.eventLog.enabled=True,spark.executor.cores=8,spark.executor.extraJavaOptions=-XX:+UseCompressedOops -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Duser.timezone=GMT -Dlog

Unnamed: 0,viewed_at,url,session_token,is_foo
0,2017-01-11 01:21:00,http://www.anotherstore.com/page1,123abc,0
1,2017-01-11 01:25:00,http://www.anotherstore.com/page2,123abc,1
2,2017-01-11 01:36:00,http://www.anotherstore.com/checkout,123abc,2


## Window Functions

### Scenario #1

No `orderBy` specified for `window` object.

In [3]:
window_1 = Window\
    .partitionBy('session_token')

df_1 = df.withColumn('is_atleast_one_foo', (F.sum(F.col('is_foo')).over(window_1)))

df_1.toPandas()

Unnamed: 0,viewed_at,url,session_token,is_foo,is_atleast_one_foo
0,2017-01-11 01:36:00,http://www.anotherstore.com/checkout,123abc,2,3
1,2017-01-11 01:25:00,http://www.anotherstore.com/page2,123abc,1,3
2,2017-01-11 01:21:00,http://www.anotherstore.com/page1,123abc,0,3


### Scenario #2

`orderBy` with no `rowsBetween` specified for `window` object.

In [4]:
window_2 = Window\
    .partitionBy('session_token')\
    .orderBy(F.col('viewed_at'))

df_2 = df.withColumn('is_atleast_one_foo', (F.sum(F.col('is_foo')).over(window_2)))

df_2.toPandas()

Unnamed: 0,viewed_at,url,session_token,is_foo,is_atleast_one_foo
0,2017-01-11 01:21:00,http://www.anotherstore.com/page1,123abc,0,0
1,2017-01-11 01:25:00,http://www.anotherstore.com/page2,123abc,1,1
2,2017-01-11 01:36:00,http://www.anotherstore.com/checkout,123abc,2,3


### Scenario #3

`orderBy` with a `rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)` specified for `window` object.

In [5]:
window_3 = Window\
    .partitionBy('session_token')\
    .orderBy(F.col('viewed_at'))\
    .rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)

df_3 = df.withColumn('is_atleast_one_foo', (F.sum(F.col("is_foo")).over(window_3)))

df_3.toPandas()

Unnamed: 0,viewed_at,url,session_token,is_foo,is_atleast_one_foo
0,2017-01-11 01:21:00,http://www.anotherstore.com/page1,123abc,0,3
1,2017-01-11 01:25:00,http://www.anotherstore.com/page2,123abc,1,3
2,2017-01-11 01:36:00,http://www.anotherstore.com/checkout,123abc,2,3


## Why is This?

In [6]:
df_1.explain()

== Physical Plan ==
Window [sum(is_foo#3L) windowspecdefinition(session_token#2, ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS is_atleast_one_foo#10L], [session_token#2]
+- *Sort [session_token#2 ASC NULLS FIRST], false, 0
   +- Exchange hashpartitioning(session_token#2, 200)
      +- Scan ExistingRDD[viewed_at#0,url#1,session_token#2,is_foo#3L]


In [7]:
df_2.explain()

== Physical Plan ==
Window [sum(is_foo#3L) windowspecdefinition(session_token#2, viewed_at#0 ASC NULLS FIRST, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS is_atleast_one_foo#18L], [session_token#2], [viewed_at#0 ASC NULLS FIRST]
+- *Sort [session_token#2 ASC NULLS FIRST, viewed_at#0 ASC NULLS FIRST], false, 0
   +- Exchange hashpartitioning(session_token#2, 200)
      +- Scan ExistingRDD[viewed_at#0,url#1,session_token#2,is_foo#3L]


In [8]:
df_3.explain()

== Physical Plan ==
Window [sum(is_foo#3L) windowspecdefinition(session_token#2, viewed_at#0 ASC NULLS FIRST, ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS is_atleast_one_foo#26L], [session_token#2], [viewed_at#0 ASC NULLS FIRST]
+- *Sort [session_token#2 ASC NULLS FIRST, viewed_at#0 ASC NULLS FIRST], false, 0
   +- Exchange hashpartitioning(session_token#2, 200)
      +- Scan ExistingRDD[viewed_at#0,url#1,session_token#2,is_foo#3L]


### TL;DR

By looking at the **Physical Plan**, the default behaviour for `Window.partitionBy('col_1').orderBy('col_2')` without a `.rowsBetween()` is to do `.rowsBetween(Window.unboundedPreceding, Window.currentRow)`.

Looking at the scala code we can see that this is indeed the default and intended behavior, https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/expressions/Window.scala#L36-L38.

```scala
 * @note When ordering is not defined, an unbounded window frame (rowFrame, unboundedPreceding,
 *       unboundedFollowing) is used by default. When ordering is defined, a growing window frame
 *       (rangeFrame, unboundedPreceding, currentRow) is used by default.
```

**Problem:**
This will cause problems if you're care about all the rows in the partitions.
