There are use cases where we would like to get the `first` or `last` of something within a `group` or particular `grain`.

It is natural to do something in SQL like:

```sql
select 
    col_1,
    first(col_2) as first_something,
    last(col_2) as first_something
from table
group by 1
order by 1
```

Which leads us to writing spark code like this `df.orderBy().groupBy().agg()`. This has unexpected behaviours in spark and can be different each run.

### Library Imports

In [1]:
from datetime import datetime

from pyspark.sql import SparkSession
from pyspark.sql import functions as F, Window

Create a `SparkSession`. No need to create `SparkContext` as you automatically get it as part of the `SparkSession`.

In [2]:
spark = SparkSession.builder \
    .master("local") \
    .appName("Exploring Joins") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

sc = spark.sparkContext

### Initial Datasets

In [58]:
data = [
    (10, datetime(2018, 1, 1), 20),
    (10, datetime(2018, 2, 1), 45)
]

df = spark.createDataFrame(data, ['id', 'date', 'value'])
df.toPandas()

Unnamed: 0,id,date,value
0,10,2018-01-01,20
1,10,2018-02-01,45


## Situation 1: Order not Maintained within a `GroupBy`

### Option 1: Wrong Way

#### Result 1

In [56]:
df_1 = df \
    .orderBy('date') \
    .groupBy('id') \
    .agg(F.first('value'))

df_1.toPandas()

Unnamed: 0,id,"first(value, false)"
0,10,20


#### Result 2

In [57]:
df_2 = df \
    .orderBy('date') \
    .groupBy('id') \
    .agg(F.first('value'))

df_2.toPandas()

Unnamed: 0,id,"first(value, false)"
0,10,20


### Option 2: Window Object, Right Way

In [18]:
window = Window.partitionBy('id').orderBy('date')

df_3 = df \
  .withColumn('c5', F.first('value').over(window.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing))) \
  .withColumn('rn', F.row_number().over(window.rowsBetween(Window.unboundedPreceding, Window.currentRow)))

df_3.toPandas()

Unnamed: 0,id,date,value,c5,rn
0,10,2018-01-01,20,20,1
1,10,2018-02-01,45,20,2


# TL;DR

Ok so my example didn't work locally lol, but trust me it that `orderBy()` in a statement like this: `orderBy().groupBy()` doesn't maintain it's order!

reference: https://stackoverflow.com/a/50012355

For anything aggregation that needs an ordering performed (ie. `first`, `last`, etc.), we should avoid using `groupby()`s and instead we should use a `window` object.

## Situation 2: High Frequency Data

### Initial Datasets

In [62]:
data = [
    (2, 1, datetime(2018, 1, 1, 1 ,1, 1), 45),
    (1, 1, datetime(2018, 1, 1, 1 ,1, 1), 20),
]

df = spark.createDataFrame(data, ['id', 'item_id', 'date', 'value'])
df.toPandas()

Unnamed: 0,id,item_id,date,value
0,2,1,2018-01-01 01:01:01,45
1,1,1,2018-01-01 01:01:01,20


### Option 1: Only ordering by date column

### Window Object

In [71]:
window = Window \
    .partitionBy('item_id') \
    .orderBy('date') \
    .rowsBetween(Window.unboundedPreceding, Window.currentRow)

In [73]:
data = [
    (1, 1, datetime(2018, 1, 1, 1 ,1, 1), 20),
    (2, 1, datetime(2018, 1, 1, 1 ,1, 1), 45),
]
df = spark.createDataFrame(data, ['id', 'item_id', 'date', 'value'])

df.withColumn("first_value", F.first("value").over(window)).toPandas()

Unnamed: 0,id,item_id,date,value,first_value
0,1,1,2018-01-01 01:01:01,20,20
1,2,1,2018-01-01 01:01:01,45,20


In [74]:
data = [
    (2, 1, datetime(2018, 1, 1, 1 ,1, 1), 45),
    (1, 1, datetime(2018, 1, 1, 1 ,1, 1), 20),
]
df = spark.createDataFrame(data, ['id', 'item_id', 'date', 'value'])

df.withColumn("first_value", F.first("value").over(window)).toPandas()

Unnamed: 0,id,item_id,date,value,first_value
0,2,1,2018-01-01 01:01:01,45,45
1,1,1,2018-01-01 01:01:01,20,45


**What Happened:**
* By changing the order of rows (this would happen with larger amounts of data stored on different partitions), we got a different value for "first" value.
* `datetime`s can only be accurate to the second and if data is coming in faster than that, it is ambiguous to order by the date column.

### Option 2: Order by `date` and `id` Column

### Window Object

In [80]:
window = Window \
    .partitionBy('item_id') \
    .orderBy('date', 'id') \
    .rowsBetween(Window.unboundedPreceding, Window.currentRow)

In [81]:
data = [
    (1, 1, datetime(2018, 1, 1, 1 ,1, 1), 20),
    (2, 1, datetime(2018, 1, 1, 1 ,1, 1), 45),
]
df = spark.createDataFrame(data, ['id', 'item_id', 'date', 'value'])

df.withColumn("first_value", F.first("value").over(window)).toPandas()

Unnamed: 0,id,item_id,date,value,first_value
0,1,1,2018-01-01 01:01:01,20,20
1,2,1,2018-01-01 01:01:01,45,20


In [82]:
data = [
    (2, 1, datetime(2018, 1, 1, 1 ,1, 1), 45),
    (1, 1, datetime(2018, 1, 1, 1 ,1, 1), 20),
]
df = spark.createDataFrame(data, ['id', 'item_id', 'date', 'value'])

df.withColumn("first_value", F.first("value").over(window)).toPandas()

Unnamed: 0,id,item_id,date,value,first_value
0,1,1,2018-01-01 01:01:01,20,20
1,2,1,2018-01-01 01:01:01,45,20


**What Happened**:
* We get the same "first" value in both incidents, which is what we expect.

# TL;DR

In databases, the `id` (primary key) column of a table is usually monotonically increasing. Therefore if we are dealing with frequently arriving data we can additionally sort by `id` along the `date` column.