When dealing with big data, some datasets will have a much higher frequent of "events" than others.

An example table could be a table that tracks each pageview, it's not uncommon for someone to visit a site at the same time as someone else, espically a very popular site such as google.

I will illustrate how you can deal with these types of events, when you need to order by time.

### Library Imports

In [1]:
from datetime import datetime

from pyspark.sql import SparkSession
from pyspark.sql import functions as F, Window

Create a `SparkSession`. No need to create `SparkContext` as you automatically get it as part of the `SparkSession`.

In [2]:
spark = SparkSession.builder \
    .master("local") \
    .appName("Exploring Joins") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

sc = spark.sparkContext

### Initial Datasets

In [3]:
data = [
    (2, 1, datetime(2018, 1, 1, 1 ,1, 1), 45),
    (1, 1, datetime(2018, 1, 1, 1 ,1, 1), 20),
]

df = spark.createDataFrame(data, ['id', 'item_id', 'date', 'value'])
df.toPandas()

Unnamed: 0,id,item_id,date,value
0,2,1,2018-01-01 01:01:01,45
1,1,1,2018-01-01 01:01:01,20


### Option 1: Only ordering by date column

### Window Object

In [4]:
window = Window \
    .partitionBy('item_id') \
    .orderBy('date') \
    .rowsBetween(Window.unboundedPreceding, Window.currentRow)

In [5]:
data = [
    (1, 1, datetime(2018, 1, 1, 1 ,1, 1), 20),
    (2, 1, datetime(2018, 1, 1, 1 ,1, 1), 45),
]
df = spark.createDataFrame(data, ['id', 'item_id', 'date', 'value'])

df.withColumn("first_value", F.first("value").over(window)).toPandas()

Unnamed: 0,id,item_id,date,value,first_value
0,1,1,2018-01-01 01:01:01,20,20
1,2,1,2018-01-01 01:01:01,45,20


In [6]:
data = [
    (2, 1, datetime(2018, 1, 1, 1 ,1, 1), 45),
    (1, 1, datetime(2018, 1, 1, 1 ,1, 1), 20),
]
df = spark.createDataFrame(data, ['id', 'item_id', 'date', 'value'])

df.withColumn("first_value", F.first("value").over(window)).toPandas()

Unnamed: 0,id,item_id,date,value,first_value
0,2,1,2018-01-01 01:01:01,45,45
1,1,1,2018-01-01 01:01:01,20,45


**What Happened:**
* By changing the order of rows (this would happen with larger amounts of data stored on different partitions), we got a different value for "first" value.
* `datetime`s can only be accurate to the second and if data is coming in faster than that, it is ambiguous to order by the date column.

### Option 2: Order by `date` and `id` Column

### Window Object

In [7]:
window = Window \
    .partitionBy('item_id') \
    .orderBy('date', 'id') \
    .rowsBetween(Window.unboundedPreceding, Window.currentRow)

In [8]:
data = [
    (1, 1, datetime(2018, 1, 1, 1 ,1, 1), 20),
    (2, 1, datetime(2018, 1, 1, 1 ,1, 1), 45),
]
df = spark.createDataFrame(data, ['id', 'item_id', 'date', 'value'])

df.withColumn("first_value", F.first("value").over(window)).toPandas()

Unnamed: 0,id,item_id,date,value,first_value
0,1,1,2018-01-01 01:01:01,20,20
1,2,1,2018-01-01 01:01:01,45,20


In [9]:
data = [
    (2, 1, datetime(2018, 1, 1, 1 ,1, 1), 45),
    (1, 1, datetime(2018, 1, 1, 1 ,1, 1), 20),
]
df = spark.createDataFrame(data, ['id', 'item_id', 'date', 'value'])

df.withColumn("first_value", F.first("value").over(window)).toPandas()

Unnamed: 0,id,item_id,date,value,first_value
0,1,1,2018-01-01 01:01:01,20,20
1,2,1,2018-01-01 01:01:01,45,20


**What Happened**:
* We get the same "first" value in both incidents, which is what we expect.

# TL;DR

In databases, the `id` (primary key) column of a table is usually monotonically increasing. Therefore if we are dealing with frequently arriving data we can additionally sort by `id` along the `date` column.