
# Events-Based Features

To minimize the risk of conflating different ways to derive features, we are going to define the following categories of features:
 - **Point-in-Time Features**: A type of data join that ensures that features are derived using the first available observation before or at a given timestamp. Commonly, results are pivoted so that every lab will have it's own column.
 - **Sliding Window Features**: A way to aggregate events that happened within a specific rolling window (e.g., last 7 days, last 30 minutes). Typically these aggregates are numeric aggregates that return a scalar response like, `mean`, `min`, `max`,
 - **Events-Based Features**:	Features are computed based on occurrences (one or many) of specific events before the observation point.
 - **Cohort-Based Features**: Features are generated based on historical groupings within a fixed observation window. The difference between Event-Based and Cohort-Based is the timestamp for each patient obeservation is the same, opposed an event-based where each patient event time is not shared. 

In this notebook, we'll explore writing a convenience function, `events_based_lab_features`, to do **Events-Based Features** retrieval from the lab data we created in <a href="$./00_Data_Generation" target="_blank">00_Data_Generation</a> 
, `main.default.patient_lab`.

**NOTE**: Events based features differ from Sliding Window Based features in how they might be deployed into a production environement. Sliding Window Features commonly have their aggregates as scalars stored in feature serving, where Events-Based features may return only an intermediate form of an array struct where further feature transform is done in the consuming model pipeline. This will be more clear when evaluting the outputs of each.


### `events_based_lab_features`

Similar to sliding window features, there isn't a benefit to using Tempo. For building our convenience function it is more straight forward to use a range join. However, unlinke sliding window features, we will be returning a complex struct and the consuming model pipeline will convert to the scalars used for the model family input. This approach is helpful when there is a need to have the time of request as part of the feature calculation and therefore the aggregates can't be pre-calculated in the feature store.

**NOTE**: When writing our function, we are going to write it so that the window is defined in seconds. Offsets by hours, days, weeks, months, quarters, year should have their own convenience functions since a typical interpretation would be to look back to the start of interval which will yield a different result than the equivalent seconds calculation.

In [0]:
# to do a range join, we define our look back window size in seconds
from pyspark.sql.functions import col, to_timestamp, expr

# This is an approximate 6 month window in seconds
window_size_in_seconds = 6*30*24*60*60

# We'll use lab_types to filter for only the labs of interest
lab_types = ['ua_protein', 'ua_ketones']

patient_lab = spark.table("main.default.patient_lab").alias('pl')
patient_event = spark.table("main.default.patient_event").alias('pe')

window_labs = patient_event.withColumnRenamed("event_ts","end_window_ts") \
                           .withColumn("start_window_ts", expr(f"end_window_ts - INTERVAL {window_size_in_seconds} seconds")) \
                           .join(patient_lab.filter(col("pl.lab_type").isin(lab_types)),
                                 (patient_lab.patient_id == patient_event.patient_id) &
                                 (patient_lab.event_ts.between(col("start_window_ts"), col("end_window_ts"))),
                                 "leftouter") \
                           .drop(col("pl.patient_id"), "start_window_ts")
                   
display(window_labs.limit(8))

In [0]:
from pyspark.sql.functions import collect_list, struct, col

# Group by patient_id and lab_type, and collect event_ts and lab_value into an array of structs
grouped_labs = window_labs.groupBy("patient_id", "lab_type", "end_window_ts") \
                          .agg(collect_list(struct("event_ts", "lab_value")).alias("labs"))

display(grouped_labs)

In [0]:
from pyspark.sql.functions import first_value

pivot_labs = grouped_labs.withColumnRenamed("end_window_ts","event_ts") \
                         .groupBy("patient_id", "event_ts") \
                         .pivot("lab_type") \
                         .agg(first_value("labs"))

display(pivot_labs)

In [0]:
# Putting it all together, we can write a events_based_lab_features function such as:

from pyspark.sql import DataFrame
from pyspark.sql.functions import expr, col, collect_list, struct, first_value

def events_based_lab_features(patient_event_df: DataFrame,
                              lab_types: [str],
                              window_size_in_seconds: int):
    patient_lab = spark.table("main.default.patient_lab").alias('pl')

    window_labs = patient_event_df.alias('pe') \
                                  .withColumnRenamed("event_ts","end_window_ts") \
                                  .withColumn("start_window_ts", expr(f"end_window_ts - INTERVAL {window_size_in_seconds} seconds")) \
                                  .join(patient_lab.filter(col("pl.lab_type").isin(lab_types)),
                                        (patient_lab.patient_id == patient_event_df.patient_id) &
                                        (patient_lab.event_ts.between(col("start_window_ts"), col("end_window_ts"))),
                                        "leftouter") \
                                  .drop(col("pl.patient_id"), "start_window_ts")
    grouped_labs = window_labs.groupBy("patient_id", "lab_type", "end_window_ts") \
                              .agg(collect_list(struct("event_ts", "lab_value")).alias("labs"))
    pivot_labs = grouped_labs.withColumnRenamed("end_window_ts","event_ts") \
                         .groupBy("patient_id", "event_ts") \
                         .pivot("lab_type") \
                         .agg(first_value("labs"))
    return pivot_labs

In [0]:
dat =  events_based_lab_features(patient_event_df=spark.table("main.default.patient_event"),
                                 lab_types=['ua_protein', 'ua_ketones'],
                                 window_size_in_seconds=6*30*24*60*60)

display(dat)                                