
# Point-in-time Lab Features

To minimize the risk of conflating different ways to derive features, we are going to define the following categories of features:
 - **Point-in-Time Features**: A type of data join that ensures that features are derived using the first available observation before or at a given timestamp. Commonly, results are pivoted so that every lab will have it's own column.
 - **Sliding Window Features**: A way to aggregate events that happened within a specific rolling window (e.g., last 7 days, last 30 minutes). Typically these aggregates are numeric aggregates that return a scalar response like, `mean`, `min`, `max`,
 - **Events-Based Features**:	Features are computed based on occurrences (one or many) of specific events before the observation point.
 - **Cohort-Based Features**: Features are generated based on historical groupings within a fixed observation window. The difference between Event-Based and Cohort-Based is the timestamp for each patient obeservation is the same, opposed an event-based where each patient event time is not shared. 

In this notebook, we'll explore writing a convenience function, `point_in_time_lab_features`, to do **Point-in-time Features** retrieval from the lab data we created in <a href="$./00_Data_Generation" target="_blank">00_Data_Generation</a> 
, `main.default.patient_lab`.

**NOTE**: This notebook is intended to demonstrate an approach for quick discovery of features using [Tempo](https://databrickslabs.github.io/tempo/about/user-guide.html). The features that perform well in a given model may not necessarily be saved into a feature store the same way. Thus, [feature serving](https://docs.databricks.com/en/machine-learning/feature-store/feature-function-serving.html#what-is-databricks-feature-serving) is necessary for productionization of features, but isn't in scope of this notebook.

In [0]:
%pip install dbl-tempo


#### labs_tsdf

**TSDF** is a time-series wrapper for a Spark DataFrame. A TSDF contains additional metadata that identifies what column shall be used for time-series expressions and additional partition columns are declared to identify a single series. In our source table `main.default.patient_lab`, we will have a series for every (**patient_id**, **lab_type**).

In [0]:
from tempo.tsdf import TSDF

labs = spark.table("main.default.patient_lab")
labs_tsdf = TSDF(labs, ts_col="event_ts", partition_cols = ["patient_id", "lab_type"])

display(labs_tsdf.df.limit(8))

In [0]:
# For as of joins to work, both tables need to be converted into a TSDF class
patient_event = spark.table("main.default.patient_event")
patient_event_tsdf = TSDF(patient_event, ts_col="event_ts", partition_cols = ["patient_id"])

display(patient_event_tsdf.df)


### `point_in_time_lab_features`

If we want to retrieve the most recent lab as of a given time from a table, this can be done at scale with performance using [asOfJoin](https://databrickslabs.github.io/tempo/references/tsdf.html#tempo.tsdf.TSDF.asofJoin).

We'll go through a couple code snippets before we write the convenience function so that we can see the results of each step.

In [0]:
# This will only return a single value for each patient. There is no declared tie breaker so it just picked one lab_type. 
# NOTE: This is not what we want.

dat = patient_event_tsdf.asofJoin(right_tsdf=labs_tsdf,
                                  left_prefix="patient",
                                  right_prefix="")
display(dat.df)

In [0]:
# To get all specific labs we are interested in, we can explode by our desired labs array and conduct the same asOfJoin
from pyspark.sql.functions import explode, lit

lab_types = ['ua_ketones', 'ua_glucose']

patient_event_labs = spark.table("main.default.patient_event") \
                               .withColumn("lab_type", explode(lit(lab_types)))
patient_event_labs_tsdf = TSDF(patient_event_labs, ts_col="event_ts", partition_cols = ["patient_id", "lab_type"])

dat = patient_event_labs_tsdf.asofJoin(right_tsdf=labs_tsdf,
                                       left_prefix="patient",
                                       right_prefix="")
display(dat.df)

In [0]:
# Often, we'll want our data pivoted so that each lab_type is a column

from pyspark.sql.functions import first_value

group_by_cols = ['patient_id']
group_by_cols += [ "patient_" + c for c in patient_event_labs.columns if c not in ["patient_id", "lab_type"]]

p_dat = dat.df.groupBy(group_by_cols) \
              .pivot("lab_type") \
              .agg(first_value("lab_value"))

display(p_dat)

In [0]:
# Putting it all together, we can write a lab_as_of_features function such as:
from pyspark.sql import DataFrame
from pyspark.sql.functions import explode, lit, first_value
from tempo.tsdf import TSDF

def lab_as_of_features(patient_event_df: DataFrame,
                       lab_types: [str]):
    
    labs = spark.table("main.default.patient_lab")
    labs_tsdf = TSDF(labs, ts_col="event_ts", partition_cols = ["patient_id", "lab_type"])

    patient_event_labs = patient_event_df.withColumn("lab_type", explode(lit(lab_types)))
    patient_event_labs_tsdf = TSDF(patient_event_labs, ts_col="event_ts", partition_cols = ["patient_id", "lab_type"])

    patient_as_of_labs = patient_event_labs_tsdf.asofJoin(right_tsdf=labs_tsdf,
                                                          left_prefix="patient",
                                                          right_prefix="")

    group_by_cols = ['patient_id'] + \
                    [ "patient_" + c for c in patient_event_labs.columns if c not in ["patient_id", "lab_type"]]
    
    return patient_as_of_labs.df.groupBy(group_by_cols) \
                                .pivot("lab_type") \
                                .agg(first_value("lab_value"))

In [0]:
dat = lab_as_of_features(patient_event_df=spark.table("main.default.patient_event"),
                         lab_types=['ua_ketones', 'ua_glucose'])
display(dat)