
# Batch Model Inference

Once we have registered our model in the model registry we will download and use that model for spark batch inference. 

This repo is going over how to create ts aggregate functions which produces features from a source table labs. There is no assurance that the feature transforms from model training will be the same as the feature transforms used during inference. This must be monitored by the development team for accuracy. Databricks does offer an enterprise solution for managing feature consistancy by adopting [Databricks Feature engineering and serving](https://docs.databricks.com/en/machine-learning/feature-store/index.html).

Aggregate timeseries data that we are doing in this repo is more complicated than the examples shown in Databricks documentation for a couple reasons:
 - Time series data usually has late arriving data that will requires inserts and updates opposed to typical insert only pattern
 - Sparse feature entries (meaning only creating a feature observation when there is a change to features) complicates the addition of new feature windows. When a new feature window is introduced a record must be updated for the ts_event when an observation enters the window **and** leaves the window. 
 - There is a databricks private preview for ts aggregate features. However, it is currently written to create a dense feature table that has features rewritten for all possible intervals, ie. day. This is still an efficient feature serving strategy since tables can be compressed efficiently.

 **Recommendation**: If only batch inference is required and Near Real Time inference is not, consider using the following batch inference pattern. Once the aggregate timeseries functionality is GA and there is a business requirement for NRT inference, then make the investment in building out a solution with databricks feature serving.

 **NOTE**: In the script below, we are pulling by registered model verion. However, the intended governanace pattern is to apply an alias to the version of interest and pull via alias. Check out [deploy model aliases](https://docs.databricks.com/en/machine-learning/manage-model-lifecycle/index.html?utm_source=chatgpt.com#deploy-models-using-aliases) when ready to adopt alias to identify current production model.

In [0]:
%run ./_setup/setup_patient_features

In [0]:
import mlflow
from pyspark.sql.functions import struct, col

mlflow.set_registry_uri("databricks-uc")

uc_model_name = "main.default.patient_lab_sick"
uc_model_version = "3"

registered_model_uri =  f"models:/{uc_model_name}/{uc_model_version}"

# Load model as a Spark UDF, this will make at scale batch prediction in spark possible
model = mlflow.pyfunc.spark_udf(spark, model_uri=registered_model_uri)

In [0]:
from patient_features.agg_func import sliding_window_numeric_aggregates
from pyspark.sql.functions import min, max, mean

patient_event_df = spark.read.table('main.default.patient_event')
patient_lab = spark.table("main.default.patient_lab")

batch_features = sliding_window_numeric_aggregates(patient_event_df=patient_event_df,
                                                   patient_lab=patient_lab,
                                                   agg_funcs=[min, max, mean],
                                                   lab_types=['ua_ph',],
                                                   windows_in_days=[12*30, 9*30])

In [0]:
batch_predict = batch_features.withColumn('predict', model(struct(*map(col, batch_features.columns))))
display(batch_predict)