
# Model training from aggregate features

This notebook will take us through an example model training with a popular [mlflow built-in model flavor](https://mlflow.org/docs/latest/models.html#built-in-model-flavors) [xgboost](https://mlflow.org/docs/latest/models.html#xgboost-xgboost).

The modules that we had sone discovery on have been updated to be loaded from a python module. This is helpful to assure reproducability. However, an even better pattern is to package and version the python modules in whl and import from the installed whl.

The following sections will take us through through training a model. The data science here is a little non-sensical, however the model is actually trained and is representative of the same python class that will be used in production deployments.

In [0]:
%run ./_setup/setup_patient_features

In [0]:
from patient_features.agg_func import sliding_window_numeric_aggregates
from pyspark.sql.functions import min, max, mean

patient_response = spark.table("main.default.patient_response")
patient_lab = spark.table("main.default.patient_lab")


train_data, test_data = sliding_window_numeric_aggregates(
                                        patient_event_df=patient_response,
                                        patient_lab=patient_lab,
                                        agg_funcs=[min, max, mean],
                                        lab_types=['ua_ph',],
                                        windows_in_days=[12*30, 9*30]) \
                            .join(patient_response, on=['patient_id', 'event_ts'], how='left') \
                            .drop('patient_id', 'event_ts') \
                            .randomSplit([0.8, 0.2], seed=42)

In [0]:
import mlflow
import mlflow.xgboost
import xgboost as xgb
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Turn off mlflow autologging
mlflow.autolog(disable=True)

# Convert Spark DataFrames to Pandas DataFrames
train_data_pd = train_data.toPandas()
test_data_pd = test_data.toPandas()

# Separate features and target variable
X_train = train_data_pd.drop('is_sick', axis=1)
y_train = train_data_pd['is_sick']
X_test = test_data_pd.drop('is_sick', axis=1)
y_test = test_data_pd['is_sick']

# Train the XGBoost model
model = xgb.XGBClassifier()
model.fit(X_train, y_train)

# Predict on the test set
y_pred = model.predict(X_test)

# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')



## Create Experiment & Log Model

While it is possible to log models directly to a notebook, in this example, we'll save to a workspace experiment. This will create an experiment location in a workspace path. This is helpful when you are going to have experiments from multiple notebooks that are in consideration for best model.

In [0]:
import mlflow
from mlflow.tracking import MlflowClient

client = MlflowClient(tracking_uri="databricks",
                      registry_uri="databricks-uc")

model_name="patient_lab_sick"
experiment_name = f"/Workspace/experiments/{model_name}"

try:
    experiment_id = client.create_experiment(name=experiment_name[10:])
    experiment = client.get_experiment(experiment_id=experiment_id)
except:
    experiment = client.search_experiments(filter_string=f'name="{experiment_name[10:]}"')[0]
    experiment_id = experiment.experiment_id


## Log and Register Model

In mlflow you are able to separate the model logging to an experiment run and registry process. This is typically done when there is a desire to retain many experiments, but only promote the best model selected as a registered model. Since there are permission differences between the two entities, this is a good governanace pattern to keep the DS team aware of all experiments, but only the registered model avaialble outside of the DS team.

That topic is outside the scope of this repo. To find out more, goto [Manage model lifecycle in Unity Catalog](https://docs.databricks.com/en/machine-learning/manage-model-lifecycle/index.html#manage-model-lifecycle-in-unity-catalog)

In [0]:
from mlflow.models.signature import infer_signature

mlflow.set_registry_uri("databricks-uc")

# Infer the model signature
signature = infer_signature(X_train, model.predict(X_train))

# Log the model and metrics to MLflow
with mlflow.start_run(experiment_id=experiment_id):
    mlflow.xgboost.log_model(xgb_model=model,
                             artifact_path="xgb_model", 
                             input_example=X_train.iloc[[0]],
                             model_format="ubj",
                             registered_model_name=f"main.default.{model_name}",
                             signature=signature)
    mlflow.log_metric("accuracy", accuracy)
    mlflow.log_metric("precision", precision)
    mlflow.log_metric("recall", recall)
    mlflow.log_metric("f1_score", f1)