In [None]:
# @title ###### Licensed to the Apache Software Foundation (ASF), Version 2.0 (the "License")

# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
#   http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License

# Fraud Detection MLOps - Feature Engineering and Model Evaluation

<table><tbody><tr>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2Fapache%2Fbeam%2Frefs%2Fheads%2Fmaster%2Fsdks%2Fpython%2Fapache_beam%2Fyaml%2Fexamples%2Ftransforms%2Fml%2Ffraud_detection%2Ffraud_detection_mlops_beam_yaml_sdk.ipynb">
      <img alt="Google Cloud Colab Enterprise logo" src="https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN" width="32px"><br> Run in Colab Enterprise
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://github.com/apache/beam/blob/master/sdks/python/apache_beam/yaml/examples/transforms/ml/fraud_detection/fraud_detection_mlops_beam_yaml_sdk.ipynb">
      <img alt="GitHub logo" src="https://github.githubassets.com/assets/GitHub-Mark-ea2971cee799.png" width="32px"><br> View on GitHub
    </a>
  </td>
</tr></tbody></table>


## Overview

This notebook demonstrates a Fraud Detection MLOps solution that uses Apache
Beam YAML SDK for feature engineering and model evaluation to detect transaction frauds.

The dataset is generated with credit card transactions, and can be found on Kaggle https://www.kaggle.com/datasets/kartik2112/fraud-detection.

The training dataset will be stored as Iceberg tables on GCS object storage for feature engineering workflows in Beam. Once the feature generation task is done and the features are stored on Iceberg, they will then be downloaded for training and for model evaluation.

The training dataset is primarily used in this example, but the workflow makes use of Jinja [templatization](https://beam.apache.org/documentation/sdks/yaml/#jinja-templatization) in YAML pipelines that makes it modular and extensible to be used with additional datasets.

## Outline
1. Setup
2. From dataset to Iceberg tables
3. Feature engineering
4. Training
5. Evaluation

## Setup

Install the necessary libraries and dependencies.

In [None]:
!pip3 install --quiet --upgrade \
  apache-beam[yaml,gcp] \
  opendatasets \
  scikit-learn \
  xgboost \
  datatable \
  pandas \
  poetry

!apt-get update
!apt-get install python3.10-venv

In [None]:
import os
import random
import time

import opendatasets as od
import pandas as pd
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier

Setting up directories and environment variables.

In [None]:
!mkdir yaml
!mkdir dataset

PROJECT = 'apache-beam-testing' # @param {type:'string'}
REGION = 'us-central1' # @param {type:'string'}
WAREHOUSE = 'gs://apache-beam-testing-charlesng/mlops' # @param {type:'string'}

os.environ['PROJECT'] = PROJECT
os.environ['REGION'] = REGION
os.environ['WAREHOUSE'] = WAREHOUSE

The feature engineering task and model evaluation necessitate implementing custom PTransforms. We will make use of [transform provider](https://beam.apache.org/documentation/sdks/yaml-providers/) to expose transforms in Python that can be used in a YAML pipeline.

[Poetry](https://python-poetry.org/) and the following `pyproject.toml` file are used to manage dependencies, build and package the custom PTransform implementation reside in `my_provider.py`.

In [None]:
%%writefile ./yaml/pyproject.toml

[tool.poetry]
name = "my_provider"
version = "0.1.0"
description = "A provider for custom transforms"
authors = ["Your Name <you@example.com>"]
license = "Apache License 2.0"
packages = [
    { include = "my_provider.py" },
]

[tool.poetry.dependencies]
python = "^3.10"
apache-beam = {extras = ["gcp", "yaml"], version = "^2.67.0"}
scikit-learn = "^1.7.0"
numpy = "^1.26.0"
pandas = "^2.2.0"
xgboost = "^3.0.0"
datatable = "^1.1.0"

[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"

## From dataset to Iceberg tables

We use the `opendatasets` library to programmatically download the dataset from Kaggle.

We'll first need a Kaggle account and register for this competition. We'll also need the API key which is stored in `kaggle.json` file automatically downloaded when you create an API token. Go to *Profile* picture -> *Settings* -> *API* -> *Create New Token*.

The dataset download will prompt you to enter your Kaggle username and key. Copy this information from `kaggle.json`.

In [None]:
dataset_url = 'https://www.kaggle.com/datasets/kartik2112/fraud-detection'
od.download(dataset_url, data_dir='./dataset')

Read in the dataset from the csv file and write it to an Iceberg table.

For Iceberg tables in this workflow, GCS is used as the storage layer.
In a data lakehouse with Iceberg and GCS object storage, a natural choice
for Iceberg catalog is [BigLake metastore](https://cloud.google.com/bigquery/docs/about-blms).
It is a managed, serverless metastore that doesn't require any setup.

In [None]:
%%writefile ./yaml/iceberg_migration_template.yaml
pipeline:
  transforms:
    - type: PyTransform
      name: ReadFromCsv
      input: {}
      config:
        constructor: apache_beam.io.ReadFromCsv
        kwargs:
            path: "{{ DATASET_PATH }}"
            index_col: False
            usecols: [
              'trans_date_trans_time',
              'cc_num',
              'merchant',
              'category',
              'amt',
              'first',
              'last',
              'gender',
              'street',
              'city',
              'state',
              'zip',
              'lat',
              'long',
              'city_pop',
              'job',
              'dob',
              'trans_num',
              'unix_time',
              'merch_lat',
              'merch_long',
              'is_fraud']

    - type: WriteToIceberg
      name: WriteToIceberg
      input: ReadFromCsv
      config:
        table: "{{ ICEBERG_TABLE }}"
        catalog_name: "my_catalog"
        catalog_properties:
          warehouse: "{{ WAREHOUSE }}"
          catalog-impl: "org.apache.iceberg.gcp.bigquery.BigQueryMetastoreCatalog"
          io-impl: "org.apache.iceberg.gcp.gcs.GCSFileIO"
          gcp_project: "{{ PROJECT }}"
          gcp_location: "{{ REGION }}"

In [None]:
%%writefile ./yaml/iceberg_migration_train_dataset.yaml
{% include './yaml/iceberg_migration_template.yaml' %}

In [None]:
!python -m apache_beam.yaml.main                                        \
  --yaml_pipeline_file=./yaml/iceberg_migration_train_dataset.yaml      \
  --jinja_variables='{                                                  \
    "DATASET_PATH": "./dataset/fraud-detection/fraudTrain.csv",         \
    "ICEBERG_TABLE": "fraud_detection.train_dataset_table",             \
    "WAREHOUSE": "'$WAREHOUSE'",                                        \
    "PROJECT": "'$PROJECT'",                                            \
    "REGION": "'$REGION'" }'

## Feature engineering

In this dataset, there's information on users' transactions over time.

We experiment by computing the following historical aggregate features:
- A user's average transaction amount over 1, 3, and 7 days.
- A user's total transaction count over 1, 3, and 7 days.

As mentioned before, we implement our custom PTransform `ComputeHistoricalFeatures` to generate these features in `my_provider.py` in order to later be exposed to the YAML pipeline.

In [None]:
%%writefile ./yaml/my_provider.py

import apache_beam as beam
from collections import deque
from datetime import datetime, timedelta, timezone

class ComputeHistoricalFeatures(beam.PTransform):

  class _ComputeFeaturesDoFn(beam.DoFn):
    """Processes all transactions for one user to compute features."""
    def process(self, element):
      cc_num, transactions = element

      # 1. Sort all transactions for the user chronologically.
      sorted_transactions = sorted(
          list(transactions),
          key=lambda t: datetime.fromisoformat(t.trans_date_trans_time).timestamp())

      # store transactions in a sliding 7-day window.
      history = deque()

      for tx in sorted_transactions:
        current_time = datetime.fromisoformat(tx.trans_date_trans_time).astimezone(timezone.utc)

        # 2. Remove transactions older than 7 days
        # relative to the current transaction's timestamp.
        while (history and
          datetime
            .fromisoformat(history[0].trans_date_trans_time)
            .astimezone(timezone.utc) < current_time - timedelta(days=7)):
            history.popleft()

        # 3. Calculate features by filtering the current history.
        def _avg(amounts):
            return sum(amounts) / len(amounts) if amounts else 0

        avg_past_1d = _avg([h.amt for h in history
                            if (datetime
                                .fromisoformat(h.trans_date_trans_time)
                                .astimezone(timezone.utc) >= current_time - timedelta(days=1))])
        avg_past_3d = _avg([h.amt for h in history
                            if (datetime
                                .fromisoformat(h.trans_date_trans_time)
                                .astimezone(timezone.utc) >= current_time - timedelta(days=3))])
        avg_past_7d = _avg([h.amt for h in history])

        count_past_1d = len([h for h in history
                             if (datetime
                                  .fromisoformat(h.trans_date_trans_time)
                                  .astimezone(timezone.utc) >= current_time - timedelta(days=1))])
        count_past_3d = len([h for h in history
                             if (datetime
                                  .fromisoformat(h.trans_date_trans_time)
                                  .astimezone(timezone.utc) >= current_time - timedelta(days=3))])
        count_past_7d = len(history)

        # 4. Yield a new row with the original data and the new features.
        yield beam.Row(
            **tx._asdict(),
            avg_amount_past_1d=avg_past_1d,
            avg_amount_past_3d=avg_past_3d,
            avg_amount_past_7d=avg_past_7d,
            count_past_1d=count_past_1d,
            count_past_3d=count_past_3d,
            count_past_7d=count_past_7d
        )

        # 5. Add the current transaction to history for the next iteration.
        history.append(tx)

  def expand(self, pcoll):
    return (
        pcoll
        | beam.WithKeys(lambda row: row.cc_num)
        | beam.GroupByKey()
        | beam.ParDo(self._ComputeFeaturesDoFn())
        | beam.Map(lambda row: beam.Row(
            **row.as_dict()
          ))
    )


Build and package the custom transform.

In [None]:
!poetry build -C ./yaml

Read the input training data stored in Iceberg table, use our custom transform to compute the features and write them to another Iceberg table.

In [None]:
%%writefile ./yaml/historical_aggregates_featurize_template.yaml
pipeline:
  type: chain
  transforms:
    - type: ReadFromIceberg
      name: ReadFromIceberg
      config:
        table: "{{ ICEBERG_TABLE_INPUT }}"
        catalog_name: "my_catalog"
        catalog_properties:
          warehouse: "{{ WAREHOUSE }}"
          catalog-impl: "org.apache.iceberg.gcp.bigquery.BigQueryMetastoreCatalog"
          io-impl: "org.apache.iceberg.gcp.gcs.GCSFileIO"
          gcp_project: "{{ PROJECT }}"
          gcp_location: "{{ REGION }}"

    - type: HistoricalAggregatesTransform
      name: HistoricalAggregatesTransform

    - type: MapToFields
      name: MapToFields
      config:
        language: python
        fields:
          cc_num:
            callable: "lambda row: row.cc_num"
            output_type: integer
          amt:
            callable: "lambda row: row.amt"
            output_type: number
          trans_date_trans_time:
            callable: "lambda row: row.trans_date_trans_time"
            output_type: string
          trans_num:
            callable: "lambda row: row.trans_num"
            output_type: string
          merchant:
            callable: "lambda row: row.merchant"
            output_type: string
          category:
            callable: "lambda row: row.category"
            output_type: string
          merch_lat:
            callable: "lambda row: row.merch_lat"
            output_type: number
          merch_long:
            callable: "lambda row: row.merch_long"
            output_type: number
          first:
            callable: "lambda row: row.first"
            output_type: string
          last:
            callable: "lambda row: row.last"
            output_type: string
          gender:
            callable: "lambda row: row.gender"
            output_type: string
          street:
            callable: "lambda row: row.street"
            output_type: string
          city:
            callable: "lambda row: row.city"
            output_type: string
          state:
            callable: "lambda row: row.state"
            output_type: string
          zip:
            callable: "lambda row: row.zip"
            output_type: integer
          lat:
            callable: "lambda row: row.lat"
            output_type: number
          long:
            callable: "lambda row: row.long"
            output_type: number
          city_pop:
            callable: "lambda row: row.city_pop"
            output_type: integer
          job:
            callable: "lambda row: row.job"
            output_type: string
          dob:
            callable: "lambda row: row.dob"
            output_type: string
          avg_amount_past_1d:
            callable: "lambda row: row.avg_amount_past_1d"
            output_type: number
          avg_amount_past_3d:
            callable: "lambda row: row.avg_amount_past_3d"
            output_type: number
          avg_amount_past_7d:
            callable: "lambda row: row.avg_amount_past_7d"
            output_type: number
          count_past_1d:
            callable: "lambda row: row.count_past_1d"
            output_type: integer
          count_past_3d:
            callable: "lambda row: row.count_past_3d"
            output_type: integer
          count_past_7d:
            callable: "lambda row: row.count_past_7d"
            output_type: integer
          is_fraud:
            callable: "lambda row: row.is_fraud"
            output_type: integer

    - type: WriteToIceberg
      name: WriteToIceberg
      config:
        table: "{{ ICEBERG_TABLE_OUTPUT }}"
        catalog_name: "my_catalog"
        catalog_properties:
          warehouse: "{{ WAREHOUSE }}"
          catalog-impl: "org.apache.iceberg.gcp.bigquery.BigQueryMetastoreCatalog"
          io-impl: "org.apache.iceberg.gcp.gcs.GCSFileIO"
          gcp_project: "{{ PROJECT }}"
          gcp_location: "{{ REGION }}"

providers:
  - type: pythonPackage
    config:
      packages:
        - ./dist/my_provider-0.1.0.tar.gz
    transforms:
      HistoricalAggregatesTransform: 'my_provider.ComputeHistoricalFeatures'


In [None]:
%%writefile ./yaml/historical_aggregates_featurize_train_dataset.yaml
{% include './yaml/historical_aggregates_featurize_template.yaml' %}

In [None]:
!python -m apache_beam.yaml.main                                                                    \
  --yaml_pipeline_file=./yaml/historical_aggregates_featurize_train_dataset.yaml                    \
  --jinja_variables='{                                                                              \
    "ICEBERG_TABLE_INPUT": "fraud_detection.train_dataset_table",                                   \
    "ICEBERG_TABLE_OUTPUT": "fraud_detection.historical_aggregates_featurized_train_dataset_table", \
    "WAREHOUSE": "'$WAREHOUSE'",                                                                    \
    "PROJECT": "'$PROJECT'",                                                                        \
    "REGION": "'$REGION'" }'

## Training

Download the dataset with all the computed features from the Iceberg table and load it to pandas Dataframe.

In [None]:
%%writefile ./yaml/download_featurized_dataset_template.yaml
pipeline:
  type: chain
  transforms:
    - type: ReadFromIceberg
      config:
        table: "{{ ICEBERG_TABLE }}"
        catalog_name: "my_catalog"
        catalog_properties:
          warehouse: "{{ WAREHOUSE }}"
          catalog-impl: "org.apache.iceberg.gcp.bigquery.BigQueryMetastoreCatalog"
          io-impl: "org.apache.iceberg.gcp.gcs.GCSFileIO"
          gcp_project: "{{ PROJECT }}"
          gcp_location: "{{ REGION }}"

    - type: WriteToCsv
      config:
        path: "{{ OUTPUT_PATH }}"

In [None]:
%%writefile ./yaml/download_featurized_train_dataset.yaml
{% include './yaml/download_featurized_dataset_template.yaml' %}

In [None]:
!python -m apache_beam.yaml.main                                                                  \
  --yaml_pipeline_file=./yaml/download_featurized_train_dataset.yaml                              \
  --jinja_variables='{                                                                            \
  "ICEBERG_TABLE": "fraud_detection.historical_aggregates_featurized_train_dataset_table",        \
  "WAREHOUSE": "'$WAREHOUSE'",                                                                    \
  "PROJECT": "'$PROJECT'",                                                                        \
  "REGION": "'$REGION'",                                                                          \
  "OUTPUT_PATH": "./dataset/historical_aggregates_featurized_train_dataset.csv" }'

In [None]:
df = pd.read_csv(
    './dataset/historical_aggregates_featurized_train_dataset.csv-00000-of-00001',
    header=0,
    parse_dates=['trans_date_trans_time', 'dob']
)

df.head(5)

ML models usually accept only numerical or categorical data. For data that are of type string (transaction date and time, date of birth, first and last name, city, state, etc...), some preprocessing is required.

For datetime columns, we break them further into multiple numerical feature columns.

In [None]:
def add_dateparts(df, col):
    """
    This function splits the datetime column into separate column such as
    year, month, day, weekday, and hour
    :param df: DataFrame table to add the columns
    :param col: the column with datetime values
    :return: None
    """
    df[col + '_year'] = df[col].dt.year
    df[col + '_month'] = df[col].dt.month
    df[col + '_day'] = df[col].dt.day
    df[col + '_weekday'] = df[col].dt.weekday
    df[col + '_hour'] = df[col].dt.hour

add_dateparts(df, 'trans_date_trans_time')
add_dateparts(df, 'dob')

For other feature columns with string data, we specify them explicitly as `category` type data.

In [None]:
categorical_feature_columns = [
    'merchant',
    'category',
    'first',
    'last',
    'gender',
    'city',
    'state',
    'job'
]
for col in categorical_feature_columns:
    df[col] = df[col].astype('category')

In [None]:
df.info()

We specify the `baseline_feature_columns` containing the original columns of the dataset that are used as input columns for our first model. Likewise, we specify `full_feature_columns` containing the original and computed feature columns of the dataset that are used as input columns for our second model.

The target/label column for training is the `is_fraud` column.

In [None]:
baseline_feature_columns = [
    'cc_num',
    'amt',
    'trans_date_trans_time_year',
    'trans_date_trans_time_month',
    'trans_date_trans_time_day',
    'trans_date_trans_time_weekday',
    'trans_date_trans_time_hour',
    'merchant',
    'category',
    'merch_lat',
    'merch_long',
    'first',
    'last',
    'gender',
    'city',
    'state',
    'zip',
    'lat',
    'long',
    'city_pop',
    'job',
    'dob_year',
    'dob_month',
    'dob_day',
    'dob_weekday',
    'dob_hour'
]
full_feature_columns = baseline_feature_columns + [
    'avg_amount_past_1d',
    'avg_amount_past_3d',
    'avg_amount_past_7d',
    'count_past_1d',
    'count_past_3d',
    'count_past_7d',
]
target_column = 'is_fraud'

X = df[full_feature_columns]
y = df[target_column]

# Split data for validation
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Predicting taxi fare is a supervised learning, regression problem and our dataset is tabular. It is well-known that [gradient-boosted decision tree (GBDT) model](https://en.wikipedia.org/wiki/Gradient_boosting) performs very well for this kind of problem and dataset type.

We use the XGBoost library which implements the GBDT machine learning algorithm in a scalable, distributed manner.

We train two models, one `baseline_model` with `baseline_feature_columns`, and one `experimental_model` with `full_feature_columns`.

In [None]:
# Train a simple XGBoost Classifier
baseline_model = XGBClassifier(
    objective='binary:logistic',
    eval_metric='logloss',
    use_label_encoder=False,
    enable_categorical=True,
    tree_method='hist'
)
baseline_model.fit(X_train[baseline_feature_columns], y_train)
baseline_model.save_model("baseline_model.bst")

In [None]:
!gcloud storage cp baseline_model.bst {WAREHOUSE}

In [None]:
# Train a simple XGBoost Classifier
experimental_model = XGBClassifier(
    objective='binary:logistic',
    eval_metric='logloss',
    use_label_encoder=False,
    enable_categorical=True,
    tree_method='hist'
)
experimental_model.fit(X_train, y_train)
experimental_model.save_model("experimental_model.bst")

In [None]:
!gcloud storage cp experimental_model.bst {WAREHOUSE}

## Evaluation

Save the the testing datasets and label to be used later in model evaluation YAML pipeline.

In [None]:
X_test[baseline_feature_columns].to_pickle("X_test_baseline_feature_columns.pkl")
X_test[full_feature_columns].to_pickle("X_test_full_feature_columns.pkl")

y_test.to_pickle("y_test.pkl")

We again implement custom PTransforms `RunInferenceTransform` to perform inference on input data, and `ComputeMetricsTransform` to calculate various model evaluation metrics.

In [None]:
%%writefile -a ./yaml/my_provider.py

from apache_beam.ml.inference.base import RunInference
from apache_beam.ml.inference.xgboost_inference import XGBoostModelHandlerPandas
import xgboost
import pandas as pd
import sklearn.metrics

class RunInferenceTransform(beam.PTransform):
  def __init__(self, model_path, df_dataset_path):
    self._model_path = model_path
    self._df_dataset_path = df_dataset_path
    self._model_handler = XGBoostModelHandlerPandas(
        model_class=xgboost.XGBClassifier,
        model_state=self._model_path,
    )

  def expand(self, pcoll):
    return (
        pcoll
        | beam.Create([pd.read_pickle(self._df_dataset_path)])
        | RunInference(self._model_handler)
        | beam.Map(lambda row: beam.Row(inferences=row.inference))
    )

class ComputeMetricsTransform(beam.PTransform):
  def __init__(self, df_targets_path):
    self.targets = pd.read_pickle(df_targets_path).to_list()

  def expand(self, pcoll):

    def compute_metrics(row):
      true_labels = self.targets
      predicted_labels = row.inferences

      accuracy = sklearn.metrics.accuracy_score(true_labels, predicted_labels)
      precision = sklearn.metrics.precision_score(true_labels, predicted_labels, average='macro', zero_division=0)
      recall = sklearn.metrics.recall_score(true_labels, predicted_labels, average='macro', zero_division=0)
      f1 = sklearn.metrics.f1_score(true_labels, predicted_labels, average='macro', zero_division=0)

      yield beam.Row(
          accuracy=float(accuracy),
          precision=float(precision),
          recall=float(recall),
          f1=float(f1),
      )

    return pcoll | 'CalculateMetrics' >> beam.FlatMap(compute_metrics)


Build and package the custom transforms.

In [None]:
!poetry build -C ./yaml

Use our custom transforms to read in the dataset, perform inference and evaluate against target labels.

In [None]:
%%writefile ./yaml/model_evaluation_template.yaml
pipeline:
  transforms:
    - type: RunInferenceTransform
      name: RunInferenceTransform
      input: {}
      config:
        model_path: "{{ MODEL_PATH }}"
        df_dataset_path: "{{ DF_DATASET_PATH }}"

    - type: ComputeMetricsTransform
      name: ComputeMetricsTransform
      input: RunInferenceTransform
      config:
        df_targets_path: "{{ DF_TARGETS_PATH }}"

    - type: LogForTesting
      name: LogForTesting
      input: ComputeMetricsTransform

providers:
  - type: pythonPackage
    config:
      packages:
        - ./dist/my_provider-0.1.0.tar.gz
    transforms:
      RunInferenceTransform: 'my_provider.RunInferenceTransform'
      ComputeMetricsTransform: 'my_provider.ComputeMetricsTransform'


Evaluate the baseline model on the baseline feature columns.

In [None]:
%%writefile ./yaml/baseline_model_evaluation.yaml
{% include './yaml/model_evaluation_template.yaml' %}

In [None]:
!python -m apache_beam.yaml.main                                              \
  --yaml_pipeline_file=./yaml/baseline_model_evaluation.yaml                  \
  --jinja_variables='{                                                        \
    "MODEL_PATH": "'$WAREHOUSE'/baseline_model.bst",                          \
    "DF_DATASET_PATH": "./X_test_baseline_feature_columns.pkl",               \
    "DF_TARGETS_PATH": "./y_test.pkl" }'

The model trained only on the original features achieves:

```
{
  "accuracy": 0.9984623532828757,
  "precision": 0.963115986169815,
  "recall": 0.8987887554423477,
  "f1": 0.9285234509618558
}
```

Evaluate the experimental model on the full feature columns.

In [None]:
%%writefile ./yaml/experimental_model_evaluation.yaml
{% include './yaml/model_evaluation_template.yaml' %}

In [None]:
!python -m apache_beam.yaml.main                                              \
  --yaml_pipeline_file=./yaml/experimental_model_evaluation.yaml              \
  --jinja_variables='{                                                        \
    "MODEL_PATH": "'$WAREHOUSE'/experimental_model.bst",                      \
    "DF_DATASET_PATH": "./X_test_full_feature_columns.pkl",                   \
    "DF_TARGETS_PATH": "./y_test.pkl" }'

The model trained on full feature columns achieve a better result:

```
{
  "accuracy": 0.9989620884659411,
  "precision": 0.9643454567937234,
  "recall": 0.944328457850605,
  "f1": 0.9541138032867618
}
```