# Demo: Real-Time Feature Engineering with Feldera

## INTRODUCTION

Feature engineering is the process of transforming raw data into features that better represent the underlying problem to an ML model. While raw data represents individual events, features summarize many raw events, e.g., average readings collected over a period of time. When raw input data arrives continuously in real-time, feature computation must also be performed in real-time in order to supply the ML model with up-to-date inputs. 

Building real-time feature pipelines using existing tools like Flink and Spark Structured Streaming is notoriously hard. **Feldera simplifies this task dramatically**. As a user, you simply express your features as SQL queries. Feldera runs these queries over historical data in order to compute feature vectors for model training and testing. Then, during inference, Feldera evaluates **the same queries** over streaming inputs in real-time.

In this notebook, we use Feldera to build a real-time credit card fraud detector. In particular, we: 
* Write SQL queries that define several interesting features, based on data enrichment and rolling aggregates.
* Compute feature vectors and train an ML model on a historical data set stored in a Delta Lake.
* Use the same queries to compute feature vectors over a real-time stream of credit card transactions.

## USE CASE: CREDIT CARD FRAUD DETECTION
Credit card fraud detection is a classic application of real-time feature engineering. Here, data comes in a stream of **transactions**, each with attributes like card number, purchase time, vendor, and amount. Additionally, the fraud detector has access to a slowly changing table with **demographics** information about cardholders, such as age and address.

<img src="./tables.png" alt="training" width="800"/>


## INPUT DATASETS
We used a publicly available [Synthetic Credit Card Transaction Generator](https://github.com/namebrandon/Sparkov_Data_Generation) to generate two labeled datasets, both with 1000 user profiles.  We will use the first dataset for model training and testing, and the second dataset -- for real-time inference.  We stored the datasets in the [Delta Lake format](https://delta.io/) in two public S3 buckets:

* Training dataset:
  * Demographics table: `s3://feldera-fraud-detection-data/demographics_train/`
  * Transaction table: `s3://feldera-fraud-detection-data/transaction_train/`

* Inference dataset:
  * Demographics table: `s3://feldera-fraud-detection-data/demographics_train/`
  * Transaction table: `s3://feldera-fraud-detection-data/transaction_train/`


## MODEL TRAINING AND TESTING

Finding an optimal set of features to train a good ML model is an iterative process. At every step, the data scientist trains and tests a model using currently selected feature queries on labeled historical data. The results of each experiment drive the next refinement of feature queries.

<img src="training_animation.gif" alt="training" width="800"/>

Below we show one iteration of this process: we define a set of features, train a model using these features, and test its accuracy.

### Feature queries

We define several features over our input tables:

* Data enrichment:
  * We add demographic attributes, such as zip code, to each transaction
* Rolling aggregates:
  * average spending per transaction in the past day, week, and month
  * average spending per transaction over a 3-month timeframe on the same day of the week
  * number of transactions made with this credit card in the last 24 hours
* Other:
  * `is_weekend` - transaction took place on a weekend
  * `is_night` - transaction took place before 6am
  * `d` - day of week

The following Python function uses the Feldera Python SDK to creates an SQL program, consisting of two tables with raw input data (`TRANSACTION` and `DEMOGRAPHICS`) and the `FEATURE` view, which computes the above features over these tables.

In [1]:
from feldera import FelderaClient, SQLContext, SQLSchema

def build_program(client, pipeline_name):
    sql = SQLContext(pipeline_name, client).get_or_create()
    # Declare input table with raw credit card transaction data.
    sql.register_table(
        "TRANSACTION",
        SQLSchema(
            {
                "trans_date_trans_time": "TIMESTAMP",
                "cc_num": "BIGINT",
                "merchant": "STRING",
                "category": "STRING",
                "amt": "DOUBLE",
                "trans_num": "STRING",
                "unix_time": "BIGINT",
                "merch_lat": "DOUBLE",
                "merch_long": "DOUBLE",
                "is_fraud": "BIGINT",
            }
        ),
    )

    # Declare input table with demographics data.
    sql.register_table(
        "DEMOGRAPHICS",
        SQLSchema(
            {
                "cc_num": "BIGINT",
                "first": "STRING",
                "last": "STRING",
                "gender": "STRING",
                "street": "STRING",
                "city": "STRING",
                "state": "STRING",
                "zip": "BIGINT",
                "lat": "DOUBLE",
                "long": "DOUBLE",
                "city_pop": "BIGINT",
                "job": "STRING",
                "dob": "DATE",
            }
        ),
    )

    # Feature query written in the Feldera SQL dialect.
    query = """
        SELECT
           t.cc_num,
           -- Demographic attributes
           zip,
           city_pop,
           -- Day-of-week
           dayofweek(trans_date_trans_time) as d,
           -- is_weekend flag
           CASE
             WHEN dayofweek(trans_date_trans_time) IN(6, 7) THEN true
             ELSE false
           END AS is_weekend,
           -- hour of day
           hour(trans_date_trans_time) as hour_of_day,
           -- is_night flag
           CASE
             WHEN hour(trans_date_trans_time) <= 6 THEN true
             ELSE false
           END AS is_night,
           -- Average spending per day, per week, and per month.
           AVG(amt) OVER window_1_day AS avg_spend_pd,
           AVG(amt) OVER window_7_day AS avg_spend_pw,
           AVG(amt) OVER window_30_day AS avg_spend_pm,
           -- Average spending over the last three months for the same day of the week.
           COALESCE(
            AVG(amt) OVER (
              PARTITION BY t.cc_num, EXTRACT(DAY FROM trans_date_trans_time)
              ORDER BY unix_time
              RANGE BETWEEN 7776000 PRECEDING and CURRENT ROW
            ), 0) AS avg_spend_p3m_over_d,
           -- Number of transactions in the last 24 hours.
           COUNT(*) OVER window_1_day AS trans_freq_24,
           -- Transaction amount
           amt,
           -- Transaction time
           unix_time,
           -- Ground truth label
           is_fraud
        -- Enrich transaction data with demographic data
        FROM transaction as t
        JOIN demographics as d
        ON t.cc_num = d.cc_num
        WINDOW
          window_1_day AS (PARTITION BY t.cc_num ORDER BY unix_time RANGE BETWEEN 86400 PRECEDING AND CURRENT ROW),
          window_7_day AS (PARTITION BY t.cc_num ORDER BY unix_time RANGE BETWEEN 604800 PRECEDING AND CURRENT ROW),
          window_30_day AS (PARTITION BY t.cc_num ORDER BY unix_time RANGE BETWEEN 2592000 PRECEDING AND CURRENT ROW);
      """

    sql.register_view("FEATURE", query)
    return sql


## Train & test the model

Overview of the following code:
- Connect to a Feldera service (we use [try.feldera.com](https://try.feldera.com))
- Create a pipeline to evaluate feature queries over transaction and demographics tables in S3
- Run the pipeline to process all input data **to completion**.
- Read the computed features into a Pandas dataframe
- Split the dataframe into train and test sets
- Train an XGBoost model and measure its accuracy.

In [2]:
# Helper functions for model training & testing

from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

# Split input dataframe into train and test sets
def get_train_test_data(dataframe, feature_cols, target_col, train_test_split_ratio, random_seed):
    X = dataframe[feature_cols]
    y = dataframe[target_col]
    X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = train_test_split_ratio, random_state = random_seed)

    return X_train, X_test, y_train, y_test

# Train a decision tree classifier using xgboost.
# Other ML frameworks and types of ML models can be readily used with Feldera.
def train_model(dataframe, config):
    max_depth = 12
    n_estimators = 100

    X_train, X_test, y_train, y_test = get_train_test_data(
        dataframe,
        config['feature_cols'],
        config['target_col'],
        config['train_test_split_ratio'],
        config['random_seed'])

    model = XGBClassifier(
        max_depth=max_depth,
        n_estimators=n_estimators,
        objective="binary:logistic")

    model.fit(X_train, y_train.values.ravel())
    return model, X_test, y_test

# Evaluate prediction accuracy against ground truth.
def eval_metrics(y, predictions):
    cm = confusion_matrix(y, predictions)
    print("Confusion matrix:")
    print(cm)

    if len(cm) < 2 or cm[1][1] == 0:  # checking if there are no true positives
        print('No fraudulent transaction to evaluate')
        return
    else:
        precision = cm[1][1] / (cm[1][1] + cm[0][1])
        recall = cm[1][1] / (cm[1][1] + cm[1][0])
        f1 = (2 * (precision * recall) / (precision + recall))

    print(f"Precision: {precision * 100:.2f}%")
    print(f"Recall: {recall * 100:.2f}%")
    print(f"F1 Score: {f1 * 100:.2f}%")

In [None]:
import pandas as pd

DATA_URI = "s3://feldera-fraud-detection-data"
#DATA_URI = "/home/leonid/projects/feldera/demo/project_demo10-FraudDetectionDeltaLake/data"

# Connect to the Feldera sandbox.
# Use the 'Settings' menu at try.feldera.com to generate an API key
client = FelderaClient("http://localhost:8080")

sql = build_program(client, "fraud_detection_training")

# Load DEMOGRAPHICS data from a Delta table stored in a public S3 bucket.
sql.connect_source_delta_table(
    "DEMOGRAPHICS",
    "demographics_train",
    {
        "uri": f"{DATA_URI}/demographics_train/",
        "mode": "snapshot",
        "aws_skip_signature": "true"
    }
)

# Load credit card TRANSACTION data.
sql.connect_source_delta_table(
    "TRANSACTION",
    "transaction_train",
    {
        "uri": f"{DATA_URI}/transaction_train/",
        "mode": "snapshot",
        "aws_skip_signature": "true",
        "timestamp_column": "unix_time"
    }
)

# sql.connect_sink_delta_table(
#     "FEATURE",
#     "feature_train",
#     {
#         "uri": "s3://feldera-fraud-detection-demo/feature_train",
#         "mode": "truncate",
#         "aws_access_key_id": dbutils.secrets.get("feature-engineering-demo", "AWS_ACCESS_KEY_ID"),
#         "aws_secret_access_key": dbutils.secrets.get("feature-engineering-demo", "AWS_SECRET_ACCESS_KEY"),
#         "aws_region": "us-east-1"

#     }
# )

hfeature = sql.listen("feature")

# Process full snapshot of the input tables and compute a dataset with feature vectors.
sql.run_to_completion()

# Read computed feature vectors into a Pandas dataframe.
features_pd = hfeature.to_pandas()
print(f"Computed {len(features_pd)} feature vectors")

print("Training the model")

feature_cols = list(features_pd.columns.drop('is_fraud'))

config={
        'feature_cols' : feature_cols,
        'target_col' : ['is_fraud'],
        'random_seed' : 45,
        'train_test_split_ratio' : 0.8
        }

trained_model, X_test, y_test = train_model(features_pd, config)

print("Testing the trained model")

y_pred = trained_model.predict(X_test)
eval_metrics(y_test, y_pred)

Computed 1300707 feature vectors
Training the model


## REAL-TIME INFERENCE

During real-time feature computation, raw data arrives from a streaming source like Kafka. Feldera can ingest data directly from such sources, but in this case we will assume that Kafka is connected to a Delta table, and configure Feldera to ingest the data by following the transaction log of the table.

<img src="inference_animation.gif" alt="inference" width="800"/>

Below, we create another Feldera pipeline to evaluate the feature query over streaming data:
- Build a pipeline identical to the training pipeline above, but using the inference dataset as input
- Configure the input connector for the `TRANSACTION` table to ingest transaction data in the `snapshot_and_follow` mode. In this mode, the connector reads the initial snapshot of the table before following the stream of changes in its transaction log. This **backfill** pattern is necessary to correctly evaluate features that depend on historical data such as rolling sums and averages.
- Run the pipeline for 30 seconds. For each batch of new feature vectors computed by the pipeline:
  - Read the data into a Pandas dataframe
  - Feed the dataframe to the trained ML model for inference
  - Measure model accuracy by comparing model prediction with the ground truth

In [None]:
# Helper function: feed a Pandas dataframe to the trained model for inference.
def inference(trained_model, df):
    print(f"\nReceived {len(df)} feature vectors.")
    if len(df) == 0:
        return

    feature_cols_inf = list(df.columns.drop('is_fraud'))
    X_inf = df[feature_cols_inf].values  # convert to numpy array
    y_inf = df["is_fraud"].values
    predictions_inf = trained_model.predict(X_inf)

    eval_metrics(y_inf, predictions_inf)

In [None]:
import time

# How long to run the inference pipeline for.
INFERENCE_TIME_SECONDS = 30

print(f"Running the inference pipeline")

sql = build_program(client, "fraud_detection_inference")

# Load DEMOGRAPHICS data from a Delta table.
sql.connect_source_delta_table(
    "DEMOGRAPHICS",
    "demographics_infer",
    {
        "uri": f"{DATA_URI}/demographics_infer",
        "mode": "snapshot",
        "aws_skip_signature": "true"
    }
)

# Read TRANSACTION data from a Delta table.
# Configure the Delta Lake connector to read the initial snapshot of
# the table before following the stream of changes in its transaction log.
sql.connect_source_delta_table(
    "TRANSACTION",
    "transaction_infer",
    {
        "uri": f"{DATA_URI}/transaction_infer",
        "mode": "snapshot_and_follow",
        "version": 10,
        "timestamp_column": "unix_time",
        "aws_skip_signature": "true"
    }
)

# sql.connect_sink_delta_table(
#     "FEATURE",
#     "feature_infer",
#     {
#         "uri": "s3://feldera-fraud-detection-demo/feature_infer",
#         "mode": "truncate",
#         "aws_access_key_id": dbutils.secrets.get("feature-engineering-demo", "AWS_ACCESS_KEY_ID"),
#         "aws_secret_access_key": dbutils.secrets.get("feature-engineering-demo", "AWS_SECRET_ACCESS_KEY"),
#         "aws_region": "us-east-1"

#     }
# )

sql.foreach_chunk("feature", lambda df, chunk : inference(trained_model, df))

# Start the pipeline to continuously process the input stream of credit card
# transactions and output newly computed feature vectors to a Delta table.
sql.start()

time.sleep(INFERENCE_TIME_SECONDS)

print(f"Shutting down the inference pipeline after {INFERENCE_TIME_SECONDS} seconds")
sql.shutdown()

### Monitoring the inference pipeline in the Web Console

While the above code is running, you can monitor the pipeline in the Feldera Web Console:

<img src="web_console.gif" alt = "Web Console" width="1000"/>

## TAKEAWAYS

In this example we used Feldera to evaluate *the same feature queries* first over historical (batch) data and then over a combination of historical and streaming inputs.  Feldera's ability to operate on any combination of batch and streaming sources eliminates the need to develop multiple implementations of the same queries for development and production environments.  **In fact, Feldera does not distinguish between the two**.  It processes inputs in the same way and produces the same outputs, whether they arrive frequently in small groups (aka streaming) or occasionally in bigger groups (aka batch).

Upon receiving new inputs, Feldera updates its output views without full re-computation, by doing work proportional to the size of the new data rather than the size of the entire database.  This **incremental evaluation** makes Feldera efficient for both streaming and batch inputs.

Finally, we would like to emphaize that **Feldera is strongly consistent**. If
we pause our inference pipeline and inspect the contents of the output view
produced by Feldera so far, it will be **precisely the same as if we ran the
query on all the inputs received so far as one large batch**.  Unpause the
pipeline and run it a little longer.  The pipeline will receive some additional
inputs and produce additional outputs, but it still preserves the same
input/output guarantee.  This property, known as **strong consistency**, ensures
that the prediction accuracy of your ML model will not be affected by incorrect input.
