# Feathr Fraud Detection Sample

This notebook illustrates the use of Feature Store to create a model that predicts the fraud status of transactions based on the user account data and trasaction data. All the data that was used in the notebook can be found here: https://github.com/microsoft/r-server-fraud-detection


In the following Notebook, we 
1. Install the latest Feathr code (to include some unreleased features) 
2. Define Environment Variables & `yaml_config` Settings 
3. Create `FeathrClient` and Define `FeatureAnchor`
4. `build_features` and `get_offline_features` 
5. Train Fraud Detection Model wih `KNeighborsClassifier`
6. `materialize_features` and `multi_get_online_features`
7. `register_features` and `list_registered_features`

## Setup Feathr Developer Environment

***Prior to running the notebook, if you have not deployed all the required resources, please refer to the guide here and follow the steps to do so: https://feathr-ai.github.io/feathr/how-to-guides/azure-deployment-arm.html***

In [4]:
# Install feathr from the latest codes in the repo. You may use `pip install feathr[notebook]` as well.
!pip install "git+https://github.com/feathr-ai/feathr.git#subdirectory=feathr_project&egg=feathr[notebook]"  

Collecting feathr[notebook]
  Cloning https://github.com/feathr-ai/feathr.git to /tmp/pip-install-zsn6kl5o/feathr_36df02ae02da42a0a56397c6e9e058e7
  Running command git clone --filter=blob:none --quiet https://github.com/feathr-ai/feathr.git /tmp/pip-install-zsn6kl5o/feathr_36df02ae02da42a0a56397c6e9e058e7
  Resolved https://github.com/feathr-ai/feathr.git to commit 3ebaa49cf36dac90005fb2cbf5412d9103871e9b
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
[?25hCollecting py4j<=0.10.9.7
  Using cached py4j-0.10.9.7-py2.py3-none-any.whl (200 kB)
Collecting Jinja2<=3.1.2
  Using cached Jinja2-3.1.2-py3-none-any.whl (133 kB)
Collecting pyspark>=3.1.2
  Downloading pyspark-3.3.1.tar.gz (281.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m281.4/281.4 MB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25h  Preparing metadata (setup.py) ... [

In [1]:
from datetime import datetime, timedelta
import glob
from math import sqrt
import os
import tempfile

from azure.identity import DefaultAzureCredential
from azure.keyvault.secrets import SecretClient
from sklearn.model_selection import train_test_split

import feathr
from feathr import (
    FeathrClient,
    STRING, BOOLEAN, FLOAT, INT32, ValueType,
    Feature, DerivedFeature, FeatureAnchor,
    BackfillTime, MaterializationSettings,
    FeatureQuery, ObservationSettings,
    RedisSink,
    HdfsSource,
    WindowAggTransformation,
    TypedKey,
)
from feathr.utils.config import generate_config


print(f"Feathr version: {feathr.__version__}")

ModuleNotFoundError: No module named 'sklearn'

In [18]:
RESOURCE_PREFIX = None  # TODO fill the value used to deploy the resources via ARM template
PROJECT_NAME = "fraud_detection"

# Currently support: 'azure_synapse', 'databricks', and 'local' 
SPARK_CLUSTER = "local"

# TODO fill values to use databricks cluster:
DATABRICKS_CLUSTER_ID = None             # Set Databricks cluster id to use an existing cluster
DATABRICKS_URL = None                    # Set Databricks workspace url to use databricks
DATABRICKS_WORKSPACE_TOKEN_VALUE = None  # Set Databricks workspace token to use databricks

# TODO fill values to use Azure Synapse cluster:
AZURE_SYNAPSE_SPARK_POOL = None  # Set Azure Synapse Spark pool name
AZURE_SYNAPSE_URL = None         # Set Azure Synapse workspace url to use Azure Synapse
ADLS_KEY = None                  # Set Azure Data Lake Storage key to use Azure Synapse

# An existing Feathr config file path. If None, we'll generate a new config based on the constants in this cell.
FEATHR_CONFIG_PATH = None

# If set True, use an interactive browser authentication to get the redis password.
USE_CLI_AUTH = False

In [19]:
# TODO remove this cell
RESOURCE_PREFIX = "juntest"
USE_CLI_AUTH = True

In [12]:
if SPARK_CLUSTER == "azure_synapse" and not os.environ.get("ADLS_KEY"):
    os.environ["ADLS_KEY"] = ADLS_KEY
elif SPARK_CLUSTER == "databricks" and not os.environ.get("DATABRICKS_WORKSPACE_TOKEN_VALUE"):
    os.environ["DATABRICKS_WORKSPACE_TOKEN_VALUE"] = DATABRICKS_WORKSPACE_TOKEN_VALUE

In [15]:
if USE_CLI_AUTH:
    !az login --use-device-code

[33mTo sign in, use a web browser to open the page https://microsoft.com/devicelogin and enter the code R6F3RHUP2 to authenticate.[0m
[33mFailed to authenticate '{'additional_properties': {}, 'id': '/tenants/d3e49573-1ecc-4a43-adfb-7400029d7049', 'tenant_id': 'd3e49573-1ecc-4a43-adfb-7400029d7049', 'tenant_category': 'Home', 'country': None, 'country_code': None, 'display_name': None, 'domains': None}' due to error 'Get Token request returned http error: 400 and server response: {"error":"invalid_grant","error_description":"AADSTS50020: User account '{EmailHidden}' from identity provider 'https://sts.windows.net/72f988bf-86f1-41af-91ab-2d7cd011db47/' does not exist in tenant 'iconfitness.com' and cannot access the application '04b07795-8ddb-461a-bbee-02f9e1bf7b46'(Microsoft Azure CLI) in that tenant. The account needs to be added as an external user in the tenant first. Sign out and sign in again with a different Azure Active Directory user account.\r\nTrace ID: 0e29aa07-07fb-4843-a

## Permission
To run the cells below, you need additional permission: permission to your managed identity to access the keyvault, and permission to the user to access the Storage Blob. Run the following lines of command in the Cloud Shell in order to grant yourself the access.

```
userId=<email_id_of_account_requesting_access>
resource_prefix=<resource_prefix>
synapse_workspace_name="${resource_prefix}syws"
keyvault_name="${resource_prefix}kv"
objectId=$(az ad user show --id $userId --query id -o tsv)
az keyvault update --name $keyvault_name --enable-rbac-authorization false
az keyvault set-policy -n $keyvault_name --secret-permissions get list --object-id $objectId
az role assignment create --assignee $userId --role "Storage Blob Data Contributor"
az synapse role assignment create --workspace-name $synapse_workspace_name --role "Synapse Contributor" --assignee $userId
```

In [16]:
# Redis password
if 'REDIS_PASSWORD' not in os.environ:
    # Try to get all the required credentials from Azure Key Vault
    from azure.identity import AzureCliCredential, DefaultAzureCredential 
    from azure.keyvault.secrets import SecretClient

    # TODO assume the resources are deployed by using the ARM template. If not, please set your vault url name.
    vault_url = f"https://{RESOURCE_PREFIX}kv.vault.azure.net"
    if USE_CLI_AUTH:
        credential = AzureCliCredential(additionally_allowed_tenants=['*'],)
    else:
        credential = DefaultAzureCredential(
            exclude_interactive_browser_credential=False,
            additionally_allowed_tenants=['*'],
        )
    secret_client = SecretClient(vault_url=vault_url, credential=credential)
    retrieved_secret = secret_client.get_secret('FEATHR-ONLINE-STORE-CONN').value
    os.environ['REDIS_PASSWORD'] = retrieved_secret.split(",")[1].split("password=", 1)[1]

# feathr_output_path = f'abfss://{adls_fs_name}@{adls_account}.dfs.core.windows.net/feathr_output'

In [21]:
if FEATHR_CONFIG_PATH:
    config_path = FEATHR_CONFIG_PATH
else:
    config_path = generate_config(
        resource_prefix=RESOURCE_PREFIX,
        project_name=PROJECT_NAME,
        spark_config__spark_cluster=SPARK_CLUSTER,
        spark_config__azure_synapse__dev_url=AZURE_SYNAPSE_URL,
        spark_config__azure_synapse__pool_name=AZURE_SYNAPSE_SPARK_POOL,
        spark_config__databricks__workspace_instance_url=DATABRICKS_URL,
        databricks_cluster_id=DATABRICKS_CLUSTER_ID,
    )

with open(config_path, 'r') as f: 
    print(f.read())

api_version: 1
feature_registry:
  api_endpoint: https://juntestwebapp.azurewebsites.net/api/v1
offline_store:
  adls:
    adls_enabled: 'true'
  wasb:
    wasb_enabled: 'true'
online_store:
  redis:
    host: juntestredis.redis.cache.windows.net
    port: '6380'
    ssl_enabled: 'true'
project_config:
  project_name: fraud_detection
spark_config:
  spark_cluster: local
  spark_result_output_parts: '1'



## Initialize `Feathr Client`
- `FeathrClient`

In [22]:
client = FeathrClient(config_path=config_path, credential=credential)

2022-12-06 12:39:40.576 | INFO     | feathr.utils._envvariableutil:get_environment_variable_with_default:51 - secrets__azure_key_vault__name not found in the config file.
2022-12-06 12:39:40.600 | INFO     | feathr.utils._envvariableutil:get_environment_variable_with_default:51 - offline_store__s3__s3_enabled not found in the config file.
2022-12-06 12:39:40.619 | INFO     | feathr.utils._envvariableutil:get_environment_variable_with_default:51 - offline_store__jdbc__jdbc_enabled not found in the config file.
2022-12-06 12:39:40.624 | INFO     | feathr.utils._envvariableutil:get_environment_variable_with_default:51 - offline_store__snowflake__snowflake_enabled not found in the config file.
2022-12-06 12:39:40.634 | INFO     | feathr.utils._envvariableutil:get_environment_variable_with_default:51 - spark_config__local__feathr_runtime_location not found in the config file.
2022-12-06 12:39:40.640 | INFO     | feathr.utils._envvariableutil:get_environment_variable_with_default:51 - spark_

## Define Features
- `HdfsSource`
- `TypedKey`
- `Feature`
- `FeatureAnchor`
- `DerivedFeature`

In [14]:
from feathr.datasets.constants import FRAUD_DETECTION_ACCOUNT_INFO_URL, FRAUD_DETECTION_TRANSACTIONS_URL

from feathr.datasets.utils import maybe_download

from pathlib import Path

In [15]:
# upload dataset if needed

account_info_file_path = str(Path(PROJECT_NAME, "account_info.csv"))
transactions_file_path = str(Path(PROJECT_NAME, "transactions.csv"))
maybe_download(
    src_url=FRAUD_DETECTION_ACCOUNT_INFO_URL,
    dst_filepath=account_info_file_path,
)
maybe_download(
    src_url=FRAUD_DETECTION_TRANSACTIONS_URL,
    dst_filepath=transactions_file_path,
)

NameError: name 'PROJECT_NAME' is not defined

### Account Features

In [None]:
#Refer to <https://feathr.readthedocs.io/en/latest/feathr.html> to learn more about the details of each method
account_info = HdfsSource(
    name="AccountData",
    path="wasbs://frauddata@feathrdatastorage.blob.core.windows.net/account_out_small.csv",
    event_timestamp_column="transactionDate",
    timestamp_format="yyyyMMdd",
)

accountId = TypedKey(key_column="accountID",
                       key_column_type=ValueType.INT32,
                       description="account id")

account_country = Feature(name="account_country",
                           key=accountId,
                           feature_type=STRING, 
                           transform="accountCountry")

is_user_registered = Feature(name="is_user_registered",
                                    key=accountId,
                                    feature_type=BOOLEAN,
                                    transform="isUserRegistered==TRUE")

num_payment_rejects_1d_per_user = Feature(name="num_payment_rejects_1d_per_user",
                                    key=accountId,
                                    feature_type=INT32,
                                    transform="numPaymentRejects1dPerUser")

account_age = Feature(name="account_age",
                      key=accountId,
                      feature_type=INT32,
                      transform="accountAge")
                                    
features = [
    account_country,
    account_age,
    is_user_registered,
    num_payment_rejects_1d_per_user
]

account_anchor = FeatureAnchor(name="account_features",
                               source=account_info,
                               features=features)

### Transaction Features

In [None]:
# # #Refer to <https://feathr.readthedocs.io/en/latest/feathr.html> to learn more about the details of each method

transaction_data = HdfsSource(name="transaction_data",
                          path="wasbs://frauddata@feathrdatastorage.blob.core.windows.net/transaction_out_small.csv",
                          event_timestamp_column="transactionDate",
                          timestamp_format="yyyyMMdd")

transaction_id = Feature(name="transaction_id",
                        key=accountId,
                        feature_type=STRING,
                        transform="transactionID")

transaction_currency_code = Feature(name="transaction_currency_code",
                                    key=accountId,
                                    feature_type=STRING,
                                    transform="transactionCurrencyCode")
                           
transaction_amount = Feature(name="transaction_amount",
                            key=accountId,
                            feature_type=FLOAT,
                            transform="transactionAmount")

transaction_device_id = Feature(name="transaction_device_id",
                                key=accountId,
                                feature_type=FLOAT,
                                transform="transactionDeviceId")

transaction_ip_address = Feature(name="transaction_ip_address",
                                key=accountId,
                                feature_type=FLOAT,
                                transform="transactionIPaddress")

transaction_time = Feature(name="transaction_time",
                            key=accountId,
                            feature_type=INT32,
                            transform="transactionTime")

fraud_status = Feature(name="fraud_status",
                       key=accountId,
                       feature_type=STRING,
                       transform="fraud_tag")

features = [
    transaction_id,
    transaction_amount,
    transaction_device_id,
    transaction_ip_address,
    transaction_time,
    transaction_currency_code,
    fraud_status
]

transaction_feature_anchor = FeatureAnchor(name="transaction_features",
                                            source=transaction_data,
                                            features=features)

### Transaction Aggregated Features

In [None]:
# average amount of transaction in the past week
transactions_aggr = HdfsSource(name="transactions_aggr",
                          path="wasbs://frauddata@feathrdatastorage.blob.core.windows.net/transaction_out_small.csv",
                          event_timestamp_column="transactionDate",
                          timestamp_format="yyyyMMdd")

# average amount of transaction in that week
avg_transaction_amount = Feature(name="avg_transaction_amount",
                                key=accountId,
                                feature_type=FLOAT,
                                transform=WindowAggTransformation(agg_expr="cast_float(transactionAmount)",
                                                            agg_func="AVG",
                                                            window="7d"))

# number of transaction that took place in a day
num_trasaction_count_in_day = Feature(name="num_trasaction_count_in_day",
                                    key=accountId,
                                    feature_type=INT32,
                                    transform=WindowAggTransformation(agg_expr="transactionID",
                                                                agg_func="COUNT",
                                                                window="1d"))

# Amount of transaction that took place in a day
total_transaction_amount_in_day = Feature(name="total_transaction_amount_in_day",
                                    key=accountId,
                                    feature_type=FLOAT,
                                    transform=WindowAggTransformation(agg_expr="cast_float(transactionAmount)",
                                                                agg_func="SUM",
                                                                window="1d"))

# average time of transaction in the past week
avg_transaction_time = Feature(name="avg_transaction_time",
                            key=accountId,
                            feature_type=INT32,
                            transform=WindowAggTransformation(agg_expr="cast_float(transactionTime)",
                                                          agg_func="AVG",
                                                          window="7d"))                                                            

# total number of currency used for transaction in the past week
num_currency_type_in_week = Feature(name="num_currency_type_in_week",
                                    key=accountId,
                                    feature_type=INT32,
                                    transform=WindowAggTransformation(agg_expr="transactionCurrencyCode",
                                                                    agg_func="COUNT",
                                                                    window="7d"))

# number of different ip address used for transaction in the past week
num_ip_address_count = Feature(name="num_ip_address_count",
                        key=accountId,
                        feature_type=INT32,
                        transform=WindowAggTransformation(agg_expr="transactionIPaddress",
                                                          agg_func="COUNT",
                                                          window="7d"))

# number of devices used for the transaction in the past week
num_device_count = Feature(name="num_device_count",
                            key=accountId,
                            feature_type=INT32,
                            transform=WindowAggTransformation(agg_expr="transactionDeviceId",
                                                            agg_func="COUNT",
                                                            window="7d"))

# find the time of most recent transaction
time_most_recent_transaction = Feature(name="time_most_recent_transaction",
                                        key=accountId,
                                        feature_type=INT32,
                                        transform=WindowAggTransformation(agg_expr="transactionTime",
                                                                        agg_func="LATEST",
                                                                        window="7d"))

features = [
    avg_transaction_amount,
    avg_transaction_time,
    total_transaction_amount_in_day,
    num_trasaction_count_in_day,
    num_currency_type_in_week,
    num_ip_address_count,
    num_device_count,
    time_most_recent_transaction
]

aggr_anchor = FeatureAnchor(name="transaction_aggr_features",
                            source=transactions_aggr,
                            features=features)

### Derived Features
- `DerivedFeature`

In [None]:
# derived features
feature_diff_current_and_avg_amount = DerivedFeature(name="feature_diff_current_and_avg_amount",
                                                    key=accountId,
                                                    feature_type=FLOAT,
                                                    input_features=[
                                                        transaction_amount, avg_transaction_amount],
                                                    transform="transaction_amount - avg_transaction_amount")

feature_time_pass_after_most_recent_transaction = DerivedFeature(name="feature_time_pass_after_most_recent_transaction",
                                                                key=accountId,
                                                                feature_type=INT32,
                                                                input_features=[
                                                                    transaction_time, time_most_recent_transaction],
                                                                transform="cast_int(transaction_time) - cast_int(time_most_recent_transaction)")

## Build Defined Features
- `build_features`

In [None]:
client.build_features(anchor_list=[account_anchor, transaction_feature_anchor, aggr_anchor], 
                      derived_feature_list=[feature_time_pass_after_most_recent_transaction, feature_diff_current_and_avg_amount])

## Get Offline Features
- `FeatureQuery`
- `ObservationSettings`
- `get_offline_features`
- `feathr_spark_launcher.download_result`

In [None]:
if client.spark_runtime == 'databricks':
    output_path = 'dbfs:/feathrfrauddetection_test.avro'
else:
    output_path = feathr_output_path

feature_query = FeatureQuery(
    feature_list=["account_country",
                  "transaction_time",
                  "num_currency_type_in_week",
                  "num_trasaction_count_in_day",
                  "total_transaction_amount_in_day",
                  "fraud_status",
                  "is_user_registered",
                  "avg_transaction_amount",
                  "num_ip_address_count",
                  "num_device_count",
                  "time_most_recent_transaction",
                  "feature_diff_current_and_avg_amount",
                  "feature_time_pass_after_most_recent_transaction"], key=accountId)
                    
settings = ObservationSettings(
    observation_path="wasbs://frauddata@feathrdatastorage.blob.core.windows.net/observation_out_small.csv",
    event_timestamp_column="transactionDate",
    timestamp_format="yyyyMMdd")
    
client.get_offline_features(observation_settings=settings,
                            feature_query=feature_query,
                            output_path=output_path)
client.wait_job_to_finish(timeout_sec=10000000000)

In [None]:
import pandas as pd
import pandavro as pdx
import glob
from pathlib import Path
import matplotlib.pyplot as plt
from datetime import datetime, timedelta

from feathr import BackfillTime, MaterializationSettings, RedisSink

In [None]:
def get_result_df(client: FeathrClient) -> pd.DataFrame:
    """Download the job result dataset from cloud as a Pandas dataframe."""
    res_url = client.get_job_result_uri(block=True, timeout_sec=600)
    tmp_dir = tempfile.TemporaryDirectory()
    client.feathr_spark_launcher.download_result(result_path=res_url, local_folder=tmp_dir.name)
    dataframe_list = []
    # assuming the result are in avro format
    for file in glob.glob(os.path.join(tmp_dir.name, '*.avro')):
        dataframe_list.append(pdx.read_avro(file))
    vertical_concat_df = pd.concat(dataframe_list, axis=0)
    tmp_dir.cleanup()
    return vertical_concat_df

df_res = get_result_df(client)

## Feature Visualization

In [None]:
filepath = Path('./result_out.csv')
df_res.to_csv(filepath, index=False) 
df_res.reset_index()
df_res

## Train Fraud Detection Model with Calculated Features

In [None]:
import pandas as pd
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split 
import seaborn as sns

final_df = df_res
final_df.drop(['accountID'], axis=1, inplace=True, errors='ignore')
final_df.drop(['transactionDate'], axis=1, inplace=True, errors='ignore')
final_df.drop(['account_country'], axis=1, inplace=True, errors='ignore')
final_df = final_df.fillna(0)

x_train, x_test, y_train, y_test = train_test_split(final_df.drop(["fraud_status"], axis=1),
                                                    final_df["fraud_status"],
                                                    test_size=0.20,
                                                    random_state=0)
  
K = []
training = []
test = []
scores = {}
  
for k in range(2, 21):
    clf = KNeighborsClassifier(n_neighbors = k)
    clf.fit(x_train, y_train)
  
    training_score = clf.score(x_train, y_train)
    test_score = clf.score(x_test, y_test)
    K.append(k)
  
    training.append(training_score)
    test.append(test_score)
    scores[k] = [training_score, test_score]

for keys, values in scores.items():
    print(keys, ':', values)

## Materialize Features in Redis
- `BackfillTime`
- `RedisSink`
- `materialize_features`
- `multi_get_online_features`

In [None]:
backfill_time = BackfillTime(start=datetime(
    2013, 4, 7), end=datetime(2013, 4, 7), step=timedelta(days=1))
redisSink = RedisSink(table_name="fraudDetectionDemoFeature")
settings = MaterializationSettings("fraudDetectionDemoFeature",
                                   backfill_time=backfill_time,
                                   sinks=[redisSink],
                                   feature_names=["fraud_status"])

client.materialize_features(settings, allow_materialize_non_agg_feature =True)
client.wait_job_to_finish(timeout_sec=5000)

In [None]:
client.multi_get_online_features('fraudDetectionDemoFeature', ['1759222192247110', '914800996051170'], [
                                 "fraud_status"])

In [None]:
client.multi_get_online_features('fraudDetectionDemoFeature', ['1759222192247110', '914800996051170', '844428033864668'], [
                                 "fraud_status"])

## Register Features with Registry APIs
- `register_features`
- `list_registered_features`
- Above queries are send to a Standard Registry API Service (both `Purview` and `SQL` backend are supported)
- More friendly interface with detailed lineage can be found in: [Feathr UI](https://feathr-sql-registry.azurewebsites.net/)

In [None]:
client.register_features()
client.list_registered_features(project_name="fraud_detection_test")