# Feathr Fraud Detection Sample

This notebook illustrates the use of Feature Store to create a model that predicts the fraud status of transactions based on the user account data and trasaction data. All the data that was used in the notebook can be found here: https://github.com/microsoft/r-server-fraud-detection


In the following Notebook, we 
1. Install the latest Feathr code (to include some unreleased features) 
2. Define Environment Variables & `yaml_config` Settings 
3. Create `FeathrClient` and Define `FeatureAnchor`
4. `build_features` and `get_offline_features` 
5. Train Fraud Detection Model wih `KNeighborsClassifier`
6. `materialize_features` and `multi_get_online_features`
7. `register_features` and `list_registered_features`

## Setup Feathr Developer Environment

***Prior to running the notebook, if you have not deployed all the required resources, please refer to the guide here and follow the steps to do so: https://linkedin.github.io/feathr/how-to-guides/azure-deployment-arm.html***

In [None]:
! pip install feathr azure-cli

In [None]:
import glob
import os
import tempfile
from datetime import datetime, timedelta
from math import sqrt

from feathr import FeathrClient
from feathr import STRING, BOOLEAN, FLOAT, INT32, ValueType
from feathr import Feature, DerivedFeature, FeatureAnchor
from feathr import BackfillTime, MaterializationSettings
from feathr import FeatureQuery, ObservationSettings
from feathr import RedisSink
from feathr import INPUT_CONTEXT, HdfsSource
from feathr import WindowAggTransformation
from feathr import TypedKey
from sklearn.model_selection import train_test_split
from azure.identity import DefaultAzureCredential
from azure.keyvault.secrets import SecretClient

In [None]:
! az login --use-device-code

In [None]:
# replace with your prefix
resource_prefix = <your_prefix_here>

## Permission
To run the cells below, you need additional permission: permission to your managed identity to access the keyvault, and permission to the user to access the Storage Blob. Run the following lines of command in the Cloud Shell in order to grant yourself the access.

~~~ 
userId=<email_id_of_account_requesting_access>
resource_prefix=<resource_prefix>
synapse_workspace_name="${resource_prefix}syws"
keyvault_name="${resource_prefix}kv"
objectId=$(az ad user show --id $userId --query id -o tsv)
az keyvault update --name $keyvault_name --enable-rbac-authorization false
az keyvault set-policy -n $keyvault_name --secret-permissions get list --object-id $objectId
az role assignment create --assignee $userId --role "Storage Blob Data Contributor"
az synapse role assignment create --workspace-name $synapse_workspace_name --role "Synapse Contributor" --assignee $userId
~~~

In [None]:
# Get all the required credentials from Azure Key Vault
key_vault_name=resource_prefix+"kv"
synapse_workspace_url=resource_prefix+"syws"
adls_account=resource_prefix+"dls"
adls_fs_name=resource_prefix+"fs"
purview_name=resource_prefix+"purview"
key_vault_uri = f"https://{key_vault_name}.vault.azure.net"
credential = DefaultAzureCredential(exclude_interactive_browser_credential=False)
client = SecretClient(vault_url=key_vault_uri, credential=credential)
secretName = "FEATHR-ONLINE-STORE-CONN"
retrieved_secret = client.get_secret(secretName).value

# Get redis credentials; This is to parse Redis connection string.
redis_port=retrieved_secret.split(',')[0].split(":")[1]
redis_host=retrieved_secret.split(',')[0].split(":")[0]
redis_password=retrieved_secret.split(',')[1].split("password=",1)[1]
redis_ssl=retrieved_secret.split(',')[2].split("ssl=",1)[1]

# Set the resource link
os.environ['spark_config__azure_synapse__dev_url'] = f'https://{synapse_workspace_url}.dev.azuresynapse.net'
os.environ['spark_config__azure_synapse__pool_name'] = 'spark31'
os.environ['spark_config__azure_synapse__workspace_dir'] = f'abfss://{adls_fs_name}@{adls_account}.dfs.core.windows.net/feathr_project'
os.environ['feature_registry__purview__purview_name'] = f'{purview_name}'
os.environ['online_store__redis__host'] = redis_host
os.environ['online_store__redis__port'] = redis_port
os.environ['online_store__redis__ssl_enabled'] = redis_ssl
os.environ['REDIS_PASSWORD']=redis_password
os.environ['feature_registry__purview__purview_name'] = f'{purview_name}'
feathr_output_path = f'abfss://{adls_fs_name}@{adls_account}.dfs.core.windows.net/feathr_output'

In [None]:
import tempfile
yaml_config = """
# Please refer to https://github.com/linkedin/feathr/blob/main/feathr_project/feathrcli/data/feathr_user_workspace/feathr_config.yaml for explanations on the meaning of each field.
api_version: 1
project_config:
  project_name: 'fraud_detection_test'
  required_environment_variables:
    - 'REDIS_PASSWORD'
offline_store:
  adls:
    adls_enabled: true
  wasb:
    wasb_enabled: true
  s3:
    s3_enabled: false
    s3_endpoint: 's3.amazonaws.com'
  jdbc:
    jdbc_enabled: false
    jdbc_database: 'feathrtestdb'
    jdbc_table: 'feathrtesttable'
  snowflake:
    url: "dqllago-ol19457.snowflakecomputing.com"
    user: "feathrintegration"
    role: "ACCOUNTADMIN"
spark_config:
  spark_cluster: 'azure_synapse'
  spark_result_output_parts: '1'
  azure_synapse:
    dev_url: 'https://feathrazuretest3synapse.dev.azuresynapse.net'
    pool_name: 'spark3'
    workspace_dir: 'abfss://feathrazuretest3fs@feathrazuretest3storage.dfs.core.windows.net/fraud_detection_test'
    executor_size: 'Small'
    executor_num: 4
    feathr_runtime_location: wasbs://public@azurefeathrstorage.blob.core.windows.net/feathr-assembly-LATEST.jar
  databricks:
    workspace_instance_url: 'https://adb-2474129336842816.16.azuredatabricks.net'
    config_template: {'run_name':'','new_cluster':{'spark_version':'9.1.x-scala2.12','node_type_id':'Standard_D3_v2','num_workers':2,'spark_conf':{}},'libraries':[{'jar':''}],'spark_jar_task':{'main_class_name':'','parameters':['']}}
    work_dir: 'dbfs:/fraud_detection_test'
    feathr_runtime_location: https://azurefeathrstorage.blob.core.windows.net/public/feathr-assembly-LATEST.jar
online_store:
  redis:
    host: 'feathrazuretest3redis.redis.cache.windows.net'
    port: 6380
    ssl_enabled: True
feature_registry:
  api_endpoint: "https://feathr-sql-registry.azurewebsites.net/api/v1"
"""
tmp = tempfile.NamedTemporaryFile(mode='w', delete=False)
with open(tmp.name, "w") as text_file:
    text_file.write(yaml_config)


## Initialize `Feathr Client`
- `FeathrClient`

In [None]:
client = FeathrClient(config_path=tmp.name)

## Define Features
- `HdfsSource`
- `TypedKey`
- `Feature`
- `FeatureAnchor`
- `DerivedFeature`

### Account Features

In [None]:
#Refer to <https://feathr.readthedocs.io/en/latest/feathr.html> to learn more about the details of each method
account_info = HdfsSource(name="AccountData",
                          path="wasbs://frauddata@feathrdatastorage.blob.core.windows.net/account_out_small.csv",
                          event_timestamp_column="transactionDate",
                          timestamp_format="yyyyMMdd")

accountId = TypedKey(key_column="accountID",
                       key_column_type=ValueType.INT32,
                       description="account id")

account_country = Feature(name="account_country",
                           key=accountId,
                           feature_type=STRING, 
                           transform="accountCountry")

is_user_registered = Feature(name="is_user_registered",
                                    key=accountId,
                                    feature_type=BOOLEAN,
                                    transform="isUserRegistered==TRUE")

num_payment_rejects_1d_per_user = Feature(name="num_payment_rejects_1d_per_user",
                                    key=accountId,
                                    feature_type=INT32,
                                    transform="numPaymentRejects1dPerUser")

account_age = Feature(name="account_age",
                      key=accountId,
                      feature_type=INT32,
                      transform="accountAge")
                                    
features = [
    account_country,
    account_age,
    is_user_registered,
    num_payment_rejects_1d_per_user
]

account_anchor = FeatureAnchor(name="account_features",
                               source=account_info,
                               features=features)

### Transaction Features

In [None]:
# # #Refer to <https://feathr.readthedocs.io/en/latest/feathr.html> to learn more about the details of each method

transaction_data = HdfsSource(name="transaction_data",
                          path="wasbs://frauddata@feathrdatastorage.blob.core.windows.net/transaction_out_small.csv",
                          event_timestamp_column="transactionDate",
                          timestamp_format="yyyyMMdd")

transaction_id = Feature(name="transaction_id",
                        key=accountId,
                        feature_type=STRING,
                        transform="transactionID")

transaction_currency_code = Feature(name="transaction_currency_code",
                                    key=accountId,
                                    feature_type=STRING,
                                    transform="transactionCurrencyCode")
                           
transaction_amount = Feature(name="transaction_amount",
                            key=accountId,
                            feature_type=FLOAT,
                            transform="transactionAmount")

transaction_device_id = Feature(name="transaction_device_id",
                                key=accountId,
                                feature_type=FLOAT,
                                transform="transactionDeviceId")

transaction_ip_address = Feature(name="transaction_ip_address",
                                key=accountId,
                                feature_type=FLOAT,
                                transform="transactionIPaddress")

transaction_time = Feature(name="transaction_time",
                            key=accountId,
                            feature_type=INT32,
                            transform="transactionTime")

fraud_status = Feature(name="fraud_status",
                       key=accountId,
                       feature_type=STRING,
                       transform="fraud_tag")

features = [
    transaction_id,
    transaction_amount,
    transaction_device_id,
    transaction_ip_address,
    transaction_time,
    transaction_currency_code,
    fraud_status
]

transaction_feature_anchor = FeatureAnchor(name="transaction_features",
                                            source=transaction_data,
                                            features=features)

### Transaction Aggregated Features

In [None]:
# average amount of transaction in the past week
transactions_aggr = HdfsSource(name="transactions_aggr",
                          path="wasbs://frauddata@feathrdatastorage.blob.core.windows.net/transaction_out_small.csv",
                          event_timestamp_column="transactionDate",
                          timestamp_format="yyyyMMdd")

# average amount of transaction in that week
avg_transaction_amount = Feature(name="avg_transaction_amount",
                                key=accountId,
                                feature_type=FLOAT,
                                transform=WindowAggTransformation(agg_expr="cast_float(transactionAmount)",
                                                            agg_func="AVG",
                                                            window="7d"))

# number of transaction that took place in a day
num_trasaction_count_in_day = Feature(name="num_trasaction_count_in_day",
                                    key=accountId,
                                    feature_type=INT32,
                                    transform=WindowAggTransformation(agg_expr="transactionID",
                                                                agg_func="COUNT",
                                                                window="1d"))

# Amount of transaction that took place in a day
total_transaction_amount_in_day = Feature(name="total_transaction_amount_in_day",
                                    key=accountId,
                                    feature_type=FLOAT,
                                    transform=WindowAggTransformation(agg_expr="cast_float(transactionAmount)",
                                                                agg_func="SUM",
                                                                window="1d"))

# average time of transaction in the past week
avg_transaction_time = Feature(name="avg_transaction_time",
                            key=accountId,
                            feature_type=INT32,
                            transform=WindowAggTransformation(agg_expr="cast_float(transactionTime)",
                                                          agg_func="AVG",
                                                          window="7d"))                                                            

# total number of currency used for transaction in the past week
num_currency_type_in_week = Feature(name="num_currency_type_in_week",
                                    key=accountId,
                                    feature_type=INT32,
                                    transform=WindowAggTransformation(agg_expr="transactionCurrencyCode",
                                                                    agg_func="COUNT",
                                                                    window="7d"))

# number of different ip address used for transaction in the past week
num_ip_address_count = Feature(name="num_ip_address_count",
                        key=accountId,
                        feature_type=INT32,
                        transform=WindowAggTransformation(agg_expr="transactionIPaddress",
                                                          agg_func="COUNT",
                                                          window="7d"))

# number of devices used for the transaction in the past week
num_device_count = Feature(name="num_device_count",
                            key=accountId,
                            feature_type=INT32,
                            transform=WindowAggTransformation(agg_expr="transactionDeviceId",
                                                            agg_func="COUNT",
                                                            window="7d"))

# find the time of most recent transaction
time_most_recent_transaction = Feature(name="time_most_recent_transaction",
                                        key=accountId,
                                        feature_type=INT32,
                                        transform=WindowAggTransformation(agg_expr="transactionTime",
                                                                        agg_func="LATEST",
                                                                        window="7d"))

features = [
    avg_transaction_amount,
    avg_transaction_time,
    total_transaction_amount_in_day,
    num_trasaction_count_in_day,
    num_currency_type_in_week,
    num_ip_address_count,
    num_device_count,
    time_most_recent_transaction
]

aggr_anchor = FeatureAnchor(name="transaction_aggr_features",
                            source=transactions_aggr,
                            features=features)

### Derived Features
- `DerivedFeature`

In [None]:
# derived features
feature_diff_current_and_avg_amount = DerivedFeature(name="feature_diff_current_and_avg_amount",
                                                    key=accountId,
                                                    feature_type=FLOAT,
                                                    input_features=[
                                                        transaction_amount, avg_transaction_amount],
                                                    transform="transaction_amount - avg_transaction_amount")

feature_time_pass_after_most_recent_transaction = DerivedFeature(name="feature_time_pass_after_most_recent_transaction",
                                                                key=accountId,
                                                                feature_type=INT32,
                                                                input_features=[
                                                                    transaction_time, time_most_recent_transaction],
                                                                transform="cast_int(transaction_time) - cast_int(time_most_recent_transaction)")

## Build Defined Features
- `build_features`

In [None]:
client.build_features(anchor_list=[account_anchor, transaction_feature_anchor, aggr_anchor], 
                      derived_feature_list=[feature_time_pass_after_most_recent_transaction, feature_diff_current_and_avg_amount])

## Get Offline Features
- `FeatureQuery`
- `ObservationSettings`
- `get_offline_features`
- `feathr_spark_launcher.download_result`

In [None]:
if client.spark_runtime == 'databricks':
    output_path = 'dbfs:/feathrfrauddetection_test.avro'
else:
    output_path = feathr_output_path

feature_query = FeatureQuery(
    feature_list=["account_country",
                  "transaction_time",
                  "num_currency_type_in_week",
                  "num_trasaction_count_in_day",
                  "total_transaction_amount_in_day",
                  "fraud_status",
                  "is_user_registered",
                  "avg_transaction_amount",
                  "num_ip_address_count",
                  "num_device_count",
                  "time_most_recent_transaction",
                  "feature_diff_current_and_avg_amount",
                  "feature_time_pass_after_most_recent_transaction"], key=accountId)
                    
settings = ObservationSettings(
    observation_path="wasbs://frauddata@feathrdatastorage.blob.core.windows.net/observation_out_small.csv",
    event_timestamp_column="transactionDate",
    timestamp_format="yyyyMMdd")
    
client.get_offline_features(observation_settings=settings,
                            feature_query=feature_query,
                            output_path=output_path)
client.wait_job_to_finish(timeout_sec=10000000000)

In [None]:
import pandas as pd
import pandavro as pdx
import glob
from pathlib import Path
import matplotlib.pyplot as plt
from datetime import datetime, timedelta

from feathr import BackfillTime, MaterializationSettings, RedisSink

In [None]:
def get_result_df(client: FeathrClient) -> pd.DataFrame:
    """Download the job result dataset from cloud as a Pandas dataframe."""
    res_url = client.get_job_result_uri(block=True, timeout_sec=600)
    tmp_dir = tempfile.TemporaryDirectory()
    client.feathr_spark_launcher.download_result(result_path=res_url, local_folder=tmp_dir.name)
    dataframe_list = []
    # assuming the result are in avro format
    for file in glob.glob(os.path.join(tmp_dir.name, '*.avro')):
        dataframe_list.append(pdx.read_avro(file))
    vertical_concat_df = pd.concat(dataframe_list, axis=0)
    tmp_dir.cleanup()
    return vertical_concat_df

df_res = get_result_df(client)

## Feature Visualization

In [None]:
filepath = Path('./result_out.csv')
df_res.to_csv(filepath, index=False) 
df_res.reset_index()
df_res

## Train Fraud Detection Model with Calculated Features

In [None]:
import pandas as pd
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split 
import seaborn as sns

final_df = df_res
final_df.drop(['accountID'], axis=1, inplace=True, errors='ignore')
final_df.drop(['transactionDate'], axis=1, inplace=True, errors='ignore')
final_df.drop(['account_country'], axis=1, inplace=True, errors='ignore')
final_df = final_df.fillna(0)

x_train, x_test, y_train, y_test = train_test_split(final_df.drop(["fraud_status"], axis=1),
                                                    final_df["fraud_status"],
                                                    test_size=0.20,
                                                    random_state=0)
  
K = []
training = []
test = []
scores = {}
  
for k in range(2, 21):
    clf = KNeighborsClassifier(n_neighbors = k)
    clf.fit(x_train, y_train)
  
    training_score = clf.score(x_train, y_train)
    test_score = clf.score(x_test, y_test)
    K.append(k)
  
    training.append(training_score)
    test.append(test_score)
    scores[k] = [training_score, test_score]

for keys, values in scores.items():
    print(keys, ':', values)

## Materialize Features in Redis
- `BackfillTime`
- `RedisSink`
- `materialize_features`
- `multi_get_online_features`

In [None]:
backfill_time = BackfillTime(start=datetime(
    2013, 4, 7), end=datetime(2013, 4, 7), step=timedelta(days=1))
redisSink = RedisSink(table_name="fraudDetectionDemoFeature")
settings = MaterializationSettings("fraudDetectionDemoFeature",
                                   backfill_time=backfill_time,
                                   sinks=[redisSink],
                                   feature_names=["fraud_status"])

client.materialize_features(settings)
client.wait_job_to_finish(timeout_sec=5000)

In [None]:
client.multi_get_online_features('fraudDetectionDemoFeature', ['1759222192247110', '914800996051170'], [
                                 "fraud_status"])

In [None]:
client.multi_get_online_features('fraudDetectionDemoFeature', ['1759222192247110', '914800996051170', '844428033864668'], [
                                 "fraud_status"])

## Register Features with Registry APIs
- `register_features`
- `list_registered_features`
- Above queries are send to a Standard Registry API Service (both `Purview` and `SQL` backend are supported)
- More friendly interface with detailed lineage can be found in: [Feathr UI](https://feathr-sql-registry.azurewebsites.net/)

In [None]:
client.register_features()
client.list_registered_features(project_name="fraud_detection_test")