# Financial Fraud Detection

- The objective of this notebook is to showcase the usage of the ___financial-fraud-training___ NIM (microservice) (NEED LINK) and how to deploy the produced trained models on the Triton Inference Server.
- We use [IBM TabFromer](https://github.com/IBM/TabFormer) as an example dataset
- That datset is then preprocess before running through the training NIM.

NOTICE:
- This notebook assume that you have followed the pre

NOTE: The preprocessing code is written specifically for the TabFormer dataset and will not work with other datasets.

In [None]:
!pip install -r "./requirements.txt"

#### Import libraries

In [None]:
import os
import sys
import json
import time

In [None]:
import cudf

----
# Step 1: Get and Prepare the data

___This example uses the IBM TabFormer dataset.
Unfortunatley it is not a simple process to download the dataset. 
There are a few manual sets needed to access the data___

# How to get a link to the Test Data

The origin of the test data this [TabFormer git repository](https://github.com/IBM/TabFormer/tree/main/data/credit_card). The problem is that it is generally not downloadable from this location as it is a large file and is stored / retrieved using git-lfs. Due to the popularity of the file it is generally not able to be downloaded here. As such, the owners of that repository have made the dataset available via [Box](https://ibm.box.com/v/tabformer-data).

While this makes it fairly easy to retrieve the file for local use, it does make it a little more difficult to retrieve for the sakes of testing in a container as this application expects. So, there are a few steps to follow to get a download link that you can put into the download box.

## Step 1

Open a browser window and navigate to the box dataset [https://ibm.box.com/v/tabformer-data](https://ibm.box.com/v/tabformer-data). 

![Image 1](../docs/images/1.png)

Right click in the webpage and select `Inspect`. When the developer tools opens, select the `Network`

![Image 2](../docs/images/2.png)


## Step 2

In the webpage itself, click on the credit_card folder. 

![Image 3](../docs/images/3.png)

You will see `transactions.tgz` when you point at the file a 3-button menu will show up and when you click that 3-button menu, a download dialog will pop up. 

Click download and after a moment the file should start downloading locally (this is fine, you can delete it after). 

In the network panel an entry will show up by the name `download`. Click the name `download` and the `Headers` panel will show up. In there you will find `Request URL`. Highlight the entire value (**this is the download URL**) and `copy` the URL. This is a temporary link that is provided. It has a time limit. It will look something like this: `https://public.boxcloud.com/d/1/{LOTS OF RANDOM LOOKING CHARACTERS}/download`

![Image 4](../docs/images/4.png)

## Step 3

In the notebook set the value of `DOWNLOAD_URL` by pasting in the url you copied in step 2: `DOWNLOAD_URL="[the pasted url text goes in between these quotes, do not include these brackets]"`



# Download the Dataset

In [None]:
DOWNLOAD_URL = "https://public.boxcloud.com/d/1/b1!9-jRH0pMAAsmTZAVeRCo4LVZGHI-_oJVf-16dcRlDjRKv-p42cBf5kBHK47LphSeKcpZ7jZblLc00mymL9IFqd-ZCDbAuZbpuxNQzqcaPprhS_ib3OaCP6lT4Ko8Mw-693DnYvE1zcsez_pevVAnHPSn0K90RfoV3QPjN_ueIawzojSKW29Yq_EQ3FgXvj-30ZD0poiTWeOFTFDJiTQcDv9dabvYFZAMKLpzjRLE-tyr88Ktnoy225tkgNCo0dGl8FmHkcyI2tMaiAmcLCv54bbseTiDox-XcdHIrhs3OEr1A218GK0ATh5kQYR08mOaANZ57z9tdrIIM_2sS7XOurtqhIWkLMOuGuRjJNuwd-rQtgoUQZN-Ggo9OUR8MT-D4dWYvpuHnQZvepAFIL9Ihl074klSRrCU3T_zVu80rVSpt8WkUJ8QG6lKIl0lwuz7Jdc7IuTayhiIUyMrfeHtvkMRUFvl51FvoSdNXE4nb2brLTbjm0HJiXgAoCR8Bi0hkARZbFNLS4uwTVtNI9c8-UNpxV0W-YcW3XO33cl4DdlI3BdJ632xdnpQAKMXkWzFv4XVTG8whgCqaosiZrgig57KHAx-1mx7btaYt62kszuRm1WB3RFJjm9Lfu8Z-vO5OUnkwQx_r1Cxiz-ngxxZU2majLoWW3Z8xsqewz4ErOaoSeJzeIjIx38Vurup3JStPNRntT_1hqY5zl5aHYW8rCLAGoUpy3PYjblezQisLW5FZixoKWgPEFv2bRHuGwVn_WVdabFD-kJzo2eFenp38w1YNlTmyRCW0BCLPXHWJjWzo9HEE47uvufGNW7Wo8-RW_80A1oGWhsvhHae8dG9--I1zLsMUhKbPAOMIsh9X4es8qeEhU1x5r9e3OrbFvmSt75DksT7OaaO7ea1IilugnWzFBCaVJGxZRiafElawjP0kMf-dC03f2S96OdWXrmC6HTNUHjlkjNDLSvEPFE7Ie0_yIgqxo-Hs68tb3JzFsp91PRaNUL2tY3MkgwJHbpHnWuTxQMuIbFNULXs1egvUS7BCwATvo6fZMdEvxR7qRQftN1ko_GXkNldw_8PEhK0Wb5c3o59j_089u20tG0D0pNQlG5I7BQ-zUhO-KsiPiVZjNirP_HkQv_hISUvLp4APp7nAyQnagxg54IwQWFgsLKys-Qh0wVmfDuUo-0HB0-uBqfS0UHEm7et4gZSjK88e1ud39Mqepk3ohw6daunzqHky0YlcvhZ6lhMjgXGCZ1dX5-IneCCxGtV12MHsDcXuVMTnhe1NaxvPhwgPKgGpKUmu7RbROVLgfZUb_NoS_RsVX_dUmzqzh1ou4CQO32SFI2oVLScKcu3o-OVE8-G84cqR2a__fK9G-1obqhDph-TRbqvddoW7ZBcB-wWwcfzFF7GM6FD9gf4mP9FnhG6BkDYL1ueJJs./download"

In [None]:
# make sure we are in the "data" folder
%cd ../data
%pwd

In [None]:
!wget {DOWNLOAD_URL}

In [None]:
!mv download download.tgz

In [None]:
!tar xvzf download.tgz

In [None]:
!mv card_transaction.v1.csv ./TabFormer/raw

You should now have

.
    data
    └── TabFormer
        └── raw
            └── card_transaction.v1.csv
```

In [None]:
# Once the raw data is placed as described above, set the path to the TabFormer directory
data_root_dir = os.path.abspath('../data/TabFormer/') # Change this path to point to TabFormer

In [None]:
# Check if the raw data has been placed properly
!tree {data_root_dir}

---
# Step 2: Preprocess the data and 
- Import the Python function that handles preprocess the TabFormer data
- Call `preprocess_TabFormer` function to prepare the data


In [None]:
# Add the "src" directory to the search path
src_dir = os.path.abspath(os.path.join(os.path.dirname(os.getcwd()), 'src'))
sys.path.insert(0, src_dir)

# should be able to import from "src" folder now
from preprocess_TabFormer import proprocess_data

In [None]:
# Preprocess the data
proprocess_data(data_root_dir)

# this will output status as it correlates data and produces catagorical types 

In [None]:
# You should not see files under a "gnn" folder and under a "xgb" folder
!tree {data_root_dir}

-----
# Step 3:  Now run train the model using the financial-fraud-training NIM


### Create training configuration file
NOTE: Training configuration file must conform to the training schemas defined in financial-fraud-training NIM  (NOTE:  NEED A LINK TO THE DOCS)

In [None]:
# Path to save the trained model
os.makedirs(os.path.join(data_root_dir, 'trained_models'), exist_ok=True)

__Important: Models and configuration files needed for deployment using the Triton Inference server will be saved in trained_models/model-repository__

In [None]:
training_config = {
  "paths": {
    "data_dir": "/data", # Mount dataset root directory under /data in the container
    "output_dir": "/data/trained_models" # Mount path to save the trained models, NOTE: This path is inside the docker container 
  },

  "models": [
    {
      "kind": "GraphSAGE_XGBoost",
      "gpu": "single",
      "hyperparameters": {
        "gnn":{
          "hidden_channels": 16,
          "n_hops": 1,
          "dropout_prob": 0.1,
          "batch_size": 1024,
          "fan_out": 16,
          "num_epochs": 16
        },
        "xgb": {
          "max_depth": 6,
          "learning_rate": 0.2,
          "num_parallel_tree": 3,
          "num_boost_round": 512,
          "gamma": 0.0
        }

      }
    }
  ]
}


#### Save the training configuration file as a json file

In [None]:
training_config_file_name = 'training_config.json'

with open(os.path.join(training_config_file_name), 'w') as json_file:
    json.dump(training_config, json_file, indent=4)

### Pull and run the financial-fraud-training NIM 


In [None]:
!docker login nvcr.io --username  '$oauthtoken' --password "ENTER YOUR NGC API KEY HERE"

#### Finally train the models according to above defined configuration file

In [None]:
! docker run --cap-add SYS_NICE -it --rm  --gpus all  -v {data_root_dir}:/data -v ./{training_config_file_name}:/app/config.json model_training_nim --config /app/config.json

#### Make sure that the `model_repository` has been created with right contents in it
According the above defined configuration file, the `model_repository`, which is folder containing the models and configuration files to be deployed on the Triton inference Server, will be created under 
{data_root_dir}/trained_models/ and its contents will look like

```sh
├── model
│   ├── 1
│   │   └── graph_sage_node_embedder.onnx
│   └── config.pbtxt
└── xgboost
    ├── 1
    │   └── xgboost_on_embeddings.json
    └── config.pbtxt

```


In [None]:
!tree {data_root_dir}/trained_models/model_repository

----
# Step 3:  Serve your model on Triton Inference Server

!Important: Change MODEL_REPO_PATH to point to the `model repository` folder if you used different path in your training configuration file

#### Install tritonclient

In [None]:
##!pip install tritonclient[all]

In [None]:
import tritonclient.grpc as triton_grpc
import tritonclient.http as httpclient
from tritonclient import utils as triton_utils

In [None]:
# Set to False for remote/cloud deployment
run_locally = True 

##### Replace HOST with the actual server URL where your Triton Inference Server is hosted.


In [None]:
if run_locally:
    HOST = 'localhost'
else:
    HOST = '<SERVER_URL>' # Replace with your server URL or IP address

HTTP_PORT = 8000
GRPC_PORT = 8001

### If you are testing a local deployment
- Pull Triton inference server docker image
- Deploy server with  models and configuration files (produced by the training NIM)
- Double check that your model repository folder has the following structures
```sh
├── model
│   ├── 1
│   │   └── graph_sage_node_embedder.onnx
│   └── config.pbtxt
└── xgboost
    ├── 1
    │   └── xgboost_on_embeddings.json
    └── config.pbtxt
```

In [None]:
if run_locally:
    
    # Triton server image
    TRITON_IMAGE = 'nvcr.io/nvidia/tritonserver:25.01-py3'
    MODEL_REPO_PATH = os.path.join(data_root_dir, 'trained_models/model_repository')

    # Pull docker 
    !docker pull {TRITON_IMAGE}
    !docker stop tritonserver
    !docker rm tritonserver

    !docker run --gpus all -d -p {HTTP_PORT}:{HTTP_PORT} -p {GRPC_PORT}:{GRPC_PORT} -v {MODEL_REPO_PATH}:/models --name tritonserver {TRITON_IMAGE} tritonserver --model-repository=/models



### URLs for GRPC and HTTP request to the inference server

In [None]:
client_grpc = triton_grpc.InferenceServerClient(url=f'{HOST}:{GRPC_PORT}')
client_http = httpclient.InferenceServerClient(url=f'{HOST}:{HTTP_PORT}')

### Wait for the triton inference server to come online
NOTE: If the following cell keeps running longer then interrupt execution and run again.

In [None]:

TIMEOUT = 60
client_grpc = triton_grpc.InferenceServerClient(url=f'{HOST}:{GRPC_PORT}')
server_start = time.time()
while True:
    try:
        if client_grpc.is_server_ready() or time.time() - server_start > TIMEOUT:
            break
    except triton_utils.InferenceServerException:
        pass
    time.sleep(1)


### For local deployment, check if the triton inference server is running properly

In [None]:
if run_locally:
    !docker logs tritonserver

### Read preprocessed input transactions to make query to the triton inference server

In [None]:
import pandas as pd
import numpy as np

test_path = os.path.join(data_root_dir, "xgb/test.csv") # already preprocessed data
test_df = pd.read_csv(test_path)
X = test_df.iloc[:, :-1].values.astype(np.float32)
y = test_df.iloc[:, -1].values
edge_index = np.array([[], []]).astype(np.int64) # empty edge_index

### Setup the HTTP request's inputs and output to retrieve embeddings for the input transactions

In [None]:
input_features = httpclient.InferInput("x", X.shape, datatype="FP32")
input_features.set_data_from_numpy(X)

input_edge_indices = httpclient.InferInput("edge_index", edge_index.shape, datatype="INT64")
input_edge_indices.set_data_from_numpy(edge_index)

outputs = httpclient.InferRequestedOutput("output")

### Send a query to retrieve embeddings

In [None]:
# Querying the server
results = client_http.infer(model_name="model", inputs=[input_features, input_edge_indices], outputs=[outputs])
node_embeddings = results.as_numpy('output')
# print(node_embeddings)


### Use the retrieved embeddings as inputs to predict the transactions' fraud scores

In [None]:
xgboost_input = httpclient.InferInput("input__0", node_embeddings.shape, datatype="FP32")
xgboost_input.set_data_from_numpy(node_embeddings)

xgboost_outputs = httpclient.InferRequestedOutput("output__0")

### Send a query to retrieve the fraud scores

In [None]:
results = client_http.infer(model_name="xgboost", inputs=[xgboost_input], outputs=[xgboost_outputs])
predictions = results.as_numpy('output__0')

### Evaluate performance

In [None]:
# Decision threshold to flag a transaction as fraud
#Change to trade-off precision and recall
decision_threshold = 0.5

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score

y_pred = (predictions > decision_threshold).astype(int)


# Compute evaluation metrics
accuracy = accuracy_score(y, y_pred)
precision = precision_score(y, y_pred, zero_division=0)
recall = recall_score(y, y_pred, zero_division=0)
f1 = f1_score(y, y_pred, zero_division=0)

print("----Summary---")
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1 Score: {f1:.4f}")


### Compute confusion matrix 

In [None]:
import pandas as pd
# Create a DataFrame with labeled rows and columns
classes = ['Non-Fraud', 'Fraud']
columns = pd.MultiIndex.from_product([["Predicted"], classes])
index = pd.MultiIndex.from_product([["Actual"], classes])

conf_mat = confusion_matrix(y, y_pred)
cm_df = pd.DataFrame(conf_mat, index=index, columns=columns)
print(cm_df)

### Plot confusion matrix

In [None]:
import matplotlib.pyplot as plt
from sklearn.metrics import ConfusionMatrixDisplay

# Plot the confusion matrix directly from predictions
disp = ConfusionMatrixDisplay.from_predictions(
    y, y_pred, display_labels=classes)
disp.ax_.set_title('Confusion Matrix')
plt.show()