Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.

SPDX-License-Identifier: Apache-2.0


# Prepare Financial Fraud dataset for Numenta Benchmark (NAB)

The [Numenta Benchmark](https://github.com/numenta/NAB) consists of multiple anomaly detection algorithms for time-series. The algorithms identify anomalies contextually - based on previous values for that time series. Therefore the NAB repository expects each time series to be in its own CSV with timestamp and value as the columns.

## This notebook consists of steps to 
1. Load raw data
2. Process raw data into independent time series for NAB
3. Save JSON specifying anomalies

In [1]:
import pandas as pd
import numpy as np

In [2]:
import sys
sys.path.append('../../src/')

from anomaly_detection_spatial_temporal_data.utils import ensure_directory

## Load raw data

In [3]:
raw_data_path = '../../data/01_raw/financial_fraud/bs140513_032310.csv'

raw_trans_data = pd.read_csv(raw_data_path)

## Construct purchase time series for a single (customer, merchant) pair


In [4]:
# values in the CSV have quotations 
example_c = """'C1001065306'"""
example_m = """'es_health'"""

In [5]:
c_m_p_data_example = raw_trans_data.loc[
    (raw_trans_data.customer==example_c)
    &(raw_trans_data.category==example_m)
][['step','amount','fraud']]

In [6]:
c_m_p_data_example

Unnamed: 0,step,amount,fraud
56918,21,37.1,0
84525,31,108.32,0
84526,31,188.94,0
100311,36,906.87,1
106319,38,146.25,0
181718,63,31.41,0
181719,63,177.82,0
181720,63,106.47,0
383480,122,1024.36,1
400723,127,80.72,1


In [7]:
example_ts_file_path = f"""../../data/02_intermediate/financial_fraud/ts_data/{example_c}_{example_m}_transaction_data.csv"""
print(example_ts_file_path)

ensure_directory(example_ts_file_path)

../../data/02_intermediate/financial_fraud/ts_data/'C1001065306'_'es_health'_transaction_data.csv


In [8]:
c_m_p_data_example.rename(columns={'step':'timestamp','amount':'value','fraud':'label'}, inplace=True)
c_m_p_data_example[['timestamp','value']].to_csv(example_ts_file_path, index=False)

In [9]:
example_ts_label_file_path = f"""../../data/02_intermediate/financial_fraud/ts_label/{example_c}_{example_m}_transaction_label.csv"""
print(example_ts_label_file_path)

ensure_directory(example_ts_label_file_path)

../../data/02_intermediate/financial_fraud/ts_label/'C1001065306'_'es_health'_transaction_label.csv


In [10]:
c_m_p_data_example[['label']].to_csv(example_ts_label_file_path, index=False)

## Generate a label dict needed for NAB model

In [11]:
from pathlib import Path
import json

def generate_dummy_labels(data_dir: str, label_dir:str) -> str:
    """Generate a dummy label JSON file and return its path"""
    data_dir_path = Path(data_dir)
    dummy_labels = dict()
    for file_path in data_dir_path.rglob("*.csv"):
        file_path_relative = file_path.relative_to(data_dir_path)
        dummy_labels[str(file_path_relative)] = []
    dummy_label_path = Path(f"{label_dir}/labels-combined.json")
    with dummy_label_path.open("w") as file:
        json.dump(dummy_labels, file, indent=4)
    return str(dummy_label_path.resolve())

In [12]:
label_dict_filepath = generate_dummy_labels(
    "../../data/02_intermediate/financial_fraud/ts_data/",
    "../../data/02_intermediate/financial_fraud/ts_label/"
)

# References

Edgar Alonso Lopez-Rojas and Stefan Axelsson. 2014. BANKSIM: A BANK PAYMENTS SIMULATOR FOR FRAUD DETECTION RESEARCH.

Alexander Lavin and Subutai Ahmad. 2015. Evaluating Real-Time Anomaly Detection Algorithms – The Numenta Anomaly Benchmark.