## BASELINE DETECTOR EVALUATION (MERCHANT)

Evaluate the simple threshold rule for detecting velocity spikes in the sparkov + synthetic spikes dataset.

# Method

 - Count number of unique cards per merchant in 30s buckets
 - Raise a flag if the count is greater than the set threshold
 - Compare predictions to the ground truth

# STREAMED DATA EVALUATION

Stream the dataset once through, update the per merchant unique card count on each Tx, flag bucket once the count is greater than the threshold, calculate confusion matrix based on all merchant/bucket pairs

In [1]:
import sys
import os
import pandas as pd

#create absoulute path to notebook's. parent directory to src
module_path = os.path.abspath(os.path.join('..', 'src'))
#add to sys.path if not already there
if module_path not in sys.path:
    sys.path.append(module_path)

#import detector wrapper functions
from baseline_detector import MerchantBaseline, CardBaseline
#import the merchant and card sets from truth tables
from truth_tables import MERCHANT_SET, CARD_SET
#inport evaluation helper functions
from eval_funcs import threshold_predictions, per_bucket_confusion, precision_recall_f1, sweep_thresholds

#ensure constistent d types
DTYPES = {"merchant_id": str, "card_id": str}

In [2]:
# #read in spiked dataset, parsing timestamp as a time object, setting merchant and card ids to strings
# #set low memory false to avoid incorrect data type inferences
df = pd.read_csv(
    "../data/processed/sparkov_spikes.csv",
    parse_dates=["timestamp"],
    dtype=DTYPES,
    low_memory=False
)

#define initial merchant spike baseline
THRESHOLD_M = 6

# #create instance of merchant baseline class
mb = MerchantBaseline(threshold=THRESHOLD_M)

#create empty sets for predictions and coverage
predicted = set() #set of (merchant_id, bucket) that we flag
all_buckets = set() #set of all (merchant_id, buckets) streamed in

for row in df.itertuples():
    #call the baseline wrapper class' update function
    flag, info = mb.update(row.merchant_id, row.timestamp, row.card_id)
    bucket = info["bucket"]
    #add each to the all bucekts set
    all_buckets.add((row.merchant_id, bucket))
    #add to flagged set if over threshold
    if flag:
        predicted.add((row.merchant_id, bucket))

#find confusion matrix
cm = per_bucket_confusion(predicted, MERCHANT_SET, all_buckets)
#calc precision, recall, and f1 scores
m  = precision_recall_f1(cm["tp"], cm["fp"], cm["fn"])

#print results
print(f"TP {cm['tp']} FP {cm['fp']} FN {cm['fn']} TN {cm['tn']}")
print(f"precision {m['precision']:.3f} recall {m['recall']:.3f} F1 {m['f1']:.3f}")

TP 50 FP 0 FN 0 TN 1555392
precision 1.000 recall 1.000 F1 1.000


# Varied Threshold Test

Precompute the unique cards per mechant/bucket, then sweep through different thresholds 2-16, calculate and compare results for each threshold.

In [10]:
#add a bucket column to the dataframe (flooring same as detector/truth table)
df["bucket"] = df["timestamp"].dt.floor("30s")

#count unique cards per merchant
counts_m = df.groupby(["merchant_id", "bucket"])["card_id"].nunique()
#convert to a dictionary of key: (merchant, bucket), value: count or unique cards
counts_m_dict = {k: int(v) for k, v in counts_m.items()}
#all merchant/bucket pairs (used to find number of true negatives)
all_buckets = set(counts_m_dict.keys())

results = sweep_thresholds(counts_m_dict, MERCHANT_SET, all_buckets, start=2, stop=8)
for r in results:
    print(r)

print("truth merchant spikes:", len(MERCHANT_SET))
print("merchant-buckets total:", counts_m.size)

{'th': 2, 'tp': 50, 'fp': 2297, 'fn': 0, 'tn': 1553095, 'precision': 0.021303792074989347, 'recall': 1.0, 'f1': 0.04171881518564872}
{'th': 3, 'tp': 50, 'fp': 2, 'fn': 0, 'tn': 1555390, 'precision': 0.9615384615384616, 'recall': 1.0, 'f1': 0.9803921568627451}
{'th': 4, 'tp': 50, 'fp': 0, 'fn': 0, 'tn': 1555392, 'precision': 1.0, 'recall': 1.0, 'f1': 1.0}
{'th': 5, 'tp': 50, 'fp': 0, 'fn': 0, 'tn': 1555392, 'precision': 1.0, 'recall': 1.0, 'f1': 1.0}
{'th': 6, 'tp': 50, 'fp': 0, 'fn': 0, 'tn': 1555392, 'precision': 1.0, 'recall': 1.0, 'f1': 1.0}
{'th': 7, 'tp': 50, 'fp': 0, 'fn': 0, 'tn': 1555392, 'precision': 1.0, 'recall': 1.0, 'f1': 1.0}
{'th': 8, 'tp': 50, 'fp': 0, 'fn': 0, 'tn': 1555392, 'precision': 1.0, 'recall': 1.0, 'f1': 1.0}
truth merchant spikes: 50
merchant-buckets total: 1555442


# Choose Threshold Going Forward

Choose the lowest threshold that still gives good results, and is also likely safe agaisnt unseen data. In this case, a threshold from 4 to 16 gives the. same perfect results, so we will keep 6 as the baseline rule as it should be safe against unseen data.

# Caixa Dataset FP Test

Run the Real-Life Dataset (Caixa) through the baseline detector.
Keep/change the curresnt threshold rule based on results

In [11]:
#read in caixa datset
dfc = pd.read_csv("../data/processed/caixa_pos_sorted.csv",
                  parse_dates=["timestamp"],
                  dtype={"merchant_id": str, "card_id": str},
                  low_memory=False)

#add a bucket column to the dataframe (flooring same as detector/truth table)
dfc["bucket"] = dfc["timestamp"].dt.floor("30s")
#count unique cards per merchant
counts_m_c = dfc.groupby(["merchant_id","bucket"])["card_id"].nunique()

#for various threshold values
for THRESHOLD in [4,5,6,7,8]:
    #count number of merchant/bucket pairs flagged as fraud
    num_flagged   = int((counts_m_c >= THRESHOLD).sum())
    #number of total merchant/bucket pairs
    total_buckets = int(counts_m_c.size)
    #calculate rate of false positives
    rate = num_flagged / total_buckets if total_buckets else 0.0
    #print results
    print(f"th={THRESHOLD}: {num_flagged}/{total_buckets}  ({rate:.5%})")

th=4: 254/6692434  (0.00380%)
th=5: 10/6692434  (0.00015%)
th=6: 1/6692434  (0.00001%)
th=7: 0/6692434  (0.00000%)
th=8: 0/6692434  (0.00000%)


Threshold of 6 give almost perfect results with only 1 false positive, so we will keep this threshold going forward

## BASELINE DETECTOR EVALUATION (CARD BURST)
mirror of what was done above but this time for the card burst threshold rather than the merchant spike threshold.

# Method

 - Count number of unique merchants per card in 30s buckets
 - Raise a flag if the count is greater than the set threshold
 - Compare predictions to the ground truth

# STREAMED DATA EVALUATION

Stream the dataset once through, update the per card unique mercahnt count on each Tx, flag bucket once the count is greater than the threshold, calculate confusion matrix based on all card/bucket pairs

In [13]:
#set static baseline threshold of 4
THRESHOLD_C = 4

#create instance of card baseline class
cb   = CardBaseline(threshold=THRESHOLD_C)

#create empty sets for predictions and coverage
predicted_cards = set() #set of card_ids that are flagged
all_cards = set(df["card_id"].astype(str)) #set of all card_ids


for row in df.itertuples():
    #call the baseline wrapper class' update function
    flag, info = cb.update(row.card_id, row.timestamp, row.merchant_id)
    #add to flagged set if over threshold
    if flag:
        predicted_cards.add(row.card_id)

#find confusion matrix
cm_c = per_bucket_confusion(predicted_cards, CARD_SET, all_cards)
#calc precision, recall, and f1 scores
m_c  = precision_recall_f1(cm_c["tp"], cm_c["fp"], cm_c["fn"])

#print results
print(f"TP {cm_c['tp']} FP {cm_c['fp']} FN {cm_c['fn']} TN {cm_c['tn']}")
print(f"precision {m_c['precision']:.3f} recall {m_c['recall']:.3f} F1 {m_c['f1']:.3f}")


TP 41 FP 0 FN 9 TN 1749
precision 1.000 recall 0.820 F1 0.901


# Varied Threshold Test

sweep through different thresholds 2-7, calculate and compare results for each threshold. Our seep thresholds function doesn't work here as the truth table isn't at the same key level as the counts dictionary

In [None]:
#counts unique merchants per card_id/bucket
counts_c = df.groupby(["card_id", "bucket"])["merchant_id"].nunique()
#set up counts dictionary with key: card_id, buckey and value: unique merchant count
counts_c_dict = {k: int(v) for k, v in counts_c.items()}
#create set of all card_ids
all_cards = set(df["card_id"].astype(str))

#define new functions for sweep of card thresholds
def card_metrics_at(th):
    #get prediced pairs (card_id, bucket)
    pred_pairs = threshold_predictions(counts_c_dict, th)
    #collapse the predicted pairs to just the card ids (which ones have been flagged)
    pred_cards = {cid for (cid, _) in pred_pairs}

    #calculste confusion at the card level (sets of card_ids)
    cm = per_bucket_confusion(pred_cards, CARD_SET, all_cards)
    #calculate evaluation metrics
    m  = precision_recall_f1(cm["tp"], cm["fp"], cm["fn"])
    #return the threshold, confusion matrix and evaluation scores 
    #(** unpacks the dictionary to allow us to create a new one)
    return {"th": th, **cm, **m}

#print the results for each threshold
for th in range(2, 8):
    print(card_metrics_at(th))

{'th': 2, 'tp': 50, 'fp': 492, 'fn': 0, 'tn': 1257, 'precision': 0.09225092250922509, 'recall': 1.0, 'f1': 0.16891891891891891}
{'th': 3, 'tp': 49, 'fp': 0, 'fn': 1, 'tn': 1749, 'precision': 1.0, 'recall': 0.98, 'f1': 0.98989898989899}
{'th': 4, 'tp': 41, 'fp': 0, 'fn': 9, 'tn': 1749, 'precision': 1.0, 'recall': 0.82, 'f1': 0.9010989010989011}
{'th': 5, 'tp': 0, 'fp': 0, 'fn': 50, 'tn': 1749, 'precision': 0.0, 'recall': 0.0, 'f1': 0.0}
{'th': 6, 'tp': 0, 'fp': 0, 'fn': 50, 'tn': 1749, 'precision': 0.0, 'recall': 0.0, 'f1': 0.0}
{'th': 7, 'tp': 0, 'fp': 0, 'fn': 50, 'tn': 1749, 'precision': 0.0, 'recall': 0.0, 'f1': 0.0}


# Choose Threshold Going Forward

Choose the lowest threshold that still gives good results, in this case, a threshold from 5 to 7 gives the same perfect results, so we will use 5 as the baseline rule.

# Caixa Dataset FP Test

Run the Real-Life Dataset (Caixa) through the baseline detector.
Keep/change the curresnt threshold rule based on results

In [None]:
#count of how many unique merchants each card has visited that window
counts_c_c = dfc.groupby(["card_id","bucket"])["merchant_id"].nunique()

#test several cart thresholds on the real data
for TH in [4,5,6,7]:
    #number of card/buckets that meet threshold
    pair_flags = int((counts_c_c >= TH).sum())
    #total number of card/bucket pairs
    pair_total = int(counts_c_c.size)
    #share of card/buckets flagged at thos threshold
    pair_rate  = pair_flags / pair_total if pair_total else 0.0

    #count of how many unique cards would be flagged at this threshold
    #reset_index() turns card/bucket pairs into columns so we can extract card_id
    cards_flagged = counts_c_c[counts_c_c >= TH].reset_index()["card_id"].astype(str).nunique()
    #total count of unique cards
    total_cards   = dfc["card_id"].astype(str).nunique()
    #share of cards flagged at least once
    card_rate     = cards_flagged / total_cards if total_cards else 0.0

    #print results
    print(f"th={TH}: pair-rate {pair_flags}/{pair_total} ({pair_rate:.5%}); "
          f"card-rate {cards_flagged}/{total_cards} ({card_rate:.5%})")


th=4: pair-rate 0/6825428 (0.00000%); card-rate 0/4065 (0.00000%)
th=5: pair-rate 0/6825428 (0.00000%); card-rate 0/4065 (0.00000%)
th=6: pair-rate 0/6825428 (0.00000%); card-rate 0/4065 (0.00000%)
th=7: pair-rate 0/6825428 (0.00000%); card-rate 0/4065 (0.00000%)


## Results

Since there are no 'card bursts' in the Caixa dataset, there is nothing to detect so the threshold here does not matter.