# Example of creating data, preprocess with noise and train statistical models
In this example we will train statistical models on node classification. To this end, we will first generate a transaction log with Swish-AMLsim. Then split the data into two parts: a trainset and a testset. The split is done in time with a overlap. For each set we then build a graph of nodes and edges. The nodes represent the bank accounts and the edges represent relations between two accounts. The nodes and edges will have aggregated features of the transactions and the labels will be on the nodes. The label is negative (0) if the node didn't participate in money laundering acitivities and positive (1) otherwise. 

The features on the nodes are:
- sums, means, medians, stds, maxs, mins, counts of spending transactions (transactions to the sink).
- sums, means, medians, stds, maxs, mins of swish transactions (transactions within the network).
- number of incoming and outgoing swish transactions.
- number of unique accounts the node has sent to and recived by.

The features of the edges are:
- sums, means, medians, stds, maxs, mins, counts of transactions between the two nodes.

The labels are:
- 0 if the node didn't participate in money laundering acitivities.
- 1 if the node participated in money laundering acitivities. 

After building the graph, we will add noise to the labels in different fashion. In this example, there are four types of noise:
- Flipped labels: random nodes flip their labels.
- Misssing labels: random nodes will lose their label.
- Neighbors noise: random negative nodes will have their labels flipped if they have a positive neighbor.
- Topological noise: random nodes in specific topologies will have their labels flipped, suggesting a bank missing labels for specific money laundering stratergies.
We will create four different datasets, one for each type of noise.

Finally, we will train statistical models on the noisy data and evaluate the performance of the model on the test set. The models are:
- Logistic regression
- Random forest
- Gradient boosting
- Support vector machine
- K-nearest neighbors

The result are TN, FP, FN and TP for 101 threshold between 0 and 1. These can be used calculate wanted metrics like ROC, AUC, F1, precision, recall, and accuracy.

In [None]:
# imports
import os
import sys

sys.path.append(os.path.abspath(os.path.join(os.getcwd(), os.pardir)))

from preprocess.feature_engineering import *
from preprocess.noise import *
from benchmark.stat_models import *

## Create data
TODO: write code to run AMLsim from a notebook.

For now we will use a pre-generated dataset. 

In [None]:
dataset_name = '10K_accts'
path_to_tx_log = f'../AMLsim/outputs/{dataset_name}/tx_log.csv'

## Preprocess data


In [None]:
df = load_data(path_to_tx_log)
bank = 'bank'
overlap = 0.5 # overlap of training and testing data
windows = [(0, 30)] # int or list of tuples - if int then the number of windows, if list of tuples then the start and end step for each window

datasets = cal_features(path_to_tx_log, [bank], windows, overlap)
trainset, testset = datasets[0]
df_nodes_train, df_edges_train = trainset
df_nodes_test, df_edges_test = testset

print('\ntrainset:\nnodes:')
display(df_nodes_train.head())
print('edges:')
display(df_edges_train.head())
print('\ntestset:\nnodes:')
display(df_nodes_test.head())
print('edges:')
display(df_edges_test.head())

## Add noise to the labels

### Flipped labels


In [None]:
labels = [0, 1] # labels that will be affected by noise
fracs = [0.01, 0.1] # fractions of nodes that will be affected by noise for each label
df_flipped_lables = flip_labels(df_nodes_train, labels, fracs)
display(df_flipped_lables.head())

### Missing labels

In [None]:
labels = [0, 1] # labels that will be affected by noise
fracs = [0.01, 0.1] # fractions of nodes that will be affected by noise for each label
df_missing_labels = missing_labels(df_nodes_train, labels, fracs)
display(df_missing_labels.head())

### Neighbour noise

In [None]:
neighbour_frac = 0.5 # fraction of negative neighbours that will be flipped to positive
df_flipped_neighbours = flip_neighbours(df_nodes_train, df_edges_train, neighbour_frac)
display(df_flipped_neighbours.head())

### Topology noise
We will add noise to 'gather_scatter', 'scatter_gather', 'stack' while keeping fan_in, fan_out and bipartite untoched to see if the models can generalize to more complex patterns.

In [None]:
topologies = ['gather_scatter', 'scatter_gather', 'stack'] 
topology_frac = 0.5 # fraction of labels in the topologies that will be affected by noise

# to add the topology noise we need to know the alert members, this will be changed in the future
alert_members = pd.read_csv(path_to_tx_log.replace('outputs', 'tmp').replace('tx_log.csv', 'alert_members.csv'))

df_flipped_topologies = topology_noise(df_nodes_train, alert_members, topologies, fracs[1])
display(df_flipped_topologies.head())

## Train models on the datasets


In [None]:
model_names = ['XGB','RF','SVM','KNN','LOG']
thresholds = [0.0, 0.2, 0.4, 0.6, 0.8, 1.0]

In [None]:

datasets = (df_nodes_train, df_nodes_test)
results = train_models(datasets, model_names)
for model_name, metrics in results.items():
    print(model_name)
    for metric, value in metrics.items():
        if metric in thresholds:
            print(f'  threshold: {metric}: {value}')

In [None]:
dataset = (df_flipped_lables, df_nodes_test) 
results = train_models(datasets, model_names)
for model_name, metrics in results.items():
    print(model_name)
    for metric, value in metrics.items():
        if metric in thresholds:
            print(f'  threshold: {metric}: {value}')

In [None]:
dataset = (df_missing_labels, df_nodes_test) 
results = train_models(datasets, model_names)
for model_name, metrics in results.items():
    print(model_name)
    for metric, value in metrics.items():
        if metric in thresholds:
            print(f'  threshold: {metric}: {value}')

In [None]:
dataset = (df_flipped_neighbours, df_nodes_test)
results = train_models(datasets, model_names)
for model_name, metrics in results.items():
    print(model_name)
    for metric, value in metrics.items():
        if metric in thresholds:
            print(f'  threshold: {metric}: {value}')

In [None]:
dataset = (df_flipped_topologies, df_nodes_test)
results = train_models(datasets, model_names)
for model_name, metrics in results.items():
    print(model_name)
    for metric, value in metrics.items():
        if metric in thresholds:
            print(f'  threshold: {metric}: {value}')