# Neptune ML Unsupervised Learning for Fraud Analysis

This notebook uses Neptune ML to train a model on a fraud graph data model and extract embeddings for visualizatio

## Checking that we are ready to run Neptune ML

The Neptune ML Toolkit is a python package that provides a SDK for graph machine learning with Neptune ML. You can also use jupyter magics `%%neptune_ml` from a jupyter notebook 

In [None]:
!pip install --pre neptuneml_toolkit
!pip install scikit-learn

In [None]:
from IPython.display import JSON
from neptuneml_toolkit import NeptuneMLClient


In [None]:
neptune_ml = NeptuneMLClient()
neptune_ml.check_enabled()

## Exporting the data to S3 for the machine learning workflow

Since we're using SageMaker as the ML infrastructure our data has to be in a S3 bucket where SageMaker can access it.


In [None]:
s3_bucket_uri="s3://<Replace-with-your-s3-bucket-name>"

In [None]:
export_params={ 
"command": "export-pg", 
"params": { "endpoint": neptune_ml.get_host(),
            "profile": "neptune_ml",
            "cloneCluster": False
            }, 
"outputS3Path": f'{s3_bucket_uri}/neptune-export',
"additionalParams": {},
"jobSize": "medium"}

In [None]:
export_results = neptune_ml.create_data_export_job(params=export_params, wait=True)

In [None]:
training_data_config = neptune_ml.get_training_data_configuration(export_results["jobId"])
JSON(training_data_config)

###  Selecting Node Properties for ML Features

We can choose which properties to use as features for machine learning and how those features are encoded

In [None]:
# Update node features for training
nodes = [{'file_name': 'nodes/Account.consolidated.csv',
    'separator': ',',
    'node': ['~id', 'Account'],
    'features': [
     {'feature': ['component', 'component', 'category']},
     ]},
   {'file_name': 'nodes/EmailAddress.consolidated.csv',
    'separator': ',',
    'node': ['~id', 'EmailAddress']},
   {'file_name': 'nodes/Address.consolidated.csv',
    'separator': ',',
    'node': ['~id', 'Address']},
   {'file_name': 'nodes/DateOfBirth.consolidated.csv',
    'separator': ',',
    'node': ['~id', 'DateOfBirth'],
    'features': [{'feature': ['value', 'value', 'datetime'],
     'datetime_parts': ['year', 'month', 'weekday', 'hour']}]},
   {'file_name': 'nodes/Transaction.consolidated.csv',
    'separator': ',',
    'node': ['~id', 'Transaction'],
    'features': [
        {'feature': ['amount', 'amount', 'numerical'],
         'norm': 'min-max',
         'imputer': 'median'},
        {'feature': ['created', 'created', 'datetime'],
         'datetime_parts': ['year', 'month', 'weekday', 'hour']}]},
   {'file_name': 'nodes/PhoneNumber.consolidated.csv',
    'separator': ',',
    'node': ['~id', 'PhoneNumber']},
   {'file_name': 'nodes/IpAddress.consolidated.csv',
    'separator': ',',
    'node': ['~id', 'IpAddress']},
   {'file_name': 'nodes/Merchant.consolidated.csv',
    'separator': ',',
    'node': ['~id', 'Merchant']}
]

training_data_config['graph']['nodes'] = nodes
JSON(nodes)

# ML Data Processing

Since we're using DGL as the Graph ML framework, we will process the export data in the S3 bucket to create a graph representation in DGL format and do the feature engineering we specified earlier.


In [None]:
data_processing_output = neptune_ml.create_data_processing_job(inputDataS3Location=export_results['outputS3Uri'],
                                      configFileName='training-data-configuration.json',
                                      processedDataS3Location= '{}/preloading'.format(s3_bucket_uri),
                                      trainingDataConfiguration=training_data_config,
                                      wait=True)

# ML Model Training

In [None]:
model_training_output = neptune_ml.create_model_training_job(dataProcessingJobId=data_processing_output["id"],
                                     trainModelS3Location='{}/training'.format(s3_bucket_uri),
                                     wait=True)

In [None]:
model_training_output

## Get Trained Model Embeddings

In [None]:
embeddings = neptune_ml.get_embeddings(model_training_output["id"])

In [None]:
mapping, embedding_index_mapping = neptune_ml.get_node_index_mapping(model_training_output["id"])
account_embeddings = embeddings[embedding_index_mapping['Account']]

In [None]:
print(account_embeddings.shape)
account_embeddings


## Reduce embedding dimension for visualization

In [None]:
from umap import UMAP

dim_reducer= UMAP(n_components=2)
account_embeddings_reduced_dim = dim_reducer.fit_transform(account_embeddings)

## Visualize generated embeddings

In [None]:
from sklearn.ensemble import IsolationForest

In [None]:
iso = IsolationForest()
iso.n_features_in_ = 128
y_pred = iso.fit(account_embeddings).predict(account_embeddings)

In [None]:
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

fig = plt.figure(figsize=(12, 10))
ax = fig.add_subplot()
fig.suptitle("Visualization of node embeddings and predicted anomalies")
colors = np.array(["#377eb8", "#ff7f00"])
axis = ax.scatter(account_embeddings_reduced_dim[:, 0], account_embeddings_reduced_dim[:, 1],
color=colors[(y_pred + 1) // 2])

In [None]:
from sklearn.cluster import KMeans

In [None]:
kmeans = KMeans(n_clusters=12).fit(account_embeddings_reduced_dim)
clusters = kmeans.labels_


fraud_acc_index = mapping['Account']['account-4398046511937']

fig = plt.figure(figsize=(12, 10))
ax = fig.add_subplot()
fig.suptitle("Visualization of node embeddings and clusters")
colors = np.array(["#ff7f00"] * 12)
colors[4] = "#377eb8"
ax.scatter(account_embeddings_reduced_dim[:, 0], account_embeddings_reduced_dim[:, 1], color=colors[clusters  % 12])
ax.scatter(account_embeddings_reduced_dim[fraud_acc_index, 0], account_embeddings_reduced_dim[fraud_acc_index, 1], color='r')
for index, key in enumerate(list(mapping['Account'].keys())[:15]):
    ax.annotate(key, (account_embeddings_reduced_dim[index, 0], account_embeddings_reduced_dim[index, 1])) 
ax.annotate('account-4398046511937', (account_embeddings_reduced_dim[fraud_acc_index, 0], account_embeddings_reduced_dim[fraud_acc_index, 1]))
axis = ax.set(xlim=(4, 11), ylim=(2, 9))