<h1>
    Network Analysis for Anti-Money Laundering with NetworkX and gravis
</h1>

<h4>
    Importing necessary libraries
</h4>

In [4]:
import pandas as pd
import networkx as nx
import itertools
from distinctipy import distinctipy
import gravis as gv

<h3>
    For our analysis we are using a dataset generated from IBM AMLSim. The same is also available in kaggle.
</h3>
<h4>
    There are a total of 8 columns in this dataset. 
</h4>
<br/>
<table style='float:left'>
    <tr>
        <th>Column Name</th>
        <th>Data Understanding</th>
    </tr>
    <tr>
        <td>TX_ID</td>
        <td>Transaction ID</td>
    </tr>
    <tr>
        <td>SENDER_ACCOUNT_ID</td>
        <td>Account ID of the person sending the fund</td>
    </tr>
    <tr>
        <td>RECEIVER_ACCOUNT_ID</td>
        <td>Account ID of the person receiving the fund</td>
    </tr>
    <tr>
        <td>TX_TYPE</td>
        <td>Type of Transaction. All are TRANSFER transactions.</td>
    </tr>
    <tr>
        <td>TX_AMOUNT</td>
        <td>Amount of the Transaction</td>
    </tr>
    <tr>
        <td>TIMESTAMP</td>
        <td>A simulated transaction timestamp starting with 0 to 199</td>
    </tr>
    <tr>
        <td>IS_FRAUD</td>
        <td>Label for the transaction being a potential fraud. An Investigator need to verify these alerts through various method starting with network analysis</td>
    </tr>
    <tr>
        <td>ALERT_ID</td>
        <td>ID of ALERT that got triggered for the transaction</td>
    </tr>
</table>

In [6]:
pd.set_option("display.max_colwidth",None)
full_txn_df = pd.read_csv("transactions.csv")
full_txn_df

Unnamed: 0,TX_ID,SENDER_ACCOUNT_ID,RECEIVER_ACCOUNT_ID,TX_TYPE,TX_AMOUNT,TIMESTAMP,IS_FRAUD,ALERT_ID
0,1,6456,9069,TRANSFER,465.05,0,False,-1
1,2,7516,9543,TRANSFER,564.64,0,False,-1
2,3,2445,9356,TRANSFER,598.94,0,False,-1
3,4,2576,4617,TRANSFER,466.07,0,False,-1
4,5,3524,1773,TRANSFER,405.63,0,False,-1
...,...,...,...,...,...,...,...,...
1323229,1323230,3733,8051,TRANSFER,112.98,199,False,-1
1323230,1323231,2536,8732,TRANSFER,459.64,199,False,-1
1323231,1323232,1466,8586,TRANSFER,468.60,199,False,-1
1323232,1323233,1451,3849,TRANSFER,562.36,199,False,-1


<h4> We are selecting the relevant columns to proceed next</h4>

In [8]:
full_txn_df = full_txn_df[["SENDER_ACCOUNT_ID","RECEIVER_ACCOUNT_ID","TX_AMOUNT","IS_FRAUD"]]
full_txn_df

Unnamed: 0,SENDER_ACCOUNT_ID,RECEIVER_ACCOUNT_ID,TX_AMOUNT,IS_FRAUD
0,6456,9069,465.05,False
1,7516,9543,564.64,False
2,2445,9356,598.94,False
3,2576,4617,466.07,False
4,3524,1773,405.63,False
...,...,...,...,...
1323229,3733,8051,112.98,False
1323230,2536,8732,459.64,False
1323231,1466,8586,468.60,False
1323232,1451,3849,562.36,False


<h3>
    The next step is to Determine the sample size for our analysis. One way to do it is by calculating the max group or connected components size by iterating through various sample sizes. It's important to do that beforehand because without a manageable group size, it's very difficult to visualize and analyse the fund flow traces.
</h3>
<h4>
    For our analysis, we are selecting a random sample based on sample size. However, in real life, we have to determine our assessment or look-back period based on the connected component size. Thus by trying different timeframe coverage, such as weekly, monthly, quarterly, yearly or so, we can determine the manageable timeframe coverage and proceed with that for the rest analysis.
</h4>

In [10]:
max_group_sizes = list()
for i in range(500,10000,500):
    cur_max_group_size = dict()
    cur_max_group_size["sample_size"] = i
    sample_df = full_txn_df.sample(i,random_state=2)
    sample_df = sample_df.groupby(["SENDER_ACCOUNT_ID","RECEIVER_ACCOUNT_ID"],as_index=False).agg({"TX_AMOUNT":["count","sum"]})
    sample_df.columns = ["SENDER_ACCOUNT_ID","RECEIVER_ACCOUNT_ID","TXN_COUNT","TOTAL_TXN_AMOUNT"]
    sample_net = nx.from_pandas_edgelist(sample_df,source="SENDER_ACCOUNT_ID",target="RECEIVER_ACCOUNT_ID", create_using = nx.DiGraph())
    cur_max_group_size["max_group_size"] = len(max(nx.weakly_connected_components(sample_net),key=len))
    max_group_sizes.append(cur_max_group_size)
max_group_sizes = pd.DataFrame(max_group_sizes)
max_group_sizes

Unnamed: 0,sample_size,max_group_size
0,500,6
1,1000,9
2,1500,22
3,2000,33
4,2500,95
5,3000,246
6,3500,767
7,4000,1297
8,4500,1886
9,5000,2470


<h4>
    From the above analysis, we can see with 2000 samples the max_group_size is 33. And it becomes 95 and increases further with more sample sizes.
</h4>
<h4>We will proceed with a manageable sample size of 2000 for the rest of the analysis</h4>

In [12]:
analysis_df = full_txn_df.sample(2000,random_state=2)
analysis_df

Unnamed: 0,SENDER_ACCOUNT_ID,RECEIVER_ACCOUNT_ID,TX_AMOUNT,IS_FRAUD
339046,9060,5397,5.34,False
512340,3503,3337,594.95,False
1015893,3452,352,519.40,False
29301,3578,9078,457.42,False
624805,7523,2425,498.02,False
...,...,...,...,...
329362,8425,9695,535.91,False
107287,6411,146,434.66,False
1196849,2175,8022,20.60,False
512501,3607,2098,169.00,False


<h4>
    The rows below in our sample are flagged as fraudulent by the alerting system. Will perform network analysis for the below accounts of interest later in our analysis process.
</h4>

In [14]:
analysis_df.loc[analysis_df["IS_FRAUD"]]

Unnamed: 0,SENDER_ACCOUNT_ID,RECEIVER_ACCOUNT_ID,TX_AMOUNT,IS_FRAUD
380151,9070,9849,3.04,True
1108504,7803,8589,11.27,True
617788,8316,6794,4.74,True
1082414,9922,1149,3.97,True
589896,4155,7628,11.52,True


<h4>
    Consolidating all the transactions between each Sender->Receiver pair to have a more clear understanding of the consolidated total fund flow and the same can be visualized for a better analysis
</h4>

In [16]:
aml_df = analysis_df.groupby(["SENDER_ACCOUNT_ID","RECEIVER_ACCOUNT_ID"],as_index=False).agg({"TX_AMOUNT":["count","sum"]})
aml_df.columns = ["SENDER_ACCOUNT_ID","RECEIVER_ACCOUNT_ID","TXN_COUNT","TOTAL_TXN_AMOUNT"]
aml_df

Unnamed: 0,SENDER_ACCOUNT_ID,RECEIVER_ACCOUNT_ID,TXN_COUNT,TOTAL_TXN_AMOUNT
0,28,9984,1,95.65
1,48,9256,1,61.93
2,53,9600,1,44.00
3,60,9328,1,60.48
4,70,4385,2,79.28
...,...,...,...,...
1898,9990,3759,1,209.22
1899,9995,1217,1,394.35
1900,9996,2428,1,344.98
1901,9996,9993,1,344.98


<h4>
    Populating two more columns "hover" and "click" for using as an edge attribute in our network. Thus making it easier to view the details of each consolidated transaction while visualizing.
</h4>

In [18]:
aml_df = aml_df.assign(hover=aml_df.apply(lambda x: f"From: {int(x['SENDER_ACCOUNT_ID'])}\nTo: {int(x['RECEIVER_ACCOUNT_ID'])}\nTransaction Count: {int(x['TXN_COUNT'])}\nTotal Transaction Amount: {x['TOTAL_TXN_AMOUNT']}",axis=1))
aml_df["click"] = aml_df["hover"]
aml_df

Unnamed: 0,SENDER_ACCOUNT_ID,RECEIVER_ACCOUNT_ID,TXN_COUNT,TOTAL_TXN_AMOUNT,hover,click
0,28,9984,1,95.65,From: 28\nTo: 9984\nTransaction Count: 1\nTotal Transaction Amount: 95.65,From: 28\nTo: 9984\nTransaction Count: 1\nTotal Transaction Amount: 95.65
1,48,9256,1,61.93,From: 48\nTo: 9256\nTransaction Count: 1\nTotal Transaction Amount: 61.93,From: 48\nTo: 9256\nTransaction Count: 1\nTotal Transaction Amount: 61.93
2,53,9600,1,44.00,From: 53\nTo: 9600\nTransaction Count: 1\nTotal Transaction Amount: 44.0,From: 53\nTo: 9600\nTransaction Count: 1\nTotal Transaction Amount: 44.0
3,60,9328,1,60.48,From: 60\nTo: 9328\nTransaction Count: 1\nTotal Transaction Amount: 60.48,From: 60\nTo: 9328\nTransaction Count: 1\nTotal Transaction Amount: 60.48
4,70,4385,2,79.28,From: 70\nTo: 4385\nTransaction Count: 2\nTotal Transaction Amount: 79.28,From: 70\nTo: 4385\nTransaction Count: 2\nTotal Transaction Amount: 79.28
...,...,...,...,...,...,...
1898,9990,3759,1,209.22,From: 9990\nTo: 3759\nTransaction Count: 1\nTotal Transaction Amount: 209.22,From: 9990\nTo: 3759\nTransaction Count: 1\nTotal Transaction Amount: 209.22
1899,9995,1217,1,394.35,From: 9995\nTo: 1217\nTransaction Count: 1\nTotal Transaction Amount: 394.35,From: 9995\nTo: 1217\nTransaction Count: 1\nTotal Transaction Amount: 394.35
1900,9996,2428,1,344.98,From: 9996\nTo: 2428\nTransaction Count: 1\nTotal Transaction Amount: 344.98,From: 9996\nTo: 2428\nTransaction Count: 1\nTotal Transaction Amount: 344.98
1901,9996,9993,1,344.98,From: 9996\nTo: 9993\nTransaction Count: 1\nTotal Transaction Amount: 344.98,From: 9996\nTo: 9993\nTransaction Count: 1\nTotal Transaction Amount: 344.98


<h4>Creating our network using NetworkX</h4>

In [20]:
aml_net = nx.from_pandas_edgelist(aml_df,source="SENDER_ACCOUNT_ID",target="RECEIVER_ACCOUNT_ID",edge_attr=["TOTAL_TXN_AMOUNT","hover","click"], create_using = nx.DiGraph())

<h4>
    Creating a group dict to add a group attribute to our nodes. Which will make it easier to identify nodes belonging to the same community or from AML POV belonging to the same criminal syndicate.
</h4>

In [22]:
group_dict = dict()
for group, nodes in enumerate(nx.weakly_connected_components(aml_net)):
    for i in nodes:
        group_dict[i] = group

pd.DataFrame({"node": group_dict.keys(), "group": group_dict.values()}).groupby(["group"]).agg({"node":"count"}).sort_values(by=["node"],ascending=False)

Unnamed: 0_level_0,node
group,Unnamed: 1_level_1
25,33
50,23
158,21
237,20
105,19
...,...
432,2
435,2
436,2
1020,2


<h4>
    Generating visually distinct colors for each group thus making it easier to differentiate nodes belonging to different groups by their color
</h4>

In [24]:
color_dict = dict()

color_map = list(map(distinctipy.get_rgb256,distinctipy.get_colors(len(list(nx.weakly_connected_components(aml_net))))))

for node, group in group_dict.items():
    rgb_int = color_map[group]
    color_dict[node] = f"rgb({rgb_int[0]},{rgb_int[1]},{rgb_int[2]})"

<h4>
    Setting the Node Attribute "color" using the previously generated color_dict
</h4>

In [26]:
nx.set_node_attributes(aml_net,color_dict,"color")

<h4>
    Setting the "size" attribute of the nodes as per their Degree. Degree refers to the no of different accounts that a person had transactions with.
</h4>

In [28]:
degree_dict = dict(aml_net.degree)

nx.set_node_attributes(aml_net,degree_dict,"size")

<h4>
    Generating and Setting the "hover" and "click" attributes of the nodes. Which contains their id, group, degree and neighbors
</h4>

In [30]:
title_dict = dict()

for i in aml_net.nodes:
    title_dict[i] = f"id: {i}\ngroup: {group_dict[i]}\ndegree: {degree_dict[i]}\nneighbors: {", ".join(map(str,nx.all_neighbors(aml_net,i)))}"
nx.set_node_attributes(aml_net,title_dict,"hover")
nx.set_node_attributes(aml_net,title_dict,"click")

<h4>
    Creating a helper function show_network to visualise network using gravis
</h4>

In [32]:
def show_network(g_net):
    fig = gv.d3(g_net,graph_height=1000,use_node_size_normalization=True,edge_size_data_source="TOTAL_TXN_AMOUNT",use_edge_size_normalization=True,many_body_force_strength=-300,node_size_normalization_min=15,edge_size_normalization_min=1,node_hover_neighborhood=True,edge_curvature=.4)
    fig.display(inline=True)

<h3>
    Visualizing the Network.
</h3>
<h4>
    Gravis has beautifully displayed the network. But since this is a lot of information on one screen. It is difficult to analyse. Will focus on specific subgraphs in next stages.
</h4>

In [34]:
show_network(aml_net)

<h4>
    Creating a subgraph for a few target syndicates and visualizing that
</h4>

In [36]:
target_groups = [25,50,158,237,37]
target_group_nodes = list()
for node, group in group_dict.items():
    if group in target_groups:
        target_group_nodes.append(node)
target_groups_network = aml_net.subgraph(target_group_nodes)
show_network(target_groups_network)

<h4>
    Creating a subgraph starting with a few target account ids that we want to analyse on. I have also highlighted those nodes in the visualization by assigning it an differentiating image. In this way we can easily create the end to end fund flow network starting with any suspected account which otherwise would have taken a lot of manpower if done manually.
</h4>

In [38]:
target_nodes = [4155,9922,8316,7803,9070]

target_group_nodes = list()

for node, group in group_dict.items():
    if group in [group_dict[i] for i in target_nodes]:
        target_group_nodes.append(node)

target_node_network = aml_net.subgraph(target_group_nodes).copy()

for node, data in target_node_network.nodes(data=True):
    if node in target_nodes:
        data["image"] = "https://openmoji.org/data/color/svg/1F4B0.svg"
        data["opacity"] = 0.2
        data["size"] = 3
        data["color"] = "red"

show_network(target_node_network)

<h3>
    This is the most powerful use case in AML investigation which allow you to identify all related accounts from an end-to-end perspective, staring with any given node
</h3>