# Anti-Money Laundering

A brain dump of anti-money-laundering (AML) code snippets for the Data Science capstone project.


## Setup

Captures the set of Python imports the Notebook requires, as well as any constants defined for the analysis.

In [1]:
import hashlib
import os
from pprint import pprint
from time import monotonic

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import ipywidgets as widgets
from google.colab import drive
from IPython import get_ipython
from IPython.display import display

content_base = "/content/drive"
data_dir = os.path.join(content_base, "My Drive/capstone/data")
data_file = os.path.join(data_dir, "LI-Small_Trans.csv")

# Some portions of the analysis are skipped due to how costly they may
# be, or that they only needed to be executed once
check_dataset_uniqueness = False

### Notebook Stuff

Not important to the project at all, just modifying aspects of the notebook runtime for my own use.

In [2]:
class CellTimer:
    def __init__(self):
        self.start_time = None

    def start(self, *args, **kwargs):
        self.start_time = monotonic()

    def stop(self, *args, **kwargs):
        try:
            delta = round(monotonic() - self.start_time, 2)
            print(f"\n⏱️ Execution time: {delta}s")
        except TypeError:
            # The `stop` will be called when the cell that
            # defines `CellTimer` is executed, but `start`
            # was never called, leading to a `TypeError` in
            # the subtraction. Skip it
            pass


timer = CellTimer()
ipython = get_ipython()
ipython.events.register("pre_run_cell", timer.start)
ipython.events.register("post_run_cell", timer.stop)

### Load Data

In [3]:
drive.mount(content_base)
files = os.listdir(data_dir)
print("\nData files available:")
pprint(files)

Mounted at /content/drive

Data files available:
['HI-Large_Patterns.txt',
 'HI-Large_Trans.csv',
 'HI-Medium_Trans.csv',
 'HI-Medium_Patterns.txt',
 'HI-Small_Patterns.txt',
 'HI-Small_Trans.csv',
 'LI-Large_Patterns.txt',
 'LI-Large_Trans.csv',
 'LI-Medium_Patterns.txt',
 'LI-Medium_Trans.csv',
 'LI-Small_Patterns.txt',
 'LI-Small_Trans.csv',
 'SAML-D.csv']

⏱️ Execution time: 15.02s


In [4]:
df = pd.read_csv(data_file)


⏱️ Execution time: 13.73s


## Data Overview

Explore aspects of the data without applying any transformations or doing any feature engineering.

### Features

The selected data set has the following features.

In [5]:
df.dtypes

Unnamed: 0,0
Timestamp,object
From Bank,int64
Account,object
To Bank,int64
Account.1,object
Amount Received,float64
Receiving Currency,object
Amount Paid,float64
Payment Currency,object
Payment Format,object



⏱️ Execution time: 0.01s


Rename the features to be more explicit with the hope of avoiding common mistakes, e.g. mistaking `Account` and `Account.1`. The new names use snake case because we're in Python.

In [6]:
df.rename(
    columns={
        "Timestamp": "timestamp",
        "From Bank": "from_bank",
        "Account": "from_account",
        "To Bank": "to_bank",
        "Account.1": "to_account",
        "Amount Received": "received_amount",
        "Receiving Currency": "received_currency",
        "Amount Paid": "sent_amount",
        "Payment Currency": "sent_currency",
        "Payment Format": "payment_type",
        "Is Laundering": "is_laundering",
    },
    inplace=True,
)


⏱️ Execution time: 0.0s


### Data Description

Provides a general overview of the data

In [7]:
df.select_dtypes(include=["number"]).describe()

Unnamed: 0,from_bank,to_bank,received_amount,sent_amount,is_laundering
count,6924049.0,6924049.0,6924049.0,6924049.0,6924049.0
mean,59387.18,84417.02,6324067.0,4676036.0,0.0005148722
std,90517.0,90645.62,2105371000.0,1544099000.0,0.02268495
min,0.0,0.0,1e-06,1e-06,0.0
25%,219.0,11255.0,174.21,175.38,0.0
50%,14195.0,29640.0,1397.62,1399.44,0.0
75%,110682.0,148040.0,12296.33,12226.87,0.0
max,376967.0,376967.0,3644854000000.0,3644854000000.0,1.0



⏱️ Execution time: 0.83s


In [8]:
df.select_dtypes(include=["object", "category"]).drop(columns="timestamp").describe()

Unnamed: 0,from_account,to_account,received_currency,sent_currency,payment_type
count,6924049,6924049,6924049,6924049,6924049
unique,681281,576176,15,15,7
top,10042B660,10042B660,US Dollar,US Dollar,Cheque
freq,222037,1553,2537242,2553887,2503158



⏱️ Execution time: 9.84s


In [9]:
df.head()

Unnamed: 0,timestamp,from_bank,from_account,to_bank,to_account,received_amount,received_currency,sent_amount,sent_currency,payment_type,is_laundering
0,2022/09/01 00:08,11,8000ECA90,11,8000ECA90,3195403.0,US Dollar,3195403.0,US Dollar,Reinvestment,0
1,2022/09/01 00:21,3402,80021DAD0,3402,80021DAD0,1858.96,US Dollar,1858.96,US Dollar,Reinvestment,0
2,2022/09/01 00:00,11,8000ECA90,1120,8006AA910,592571.0,US Dollar,592571.0,US Dollar,Cheque,0
3,2022/09/01 00:16,3814,8006AD080,3814,8006AD080,12.32,US Dollar,12.32,US Dollar,Reinvestment,0
4,2022/09/01 00:00,20,8006AD530,20,8006AD530,2941.56,US Dollar,2941.56,US Dollar,Reinvestment,0



⏱️ Execution time: 0.02s


### Missing Values

Determines whether there are missing values. There aren't any in the initial exploration of the data, and so the following cell will cause the notebook to fail if null values are present, as it may violate an assumption made by subsequent steps in the analysis.

In [10]:
df.isnull().sum()

if df.isnull().values.any():
    raise ValueError(
        "Initial analysis showed that there were no null values in the data "
        "set, and the proceeding work was done under this assumption. "
        "However, null values were detected. Does the dataset now need to be "
        "cleaned prior to analysis?"
    )


⏱️ Execution time: 4.49s


## Data Imbalance

Look at the laundering rate, and at how the provided data differs between the licit and illicit transactions.

In [11]:
print(f'Laundering rate: {round(100*(df["is_laundering"].sum() / len(df["is_laundering"])), 3)}%')

Laundering rate: 0.051%

⏱️ Execution time: 0.01s


### Categorical Imbalance

The balance in categorical features can be handled directly. Numerical features need to be binned prior to demonstrating imbalance.

In [12]:
def plot_column_imbalance(
    df: pd.DataFrame,
    column: str,
    label: str,
) -> None:
    if pd.api.types.is_numeric_dtype(df[column]):
        # Custom bins for numerical data
        df["binned"] = pd.cut(
            df[column],
            bins=[0, 10, 100, 1000, 10000, np.inf],
            include_lowest=True,
        )
        bin_labels = {
            interval: f"{int(interval.left)} - {int(interval.right) if interval.right != np.inf else '∞'}"
            for interval in df["binned"].cat.categories
        }

        df["binned"] = df["binned"].map(bin_labels)
        all_types = sorted(
            bin_labels.values(),
            key=lambda x: int(x.split(" - ")[0]),
        )
        data_column = "binned"
    else:
        all_types = df[column].unique()
        data_column = column

    df_licit = df[df["is_laundering"] == 0]
    proportion_licit = df_licit[data_column].value_counts(normalize=True) * 100
    proportion_licit = proportion_licit.reindex(all_types, fill_value=0)

    df_illicit = df[df["is_laundering"] == 1]
    proportion_illicit = df_illicit[data_column].value_counts(normalize=True) * 100
    proportion_illicit = proportion_illicit.reindex(all_types, fill_value=0)

    total_proportion = proportion_licit + proportion_illicit
    licit_normalized = (proportion_licit / total_proportion) * 100
    illicit_normalized = (proportion_illicit / total_proportion) * 100

    fig, ax = plt.subplots(figsize=(7, 4))
    y_pos = np.arange(len(all_types))

    ax.barh(y_pos, licit_normalized, color="#76c7c0", label="Licit")
    ax.barh(y_pos, illicit_normalized, left=licit_normalized, color="#f4a261", label="Illicit")
    ax.axvline(50, linestyle="--", color="gray", linewidth=1)

    ax.set_yticks(y_pos)
    ax.set_yticklabels(all_types)
    ax.set_xlabel("Proportion")
    ax.set_title(f"{label}, Licit vs. Illicit")
    ax.legend(loc="upper left", bbox_to_anchor=(1, 1))
    plt.gca().invert_yaxis()
    plt.show()


column_mapping_imbalance = {
    "Payment Type": "payment_type",
    "Sent Currency": "sent_currency",
    "Received Currency": "received_currency",
    "Sent Amount": "sent_amount",
    "Received Amount": "received_amount",
}
dropdown_imbalance = widgets.Dropdown(
    options=column_mapping_imbalance.keys(),
    description="Column:",
    style={"description_width": "initial"},
)

def update_plot(column_label: str) -> None:
    column = column_mapping_imbalance[column_label]
    plot_column_imbalance(df, column, column_label)

widgets.interactive(update_plot, column_label=dropdown_imbalance)

interactive(children=(Dropdown(description='Column:', options=('Payment Type', 'Sent Currency', 'Received Curr…


⏱️ Execution time: 2.17s


## Bank+Account Uniqueness

Between the datasets in the following subsesction, are account numbers unique? Given that they aren't, provide a method to make them unique.

### Intra-dataset Uniqueness

The goal is to determine whether the AMLworld synthetic data generates unique account numbers per dataset, which, if unique, might give us the opportunity to train on one dataset and test on another. This is done by streaming the data files and not by loading the files into a data frame because loading the large files into memory is not possible in many computation environments. Even so, this takes a long time.

In [18]:
def check_pairwise_dataset_uniqueness(
    dataset_a: str,
    dataset_b: str,
) -> None:
    hash_map = {}
    hash_map_aggregate = {}
    poor_account_uniqueness = "Account Uniqueness header mismatch"
    for i, dataset in enumerate([dataset_a, dataset_b]):
        print(f"Hashing: {dataset}")
        with open(
            os.path.join(data_dir, dataset), "r", encoding="utf-8",
        ) as file:
            header = True
            for line in file:
                columns = line.strip().split(",")

                # Checks that each data set is formatted with the account data
                # in the same location
                if header:
                    header = False
                    if (
                        columns[1] != "From Bank" or
                        columns[2] != "Account" or
                        columns[3] != "To Bank" or
                        columns[4] != "Account"
                    ):
                        raise ValueError(poor_account_uniqueness)
                    continue

                # Hash on both the from and the to account, keeping track of an
                # enumerated dataset
                for account in [columns[2], columns[4]]:
                    if account not in hash_map:
                        hash_map[account] = [i]
                    elif i not in hash_map[account]:
                        hash_map[account].append(i)

                # Hash on a combination of bank and account, for both the from
                # and to account/bank
                for bank_account in [
                    f"{columns[1]}_{columns[2]}",
                    f"{columns[3]}_{columns[4]}",
                ]:
                    if bank_account not in hash_map_aggregate:
                        hash_map_aggregate[bank_account] = [i]
                    elif i not in hash_map_aggregate[bank_account]:
                        hash_map_aggregate[bank_account].append(i)

    # Checks for duplicate accounts
    count = 0
    for account, datasets in hash_map.items():
        if len(datasets) > 1:
            count += 1
    n = len(hash_map)
    print(f"Hash map by account: {n}, duplicate accounts: {count}")
    print(f"Uniqueness by account: {round(100*(n-count)/n, 3)}%")

    # Checks for duplicate account, bank pairs
    count = 0
    for account, datasets in hash_map_aggregate.items():
        if len(datasets) > 1:
            count += 1
    n = len(hash_map_aggregate)
    print(f"Hash map by bank_account: {n}, duplicates: {count}")
    print(f"Uniqueness by bank_account: {round(100*(n-count)/n, 3)}%")

if check_dataset_uniqueness:
    check_pairwise_dataset_uniqueness(
        "LI-Medium_Trans.csv",
        "LI-Small_Trans.csv",
    )
    print("")
    check_pairwise_dataset_uniqueness(
        "HI-Medium_Trans.csv",
        "LI-Medium_Trans.csv",
    )
    print("")
    check_pairwise_dataset_uniqueness(
        "LI-Large_Trans.csv",
        "LI-Medium_Trans.csv",
    )
else:
    print("Skipped potentially lengthy uniqueness check.")

    # Keeping a snapshot of a previous analysis
    print("Data from a previous run:")
    print("""
Hashing: LI-Medium_Trans.csv
Hashing: LI-Small_Trans.csv
Hash map by account: 2721565, duplicate accounts: 16399
Uniqueness by account: 99.397%
Hash map by bank_account: 2737985, duplicates: 17
Uniqueness by bank_account: 99.999%

Hashing: HI-Medium_Trans.csv
Hashing: LI-Medium_Trans.csv
Hash map by account: 4047087, duplicate accounts: 61973
Uniqueness by account: 98.469%
Hash map by bank_account: 4094704, duplicates: 14414
Uniqueness by bank_account: 99.648%

Hashing: LI-Large_Trans.csv
Hashing: LI-Medium_Trans.csv
Hash map by account: 2054565, duplicate accounts: 2031886
Uniqueness by account: 1.104%
Hash map by bank_account: 2071157, duplicates: 2031918
Uniqueness by bank_account: 1.895%
    """)

Skipped potentially lengthy uniqueness check.
Data from a previous run:

Hashing: LI-Medium_Trans.csv
Hashing: LI-Small_Trans.csv
Hash map by account: 2721565, duplicate accounts: 16399
Uniqueness by account: 99.397%
Hash map by bank_account: 2737985, duplicates: 17
Uniqueness by bank_account: 99.999%

Hashing: HI-Medium_Trans.csv
Hashing: LI-Medium_Trans.csv
Hash map by account: 4047087, duplicate accounts: 61973
Uniqueness by account: 98.469%
Hash map by bank_account: 4094704, duplicates: 14414
Uniqueness by bank_account: 99.648%

Hashing: LI-Large_Trans.csv
Hashing: LI-Medium_Trans.csv
Hash map by account: 2054565, duplicate accounts: 2031886
Uniqueness by account: 1.104%
Hash map by bank_account: 2071157, duplicates: 2031918
Uniqueness by bank_account: 1.895%
    

⏱️ Execution time: 0.0s


### Create Uniqueness

For the given data set name, and for each bank and account number, create a unique identifier. This will ensure that if models are trained on one dataset, they can be transferred to or tested on other datasets without worrying that the model learned identifiers that happen to be non-distinct between the AMLworld datasets.

If the following is used to create unique entity identifiers between datasets, it needs to be applied to two datasets and tested (applying it to one of the larger datasets will be computationally intensive).

In [15]:
def h(value: str, length=8):
    return hashlib.sha256(value.encode()).hexdigest()[:length]

def generate_unique_identifiers(dataset_name, df):
    d = h(dataset_name)

    df["from_bank_hash"] = df["from_bank"].astype(str).map(h)
    df["from_account_hash"] = df["from_account"].astype(str).map(h)
    df["to_bank_hash"] = df["to_bank"].astype(str).map(h)
    df["to_account_hash"] = df["to_account"].astype(str).map(h)

    df["from_unique"] = d + "_" + df["from_bank_hash"] + "_" + df["from_account_hash"]
    df["to_unique"] = d + "_" + df["to_bank_hash"] + "_" + df["to_account_hash"]

    # Drop intermediate hash columns
    df.drop(
        columns=[
            "from_bank_hash",
            "from_account_hash",
            "to_bank_hash",
            "to_account_hash",
        ],
        inplace=True,
    )

    return df

dataset_name = data_file.split("/")[-1]
df = generate_unique_identifiers(dataset_name, df)
df[[
    "from_bank",
    "from_account",
    "from_unique",
    "to_bank",
    "to_account",
    "to_unique",
]].head()

Unnamed: 0,from_bank,from_account,from_unique,to_bank,to_account,to_unique
0,11,8000ECA90,2255a05e_4fc82b26_b8f0b6b3,11,8000ECA90,2255a05e_4fc82b26_b8f0b6b3
1,3402,80021DAD0,2255a05e_5dd0890c_f536a3a5,3402,80021DAD0,2255a05e_5dd0890c_f536a3a5
2,11,8000ECA90,2255a05e_4fc82b26_b8f0b6b3,1120,8006AA910,2255a05e_829f00a1_2f1cad32
3,3814,8006AD080,2255a05e_9b2bcad1_2b52951c,3814,8006AD080,2255a05e_9b2bcad1_2b52951c
4,20,8006AD530,2255a05e_f5ca38f7_ed7b0509,20,8006AD530,2255a05e_f5ca38f7_ed7b0509



⏱️ Execution time: 28.68s


In [16]:
# Once categorical data is converted into numerical data, do:
#
#   df.corr()
#


⏱️ Execution time: 0.0s
