# 1. Introduction:
---

Money laundering is a multi-billion dollar issue. Detection of laundering is very difficult. Most automated algorithms have a high false positive rate: legitimate transactions incorrectly flagged as laundering. The converse is also a major problem -- false negatives, i.e. undetected laundering transactions. Naturally, criminals work hard to cover their tracks.

Access to real financial transaction data is highly restricted, for both proprietary and privacy reasons. Even when access is possible, it is problematic to provide a correct tag (laundering or legitimate) to each transaction, as noted above. 

In this project we are using a synthetic transaction dataset from IBM that avoids these problems (ALTMAN et al. 2023).


**To check the paper that originated this synthetic dataset, [click here!](https://arxiv.org/abs/2306.16424)**

The data provided here is based on a virtual world inhabited by individuals, companies, and banks. Individuals interact with other individuals and companies. Likewise, companies interact with other companies and with individuals. These interactions can take many forms, e.g. purchase of consumer goods and services, purchase orders for industrial supplies, payment of salaries, repayment of loans, and more. These financial transactions are generally conducted via banks, i.e. the payer and receiver both have accounts, with accounts taking multiple forms from checking to credit cards to bitcoin.

Some (small) fraction of the individuals and companies in the generator model engage in criminal behavior -- such as smuggling, illegal gambling, extortion, and more. Criminals obtain funds from these illicit activities, and then try to hide the source of these illicit funds via a series of financial transactions. Such financial transactions to hide illicit funds constitute laundering. Thus, the data available here is labelled and can be used for training and testing AML (Anti Money Laundering) models and for other purposes.

The data generator that created the data here not only models illicit activity, but also tracks funds derived from illicit activity through arbitrarily many transactions -- thus creating the ability to label laundering transactions many steps removed from their illicit source. With this foundation, it is straightforward for the generator to label individual transactions as laundering or legitimate.

Note that this IBM generator models the entire money laundering cycle:

*   **Placement**: Sources like smuggling of illicit funds.
*   **Layering**: Mixing the illicit funds into the financial system.
*   **Integration**: Spending the illicit funds.


As another capability possible only with synthetic data, note that a real bank or other institution typically has access to only a portion of the transactions involved in laundering: the transactions involving that bank. Transactions happening at other banks or between other banks are not seen. Thus, models built on real transactions from one institution can have only a limited view of the world.

By contrast these synthetic transactions contain an entire financial ecosystem. Thus it may be possible to create laundering detection models that undertand the broad sweep of transactions across institutions, but apply those models to make inferences only about transactions at a particular bank.

## 1.1. Importing Libraries
---

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import pathlib
import zipfile


from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import RobustScaler, OrdinalEncoder
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import GridSearchCV

from sklearn.model_selection import train_test_split

from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, confusion_matrix, roc_auc_score, roc_curve
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier

from sklearn.metrics import classification_report, confusion_matrix
from imblearn.over_sampling import SMOTE
from collections import Counter

import warnings
warnings.filterwarnings("ignore")

## 1.2. Verify if Data is Present
---

In [2]:
pathlib.Path("data").mkdir(parents=True, exist_ok=True)
PATH = str(pathlib.Path.cwd())
file_path = pathlib.Path("data/HI-Large_Trans.csv")

if not file_path.is_file():
    with zipfile.ZipFile("./data.zip", 'r') as zf:
        zf.extractall("./data/")

# 2. Exploratory Data Analisys (EDA)
---

## 2.1. Reading the HI-Small_Trans file

In [3]:
import pandas as pd

full_df = pd.read_csv("./data/HI-Small_Trans.csv")

full_df.shape

(5078345, 11)

### 2.1.1. Sampling a Portion of the Original DataFrame
---

In [4]:
df = full_df.sample(n=500000, random_state=42)

df.shape

(500000, 11)

In [5]:
df.head()

Unnamed: 0,Timestamp,From Bank,Account,To Bank,Account.1,Amount Received,Receiving Currency,Amount Paid,Payment Currency,Payment Format,Is Laundering
298872,2022/09/01 00:29,117,80E50C3C0,40653,80FA8F490,4981.6,Swiss Franc,4981.6,Swiss Franc,Cheque,0
746726,2022/09/01 13:28,10,8001C6CC0,22828,8010A7DF0,297.72,US Dollar,297.72,US Dollar,Cheque,0
405190,2022/09/01 02:46,29191,80CAF3CE0,29191,80CAF3CE0,32.9,Yuan,32.9,Yuan,Reinvestment,0
1388703,2022/09/02 08:02,10,804DC2C20,14381,80597A020,194634.45,Rupee,194634.45,Rupee,Cheque,0
4713645,2022/09/09 18:01,16136,80A5EC8A0,16031,80C038E30,698940.91,US Dollar,698940.91,US Dollar,ACH,0


### 2.1.2. About the Features
---

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 500000 entries, 298872 to 3845689
Data columns (total 11 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   Timestamp           500000 non-null  object 
 1   From Bank           500000 non-null  int64  
 2   Account             500000 non-null  object 
 3   To Bank             500000 non-null  int64  
 4   Account.1           500000 non-null  object 
 5   Amount Received     500000 non-null  float64
 6   Receiving Currency  500000 non-null  object 
 7   Amount Paid         500000 non-null  float64
 8   Payment Currency    500000 non-null  object 
 9   Payment Format      500000 non-null  object 
 10  Is Laundering       500000 non-null  int64  
dtypes: float64(2), int64(3), object(6)
memory usage: 45.8+ MB


## 2.2. Basic Statistic in the Numerical Features
---

In [7]:
df.select_dtypes(exclude='object').describe()

Unnamed: 0,From Bank,To Bank,Amount Received,Amount Paid,Is Laundering
count,500000.0,500000.0,500000.0,500000.0,500000.0
mean,45818.82191,65842.322062,8728360.0,4127212.0,0.001038
std,81937.835213,84214.998216,1554638000.0,418905400.0,0.032201
min,1.0,1.0,1e-06,1e-06,0.0
25%,119.0,4403.0,182.37,183.5675,0.0
50%,9679.0,21575.0,1418.135,1422.62,0.0
75%,28663.0,122332.0,12324.78,12286.81,0.0
max,356302.0,356266.0,626035500000.0,140212400000.0,1.0
