## Data Exploration  

Author: Calvin Chan

### Introduction  
Before we start carrying out any meaningful analysis using our data, let us ensure that it is clean and prepared. 

### Table of Contents  
1. [Import Data](#import)
2. [Data Cleaning](#cleaning)

In [53]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

<a id='import'></a>
### Import AML dataset

In [54]:
# File path locally
path = '../data/HI-Small_Trans.csv'

# Load data 
df = pd.read_csv(path, delimiter=',')

# Read data head
df.head()

Unnamed: 0,Timestamp,From Bank,Account,To Bank,Account.1,Amount Received,Receiving Currency,Amount Paid,Payment Currency,Payment Format,Is Laundering
0,2022/09/01 00:20,10,8000EBD30,10,8000EBD30,3697.34,US Dollar,3697.34,US Dollar,Reinvestment,0
1,2022/09/01 00:20,3208,8000F4580,1,8000F5340,0.01,US Dollar,0.01,US Dollar,Cheque,0
2,2022/09/01 00:00,3209,8000F4670,3209,8000F4670,14675.57,US Dollar,14675.57,US Dollar,Reinvestment,0
3,2022/09/01 00:02,12,8000F5030,12,8000F5030,2806.97,US Dollar,2806.97,US Dollar,Reinvestment,0
4,2022/09/01 00:06,10,8000F5200,10,8000F5200,36682.97,US Dollar,36682.97,US Dollar,Reinvestment,0


Let's change the column name for the account hexadecimal codes where transactions originates from and go to. 

In [55]:
# Dictionary mapping old to new column name
rename_dict = {
    'Account': 'From Account', 
    'Account.1': 'To Account'
}

# Reassign dataframe with new names
df = df.rename(columns=rename_dict)

# Sanity check
df.head()

Unnamed: 0,Timestamp,From Bank,From Account,To Bank,To Account,Amount Received,Receiving Currency,Amount Paid,Payment Currency,Payment Format,Is Laundering
0,2022/09/01 00:20,10,8000EBD30,10,8000EBD30,3697.34,US Dollar,3697.34,US Dollar,Reinvestment,0
1,2022/09/01 00:20,3208,8000F4580,1,8000F5340,0.01,US Dollar,0.01,US Dollar,Cheque,0
2,2022/09/01 00:00,3209,8000F4670,3209,8000F4670,14675.57,US Dollar,14675.57,US Dollar,Reinvestment,0
3,2022/09/01 00:02,12,8000F5030,12,8000F5030,2806.97,US Dollar,2806.97,US Dollar,Reinvestment,0
4,2022/09/01 00:06,10,8000F5200,10,8000F5200,36682.97,US Dollar,36682.97,US Dollar,Reinvestment,0


We changed the column names to be able to understand our data better, now let's look at the head and tail parts of our data set and a sample.

In [56]:
# Display head and tail of dataset
with pd.option_context('display.max_rows', 5, 'display.max_columns', None):
    print("Head: ")
    display(df.head())
    print("Tails: ")
    display(df.tail())
    print("Sample of 5 rows: ")
    display(df.sample(5))

Head: 


Unnamed: 0,Timestamp,From Bank,From Account,To Bank,To Account,Amount Received,Receiving Currency,Amount Paid,Payment Currency,Payment Format,Is Laundering
0,2022/09/01 00:20,10,8000EBD30,10,8000EBD30,3697.34,US Dollar,3697.34,US Dollar,Reinvestment,0
1,2022/09/01 00:20,3208,8000F4580,1,8000F5340,0.01,US Dollar,0.01,US Dollar,Cheque,0
2,2022/09/01 00:00,3209,8000F4670,3209,8000F4670,14675.57,US Dollar,14675.57,US Dollar,Reinvestment,0
3,2022/09/01 00:02,12,8000F5030,12,8000F5030,2806.97,US Dollar,2806.97,US Dollar,Reinvestment,0
4,2022/09/01 00:06,10,8000F5200,10,8000F5200,36682.97,US Dollar,36682.97,US Dollar,Reinvestment,0


Tails: 


Unnamed: 0,Timestamp,From Bank,From Account,To Bank,To Account,Amount Received,Receiving Currency,Amount Paid,Payment Currency,Payment Format,Is Laundering
5078340,2022/09/10 23:57,54219,8148A6631,256398,8148A8711,0.154978,Bitcoin,0.154978,Bitcoin,Bitcoin,0
5078341,2022/09/10 23:35,15,8148A8671,256398,8148A8711,0.108128,Bitcoin,0.108128,Bitcoin,Bitcoin,0
5078342,2022/09/10 23:52,154365,8148A6771,256398,8148A8711,0.004988,Bitcoin,0.004988,Bitcoin,Bitcoin,0
5078343,2022/09/10 23:46,256398,8148A6311,256398,8148A8711,0.038417,Bitcoin,0.038417,Bitcoin,Bitcoin,0
5078344,2022/09/10 23:37,154518,8148A6091,256398,8148A8711,0.281983,Bitcoin,0.281983,Bitcoin,Bitcoin,0


Sample of 5 rows: 


Unnamed: 0,Timestamp,From Bank,From Account,To Bank,To Account,Amount Received,Receiving Currency,Amount Paid,Payment Currency,Payment Format,Is Laundering
1076263,2022/09/01 22:54,112064,805009500,113530,80524B1B0,46828400.0,Rupee,46828400.0,Rupee,Cheque,0
864335,2022/09/01 16:43,2053,8015DF9E0,2053,8015DF9E0,10.83,US Dollar,10.83,US Dollar,Reinvestment,0
1055255,2022/09/01 22:24,1501,80149D6C0,3881,8041A9610,755.81,Euro,755.81,Euro,Credit Card,0
1312947,2022/09/02 05:31,28611,80A98EF80,28416,80AD0FA70,2205.99,Australian Dollar,2205.99,Australian Dollar,Cheque,0
1870134,2022/09/02 23:32,254242,813E4B521,15,81474CC81,0.323689,Bitcoin,0.323689,Bitcoin,Bitcoin,0


Notice that we do not see any missing values in our data. Let's check this explicitly through a thorough data cleaning process.

<a id='cleaning'></a>
### Data Cleaning  
Here, we will perform a more in-depth look at our data and ensure that it is clean and does not have any missing values.

#### Data Types
First, let's look into the data types of our columns.

In [57]:
# Check dataframe details
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5078345 entries, 0 to 5078344
Data columns (total 11 columns):
 #   Column              Dtype  
---  ------              -----  
 0   Timestamp           object 
 1   From Bank           int64  
 2   From Account        object 
 3   To Bank             int64  
 4   To Account          object 
 5   Amount Received     float64
 6   Receiving Currency  object 
 7   Amount Paid         float64
 8   Payment Currency    object 
 9   Payment Format      object 
 10  Is Laundering       int64  
dtypes: float64(2), int64(3), object(6)
memory usage: 426.2+ MB


Notice how `Timestamp` is casted as type `object`. Since it is in datetime format, want to convert this to its corresponding dtype, `datetime`.

- `Timestamp`: object $\rightarrow$ datetime


In [58]:
# Convert column types accordingly
df['Timestamp'] = pd.to_datetime(df['Timestamp'])

In [59]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5078345 entries, 0 to 5078344
Data columns (total 11 columns):
 #   Column              Dtype         
---  ------              -----         
 0   Timestamp           datetime64[ns]
 1   From Bank           int64         
 2   From Account        object        
 3   To Bank             int64         
 4   To Account          object        
 5   Amount Received     float64       
 6   Receiving Currency  object        
 7   Amount Paid         float64       
 8   Payment Currency    object        
 9   Payment Format      object        
 10  Is Laundering       int64         
dtypes: datetime64[ns](1), float64(2), int64(3), object(5)
memory usage: 426.2+ MB


#### Duplicated Data