# Synthetic Financial Datasets For Fraud Detection

Paysim synthetic dataset of mobile money transactions. Each step represents an hour of simulation. This dataset is scaled down 1/4 of the original dataset which is presented in the paper "PaySim: A financial mobile money simulator for fraud detection".

In [2]:
import pandas as pd

In [3]:
paysim = pd.read_csv('../dataset/kaggle_finance_short.csv')
paysim.head()

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,0,0
1,1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0,0
2,1,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1,0
3,1,CASH_OUT,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,1,0
4,1,PAYMENT,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.0,0.0,0,0


In [4]:
paysim.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 649999 entries, 0 to 649998
Data columns (total 11 columns):
step              649999 non-null int64
type              649999 non-null object
amount            649999 non-null float64
nameOrig          649999 non-null object
oldbalanceOrg     649999 non-null float64
newbalanceOrig    649999 non-null float64
nameDest          649999 non-null object
oldbalanceDest    649999 non-null float64
newbalanceDest    649999 non-null float64
isFraud           649999 non-null int64
isFlaggedFraud    649999 non-null int64
dtypes: float64(5), int64(3), object(3)
memory usage: 54.6+ MB


## Get value counts per column: type

In [5]:
paysim['type'].value_counts()

CASH_OUT    230990
PAYMENT     219460
CASH_IN     141571
TRANSFER     53150
DEBIT         4828
Name: type, dtype: int64

## Get first Day

Remember, step indicates a single hour, so one day is... 

In [6]:
mask = paysim['step']<=24
df_day1 = paysim[mask]

In [7]:
df_day1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 574255 entries, 0 to 574254
Data columns (total 11 columns):
step              574255 non-null int64
type              574255 non-null object
amount            574255 non-null float64
nameOrig          574255 non-null object
oldbalanceOrg     574255 non-null float64
newbalanceOrig    574255 non-null float64
nameDest          574255 non-null object
oldbalanceDest    574255 non-null float64
newbalanceDest    574255 non-null float64
isFraud           574255 non-null int64
isFlaggedFraud    574255 non-null int64
dtypes: float64(5), int64(3), object(3)
memory usage: 52.6+ MB


## Get number of transaction per hour for day1

In [8]:
df_day1['step'].value_counts().sort_index()

1      2708
2      1014
3       552
4       565
5       665
6      1660
7      6837
8     21097
9     37628
10    35991
11    37241
12    36153
13    37515
14    41485
15    44609
16    42471
17    43361
18    49579
19    51352
20    40625
21    19152
22    12635
23     6144
24     3216
Name: step, dtype: int64

## Get day2!

In [9]:
mask = (paysim['step']>24) & (paysim['step']<=48)
df_day2 = paysim[mask]
df_day2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 75744 entries, 574255 to 649998
Data columns (total 11 columns):
step              75744 non-null int64
type              75744 non-null object
amount            75744 non-null float64
nameOrig          75744 non-null object
oldbalanceOrg     75744 non-null float64
newbalanceOrig    75744 non-null float64
nameDest          75744 non-null object
oldbalanceDest    75744 non-null float64
newbalanceDest    75744 non-null float64
isFraud           75744 non-null int64
isFlaggedFraud    75744 non-null int64
dtypes: float64(5), int64(3), object(3)
memory usage: 6.9+ MB


## Get number of transaction per hour for day2

In [10]:
df_day2['step'].value_counts().sort_index()

25     1598
26      440
27       41
28        4
29        4
30        8
31       12
32       12
33    23616
34    30904
35    19105
Name: step, dtype: int64

## Select hour 19

In [12]:
mask = paysim['step']==19
df_19 = paysim[mask]
df_19.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 51352 entries, 441131 to 492482
Data columns (total 11 columns):
step              51352 non-null int64
type              51352 non-null object
amount            51352 non-null float64
nameOrig          51352 non-null object
oldbalanceOrg     51352 non-null float64
newbalanceOrig    51352 non-null float64
nameDest          51352 non-null object
oldbalanceDest    51352 non-null float64
newbalanceDest    51352 non-null float64
isFraud           51352 non-null int64
isFlaggedFraud    51352 non-null int64
dtypes: float64(5), int64(3), object(3)
memory usage: 4.7+ MB


# Get per Origin and destination value counts

In [13]:
df_19['type'].value_counts()

CASH_OUT    19822
PAYMENT     15289
CASH_IN     11875
TRANSFER     3946
DEBIT         420
Name: type, dtype: int64

In [14]:
df_19['nameOrig'].value_counts()

C570148238     2
C388263778     1
C1355340407    1
C175282911     1
C1432247813    1
C16360807      1
C1333908337    1
C967830348     1
C510524720     1
C2122207145    1
C1030919418    1
C1532848222    1
C77873697      1
C141612651     1
C1430340895    1
C1966395667    1
C1477060286    1
C2009780559    1
C1792963348    1
C1695751213    1
C1159813956    1
C406422969     1
C573946771     1
C1400584117    1
C1355660181    1
C418504952     1
C320891370     1
C955586475     1
C650348886     1
C513516892     1
              ..
C1407828263    1
C2064448467    1
C104867553     1
C943190306     1
C1506945312    1
C1113039690    1
C191535105     1
C717397299     1
C687566995     1
C912758357     1
C1826390360    1
C210514071     1
C1481948012    1
C1390297073    1
C473183357     1
C873950091     1
C439529939     1
C531232110     1
C1907994121    1
C575477114     1
C270268150     1
C666806136     1
C1470884452    1
C1805712657    1
C1369426492    1
C2082767516    1
C678219894     1
C920825128    

In [15]:
df_19['nameDest'].value_counts()

C272881310     6
C1183266411    6
C1706891711    6
C1233497086    6
C875222405     6
C1317067915    6
C1551456884    6
C1768818329    6
C768414110     6
C2065787664    6
C487569150     5
C1809716881    5
C2145319283    5
C845458098     5
C1847835159    5
C349350443     5
C1832323051    5
C1453244959    5
C1002469873    5
C1561708811    5
C219673488     5
C316486913     5
C849335532     5
C833472510     5
C338861738     5
C726834993     5
C849863109     5
C12526794      5
C1817319941    5
C1598190663    5
              ..
M201282879     1
M1505756798    1
M2056469112    1
C795168473     1
C689645295     1
M328774286     1
M1698791075    1
C1499514230    1
M640338945     1
C1186946484    1
C231206010     1
M859289912     1
M860908978     1
M2035119046    1
M1257350669    1
C1408864247    1
C1263041537    1
M335804110     1
C663267164     1
C1919930418    1
M1399169169    1
M203711921     1
M1466344007    1
M494535995     1
M113040008     1
C465764513     1
C1539333032    1
M1520374315   

### Let's check what's the transaction with the highest amount  for each hours of the first day

In [16]:
hours = df_day1.groupby('step')

In [17]:
entries = pd.DataFrame(columns=df_day1.columns)

In [18]:
entries

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud


In [19]:
for hour, value in hours:
    highest = value.nlargest(1, 'amount')
    entries = entries.append(highest)

In [20]:
entries

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
1153,1,TRANSFER,3776389.09,C197491520,0.0,0.0,C1883840933,10138670.86,16874643.09,0,0
3558,2,TRANSFER,2474181.78,C1238013097,0.0,0.0,C306206744,3799219.95,7173523.49,0,0
4155,3,TRANSFER,2837270.65,C1861416877,0.0,0.0,C97730845,9126860.4,12946583.44,0,0
4440,4,TRANSFER,10000000.0,C7162498,12930418.44,2930418.44,C945327594,0.0,0.0,1,0
5162,5,TRANSFER,1915470.93,C457660003,0.0,0.0,C1262822392,11595793.82,13511264.76,0,0
6457,6,TRANSFER,2062692.94,C1650415378,0.0,0.0,C1262822392,14791285.74,16853978.68,0,0
10395,7,TRANSFER,5460002.91,C666654362,5460002.91,0.0,C1726301214,0.0,0.0,1,0
16720,8,TRANSFER,5677662.29,C293394374,0.0,0.0,C1856036778,8427389.99,13688613.94,0,0
66040,9,TRANSFER,6072832.27,C2022065686,1344.0,0.0,C460989529,162174.09,9174785.39,0,0
84416,10,TRANSFER,6419835.27,C890128330,31784.0,0.0,C1192472312,0.0,6691744.85,0,0


### Let's check what's the transaction that led to the highest account balance for each hours of the first day.

In [21]:
entries2 = pd.DataFrame(columns=df_day1.columns)

for hour, value in hours:
    highest = value.nlargest(1, 'newbalanceOrig')
    entries2 = entries2.append(highest)


In [22]:
entries2

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
1332,1,CASH_IN,143405.8,C2108708444,10102842.03,10246247.83,C667346055,195636.81,9291619.62,0,0
3677,2,CASH_IN,232769.1,C819680566,12225881.83,12458650.93,C2134357721,415675.3,182906.2,0,0
3959,3,CASH_IN,46516.42,C1568507411,7028238.22,7074754.63,C1504109395,976787.99,930271.58,0,0
4526,4,CASH_IN,161339.92,C1531506932,12197608.16,12358948.08,C1504109395,1512985.69,1640678.66,0,0
4977,5,CASH_IN,81126.42,C2002720253,12929376.35,13010502.78,C434091818,108733.31,27606.88,0,0
6695,6,CASH_IN,32078.13,C1489504599,12706854.77,12738932.9,C468154998,40392.82,8314.69,0,0
11721,7,CASH_IN,86755.93,C1536280423,12062822.48,12149578.41,C513145905,94666.0,0.0,0,0
29202,8,CASH_IN,70159.05,C1883677955,28547237.16,28617396.21,C1286885448,227710.07,0.0,0,0
64526,9,CASH_IN,211345.43,C1393868694,33797391.55,34008736.98,C299715257,765542.61,701842.35,0,0
100307,10,CASH_IN,6808.99,C1841909664,38939424.03,38946233.02,C734236179,97006.33,90197.34,0,0


## Explore the dataset by yourself!